Implementing code coverage for custom programming language

I want to implement code coverage for my programming language. The coverage mapping format is explained here: LLVM Code Coverage Mapping Format — LLVM 13 documentation

But when I compile a sample file with -fprofile-instr-generate -fcoverage-mapping I also get some other globals with __profd_<name>

which are of type { i64, i64, i64*, i8*, i8*, i32, [2 x i16] }. Could you explain to me what values I have to pass here?

Some I already figured out by looking at the source code llvm-project/InstrProfData.inc at main · llvm/llvm-project · GitHub

But some others make no sense to me, for example the two i8* seem to always be null and the [2 x i16] seems to always be zero initialized.

Another global I found no documentation on is __llvm_prf_nm. What do I need to set it to?

Code coverage leverages the PGO infrastructure to gather the actual coverage at runtime.

The way PGO usually works is that clang generates calls to a few intrinsics, like @llvm.instrprof.increment. Then, those calls are lowered by an LLVM IR pass; see llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp. I don’t know if data structures generated by that pass are documented anywhere.

I have no experience on working with code coverage but to answer your question on the purpose of individual members in { i64, i64, i64*, i8*, i8*, i32, [2 x i16] }, I think a better place will be compiler-rt’s profile component, which implements the runtime needed by PGO.
First, compiler-rt also has a copy of InstrProfData.inc (they’re identical), it is used by files in compiler-rt/lib/profile. So for instance, the first i8* field is __llvm_profile_data::FunctionPointer, which is used in InstrProfilingValue.c:

/* This method is only used in value profiler mock testing.  */
COMPILER_RT_VISIBILITY void *
__llvm_get_function_addr(const __llvm_profile_data *Data) {
  return Data->FunctionPointer;
}

And the second i8* field is __llvm_profile_data::Values, which is used by the value profiling infrastructure in various places like InstrProfilingValue.c

Are these intrinsics documented anywhere? So according to you I don’t need to generate what -fprofile-instr-generate does but actually let llvm do this for me?

https://llvm.org/docs/LangRef.html#llvm-instrprof-increment-intrinsic

Assuming your frontend generates LLVM IR, yes.

Thanks for the reply. I messed around with manually adding @llvm.instrprof.increment to my functions and running it through òpt --instrprof and what do you know, it generated the required globals, yay.

Some things are beyond weird tho. It chooses the name for the globals by literally chopping off the first 8 characters from the global’s name passed to the first parameter of @llvm.instrprof.increment. I mean I can work with that but its very strange.

The next problem I have is that the hash generated by this isn’t md5 as described in the documentation for the coverage mapping. For a function called foo I get the hash 5cf8c24cdb18bdac and the md5 for bar is 37b51d194a7513e45b56f6524f2d51f2. So its neither the low nor the high bits. What hashing algorithm does LLVM use? I assume I need to use the same hash for the code coverage mapping format so I need to calculate it somehow.

The hash is just a unique identifier for the function, generated by your frontend. It’s there to avoid using mismatched profile data: if a user modifies their source code, any existing profile is invalid. clang generates this by hashing the AST of the function.

It doesn’t matter what hash algorithm you use: the hash is only going to be compared against other hashes generated by your frontend.

You are right, that’s the structural hash and I can just set that to 0 if I please. However, it also hashes the name of the function:

@__covrec_5CF8C24CDB18BDACu = linkonce_odr hidden constant <{ i64, i32, i64, i64, [9 x i8] }> <{ i64 6699318081062747564, i32 9, i64 24, i64 3850114258649334376, [9 x i8] c"\01\01\00\01\01\01\0B\02\02" }>, section "__llvm_covfun", comdat, align 8
...
@__profd_foo = private global { i64, i64, i64*, i8*, i8*, i32, [2 x i16] } { i64 6699318081062747564, i64 24, i64* getelementptr inbounds ([1 x i64], [1 x i64]* @__profc_foo, i32 0, i32 0), i8* null, i8* null, i32 1, [2 x i16] zeroinitializer }, section "__llvm_prf_data", comdat($__profc_foo), align 8

6699318081062747564
as __profd_foo gets generated by opt, my frontend needs to generate the covrec global which uses the hash of the function name as a name. I assume this is how it finds the function record? Please correct me if I’m wrong.

Oh. It looks like that hash is IndexedInstrProf::ComputeHash. From the implementation, it looks like that’s supposed to be MD5, but no idea if it actually is.