[RFC] Adding Binary ID into LLVM Profiles

Motivation

There is no direct way of associating binaries with the corresponding profiles in LLVM. Therefore, source code coverage processing requires an additional post-processing step to match the executables to their associated profiles. In order to improve it, we propose embedding binary IDs into profiles, so that we can uniquely identify a profile and easily find the relevant binary.

Background

Binary ID

We use the name binary ID to refer to the unique identifiers used in binaries in different file formats. Build ID is a unique identifier for the build that is included in the ELF file format. It was originally introduced in GNU, and is used for various purposes, such as assoicating binaries with core dumps. Build ID is optional, and can be enabled by using -Wl,–build-id options. To the best of our knowledge, similar unique identifiers are used in different file formats. For example, a unique identifier called LC_UUID is used in Mach-O, and similarly GUID (Globally Unique Identifier) is used in COFF.

Profiling

Clang supports profiling with instrumentation for two main purposes:

  1. Front-end instrumentation, where the compiler front-end inserts instrumentation for collecting source code coverage.

  2. IR-level instrumentation, where LLVM inserts instrumentation during optimizations for PGO (Profile-Guided Optimization).

Profiling inserts instrumentation code into binaries, which will be used by compiler_rt (compiler runtime) during execution. When the instrumented binary executes, it will write a raw profile (.profraw). Multiple raw profiles are merged together by using llvm-profdata tool. At the end, a single indexed profile is created (.profdata) that is used to generate source code coverage reports.

Profile format consists of two major parts:

  1. Profile header includes version, magic (and paddings and sizes of each section in raw profile).

  2. Profile data includes function name and hash, and pointers to three sections: counters, names and value profiling counters per function.

Proposal

We propose adding build ID, which is the unique binary ID in ELF, into profiles to improve source-code coverage post-processing step. Although we target ELF file format, we are proposing a design that can be leveraged and extended for other file formats, such as Mach-O and COFF.

Extending profile format

We need to extend the both raw and indexed profile format to include build ID. Since build ID does not have a fixed length, we will add a variable-length byte array at the end of profile formats. We will also change the compiler-rt profiling runtime for ELF platforms to read build IDs from ELF data in memory and write them into the raw profile.

Extending profiling tools

Since the profile format changes, we also need to extend the tools that process profiles. We need to extend the ProfileData library functions that llvm-profdata tool uses to operate on profiles, and add support for printing binary ids in the profiles.

Future Work

Embedding binary ids into profiles would also enable implementing support for debuginfod library in llvm-cov, where the tool will automatically download binaries corresponding to input profile.

References

Please let us know if you have any suggestions or questions.

Thanks,

Gülfem

Hi Gulfem, current profile matching scheme supports function level mis-match detection which is at a finer level of granularity as the executable level build-id. What is the use case of this level of identification?

David