RFC: Machine IR Profile

Machine IR Profile (MIP)https://reviews.llvm.org/D104060

The full branch can be found at https://github.com/ellishg/llvm-project

tl;dr;

This is a proposal to introduce a new instrumentation pass that can produce optimization profiles with a focus on binary size and runtime performance of the instrumented binaries.

Our instrumented binaries record machine function call counts, machine function timestamps, machine basic block coverage, and a subset of the dynamic call graph. There is also a more lightweight mode that only collects machine function coverage data that has negligible runtime overhead and a binary size increase of 2-5% for instrumented binaries.

Motivation

In the mobile space, increasing binary size has an outsized impact on both runtime performance and download speed. Current instrumentation implementations such as XRay and GCov produce binaries that are too slow and too large to run on real mobile devices. We propose a new pass that injects instrumentation code at the machine ir level. At runtime, we write profile data to our custom __llvm_mipraw section that is eventually dumped to a .mipraw file. At buildtime, we emit a .mipmap file which we use to map function information to data in the .mipraw file. The result is that no redundant function info is stored in the binary, which allows our instrumentation to have minimal size overhead.

MIP has been implemented on ELF and Mach-O targets for x86_64, AArch64, and Armv7 with Thumb and Thumb2.

Performance

Our focus for now is on the performance and size of binaries that have injected instrumentation instead of binaries that have been optimized with instrumentation profiles. We collected some basic results from MultiSource/Benchmarks in llvm-test-suite for both MIP and clang’s instrumentation using the -fprofile-generate flag. It should be noted that this comparison is not fair because clang’s instrumentation collects much more data than just function coverage. However, we expect fully-featured MIP to have similar metrics.

Instrumented Binary Size

At the moment, we have implemented function coverage which injects one x86_64 instruction (7 bytes) and one byte of global data for each instrumented function, which should have minimal impact on binary size and runtime performance. In fact, our results show that we should expect MIP instrumented binaries to be only 2-5% larger. We contrast this with clang’s instrumentation, which can increase the binary size by 500-900%.

Instrumented Execution Time

We found that MIP had negligable execution time regressions when instrumented with MIP. Again, we can (unfairly) contrast this to -fprofile-generate which increased execution time by 1-40%.

Usage

We use the -fmachine-profile-generate clang flag to produce an instrumented binary and then use llvm-objcopy to extract the .mipmap file.

$ clang -g -fmachine-profile-generate main.cpp
$ llvm-objcopy --dump-section=__llvm_mipmap=default.mipmap a.out /dev/null
$ llvm-strip -g a.out -o a.out.stripped

This will produce the instrumented binary a.out and a map file default.mipmap.

When we run the binary, it will produce a default.mipraw file containing the profile data for that run.

$ ./a.out.stripped
$ ls
a.out    default.mipmap    default.mipraw    main.cpp

Then we use our custom tool to postprocess the raw profile and produce the final profile default.mip.

$ llvm-mipdata create -p default.mip default.mipmap
$ llvm-mipdata merge -p default.mip default.mipraw

If our binary has debug info, we can use it to report source information along with the profile data.

$ llvm-mipdata show -p default.mip --debug a.out
_Z3fooi
  Source Info: /home/main.cpp:9
  Call Count: 0
  Block Coverage:
     COLD COLD COLD COLD COLD

_Z3bari
  Source Info: /home/main.cpp:16
  Call Count: 1
  Block Coverage:
     HOT  HOT  COLD HOT  HOT

Finally, we can consume the profile using the clang flag -fmachine-profile-use= to produce a profile-optimized binary.

$ clang -fmachine-profile-use=default.mip main.cpp