PGO for cross compilation

Background

I’ve looked into building my llvm-mingw toolchain distribution with PGO (Profile Guided Optimization), at llvm-mingw PR #503. Here I’m resharing the main takeaways regarding how much performance one can gain with PGO, and what potential tradeoffs one can make, giving most of the gains at a reasonable cost.

Building toolchains with PGO usually requires compiling all of LLVM/Clang many more times; compared with just shipping a toolchain built with a preexisting e.g. Linux toolchain, it requires two or three times the amount of compilation work. The llvm-mingw toolchain currently has 7 release packages for different OSes/archs; I wouldn’t want to multiply the total build effort by 2-3x.

Usually, with PGO, one would build the instrumented compiler for profiling, run that, and use the result for rebuilding an optimized compiler. However, when cross compiling, one can’t run the newly built compiler.

One central design of how the llvm-mingw toolchain is built, is that it that it is cross compiled; all toolchains for running on Windows are built on Linux. And it is built for platforms that might not even be available for executing in the build environment.

Potential gains of PGO (and LTO)

First off, I’ve measured how much the performance of Clang can be improved by building it with PGO, in a normal setup on Linux; doing both the instrumented and optimized build using a preexisting older distro provided version of Clang. I evaulated the build configurations LLVM_BUILD_INSTRUMENTED=Frontend and LLVM_BUILD_INSTRUMENTED=IR.

I did a benchmarking of that with a build setup on GitHub Actions, see the full results here.

As benchmark, I built Clang 20.1.6 with the Ubuntu 24.04 provided distro compilers, GCC 13 and Clang 18, and benchmarked that compiling sqlite.

time speedup
GCC 20.541 0%
Clang 20.213 1%
Clang, LTO 18.535 10%
Clang, PGO(Frontend) 15.758 30%
Clang, PGO(IR) 15.040 36%
Clang, LTO+PGO(Frontend) 14.625 40%
Clang, LTO+PGO(IR) 13.753 49%

Here, we note that LLVM_BUILD_INSTRUMENTED=Frontend gives decent speedups, while LLVM_BUILD_INSTRUMENTED=IR gives even better performance.

PGO in cross compilation

When cross building an LLVM toolchain, one can’t execute the cross built binaries. Can one reuse a preexisting profile from a build for one OS (or architecture) in a build for another architecture? I’m not sure if this really is documented anywhere (although I didn’t look very far TBH).

I built an instrumented Clang for Linux, profiled it, then used that profile for cross compiling Clang for Windows (with the same version of Clang as used for the instrumented build). See this commit for the test script and this actions run for the results.

time speedup
Regular 20.265 0%
LTO 18.834 7%
PGO(Frontend) 16.140 25%
PGO(IR) 17.494 15%
LTO+PGO(Frontend) 14.991 35%
LTO+PGO(IR) 16.347 23%

Here, we can see that LLVM_BUILD_INSTRUMENTED=Frontend gives almost the same level of speedup as it did in the regular case, while LLVM_BUILD_INSTRUMENTED=IR performs much worse. This is probably intuitive and to be expected, but still is worth noting.

Thus, sharing/reusing a profile for cross compilation works quite well with LLVM_BUILD_INSTRUMENTED=Frontend. It doesn’t give the very best potential results, but it does give a quite decent speedup without needing to do a new profile.

PGO with mixed compiler versions

I also considered if one could do the instrumentation/profiling build with one version of Clang, and use that profile when doing an optimized build with a different version of Clang. (E.g. profiling on Linux with e.g. a Linux distro provided version of Clang, but using that profile for cross compilation with a different version of Clang from llvm-mingw. Or using it for doing an optimized build with Apple Clang.) In the end, I didn’t end up doing this, but the results are still worth noting.

For benchmarks of this scenario, see this commit for the benchmark and the results. Here I did single stage builds with the distro compilers from Ubuntu 22.04 (GCC 11 and Clang 14), and tested doing instrumentation/profiling with Clang 14 and optimized builds with Clang 20.

time speedup
GCC 11 19.664 0%
Clang 14 20.008 -1%
Clang 14, LTO 18.547 6%
Clang 14+20, PGO(Frontend) 15.444 27%
Clang 14+20, PGO(IR) 17.872 10%
Clang 14+20, LTO+PGO(Frontend) 14.468 35%
Clang 14+20, LTO+PGO(IR) 16.754 17%

The results are essentially the same as in the cross PGO case above; LLVM_BUILD_INSTRUMENTED=Frontend gives pretty much the same level of speedup as before, while LLVM_BUILD_INSTRUMENTED=IR gives worse results.

Summary

All in all; LLVM_BUILD_INSTRUMENTED=Frontend gives around 6% less speedup than LLVM_BUILD_INSTRUMENTED=IR. But it can give roughly the same level of speedup while reusing the profile while compiling for a different OS/architecture, and/or compiling with a different version of Clang - while LLVM_BUILD_INSTRUMENTED=IR gives worse results than Frontend if the compilation environments don’t match exactly.

6 Likes

I’m guessing the performance regression in IRPGO is simply due to mismatched profiles. You can build with -stats to see how many function have matching profiles. This should be the relevent statistic.

Yes, would guess so. And even with Frontend instrumentation, I would expect some amount of mismatches along the edges, anything that interacts directly with the system or arch specific headers. But the majority of the LLVM/Clang internal code should match, at least.

Hmm, right. I wasn’t familiar with this area; after a bit of poking around, I see that by building with -mllvm -stats, I can get such numbers printed out. It requires having the compiler built with assertions (or LLVM_FORCE_ENABLE_STATS enabled in cmake). So I guess e.g. the ratio of (NumOfPGOMismatch+NumOfPGOMissing)/NumOfPGOFunc would be interesting to gather. To properly get a sense of things, one would need to aggregate those numbers from the whole build.