Background
I’ve looked into building my llvm-mingw toolchain distribution with PGO (Profile Guided Optimization), at llvm-mingw PR #503. Here I’m resharing the main takeaways regarding how much performance one can gain with PGO, and what potential tradeoffs one can make, giving most of the gains at a reasonable cost.
Building toolchains with PGO usually requires compiling all of LLVM/Clang many more times; compared with just shipping a toolchain built with a preexisting e.g. Linux toolchain, it requires two or three times the amount of compilation work. The llvm-mingw toolchain currently has 7 release packages for different OSes/archs; I wouldn’t want to multiply the total build effort by 2-3x.
Usually, with PGO, one would build the instrumented compiler for profiling, run that, and use the result for rebuilding an optimized compiler. However, when cross compiling, one can’t run the newly built compiler.
One central design of how the llvm-mingw toolchain is built, is that it that it is cross compiled; all toolchains for running on Windows are built on Linux. And it is built for platforms that might not even be available for executing in the build environment.
Potential gains of PGO (and LTO)
First off, I’ve measured how much the performance of Clang can be improved by building it with PGO, in a normal setup on Linux; doing both the instrumented and optimized build using a preexisting older distro provided version of Clang. I evaulated the build configurations LLVM_BUILD_INSTRUMENTED=Frontend and LLVM_BUILD_INSTRUMENTED=IR.
I did a benchmarking of that with a build setup on GitHub Actions, see the full results here.
As benchmark, I built Clang 20.1.6 with the Ubuntu 24.04 provided distro compilers, GCC 13 and Clang 18, and benchmarked that compiling sqlite.
| time | speedup | |
|---|---|---|
| GCC | 20.541 | 0% |
| Clang | 20.213 | 1% |
| Clang, LTO | 18.535 | 10% |
| Clang, PGO(Frontend) | 15.758 | 30% |
| Clang, PGO(IR) | 15.040 | 36% |
| Clang, LTO+PGO(Frontend) | 14.625 | 40% |
| Clang, LTO+PGO(IR) | 13.753 | 49% |
Here, we note that LLVM_BUILD_INSTRUMENTED=Frontend gives decent speedups, while LLVM_BUILD_INSTRUMENTED=IR gives even better performance.
PGO in cross compilation
When cross building an LLVM toolchain, one can’t execute the cross built binaries. Can one reuse a preexisting profile from a build for one OS (or architecture) in a build for another architecture? I’m not sure if this really is documented anywhere (although I didn’t look very far TBH).
I built an instrumented Clang for Linux, profiled it, then used that profile for cross compiling Clang for Windows (with the same version of Clang as used for the instrumented build). See this commit for the test script and this actions run for the results.
| time | speedup | |
|---|---|---|
| Regular | 20.265 | 0% |
| LTO | 18.834 | 7% |
| PGO(Frontend) | 16.140 | 25% |
| PGO(IR) | 17.494 | 15% |
| LTO+PGO(Frontend) | 14.991 | 35% |
| LTO+PGO(IR) | 16.347 | 23% |
Here, we can see that LLVM_BUILD_INSTRUMENTED=Frontend gives almost the same level of speedup as it did in the regular case, while LLVM_BUILD_INSTRUMENTED=IR performs much worse. This is probably intuitive and to be expected, but still is worth noting.
Thus, sharing/reusing a profile for cross compilation works quite well with LLVM_BUILD_INSTRUMENTED=Frontend. It doesn’t give the very best potential results, but it does give a quite decent speedup without needing to do a new profile.
PGO with mixed compiler versions
I also considered if one could do the instrumentation/profiling build with one version of Clang, and use that profile when doing an optimized build with a different version of Clang. (E.g. profiling on Linux with e.g. a Linux distro provided version of Clang, but using that profile for cross compilation with a different version of Clang from llvm-mingw. Or using it for doing an optimized build with Apple Clang.) In the end, I didn’t end up doing this, but the results are still worth noting.
For benchmarks of this scenario, see this commit for the benchmark and the results. Here I did single stage builds with the distro compilers from Ubuntu 22.04 (GCC 11 and Clang 14), and tested doing instrumentation/profiling with Clang 14 and optimized builds with Clang 20.
| time | speedup | |
|---|---|---|
| GCC 11 | 19.664 | 0% |
| Clang 14 | 20.008 | -1% |
| Clang 14, LTO | 18.547 | 6% |
| Clang 14+20, PGO(Frontend) | 15.444 | 27% |
| Clang 14+20, PGO(IR) | 17.872 | 10% |
| Clang 14+20, LTO+PGO(Frontend) | 14.468 | 35% |
| Clang 14+20, LTO+PGO(IR) | 16.754 | 17% |
The results are essentially the same as in the cross PGO case above; LLVM_BUILD_INSTRUMENTED=Frontend gives pretty much the same level of speedup as before, while LLVM_BUILD_INSTRUMENTED=IR gives worse results.
Summary
All in all; LLVM_BUILD_INSTRUMENTED=Frontend gives around 6% less speedup than LLVM_BUILD_INSTRUMENTED=IR. But it can give roughly the same level of speedup while reusing the profile while compiling for a different OS/architecture, and/or compiling with a different version of Clang - while LLVM_BUILD_INSTRUMENTED=IR gives worse results than Frontend if the compilation environments don’t match exactly.