PGO for cross compilation

mstorsjo · June 12, 2025, 10:00pm

Background

I’ve looked into building my llvm-mingw toolchain distribution with PGO (Profile Guided Optimization), at llvm-mingw PR #503. Here I’m resharing the main takeaways regarding how much performance one can gain with PGO, and what potential tradeoffs one can make, giving most of the gains at a reasonable cost.

Building toolchains with PGO usually requires compiling all of LLVM/Clang many more times; compared with just shipping a toolchain built with a preexisting e.g. Linux toolchain, it requires two or three times the amount of compilation work. The llvm-mingw toolchain currently has 7 release packages for different OSes/archs; I wouldn’t want to multiply the total build effort by 2-3x.

Usually, with PGO, one would build the instrumented compiler for profiling, run that, and use the result for rebuilding an optimized compiler. However, when cross compiling, one can’t run the newly built compiler.

One central design of how the llvm-mingw toolchain is built, is that it that it is cross compiled; all toolchains for running on Windows are built on Linux. And it is built for platforms that might not even be available for executing in the build environment.

Potential gains of PGO (and LTO)

First off, I’ve measured how much the performance of Clang can be improved by building it with PGO, in a normal setup on Linux; doing both the instrumented and optimized build using a preexisting older distro provided version of Clang. I evaulated the build configurations LLVM_BUILD_INSTRUMENTED=Frontend and LLVM_BUILD_INSTRUMENTED=IR.

I did a benchmarking of that with a build setup on GitHub Actions, see the full results here.

As benchmark, I built Clang 20.1.6 with the Ubuntu 24.04 provided distro compilers, GCC 13 and Clang 18, and benchmarked that compiling sqlite.

	time	speedup
GCC	20.541	0%
Clang	20.213	1%
Clang, LTO	18.535	10%
Clang, PGO(Frontend)	15.758	30%
Clang, PGO(IR)	15.040	36%
Clang, LTO+PGO(Frontend)	14.625	40%
Clang, LTO+PGO(IR)	13.753	49%

Here, we note that LLVM_BUILD_INSTRUMENTED=Frontend gives decent speedups, while LLVM_BUILD_INSTRUMENTED=IR gives even better performance.

PGO in cross compilation

When cross building an LLVM toolchain, one can’t execute the cross built binaries. Can one reuse a preexisting profile from a build for one OS (or architecture) in a build for another architecture? I’m not sure if this really is documented anywhere (although I didn’t look very far TBH).

I built an instrumented Clang for Linux, profiled it, then used that profile for cross compiling Clang for Windows (with the same version of Clang as used for the instrumented build). See this commit for the test script and this actions run for the results.

	time	speedup
Regular	20.265	0%
LTO	18.834	7%
PGO(Frontend)	16.140	25%
PGO(IR)	17.494	15%
LTO+PGO(Frontend)	14.991	35%
LTO+PGO(IR)	16.347	23%

Here, we can see that LLVM_BUILD_INSTRUMENTED=Frontend gives almost the same level of speedup as it did in the regular case, while LLVM_BUILD_INSTRUMENTED=IR performs much worse. This is probably intuitive and to be expected, but still is worth noting.

Thus, sharing/reusing a profile for cross compilation works quite well with LLVM_BUILD_INSTRUMENTED=Frontend. It doesn’t give the very best potential results, but it does give a quite decent speedup without needing to do a new profile.

PGO with mixed compiler versions

I also considered if one could do the instrumentation/profiling build with one version of Clang, and use that profile when doing an optimized build with a different version of Clang. (E.g. profiling on Linux with e.g. a Linux distro provided version of Clang, but using that profile for cross compilation with a different version of Clang from llvm-mingw. Or using it for doing an optimized build with Apple Clang.) In the end, I didn’t end up doing this, but the results are still worth noting.

For benchmarks of this scenario, see this commit for the benchmark and the results. Here I did single stage builds with the distro compilers from Ubuntu 22.04 (GCC 11 and Clang 14), and tested doing instrumentation/profiling with Clang 14 and optimized builds with Clang 20.

	time	speedup
GCC 11	19.664	0%
Clang 14	20.008	-1%
Clang 14, LTO	18.547	6%
Clang 14+20, PGO(Frontend)	15.444	27%
Clang 14+20, PGO(IR)	17.872	10%
Clang 14+20, LTO+PGO(Frontend)	14.468	35%
Clang 14+20, LTO+PGO(IR)	16.754	17%

The results are essentially the same as in the cross PGO case above; LLVM_BUILD_INSTRUMENTED=Frontend gives pretty much the same level of speedup as before, while LLVM_BUILD_INSTRUMENTED=IR gives worse results.

Summary

All in all; LLVM_BUILD_INSTRUMENTED=Frontend gives around 6% less speedup than LLVM_BUILD_INSTRUMENTED=IR. But it can give roughly the same level of speedup while reusing the profile while compiling for a different OS/architecture, and/or compiling with a different version of Clang - while LLVM_BUILD_INSTRUMENTED=IR gives worse results than Frontend if the compilation environments don’t match exactly.

ellishg · June 12, 2025, 10:45pm

I’m guessing the performance regression in IRPGO is simply due to mismatched profiles. You can build with -stats to see how many function have matching profiles. This should be the relevent statistic.

github.com/llvm/llvm-project

llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp

2ee8fdbfd


      
          
          #define DEBUG_TYPE "pgo-instrumentation"
          
          STATISTIC(NumOfPGOInstrument, "Number of edges instrumented.");
          STATISTIC(NumOfPGOSelectInsts, "Number of select instruction instrumented.");
          STATISTIC(NumOfPGOMemIntrinsics, "Number of mem intrinsics instrumented.");
          STATISTIC(NumOfPGOEdge, "Number of edges.");
          STATISTIC(NumOfPGOBB, "Number of basic-blocks.");
          STATISTIC(NumOfPGOSplit, "Number of critical edge splits.");
          STATISTIC(NumOfPGOFunc, "Number of functions having valid profile counts.");
          STATISTIC(NumOfPGOMismatch, "Number of functions having mismatch profile.");
          STATISTIC(NumOfPGOMissing, "Number of functions without profile.");
          STATISTIC(NumOfPGOICall, "Number of indirect call value instrumentations.");
          STATISTIC(NumOfCSPGOInstrument, "Number of edges instrumented in CSPGO.");
          STATISTIC(NumOfCSPGOSelectInsts,
                    "Number of select instruction instrumented in CSPGO.");
          STATISTIC(NumOfCSPGOMemIntrinsics,
                    "Number of mem intrinsics instrumented in CSPGO.");
          STATISTIC(NumOfCSPGOEdge, "Number of edges in CSPGO.");
          STATISTIC(NumOfCSPGOBB, "Number of basic-blocks in CSPGO.");
          STATISTIC(NumOfCSPGOSplit, "Number of critical edge splits in CSPGO.");

mstorsjo · June 13, 2025, 7:28am

Yes, would guess so. And even with Frontend instrumentation, I would expect some amount of mismatches along the edges, anything that interacts directly with the system or arch specific headers. But the majority of the LLVM/Clang internal code should match, at least.

Hmm, right. I wasn’t familiar with this area; after a bit of poking around, I see that by building with -mllvm -stats, I can get such numbers printed out. It requires having the compiler built with assertions (or LLVM_FORCE_ENABLE_STATS enabled in cmake). So I guess e.g. the ratio of (NumOfPGOMismatch+NumOfPGOMissing)/NumOfPGOFunc would be interesting to gather. To properly get a sense of things, one would need to aggregate those numbers from the whole build.

Topic		Replies	Views
Windows/Clang build instrumented/PGO LLVM Dev List Archives	3	266	February 23, 2019
Profile-Guided Optimization (PGO) related questions and suggestions LLVM Project pgo	24	2161	December 20, 2023
Clang PGO and LLVM PGO commandline interface Clang Frontend	3	264	January 4, 2018
Not able to use PGO with LLVM+Clang built from source LLVM Dev List Archives	4	175	July 12, 2016
Is MSVC-style PGO possible with Clang? Clang Frontend	0	89	April 25, 2015