Hi!
I do research about PGO state across the industry (all current results can be found in my repo). During the investigation, I met multiple PGO-related questions for which could not find answers. I already asked in the LLVM Discord (#profiling channel) but didn’t get a response. So maybe would be better to ask the questions here.
The first question is about PGO approach differences in practice. According to the Clang documentation (Clang Compiler User’s Manual — Clang 18.0.0git documentation), there are two ways to implement PGO: via -fprofile-instr-generate
(frontend-level instrumentation) and -fprofile-generate
(IR-level instrumentation). Are there available comparisons between them from different perspectives? I mean instrumentation performance overhead, binary size overhead, PGO optimization opportunities, PGO profiles resistance to changes in the source code of the program, maybe something else. Right now it’s not clear which PGO way is recommended to use with Clang. E.g. cargo-pgo
(a Cargo plugin that implements PGO for the Rust ecosystem) uses IR-level approach: https://github.com/Kobzol/cargo-pgo/blob/main/src/pgo/instrument.rs#L64
I found a thread about this question: Status of IR vs. frontend PGO (fprofile-generate vs fprofile-instr-generate) . But this thread lacks some real-life benchmarks. Can anyone give us more insights about the question? Did anything change during the last several years in this area?
The second question is about PGO profile compatibility between compiler versions. As far as I know, the profile version is somehow stored in the profile header. But I cannot find guarantees about forward/backward PGO profile compatibility between compiler versions (or even compiler commits). I already met issues like __llvm_profile_raw_version doesn't prevent version mismatches hard enough · Issue #52683 · llvm/llvm-project · GitHub when llvm-profdata
fails to process profiles with different versions. Understanding such guarantees is important for us since we want to cache PGO profiles somewhere in our storage and try to reuse them even after the compiler upgrade. We do not want to regenerate all our PGO profiles each time when the compiler is upgraded.
The third question is about PGO profiles reusage between compilers. Right now each major compiler has its own PGO format (.gcda
for GCC-based compilers, .profraw
/.profdata
for LLVM-based, .pgc
/.pgd
files for MSVC, don’t know about other compilers). These profiles are incompatible with each other. However, my assumption is that it’s possible to somehow try to convert PGO profiles from one format to another (probably with worse profile precision or some missed details) and reuse PGO profiles from GCC for optimizing the application with Clang. The use-case for the feature is the following. We build our application with N compilers (Clang, GCC, MSVC) and we want to perform PGO-optimization step for each of them. Right now we need to prepare an instrumentation build with each compiler, run each instrumented version, collect the profiles for each compiler, and use them for the optimization step. It would be easier for us to manage only one profile format (honestly, we don’t care about which one exactly but since we are in the LLVM Discord channel - let’s choose .profraw
/.profdata
) and reuse it for each compiler. I tried to find tooling for that or at least some notes about possibility/impossibility of such an idea but found nothing. If anyone can say something more about this question - would be great to listen to it.
The fourth question is about PGO profiles resistance to changing compiler options. How PGO profiles are influenced by using different compiler flags on the Instrumentation phase (if we are talking about instrumentation PGO)? I mean how llvm-profdata overlap
metric is affected by using different optimization flags on the instrumentation phase like different optimization options (like “O2 vs O3”), maybe different inlining budgets, different LTO settings, etc. The use case for the question is the following. There is an idea about sharing PGO profiles between different operating systems/package managers, so will be no need for each distribution to prepare its own PGO profiles storage - will be possible to reuse some shared PGO profiles place (this idea was mentioned on LLVM Now Using PGO For Building x86_64 Windows Release Binaries: ~22% Faster Builds - Phoronix Forums). The problem is that every distribution can use slightly different compiler flags like “-O2 vs -O3”, different LTO approaches (no LTO, Thin LTO, Fat LTO), different march
defaults, etc. And if PGO profiles for the same training workload but with slightly different compiler flags differ A LOT due to these differences in the compiler flags - the idea cannot be implemented. Maybe someone already did some research about the topic. If different (frontend level, IR level, sampling) PGO approaches have different properties from this perspective - would be happy to know the answer for each of them.
According to my tests, enabling/disabling LTO or switching from “-O3” to “-O2” (and vice versa) completely breaks PGO profiles reusage - llvm-profdata overlap
metrics instantly become 0%. If someone can explain - which else options affect it, would be great.
The fifth question is about PGO profiles compatibility between different operating systems for the same program built with the same compiler on all operating systems. Let’s imagine a pretty normal situation when we build our application with Clang on all target platforms (Linux-based, macOS, Windows, *BSD). Right now we perform instrumentation build for each platform, collect the corresponding PGO profiles, and then use platform-specific PGO profiles to perform PGO optimization for each platform. The idea is to reduce the count of instrumentation builds and use PGO profiles from one platform for performing PGO optimization on all platforms. Our application workloads are the same across all platforms, we have a few platform-specific code behind conditional compilation, and we have no runtime dispatching. So from my expectation in this case profiles should be reusable across platform. However, maybe I missed some details about PGO profiles platform specifics. The question was raised in the issue about enabling PGO for pydantic-core
for macOS platform (build with PGO on macOS arm · Issue #732 · pydantic/pydantic-core · GitHub). Due to the GitHub actions limitations, it’s difficult/impossible to prepare PGO-enabled build for the macOS platform. The idea is simple: let’s try to reuse PGO profiels collected from Linux x86-64 platform for macOS ARM build. Of course platform-specific code (like special IFDEF
s for OS/architected) will not be PGO-optimized but the vast majority of other code could be optimized since the code for both platforms is common. If anyone already tried to do the same things or has some implementation insights - would be great to discuss it. Completely the same question applies to the different target architectures. From my assumptions, PGO profiles from x86-64 should be usable for other targets like ARM. If I am wrong - please tell me about it.
According to my tests (and the tests of the Rust dev team) - profiles from different OS are not compatible at all. I tested it locally on different applications - llvm-profdata overlap
comparison between Linux and macOS profiles is 0% overlap.
Thank you in advance for your answers!