Profile-Guided Optimization (PGO) related questions and suggestions

Hi!

I do research about PGO state across the industry (all current results can be found in my repo). During the investigation, I met multiple PGO-related questions for which could not find answers. I already asked in the LLVM Discord (#profiling channel) but didn’t get a response. So maybe would be better to ask the questions here.

The first question is about PGO approach differences in practice. According to the Clang documentation (Clang Compiler User’s Manual — Clang 18.0.0git documentation), there are two ways to implement PGO: via -fprofile-instr-generate (frontend-level instrumentation) and -fprofile-generate (IR-level instrumentation). Are there available comparisons between them from different perspectives? I mean instrumentation performance overhead, binary size overhead, PGO optimization opportunities, PGO profiles resistance to changes in the source code of the program, maybe something else. Right now it’s not clear which PGO way is recommended to use with Clang. E.g. cargo-pgo (a Cargo plugin that implements PGO for the Rust ecosystem) uses IR-level approach: https://github.com/Kobzol/cargo-pgo/blob/main/src/pgo/instrument.rs#L64

I found a thread about this question: Status of IR vs. frontend PGO (fprofile-generate vs fprofile-instr-generate) . But this thread lacks some real-life benchmarks. Can anyone give us more insights about the question? Did anything change during the last several years in this area?

The second question is about PGO profile compatibility between compiler versions. As far as I know, the profile version is somehow stored in the profile header. But I cannot find guarantees about forward/backward PGO profile compatibility between compiler versions (or even compiler commits). I already met issues like __llvm_profile_raw_version doesn't prevent version mismatches hard enough · Issue #52683 · llvm/llvm-project · GitHub when llvm-profdata fails to process profiles with different versions. Understanding such guarantees is important for us since we want to cache PGO profiles somewhere in our storage and try to reuse them even after the compiler upgrade. We do not want to regenerate all our PGO profiles each time when the compiler is upgraded.

The third question is about PGO profiles reusage between compilers. Right now each major compiler has its own PGO format (.gcda for GCC-based compilers, .profraw /.profdata for LLVM-based, .pgc /.pgd files for MSVC, don’t know about other compilers). These profiles are incompatible with each other. However, my assumption is that it’s possible to somehow try to convert PGO profiles from one format to another (probably with worse profile precision or some missed details) and reuse PGO profiles from GCC for optimizing the application with Clang. The use-case for the feature is the following. We build our application with N compilers (Clang, GCC, MSVC) and we want to perform PGO-optimization step for each of them. Right now we need to prepare an instrumentation build with each compiler, run each instrumented version, collect the profiles for each compiler, and use them for the optimization step. It would be easier for us to manage only one profile format (honestly, we don’t care about which one exactly but since we are in the LLVM Discord channel - let’s choose .profraw /.profdata ) and reuse it for each compiler. I tried to find tooling for that or at least some notes about possibility/impossibility of such an idea but found nothing. If anyone can say something more about this question - would be great to listen to it.

The fourth question is about PGO profiles resistance to changing compiler options. How PGO profiles are influenced by using different compiler flags on the Instrumentation phase (if we are talking about instrumentation PGO)? I mean how llvm-profdata overlap metric is affected by using different optimization flags on the instrumentation phase like different optimization options (like “O2 vs O3”), maybe different inlining budgets, different LTO settings, etc. The use case for the question is the following. There is an idea about sharing PGO profiles between different operating systems/package managers, so will be no need for each distribution to prepare its own PGO profiles storage - will be possible to reuse some shared PGO profiles place (this idea was mentioned on LLVM Now Using PGO For Building x86_64 Windows Release Binaries: ~22% Faster Builds - Phoronix Forums). The problem is that every distribution can use slightly different compiler flags like “-O2 vs -O3”, different LTO approaches (no LTO, Thin LTO, Fat LTO), different march defaults, etc. And if PGO profiles for the same training workload but with slightly different compiler flags differ A LOT due to these differences in the compiler flags - the idea cannot be implemented. Maybe someone already did some research about the topic. If different (frontend level, IR level, sampling) PGO approaches have different properties from this perspective - would be happy to know the answer for each of them.

According to my tests, enabling/disabling LTO or switching from “-O3” to “-O2” (and vice versa) completely breaks PGO profiles reusage - llvm-profdata overlap metrics instantly become 0%. If someone can explain - which else options affect it, would be great.

The fifth question is about PGO profiles compatibility between different operating systems for the same program built with the same compiler on all operating systems. Let’s imagine a pretty normal situation when we build our application with Clang on all target platforms (Linux-based, macOS, Windows, *BSD). Right now we perform instrumentation build for each platform, collect the corresponding PGO profiles, and then use platform-specific PGO profiles to perform PGO optimization for each platform. The idea is to reduce the count of instrumentation builds and use PGO profiles from one platform for performing PGO optimization on all platforms. Our application workloads are the same across all platforms, we have a few platform-specific code behind conditional compilation, and we have no runtime dispatching. So from my expectation in this case profiles should be reusable across platform. However, maybe I missed some details about PGO profiles platform specifics. The question was raised in the issue about enabling PGO for pydantic-core for macOS platform (build with PGO on macOS arm · Issue #732 · pydantic/pydantic-core · GitHub). Due to the GitHub actions limitations, it’s difficult/impossible to prepare PGO-enabled build for the macOS platform. The idea is simple: let’s try to reuse PGO profiels collected from Linux x86-64 platform for macOS ARM build. Of course platform-specific code (like special IFDEF s for OS/architected) will not be PGO-optimized but the vast majority of other code could be optimized since the code for both platforms is common. If anyone already tried to do the same things or has some implementation insights - would be great to discuss it. Completely the same question applies to the different target architectures. From my assumptions, PGO profiles from x86-64 should be usable for other targets like ARM. If I am wrong - please tell me about it.

According to my tests (and the tests of the Rust dev team) - profiles from different OS are not compatible at all. I tested it locally on different applications - llvm-profdata overlap comparison between Linux and macOS profiles is 0% overlap.

Thank you in advance for your answers!

First question:

Answer: IR PGO uses MST to find instrumentation point with help of static branch prediction to minimize instrumentation overhead. It also has early cleanup pipeline including early inliner thus further reduces overhead. The early inliner also enables more precise context dependent profile thus the runtime performance (of the profile-use build) is also better. You can easily benchmarking this by yourself. In summary, for performance, IR PGO should be used. Front-end instrumentation is better used for coverage testing. IR PGO and frontend PGO have similar resistance to source change, but IR PGO is also sensitive to compiler pipeline change – e.g., CFG change for instrumentation point can lead to mismatches.

Second question:
Answer: instrumentation PGO has two profile format: raw and indexed. For raw profile format, there is no backward compatibility, but for indexed format, it is guaranteed that old version of profile can be consumed by new profile reader.

Third question:
Answer: the goal is attractive, but unrealistic to be implemented. Assuming IR PGO, the control flow produced by different compiler can be different and the ways instrumentation points are selected can be very different, unless all compilers also uses debug information for profile matching purpose.

Forth question:
IR PGO can be sensitive to optimization related options especially options that can change pipelines and inline threshold.

Fifth question:
For different platforms, platform specific parameters can make this hard – e.g. Arm and x86’s call overhead modeling is different. This affects inlining decisions.

If target specific code is not much, you may consider using Sample (PMU) based PGO (aka AutoFDO). The LBR based profile from Intel platform can be used to optimized for Arm.

David

2 Likes

Thanks a lot for the answers!

Regarding FE PGO vs IR PGO. Can we add the recommendation about using IR PGO by default in the Clang PGO guide (Clang Compiler User’s Manual — Clang 18.0.0git documentation)? Having this information in this guideline would be helpful for the users. By the way, if you have existing FE PGO vs IR PGO benchmarks - could you please share them? Of course, I can perform benchmarks on my own but having as many numbers as possible is always a good idea.

For raw profile format, there is no backward compatibility, but for indexed format, it is guaranteed that old version of profile can be consumed by new profile reader.

Is it documented somewhere in the user documentation? If not, could we add this information somewhere?

Yes, we can add documentation on the compatibility and IR PGO recommendation (it is a common practice these days). Regarding benchmarking, it really depends on the workload. For large server workload, the perf diff can be large (probably also due to the fact the front-end instrumented binary being too slow leading to error paths/timeouts and missing profile coverages).

I just found an interesting detail about the instrumentation-based PGO in LLVM: Clang Compiler User’s Manual — Clang 18.0.0git documentation

According to the documentation, by default non-thread safe PGO mode is used. This was introduced in ⚙ D87737 Add -fprofile-update={atomic,prefer-atomic,single} . According to the discussion, this choice was made for reducing performance overhead in the cost of possible profile miscalculations. According to the ⚙ D34085 [PGO] Register promote profile counter updates it could be the problem (however I cannot find estimations how critical this problem is in practice in the current instrumentation PGO implementation in LLVM).

GCC by default uses thread-safe approach - Instrumentation Options (Using the GNU Compiler Collection (GCC)) but has some performance problems with it - 89307 – -fprofile-generate binary may be too slow in multithreaded environment due to cache-line conflicts on counters . More information about it I found in some pretty old mailing lists discussions - https://lists.llvm.org/pipermail/llvm-dev/2014-April/072172.html (but since they are pretty old - something could be changed during the years).

As far as I understand, this problem can be pretty important if we are talking about reproducible PGO profiles (e.g. for some ecosystems like NixOS reproducibility is one of the biggest concerns).

Is it really a problem in practice for the instrumentation PGO? Are both FE PGO and IR PGO affected by this? If yes - are they affected equally by the contention influence on the profile quality perspective? Are there some ways to mitigate it except switching to AutoFDO?

Question regarding PGO profiles reproducibility.

GCC has a special compiler flag - Instrumentation Options (Using the GNU Compiler Collection (GCC)) for making reproducible PGO profiles. But I cannot find anything similar to it in Clang/LLVM. Are LLVM-based PGO profiles by default reproducible in the same way as GCC declares in the -fprofile-reproducible description above? What is the default LLVM behavior?

The level of profile quality loss depends on how heavily the counter updates are contended by different threads. For instance if there is counter update in a hot loop simultaneously executed by a lot of threads, you will see not only large slow down (due to cache ping-pong), but also lose profile counts.

Atomic update is one way to solve it but it slows down instrumentation even more. Currently with counter promotion, it is not a big problem.

For a lot of workloads (e.g. server), expecting exact profile reproducibility even with atomic update is not realistic. What important is high level of profile overlap (small diff).

1 Like

Could you please explain a bit more about the counter promotion? What is it? Is it enabled by default in Clang/LLVM? I see mentions of such a feature only in LLVM: lib/Transforms/Instrumentation/InstrProfiling.cpp Source File but cannot quickly track it to the Clang side. How does it solve the problem with profile counter atomic updates?

For a lot of workloads (e.g. server), expecting exact profile reproducibility even with atomic update is not realistic.

If in our application we have no time-based or random-based logic, I think achieving PGO profile reproducibility is a doable goal, if we have a predefined deterministic training workload (when we do PGO training in the isolated environment). What else in this case can introduce profile drifting between PGO training runs?

Counter promotion basically promote the static counters in the loop into register counters, and the profile data is synced back to the original counter copy at loop exits.

yes, the case you described can/should achieve profile reproducibility when atomic update option is used – though it is not a stated goal of that option.

1 Like

Is it possible right now enable functionality like -fprofile-update=<method> for other LLVM-based compilers like Rustc? There is an idea to try to enable it for the Rustc compiler too (at least temporarily) and test its effects.

I am looking at the commit Add -fprofile-update={atomic,prefer-atomic,single} · llvm/llvm-project@3681be8 · GitHub and honestly cannot understand - is it a Clang-frontend specific feature or somehow can be enabled for other LLVM-based compilers.

The middle end option is language independent, so you can use -mllvm -instrprof-atomic-counter-update-all=true to turn it on. The -fprofile-update= option is available in Clang. Not sure about rustc’s support in compiler driver.

A question regarding .profdata files reusage between compiler versions.

As you said above, one of the biggest influences on the profile re-usage has multiple inlining decisions. Let’s imagine a situation when we collected profdata PGO profile for our application with Clang 15 and saved it somewhere. Later Clang 16 is released and we want to use our saved PGO profile to optimize our program with Clang 16 (so we are skipping collected PGO profile with Clang 16).

Is there a chance that the PGO profile will not be helpful for optimization purposes in this case? A new optimization pass is added in Clang 16, it influences somehow the inlining decisions. Inlining changed, and our profile from Clang 15 is not useful anymore (however is compatible with Clang 16, as you said above). Is this scenario real? If yes, what are the ways to mitigate it except regenerating PGO profiles with each compiler update?

The answer really depends on the changes between compiler release. The profile mismatch can range from none to a lot (compiler will emit function level mismatch warnings). In general, I would expect the (old) profile effectiveness gradually degrades over future compiler releases. To mitigation depends on the nature of change. For instance if the difference is default parameter (e.g. for inining) settings, using the old setting may solve the problem.

Yeah, I thought the same thing. The problem here is that the compiler defaults are unknown to the usual compiler user. In the release notes usually new lang features are described. And I don’t think that it’s possible to somehow estimate the impact except by measuring the warning about “missing profile for a function” count and comparing this number to the previously counted number from the previous compiler version (here I assume that for the previous compiler versions we haven’t 100% PGO profile coverage too - usual situation IRL).

@davidxl is it possible to somehow enable Context-Sensitive IR PGO (CSIR PGO) via passing options to the LLVM part without modifying a frontend? I want to perform tests for the Rustc compiler (GitHub issue for that - Add Context-Sensitive IR PGO (CSIR PGO) · Issue #118562 · rust-lang/rust · GitHub). It would be great if you could suggest how we can test CSIR PGO with Rustc in this case. If it shows improvements for Rustc too I think we can think about enabling it for the Rustc compiler too.

By the way, in ⚙ D54175 [PGO] context sensitive PGO there is a note “We performance-tested this patch with a few large Google benchmarks and saw good performance improvements.” but without actual numbers. Could you please share the actual performance results (if it’s possible to do without NDA violation, of course). With some numbers, it’s much easier to consider enabling CSIR PGO as a part of a PGO infrastructure.

Also, I want to clarify the current status of the following PGO enhancements: Flow-Sensitive AutoFDO (FS-AFDO, [llvm-dev] [RFC] Control Flow Sensitive AutoFDO (FS-AFDO)) and Context-Sensitive Sampling PGO (CSS PGO, https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s/m/iJjcmUS7AwAJ). I read about these PGO kinds at The many faces of LLVM PGO and FDO . However, right now it’s not clear about their current status in LLVM. Are they already integrated into the LLVM? Is it possible to use them? Are they well battle-tested in Google/Meta environments?

If any of these approaches are good-enough to use in real-production and well-maintained in LLVM, do we need to update the PGO documentation in Clang?

There is currently no way to do that. The feature relies on frontend to pass the option so that the pipeline builder can setup passes properly.

The improvement varies depends on application. It is common to see 1 to 3% improvement.

Both FS-AFDO and CSSPGO are based on PMU profile (Sample based). They are all integrated in LLVM. I need to update the documentation to reflect it.

Question about CSIR PGO (Clang Compiler User’s Manual — Clang 18.0.0git documentation).

In current Clang documentation, CSIR PGO is used only as an additional step after IR PGO. So it means using a 3-step compilation model (IR instrumentation, CSIR instrumentation, optimization build). Is it possible to use CSIR PGO without IR PGO?

From the documentation and the related PR (⚙ D54175 [PGO] context sensitive PGO) it’s unclear why we need to use CSIR PGO only as a second after the usual IR PGO. Could you please elaborate here a bit more? Right now someone can decide to skip IR PGO and use only CSIR PGO.

I think some clarifications need to be done in the documentation.

In the Clang command-line reference (Clang command line argument reference — Clang 18.0.0git documentation) I found several PGO-related flags that are missing in the official Clang PGO documentation (Clang Compiler User’s Manual — Clang 18.0.0git documentation). These flags are: -fprofile-sample-accurate, -fauto-profile-accurate, -fno-profile-sample-accurate, -fauto-profile.

Is there a difference between -fsample-use and -fauto-profile? If yes, what is the difference? If not, why do we have two flags for the same thing?

Is there a difference between -fprofile-sample-accurate and -fauto-profile-accurate? If yes, what is the difference? If not, why do we have two flags for the same thing? Also, according to the description, these flags can be important in some PGO scenarios and they should be documented in the official Clang PGO guide: what they do, in which cases they should be used in practice, etc.