How to solve the problem of stale Profile data when Bolt is used with pgo?

I’m currently trying to use pgo and bolt to optimize our services and they have an additive effect, if I follow the steps: codeorg_binperfpgo1perfpgo_bolt1 . Based on the original binary sampling data, perform pgo optimization, then sample pgo1, and then perform bolt optimization.
However, this brings some extra work to the deployment of services, so I tried to sample pgo_bolt1 and use a same perf data to perform pgo and bolt. The purpose of this is that I only need to deploy pgo_bolt to perform iterative optimization.
Proceed as follows

  1. perf pgo_bolt1 → pgo2; the effects of pgo2 and pgo1 are close
  2. perf pgo_bolt1 and perform bolt based on pgo2. At this time, bolt has no optimization effect. And bolt will have alarm logs, 40% have invalid (possibly stale) profile.

I searched for the reason (it may not be correct, please correct me). Since the perf data of pgo1 and pgo2 are different, the binary output by them is inconsistent. This causes the instruction offset output by perf2bolt to not match in pgo2.
Is it possible to solve this problem?

Is there a bolt2source/source2bolt tool to reduce the impact of this instruction address change?

Couple of things to unpack here:

However, this brings some extra work to the deployment of services, so I tried to sample pgo_bolt1 and use a same perf data to perform pgo and bolt. The purpose of this is that I only need to deploy pgo_bolt to perform iterative optimization.

There’s BAT mode which allows sampling BOLTed binary. BAT is enabled by -enable-bat flag. You can then sample BOLTed binary to collect BOLT profile. You would need to pass BOLTed binary to perf2bolt and it automatically detects BAT.

You can sample BOLTed binary to collect PGO profile if you update debug information used to match profile back to the source. Use -update-debug-sections.

This causes the instruction offset output by perf2bolt to not match in pgo2.

BOLT has stale profile matching feature that can partially mitigate binary differences. Stale matching requires the use of YAML profile (produced with perf2bolt binary -p perf.data -o fdata -w yaml), and is enabled by -infer-stale-profile flag passed to BOLT at optimization time.

However, the mode to produce YAML profile in BAT mode is very recent and experimental, so you can try today’s trunk version (at least containing [BOLT] Set EntryDiscriminator in YAML profile for indirect calls · llvm/llvm-project@385e3e2 · GitHub) if you’re willing to dig deeper and help with debugging. Otherwise you can collect fdata profile in BAT mode, and convert it into YAML in two steps:

  • perf2bolt binary.bolt.bat -p perf.data -o fdata
  • llvm-bolt binary.orig -data fdata -w yaml -o /dev/null
    and then use yaml profile with new binary, adding -infer-stale-profile.

Is there a bolt2source/source2bolt tool to reduce the impact of this instruction address change?

BOLT doesn’t map the profile using source information. The profile matching accuracy requirements are much higher for BOLT to be effective which makes the use of source information impractical. BOLT uses either address/offset-based profile (fdata), or binary basic block-based profile (yaml).

Thank you for your answer. Here I will add some information.

  1. I am currently using BAT mode.
  2. -infer-stale-profile This option feels like the function I want, and I will try it.
  3. Regarding bolt2source/source2bolt tool, it is a simple idea of mine that has not been fully investigated. My idea is: Since the binaries of pgo2 and pgo1 are different, I hope to use a tool to convert fdata so that its address can match pgo2. For example
  • perf2bolt binary.pgo.bolt.bat -p perf.data -o fdata
  • bolt2source -binary binary.pgo.bolt.bat perf -o fdata.source
  • source2source -binary pgo2 -p fdata.source -o fdata.convert
  • llvm-bolt pgo2 -datafdata.convert

But according to you, using source information is impractical.

Thanks, I tested that infer-stale-profile can eliminate the perf stale caused by two different pgo binaries. I can still gain performance benefits. I want to briefly understand the principle of infer-stale-profile. Is there any documentation on this?

Sure, there’s a paper published at Compiler Construction’24:

And slides used to present the work at the conference:
stale-matching.pdf (338.3 KB)

Regarding the [9] Profile inference revisited cited in the paper, I have a question.

The assumption is that vertex weights are a result of profiling an actual binary, while branch probabilities are likely coming from a predictive model, which is arguably less trustworthy.

The paper does not mention how the predictive model predicts. Isn’t this probability what really needs to be speculated on? If you already know the actual counts of some points and the probabilities of edges, don’t you just need to allocate the required traffic to the matching points? Wouldn’t it be better to allocate the remaining traffic according to probability?