How to solve the problem of stale Profile data when Bolt is used with pgo?

zcfh · March 28, 2024, 12:19pm

I’m currently trying to use pgo and bolt to optimize our services and they have an additive effect, if I follow the steps: code → org_bin → perf → pgo1 → perf → pgo_bolt1 . Based on the original binary sampling data, perform pgo optimization, then sample pgo1, and then perform bolt optimization.
However, this brings some extra work to the deployment of services, so I tried to sample pgo_bolt1 and use a same perf data to perform pgo and bolt. The purpose of this is that I only need to deploy pgo_bolt to perform iterative optimization.
Proceed as follows

perf pgo_bolt1 → pgo2; the effects of pgo2 and pgo1 are close
perf pgo_bolt1 and perform bolt based on pgo2. At this time, bolt has no optimization effect. And bolt will have alarm logs, 40% have invalid (possibly stale) profile.

I searched for the reason (it may not be correct, please correct me). Since the perf data of pgo1 and pgo2 are different, the binary output by them is inconsistent. This causes the instruction offset output by perf2bolt to not match in pgo2.
Is it possible to solve this problem?

zcfh · March 28, 2024, 3:26pm

Is there a bolt2source/source2bolt tool to reduce the impact of this instruction address change?

aaupov · March 28, 2024, 6:43pm

Couple of things to unpack here:

However, this brings some extra work to the deployment of services, so I tried to sample pgo_bolt1 and use a same perf data to perform pgo and bolt. The purpose of this is that I only need to deploy pgo_bolt to perform iterative optimization.

There’s BAT mode which allows sampling BOLTed binary. BAT is enabled by -enable-bat flag. You can then sample BOLTed binary to collect BOLT profile. You would need to pass BOLTed binary to perf2bolt and it automatically detects BAT.

You can sample BOLTed binary to collect PGO profile if you update debug information used to match profile back to the source. Use -update-debug-sections.

This causes the instruction offset output by perf2bolt to not match in pgo2.

BOLT has stale profile matching feature that can partially mitigate binary differences. Stale matching requires the use of YAML profile (produced with perf2bolt binary -p perf.data -o fdata -w yaml), and is enabled by -infer-stale-profile flag passed to BOLT at optimization time.

However, the mode to produce YAML profile in BAT mode is very recent and experimental, so you can try today’s trunk version (at least containing [BOLT] Set EntryDiscriminator in YAML profile for indirect calls · llvm/llvm-project@385e3e2 · GitHub) if you’re willing to dig deeper and help with debugging. Otherwise you can collect fdata profile in BAT mode, and convert it into YAML in two steps:

perf2bolt binary.bolt.bat -p perf.data -o fdata
llvm-bolt binary.orig -data fdata -w yaml -o /dev/null
and then use yaml profile with new binary, adding -infer-stale-profile.

Is there a bolt2source/source2bolt tool to reduce the impact of this instruction address change?

BOLT doesn’t map the profile using source information. The profile matching accuracy requirements are much higher for BOLT to be effective which makes the use of source information impractical. BOLT uses either address/offset-based profile (fdata), or binary basic block-based profile (yaml).

zcfh · March 29, 2024, 2:16am

Thank you for your answer. Here I will add some information.

I am currently using BAT mode.
-infer-stale-profile This option feels like the function I want, and I will try it.
Regarding bolt2source/source2bolt tool, it is a simple idea of mine that has not been fully investigated. My idea is: Since the binaries of pgo2 and pgo1 are different, I hope to use a tool to convert fdata so that its address can match pgo2. For example

perf2bolt binary.pgo.bolt.bat -p perf.data -o fdata
bolt2source -binary binary.pgo.bolt.bat perf -o fdata.source
source2source -binary pgo2 -p fdata.source -o fdata.convert
llvm-bolt pgo2 -datafdata.convert

But according to you, using source information is impractical.

zcfh · March 29, 2024, 10:59am

Thanks, I tested that infer-stale-profile can eliminate the perf stale caused by two different pgo binaries. I can still gain performance benefits. I want to briefly understand the principle of infer-stale-profile. Is there any documentation on this?

aaupov · March 29, 2024, 4:42pm

Sure, there’s a paper published at Compiler Construction’24:

And slides used to present the work at the conference:
stale-matching.pdf (338.3 KB)

zcfh · April 3, 2024, 2:10am

Regarding the [9] Profile inference revisited cited in the paper, I have a question.

The assumption is that vertex weights are a result of profiling an actual binary, while branch probabilities are likely coming from a predictive model, which is arguably less trustworthy.

The paper does not mention how the predictive model predicts. Isn’t this probability what really needs to be speculated on? If you already know the actual counts of some points and the probabilities of edges, don’t you just need to allocate the required traffic to the matching points? Wouldn’t it be better to allocate the remaining traffic according to probability?

Topic		Replies	Views
Can the binary optimized by Autofdo and bolt be iteratively optimized? Using Clang pgo , clang , bolt	0	61	March 25, 2024
Can base binaries be optimized using optimized binary perf data? BOLT bolt	3	354	November 11, 2023
Error with perf2bolt in LLVM BOLT LLVM Dev List Archives	3	123	April 10, 2020
Propeller can work with pgo use one profile? IR & Optimizations pgo , clang	9	129	April 11, 2024
Making Clang/LLVM faster using code layout optimizations LLVM Dev List Archives	3	148	October 19, 2018

How to solve the problem of stale Profile data when Bolt is used with pgo?

Related Topics