RFC: HWPGO, i.e., adding new SPGO feedback types

tcreech-intel · August 6, 2024, 8:32pm

We (Intel) are interested in upstreaming initial work toward supplementary sample-based feedback types.

We refer to SPGO with supplementary profiles from hardware performance counters as “Hardware-based PGO” (HWPGO). HWPGO feedback is opt-in, provides additional profiles on top of current SPGO, and re-uses existing formats/tooling whenever possible.

The usage model is currently a bit ad-hoc, with only minimal tooling. This is an area we are particularly interested in feedback on.

While the sections below focus on Linux/perf, HWPGO also works on Windows using Intel SEP as a profiler.

A talk on HWPGO was given at EuroLLVM 2024, with slides available here:

https://llvm.org/devmtg/2024-04/slides/TechnicalTalks/Xiao-EnablingHW-BasedPGO.pdf

Proposal

In the short-term, we propose contributing:

llvm-profgen support for generating arbitrary performance counter-based profiles, and,
a new pass which annotates IR with !unpredictable metadata based on a profile of branch mispredicts.

These changes are opt-in and do not affect current SPGO users uninterested in HWPGO. They demonstrate the general technique of collecting a secondary PMU profile and using it toward optimization.

Longer-term we’d like to extend the concept to additional PMU metrics while keeping the feedback process manageable for users. We’d like to gauge interest in the general idea, and see if there are comments on how best to go about it.

Motivation

A sampling profiler typically leverages hardware counters to sample execution of instructions. The hardware can also simultaneously produce profiles of other events such as branch mispredicts and cache misses in addition to the usual retired branches and LBRs used for execution frequency feedback. Often a very large number of metrics are available.

Furthermore, some PMUs can sample with enough precision that the profiles can be used to compute high-level metrics such as branch- or cache-mispredict ratios for individual instructions.
This is not Intel-specific as other hardware certainly exposes similar events. However, we can’t speak to the precision of the sampling. (Feedback appreciated here.)

We believe there is value in providing a number of these supplementary profiles alongside the execution frequency profile.
These profiles can provide information not available via static analyses or even instrumentation.
For example, there may be utility in enabling data cache miss feedback, instruction cache miss feedback, SIMD utilization feedback, etc.
Also, target-specific feedback for backends, such as DSB events, frequency licensing, etc., is possible.

Branch Mispredict Feedback

One new feedback type we have explored is branch mispredict profiles. The intent is to identify genuinely unpredictable branch conditions, not target-specific issues.
A high branch mispredict ratio is a strong hint to the compiler that additional speculation in order to eliminate control flow may be profitable.
This kind of feedback is uniquely available from hardware counters, not available even via instrumentation.

The compilation process (on Intel hardware) looks something like:

# First compilation:
clang -O2 -gline-tables-only -fdebug-info-for-profiling app.c -o app

# A single profiling run with additional events:
perf record -o app.perf.data -b -c 1000003 -e br_inst_retired.near_taken:uppp,br_misp_retired.all_branches:upp -- ./app

# Generate multiple source-level profiles from the single binary-level profile:
llvm-profgen --perfdata app.perf.data --binary app --output app.freq.prof --sample-period 1000003 --perf-event br_inst_retired.near_taken:uppp
llvm-profgen --perfdata app.perf.data --binary app --output app.misp.prof --sample-period 1000003 --perf-event br_misp_retired.all_branches:upp --leading-ip-only

# Finally, feedback to next compilation:
clang -O2 -fprofile-sample-use=app.freq.prof -mllvm -unpredictable-hints-file=app.misp.prof -o app.2

Only a single perf run is needed, but we run llvm-profgen twice:

Once to generate a typical LBR-based execution frequency profile using only the br_inst_retired samples.
Again to form a new profile of mispredicted branches using only the br_misp_retired samples.

Three new llvm-profgen options are used to form profiles suitable for HWPGO:

--perf-event: choose the event to form the profile from.
--leading-ip-only: do not use the LBR trace but only leading IP to contribute to the profile. Only the instruction at the sampled IP is known to have mispredicted.
--sample-period: provides the sampling period for this event so that the two profiles have comparable magnitudes. This is important for the compiler to be able to compute branch mispredict ratios.

These options are only required if using HWPGO. I.e., existing SPGO methods do not require any of these options, and will be completely unaffected.
Both output profile formats are unchanged from today, with no changes needed to the use of app.freq.prof from today. i.e., no changes to the SampleProfileLoader path.

When recompiling, the branch mispredict profile is used in combination with the execution frequency profile to compute mispredict ratios and add !unpredictable metadata.
This is implemented in an “UnpredictableProfileLoader” pass.
!unpredictable metadata is an existing concept in LLVM, mainly incurred today by Clang’s __builtin_unpredictable.
LLVM already uses this metadata in both IR and Machine IR to make optimization decisions, such as promoting conditional instructions over control flow.

This functionality has been in our downstream OneAPI compiler (“icx”) targeting both Windows and Linux since 2024.0. An article describing an example is available here:

Performance Results

Branch mispredict feedback has less broad impact when compared with fundamental execution frequency feedback.

Typically improvement comes from the identification of a genuinely unpredictable branch condition which static analysis or instrumentation feedback cannot identify.

As a result, improvement is typically in a smaller number of cases, with no effect in other cases.

Branch mispredict feedback was developed in the OneAPI compiler to improve performance of a very popular codec software on Windows. The application-level performance improvement from branch mispredict feedback was 8% in that case.

In another suite of benchmarks based on real-world applications and datasets we observe 1% and 5% improvement in 2 out of 16 tests, with no significant impact in the other 14 tests.

Finally, in CoreMark-PRO we have observed 10% improvement from CMOV conversion of one particular branch with the OneAPI compiler. (See EuroLLVM slides above.)

These improvements are in addition to any improvements from DWARF/LBR-based SPGO.

As more optimizations are developed to take advantage of !unpredictable metadata, these results may improve. For example, see llvm-project 3d494bfc.

Usability and Managing Multiple Profiles

Longer term, we are exploring extending HWPGO beyond branch mispredict feedback, i.e., cache misses, SIMD utilization, etc.

The usage shown above becomes tedious and error-prone as more profile types are added. Each new metric requires another llvm-profgen run, maintenance of another profile, and a new option to name the profile file for recompilation.

To simplify usage, we propose establishing some kind of “profile bundle” concept. For example, this might look like a directory containing well-known profile types, which the compiler driver could use to identify all of the profile types.

It could also be useful to provide tooling to orchestrate the perf and/or llvm-profgen runs to simplify this for the user. For example, a tool which can identify any known events present in perf output and invoke llvm-profgen runs with appropriate options to create the “profile bundle.”

We haven’t worked on this tooling yet as the usage is still just about manageable for expert users, but any thoughts are greatly appreciated.

williamweixiao · August 7, 2024, 12:36am

@WenleiHe

WenleiHe · August 7, 2024, 9:49pm

Thanks for the proposal. This is generally an interesting direction to explore. We had similar thoughts around using branch mispredict to guide CMOV conversion, but we weren’t sure that the extra benefit on top of today’s PGO warrants the added complexity. So I really appreciate the effort to help get clearer answers in this space. Some questions below.

On results:

The application-level performance improvement from branch mispredict feedback was 8% in that case.
In another suite of benchmarks based on real-world applications and datasets we observe 1% and 5% improvement in 2 out of 16 tests, with no significant impact in the other 14 tests.

Generally we want to make sure the optimization added has practical benefit. Benchmarking a prototype is a way to answer that question. In this case, the “practical” aspect comes down to 1) benchmarking being somewhat typical, 2) baseline implementation being reasonably good.

For 1) how big are these benchmarks? is it closer to micro-kernel (extracted from real application), or actual real world applications, 2) Is this all done on ICX, and with PGO on in baseline?

Given that you have downstream implementation finished E2E, would you be able to get results on say full spec2017, comparing HWPGO+SPGO vs SPGO to give others a better idea of the practically and generality of the optimization?

On design/implementation:

I’d suggest focus on E2E optimization solution, rather than exposing everything available. Currently there is only an optimization that leverage branch mispredict and !unpredictable, so we should probably expose just that from llvm-profgen and profile as well, while making sure the design is future proof. The llvm-profgen changes especially the switches may need some tweak, but we can probably work that out during patch review.

It’d be great if we can use cache-miss to guide optimization that has code locality/size implication (e.g. inlining). Wondering if you have thought about that?

Since you provide separate profile (regardless of using profile bundle or not), it would be good to make separate profiles independent, without needing to calibrate/normalize by sampling rate. If -unpredictable-hints actually contains hints that something is unpredictable, it wouldn’t need to be calibrated with execution count profile.

In the case of continuous PGO, how do make sure CMOV stays as CMOV for unpredictable source branches after many iterations of PGO? Once we turned a branch into CMOV, we may lose its brach mispredict profile if we collect profile again. Any thoughts on mitigation?

davidxl · August 7, 2024, 10:15pm

Regarding benchmarking, Meta open sourced DCPerf: An open source benchmark suite for hyperscale compute applications - Engineering at Meta.

Regarding losing profile data in continuous PGO, it is a common issue to be resolved and we need a shared mechanism for it (i.e. for vtable profiling). @mingmingl-llvm brought up the idea of recording the profile data of the previous IR construct (line+descriminator) in the binary and the information can be ingested into the refreshed profile data.

snehasish · August 7, 2024, 11:04pm

+1 on @WenleiHe said about benchmarking and baselines.

In particular, the profile guided cmov conversion pass is disabled by default. With this turned on @apostolakis noted a 1% improvement on clang bootstrap on top of instrumentation PGO+ThinLTO. Reduced, but still measurable improvements were noted when using sample based profiles. Does the addition of unpredictable branch data improve beyond this baseline?

Regarding the events used for sampling, wouldn’t br_inst_retired.conditional and br_misp_retired.conditional be more accurate? Also LBR has metadata bits which includes mispredict information, using this means we wouldn’t need an additional profile collection step. Any reason why this was not considered?

Regarding usability and managing profiles, we should consider extending the sample profile extbinary format to hold additional profile data instead of new profile files.

Overall, I’m excited about the prospect of new profile types and eager to see how we can improve beyond the state of the art.

WenleiHe · August 7, 2024, 11:29pm

@snehasish any reasons the select optimization is off by default btw?

mingmingl-llvm · August 8, 2024, 12:02am

Only a single perf run is needed, but we run llvm-profgen twice:

Once to generate a typical LBR-based execution frequency profile using only the br_inst_retired samples.

Again to form a new profile of mispredicted branches using only the br_misp_retired samples.
--sample-period: provides the sampling period for this event so that the two profiles have comparable magnitudes. This is important for the compiler to be able to compute branch mispredict ratios.

One question on the detail, I wonder if it makes sense to relax the requirement of making profile counters comparable outside of compiler, specifically in the broader context to profile different types of events.

To elaborate my question with an example, block frequencies are derived from LBR events and known in the compiler today. The absolute counters from other events might be recorded as they are in the SPGO profiles, and compilers can be taught to derive percentages using basic block as an anchor.

In the case of continuous PGO, how do make sure CMOV stays as CMOV for unpredictable source branches after many iterations of PGO? Once we turned a branch into CMOV, we may lose its brach mispredict profile if we collect profile again. Any thoughts on mitigation?

Regarding losing profile data in continuous PGO, it is a common issue to be resolved and we need a shared mechanism for it (i.e. for vtable profiling). @mingmingl-llvm brought up the idea of recording the profile data of the previous IR construct (line+descriminator) in the binary and the information can be ingested into the refreshed profile data.

Hope I won’t derail the discussion too much. To make profiles stable in SPGO iterative compilation, the basic essence of what I have in mind is to record the instruction or basic block position (e.g., a position is represented by function MCSymbol and inst/bb address offset to the function) along with the metadata in a (unloaded) ELF section. When profiling the SPGO-optimized binary, ELF section and new hardware events are used together to keep profiles stable and as fresh as possible.

While this ELF section idea is in design phase and prototyping work is needed before I have an RFC, I agree it should be as general as possible (i.e., reusable by CMOV, If-Conversion or other optimizations that need profile stability in SPGO iterative compilations).

wlei · August 8, 2024, 12:37am

I’m wondering if it’s really necessary to use absolute counters in this case. As far as I can see, the goal is to annotate the !unpredictable, then maybe we can compute this early and save it in the profile. We could use an optional 1 bit for each the sample linelocation, done by either extending the metadata field or extending the SampleRecord(VP) field). In this way, we can move all the computations in Introduce UnpredictableProfileLoader for PMU branch-miss profiles by tcreech-intel · Pull Request #99027 · llvm/llvm-project · GitHub from compiler to llvm-profgen, just needs some small changes in the existing sampleloader for IR annotation(we don’t need the additional pass).

snehasish · August 8, 2024, 5:23am

I don’t think there is a particular reason to keep it turned off by default on x86 (though @apostolakis should confirm). It has been used in production internally for >1y now. Separately @mingmingl-llvm pointed out that it has already been made the default for ARM (⚙ D138990 [AArch64] Enable the select optimize pass for AArch64 and ⚙ D143162 [AArch64] Add PredictableSelectIsExpensive feature to all the cpus that have FeatureEnableSelectOptimize).

tcreech-intel · August 8, 2024, 6:35pm

Thanks for all the responses and comments so far. It’s great to see that there is some interest. I’ll try to respond to everything, perhaps in multiple comments.

The latter. The benchmarks are based on real world applications and datasets which are meant to be representative of typical use.

Yes, in all cases the baseline is icx with SPGO. In other words, the improvements observed are solely from the additional !unpredictable metadata, on top of a fully-optimized baseline.

We don’t have the results on hand, but I will check on this.

We have done some early work on leveraging both data and instruction cache miss feedback. Evaluation is ongoing. In the case of icache misses I don’t think we looked into influencing inlining, however – that’s an interesting thought.

Data cache miss feedback is interesting because it needs the source-level profile to have more precision to identify specific loads within a basic block. Line+discriminator locations (and I believe pseudo-probes, too) intentionally identify only distinct control-flow paths because that’s what’s needed for execution frequency profiles, but some kinds of profiles have interesting features within control-flow paths.

Our first implementation added !unpredictable above some minimum absolute mispredict sample count. The problem we saw is that the sample count is just not very meaningful without execution count as a denominator to determine the mispredict ratio. We also considered making an !unpredictable decision early, in llvm-profgen, as @wlei suggests, but we feel this has a couple of longer-term drawbacks:

While the mispredict profile is hopefully portable to different hardware, profitability thresholds may vary across targets. Exposing the ratio to the compiler could allow per-target tuning with a common profile.
Long term we’d like to expose other orthogonal feedback types to the compiler in a way which allows it to compute similar ratios, for example, to effect perf-target and per-transform profitability thresholds. Some optimizations may even benefit from metrics derived from more than 2 PMU counters, though we don’t have any in mind yet, and we’d like to keep the flexibility to derive such metrics.

We could accomplish this without requiring the user to give --sample-period by improving our changes to get the sample periods directly from perf output. (Likely disabled by default, and only enabled if the user wants HWPGO.)

We haven’t taken any special precautions here, though we are interested in the continuous PGO model. Others have noted that this sort of oscillation is a general SPGO problem. I suspect it can even happen today with only execution frequency feedback: branch probabilities can be used to estimate that a branch is unpredictable, and this may cause the branch to disappear from the next execution frequency profile.

tcreech-intel · August 8, 2024, 7:29pm

We haven’t tried turning on SelectOptimize, meaning I believe we’re using X86CmovConversion. My understanding is that both benefit from unpredictable metadata, so in theory both should benefit.

The execution frequency profile is formed not from sample IPs, but from LBRs forming a control flow trace leading up to each sample. LBRs are captured only on taken branches, and so we use br_inst_retired.near_taken essentially as a proxy for an “LBR inserts” event. The desired effect is, “give me an LBR-based trace every N times we update the LBRs.” (I believe this what’s recommended for AutoFDO, and so it’s not specific to the branch mispredict work.)

For the mispredicts we use br_misp_retired.all_branches so that we can collect indirect branches, too, but both of all_branches and conditional should be comparable to the LBR-based execution frequencies.

We did actually did consider this. The issue is that LBRs only record taken branches, with not-taken branches implied. This means that you only know when the taken branches are mispredicted. We found that the branch mispredict ratio of only taken branches did not accurately represent the overall branch mispredict ratio, and so we decided to use a separate event to include all branches. Given that we hope to extend the idea to other events which do not have special LBR bits this seemed reasonable.

Thanks. We would like to explore this idea if there seems to be community support. A single file would certainly improve usability.

tcreech-intel · August 8, 2024, 7:56pm

I’m interested in ideas here. Ultimately we don’t need profiles with absolute values – this is just one way to make them comparable in the face of varying sampling periods and the fact that one is LBR-based while the other is not.

Are you suggesting that we add a way to compute the absolute execution counts from today’s block frequencies to avoid adjusting and re-reading the execution profile? (Sorry if I’ve misunderstood.)

mingmingl-llvm · August 8, 2024, 10:03pm

After reading Introduce UnpredictableProfileLoader for PMU branch-miss profiles by tcreech-intel · Pull Request #99027 · llvm/llvm-project · GitHub, I understand the motivation of profiling/annotating un-predictable branches on the IR better. I agree that llvm-profgen is a better place to know whether a branch is predictable or not (a one-bit information).

To clarify my original comment, I’m not a big fan of keeping two profiles comparable and requiring users to configure --sample-period to achieve that, and my question is around whether we can improve on this (usability/flexibility) aspect.

The example I gave in the original comment does not apply to !unpredictable annotation though.

tcreech-intel · August 9, 2024, 1:20am

Got it. Thank you for the clarification.
Can I ask how you feel about the idea in the last paragraph below, copied from above?

tcreech-intel:

We also considered making an !unpredictable decision early, in llvm-profgen, as @wlei suggests, but we feel this has a couple of longer-term drawbacks:

While the mispredict profile is hopefully portable to different hardware, profitability thresholds may vary across targets. Exposing the ratio to the compiler could allow per-target tuning with a common profile.

Long term we’d like to expose other orthogonal feedback types to the compiler in a way which allows it to compute similar ratios, for example, to effect perf-target and per-transform profitability thresholds. Some optimizations may even benefit from metrics derived from more than 2 PMU counters, though we don’t have any in mind yet, and we’d like to keep the flexibility to derive such metrics.

We could accomplish this without requiring the user to give --sample-period by improving our changes to get the sample periods directly from perf output. (Likely disabled by default, and only enabled if the user wants HWPGO.)

This would get rid of --sample-period=..., as llvm-profgen would obtain sample periods automatically, and doesn’t require any special maintenance of the profiles afterward. Tools like llvm-profgen merge --sample would work as expected on profiles created this way.

teresajohnson · August 9, 2024, 11:33pm

For memprof (profiling of memory allocations) we are using line offset + column for this reason.

tcreech-intel · August 10, 2024, 1:49am

Ah, that’s interesting. We’re actually doing the same in our prototyping.

snehasish · August 12, 2024, 5:31pm

I think the impact to the application will differ though. It would be great if select optimize could be used as part of the evaluation since it is a profile guided approach unlike the prior pass. I’m curious about whether the unpredictability of the branch (can be approximated by the PGO counts) matters more or the criticality of the decision (dependent instructions). As far as I know, the select optimize pass approximates the former and places emphasis on the latter.

With columns you can get better profile association with expressions within a statement but individual expressions may generate multiple memory operations. For example, I believe something like res = a * mat[1][2][3] will only have a one column number for mat but multiple memory operations. We investigated data cache miss feedback internally a while ago and solved this with an additional discriminator inserted at the machine function stage. See the code in llvm-project/llvm/lib/Target/X86/X86DiscriminateMemOps.cpp at main · llvm/llvm-project · GitHub

I’m very interested in this direction though since we spent some time evaluating it a few years ago but were unable to scale it beyond small applications
/ microbenchmarks. A paper similar to our internal experiments was published APT-GET | Proceedings of the Seventeenth European Conference on Computer Systems.

tcreech-intel · August 12, 2024, 6:30pm

Thanks for this. We’re familiar with X86InsertPrefetch and X86DiscriminateMemOps and some of our prototypes are certainly influenced by that scheme, especially the idea of maintaining a auxiliary profile which re-uses the SPGO formats.

We’ve tended not to use X86DiscriminateMemOps (for these other prototypes) mainly because it doesn’t help with profiling non-memory instructions, and we are interested in establishing very general techniques. I hadn’t realized it was so easy to show an example where multiple loads share a column, however, so we may need to revisit. (I’ve confirmed that your example does result in 3 loads in a single column and that enabling X86DiscriminateMemOps allows discrimination as expected.)

Agreed. I’ll see if I can find any evidence of SelectOptimize yielding different results with branch mispredict feedback.

mingmingl-llvm · August 12, 2024, 9:40pm

Thanks for following up! I replied inline.

tcreech-intel:

Can I ask how you feel about the idea in the last paragraph below, copied from above?

We also considered making an !unpredictable decision early, in llvm-profgen, as @wlei suggests, but we feel this has a couple of longer-term drawbacks:

While the mispredict profile is hopefully portable to different hardware, profitability thresholds may vary across targets. Exposing the ratio to the compiler could allow per-target tuning with a common profile.

Long term we’d like to expose other orthogonal feedback types to the compiler in a way which allows it to compute similar ratios, for example, to effect perf-target and per-transform profitability thresholds. Some optimizations may even benefit from metrics derived from more than 2 PMU counters, though we don’t have any in mind yet, and we’d like to keep the flexibility to derive such metrics.

While both compiler and standalone LLVM tools can access TargetTransformInfo to get backend information, I agree compiler is a better place for per-target fine tuning than standalone tools generally (not limited to branch predictability).

I’m in favor of the direction to save compiler user the effort to coordinately set --sample-period correctly especially if the engineering cost of doing so is reasonable (and justified by better usability).

tcreech-intel · August 21, 2024, 6:39pm

I ran a similar experiment bootstrapping our own compiler, then evaluating time (in seconds) to compile TraMP-3d:

x Execution frequency feedback (baseline)
+ Execution frequency, SelectOptimize enabled
* Execution frequency, branch mispredicts
% Execution frequency, branch mispredicts, SelectOptimize enabled
+------------------------------------------------------------------------+
|         *         %                   +                                |
|%        *  %  *  *O *  %    + ++    + +     %       x   x xx       x  x|
x                                                       |___M_A______|   |
+                              |____A___|                                |
*          |____AM___|                                                   |
%     |_____________MA______________|                                    |
+------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   6     24.301482     24.328502     24.310511     24.313672    0.01028013
+   6      24.26375      24.27977     24.272773     24.272727  0.0068773174
Difference at 99.5% confidence
        -0.0409452 +/- 0.0209197
        -0.168404% +/- 0.0859407%
        (Student's t, pooled s = 0.00874582)
*   6      24.23344     24.252322      24.24495     24.243147  0.0078242445
Difference at 99.5% confidence
        -0.0705245 +/- 0.0218508
        -0.290061% +/- 0.0897054%
        (Student's t, pooled s = 0.00913509)
%   6     24.219657     24.288794     24.248967     24.250278    0.02278254
Difference at 99.5% confidence
        -0.0633935 +/- 0.042275
        -0.260732% +/- 0.173797%
        (Student's t, pooled s = 0.0176738)

The output above is from ministat(1). You may need to scroll the pre-formatted box to see all of it.

The baseline (“x”) is optimized with SPGO via our typical profile training workloads, so it’s ~10% faster than without any feedback. The data shown evaluates the effects of adding mispredict feedback and/or enabling SelectOptimize on top of this.

The results show a small improvement (-0.17%) from SelectOptimize alone.
Branch mispredict feedback shows greater improvement. (-0.29%.)
Combining the two exhibits more run-to-run variation for some reason, but the improvement is similar. (-0.26%.)

Overall the results suggest that branch mispredict feedback can improve even beyond profile-driven SelectOptimize.

Topic		Replies	Views
RFC - Profile Guided Optimization in LLVM LLVM Dev List Archives	30	374	September 6, 2013
Proposal: add instrumentation for PGO and code coverage Clang Frontend	11	681	September 10, 2013
Profile-Guided Optimization (PGO) related questions and suggestions LLVM Project pgo	24	2078	December 20, 2023
RFC - Improvements to PGO profile support LLVM Dev List Archives	61	449	May 29, 2015
[RFC] Context-sensitive Sample PGO with Pseudo-Instrumentation LLVM Dev List Archives	22	604	August 12, 2020