We (Intel) are interested in upstreaming initial work toward supplementary sample-based feedback types.
We refer to SPGO with supplementary profiles from hardware performance counters as “Hardware-based PGO” (HWPGO). HWPGO feedback is opt-in, provides additional profiles on top of current SPGO, and re-uses existing formats/tooling whenever possible.
The usage model is currently a bit ad-hoc, with only minimal tooling. This is an area we are particularly interested in feedback on.
While the sections below focus on Linux/perf, HWPGO also works on Windows using Intel SEP as a profiler.
A talk on HWPGO was given at EuroLLVM 2024, with slides available here:
https://llvm.org/devmtg/2024-04/slides/TechnicalTalks/Xiao-EnablingHW-BasedPGO.pdf
Proposal
In the short-term, we propose contributing:
llvm-profgensupport for generating arbitrary performance counter-based profiles, and,- a new pass which annotates IR with
!unpredictablemetadata based on a profile of branch mispredicts.
These changes are opt-in and do not affect current SPGO users uninterested in HWPGO. They demonstrate the general technique of collecting a secondary PMU profile and using it toward optimization.
Longer-term we’d like to extend the concept to additional PMU metrics while keeping the feedback process manageable for users. We’d like to gauge interest in the general idea, and see if there are comments on how best to go about it.
Motivation
A sampling profiler typically leverages hardware counters to sample execution of instructions. The hardware can also simultaneously produce profiles of other events such as branch mispredicts and cache misses in addition to the usual retired branches and LBRs used for execution frequency feedback. Often a very large number of metrics are available.
Furthermore, some PMUs can sample with enough precision that the profiles can be used to compute high-level metrics such as branch- or cache-mispredict ratios for individual instructions.
This is not Intel-specific as other hardware certainly exposes similar events. However, we can’t speak to the precision of the sampling. (Feedback appreciated here.)
We believe there is value in providing a number of these supplementary profiles alongside the execution frequency profile.
These profiles can provide information not available via static analyses or even instrumentation.
For example, there may be utility in enabling data cache miss feedback, instruction cache miss feedback, SIMD utilization feedback, etc.
Also, target-specific feedback for backends, such as DSB events, frequency licensing, etc., is possible.
Branch Mispredict Feedback
One new feedback type we have explored is branch mispredict profiles. The intent is to identify genuinely unpredictable branch conditions, not target-specific issues.
A high branch mispredict ratio is a strong hint to the compiler that additional speculation in order to eliminate control flow may be profitable.
This kind of feedback is uniquely available from hardware counters, not available even via instrumentation.
The compilation process (on Intel hardware) looks something like:
# First compilation:
clang -O2 -gline-tables-only -fdebug-info-for-profiling app.c -o app
# A single profiling run with additional events:
perf record -o app.perf.data -b -c 1000003 -e br_inst_retired.near_taken:uppp,br_misp_retired.all_branches:upp -- ./app
# Generate multiple source-level profiles from the single binary-level profile:
llvm-profgen --perfdata app.perf.data --binary app --output app.freq.prof --sample-period 1000003 --perf-event br_inst_retired.near_taken:uppp
llvm-profgen --perfdata app.perf.data --binary app --output app.misp.prof --sample-period 1000003 --perf-event br_misp_retired.all_branches:upp --leading-ip-only
# Finally, feedback to next compilation:
clang -O2 -fprofile-sample-use=app.freq.prof -mllvm -unpredictable-hints-file=app.misp.prof -o app.2
Only a single perf run is needed, but we run llvm-profgen twice:
- Once to generate a typical LBR-based execution frequency profile using only the
br_inst_retiredsamples. - Again to form a new profile of mispredicted branches using only the
br_misp_retiredsamples.
Three new llvm-profgen options are used to form profiles suitable for HWPGO:
--perf-event: choose the event to form the profile from.--leading-ip-only: do not use the LBR trace but only leading IP to contribute to the profile. Only the instruction at the sampled IP is known to have mispredicted.--sample-period: provides the sampling period for this event so that the two profiles have comparable magnitudes. This is important for the compiler to be able to compute branch mispredict ratios.
These options are only required if using HWPGO. I.e., existing SPGO methods do not require any of these options, and will be completely unaffected.
Both output profile formats are unchanged from today, with no changes needed to the use of app.freq.prof from today. i.e., no changes to the SampleProfileLoader path.
When recompiling, the branch mispredict profile is used in combination with the execution frequency profile to compute mispredict ratios and add !unpredictable metadata.
This is implemented in an “UnpredictableProfileLoader” pass.
!unpredictable metadata is an existing concept in LLVM, mainly incurred today by Clang’s __builtin_unpredictable.
LLVM already uses this metadata in both IR and Machine IR to make optimization decisions, such as promoting conditional instructions over control flow.
This functionality has been in our downstream OneAPI compiler (“icx”) targeting both Windows and Linux since 2024.0. An article describing an example is available here:
Performance Results
Branch mispredict feedback has less broad impact when compared with fundamental execution frequency feedback.
Typically improvement comes from the identification of a genuinely unpredictable branch condition which static analysis or instrumentation feedback cannot identify.
As a result, improvement is typically in a smaller number of cases, with no effect in other cases.
Branch mispredict feedback was developed in the OneAPI compiler to improve performance of a very popular codec software on Windows. The application-level performance improvement from branch mispredict feedback was 8% in that case.
In another suite of benchmarks based on real-world applications and datasets we observe 1% and 5% improvement in 2 out of 16 tests, with no significant impact in the other 14 tests.
Finally, in CoreMark-PRO we have observed 10% improvement from CMOV conversion of one particular branch with the OneAPI compiler. (See EuroLLVM slides above.)
These improvements are in addition to any improvements from DWARF/LBR-based SPGO.
As more optimizations are developed to take advantage of !unpredictable metadata, these results may improve. For example, see llvm-project 3d494bfc.
Usability and Managing Multiple Profiles
Longer term, we are exploring extending HWPGO beyond branch mispredict feedback, i.e., cache misses, SIMD utilization, etc.
The usage shown above becomes tedious and error-prone as more profile types are added. Each new metric requires another llvm-profgen run, maintenance of another profile, and a new option to name the profile file for recompilation.
To simplify usage, we propose establishing some kind of “profile bundle” concept. For example, this might look like a directory containing well-known profile types, which the compiler driver could use to identify all of the profile types.
It could also be useful to provide tooling to orchestrate the perf and/or llvm-profgen runs to simplify this for the user. For example, a tool which can identify any known events present in perf output and invoke llvm-profgen runs with appropriate options to create the “profile bundle.”
We haven’t worked on this tooling yet as the usage is still just about manageable for expert users, but any thoughts are greatly appreciated.