In an ideal scenario, the compiler would have perfect knowledge of expected runtime behavior. PGO approximates expected behavior to the compiler via runtime profiles. The profile and its forms within the compiler should line up with runtime behavior of the program. As such, it would be beneficial to cross check the compiler profile analysis against actual execution traces.
Our aim is to develop tools and systems for analyzing the quality of PGO for end-to-end compilation. To perform the comparison, we need ways to emit the PGO related analysis (BPI, BFI, etc.) after compilation, ways to correlate the compiler analysis with the runtime traces, and tools and metrics that can perform the comparison.
Our proposed additions to LLVM consists of two parts: emitting PGO related data and tools for generating metrics. We plan to emit the analysis data into an ELF section like BBAddrMap. As for tools to consume the compiler analysis, we present a new utils folder PGOAccuracy with related scripts for these metrics.
Emitting PGO Related Data
We need to emit the compiler information to compare compiler PGO information against runtime traces. The metrics need a way to map BranchProbabilityInfo back to individual branches and map BlockFrequencyInfo back to basic blocks. Ideally this would be emitted as a section of the ELF so that it can be combined from each compilation unit.
To implement this behavior, we are suggesting creating an extended version of BBAddrMap, existing as a unique ELF section. Our extension, which we will call PGOBBAddrMap, would add additional fields for block weights and branch probabilities on blocks with multiple successors. The base BBAddrMap would allow for mapping the block information to their program locations.
PGOBBAddrMap
The following snippet demonstrates what PGOBBAddrMap would look like when output by a tool such as llvm-objdump. PGOBBAddrMap is a superset of BBAddrMap. Fields that are not shared with BBAddrMap are denoted by +
.
Function {
At: 0xXXXX
Name: ccc
+ Entry Count: NN
BB entries [
{
ID: NN
Offset: 0xXX
Size: 0xXX
Bitfield: ...
+ | 2 Bits Jump Type
+ Weight: NN
+ SuccessorCount: (Optional Number Depending on Jump Type)
+ Successors: (Not used on single successor blocks) [
+ {
+ ID: NN
+ Prob: FF
+ },
+ ...
+ ]
},
...
]
}
To handle successor information for branch probabilities, the bitfield would contain 2 bits for encoding three jump types: unconditional successor, two successors, and many successors. The vast majority of blocks end with an unconditional successor or two successors via a conditional jump, allowing Successor Count to be omitted. For jump tables and other multiple successor blocks, we would need to include a Successor Count to know the variable length.
The underlying implementation of PGOBBAddrMap would be an adapter over BBAddrMap. Existing consumers of BBAddrMap should not need to change if they do not care about the PGO extension.
Scripts and Tools for Generating Metrics
To generate the accuracy metrics, we need runtime accurate runtime traces and tools that can consume and compare the traces and compiler output. We are suggesting adding a directory under āutilsā called PGOAccuracy to contain scripts and tracing tools that are helpful to the metrics.
To generate accuracy metrics, we want to add a python script that consumes the compiler data and target-dependent traces, then emits relevant stats. We have an existing implementation for just branch data at D158890 that will be extended to include block frequencies. The script generates a CSV containing stats (compiler prediction, runtime execution, derived metrics, and more) for individual branches and eventually blocks as well. It also generates some overall stats (weighted average error, branch direction match, and more) for the whole program. These metrics were inspired by some of the work done in Profi [1] and Googleās ML PGO prediction work [2] for comparing loaded profiles. The script would be extended to include block frequencies as well.
For runtime tracing, this is to an extent target dependent. For x86, we have a Pin Tool plugin (consisting of a single cpp file and makefiles) that can collect relevant data from X86. An initial version for just branch traces can be seen patch D158890. This is one way to collect the accurate traces necessary on x86 but by no means the only way. Other targets would have to find or implement their own tracing methods.
The metric script should try to encapsulate the consumption of runtime traces so that other targets could reuse the metric collection and compiler emitted analysis, needing only to re-implement consumption for the target traces.
Preliminary Analysis of BranchProbabilityInfo
At MediaTek, we implemented ways to do this for just BranchProbabilityInfo. We emitted the probabilities into the ELF with links to their originating conditional branches and collected traces from our benchmarks. Our scripts mapped individual branches to compare the compiler probability against execution taken percent. For metrics on these conditional branches, we collected the difference in the values and whether they bias in the same direction (taken or non-taken). We also generated overall statistics judging the weighted average error and overall percent whose direction matched. Implementation of the prototype emitting process for X86 and script for analysis can be seen in patch D158889 and D158890 respectively.
Initial analysis of Sample PGO on four benchmarks from llvm-test-suite. Sample PGO was picked over Instrumentation since we use that internally and it is growing increasingly popular due to the lower profiling overhead.
The following plot shows the overall metrics for direction match and weighted average error. Baseline execution with no PGO is shown on the left (blue) and Sample PGO execution is shown on the right (orange). There is still room to increase PGO direction match and decrease average error.
Using this analysis on our internal compiler, we have been able to find and fix some sources of inaccuracies and evaluate other changes:
- In our backend, we found a case where we dropped profile supplied branch weights even though the transformation could have preserved them.
- In our backend, we confirmed accuracy improvements via better handling of debug information in the sample compilation.
- These metrics detected patterns of inaccuracy caused by loop rotation duplicating the guard and latch branch probabilities. (We didnāt get around to trying to fix or correct this).
- When optimizing for size (a common case for our target), constant iteration loops would sometimes bias the exiting edge even though it was statically known to repeat multiple times. This was caused by skewing common in Sample PGO. We experimented with a pass that adjusts the weights for this case.
In addition to finding improvements, this analysis is helpful in our CI as it provides more context on top of the existing metrics.
Complimentary PGO Work
This RFC is complementary to other efforts in the recent ā[RFC] Profile Information Propagation Unittestingā to improve PGO metadata unit testing. Our work focuses on end-to-end accuracy while their work focuses on pass-by-pass metadata handling.
We welcome any feedback on the proposal. Thanks!
[1] W. He, J. Mestre, S. Pupyrev, L. Wang, and H. Yu, āProfile Inference Revisited,ā Proceedings of the ACM on Programming Languages, https://dl.acm.org/doi/10.1145/3498714 (accessed Aug. 22, 2023).
[2] E. Raman and X. D. Li, āLearning branch probabilities in compiler from Datacenter workloads,ā arXiv.org, https://arxiv.org/abs/2202.06728 (accessed Aug. 22, 2023).