[RFC] PGO Accuracy Metrics: Emitting and Evaluating Branch and Block Analysis

In an ideal scenario, the compiler would have perfect knowledge of expected runtime behavior. PGO approximates expected behavior to the compiler via runtime profiles. The profile and its forms within the compiler should line up with runtime behavior of the program. As such, it would be beneficial to cross check the compiler profile analysis against actual execution traces.

Our aim is to develop tools and systems for analyzing the quality of PGO for end-to-end compilation. To perform the comparison, we need ways to emit the PGO related analysis (BPI, BFI, etc.) after compilation, ways to correlate the compiler analysis with the runtime traces, and tools and metrics that can perform the comparison.

Our proposed additions to LLVM consists of two parts: emitting PGO related data and tools for generating metrics. We plan to emit the analysis data into an ELF section like BBAddrMap. As for tools to consume the compiler analysis, we present a new utils folder PGOAccuracy with related scripts for these metrics.

Emitting PGO Related Data

We need to emit the compiler information to compare compiler PGO information against runtime traces. The metrics need a way to map BranchProbabilityInfo back to individual branches and map BlockFrequencyInfo back to basic blocks. Ideally this would be emitted as a section of the ELF so that it can be combined from each compilation unit.

To implement this behavior, we are suggesting creating an extended version of BBAddrMap, existing as a unique ELF section. Our extension, which we will call PGOBBAddrMap, would add additional fields for block weights and branch probabilities on blocks with multiple successors. The base BBAddrMap would allow for mapping the block information to their program locations.

PGOBBAddrMap

The following snippet demonstrates what PGOBBAddrMap would look like when output by a tool such as llvm-objdump. PGOBBAddrMap is a superset of BBAddrMap. Fields that are not shared with BBAddrMap are denoted by +.

Function {
  At: 0xXXXX
  Name: ccc
+ Entry Count: NN
  BB entries [
    {
      ID: NN
      Offset: 0xXX
      Size: 0xXX
      Bitfield: ... 
+             | 2 Bits Jump Type
+     Weight: NN
+     SuccessorCount: (Optional Number Depending on Jump Type)
+     Successors: (Not used on single successor blocks) [
+       {
+         ID: NN
+         Prob: FF
+       },
+       ...
+     ]
    },
    ...
  ]
}

To handle successor information for branch probabilities, the bitfield would contain 2 bits for encoding three jump types: unconditional successor, two successors, and many successors. The vast majority of blocks end with an unconditional successor or two successors via a conditional jump, allowing Successor Count to be omitted. For jump tables and other multiple successor blocks, we would need to include a Successor Count to know the variable length.

The underlying implementation of PGOBBAddrMap would be an adapter over BBAddrMap. Existing consumers of BBAddrMap should not need to change if they do not care about the PGO extension.

Scripts and Tools for Generating Metrics

To generate the accuracy metrics, we need runtime accurate runtime traces and tools that can consume and compare the traces and compiler output. We are suggesting adding a directory under ‘utils’ called PGOAccuracy to contain scripts and tracing tools that are helpful to the metrics.

To generate accuracy metrics, we want to add a python script that consumes the compiler data and target-dependent traces, then emits relevant stats. We have an existing implementation for just branch data at D158890 that will be extended to include block frequencies. The script generates a CSV containing stats (compiler prediction, runtime execution, derived metrics, and more) for individual branches and eventually blocks as well. It also generates some overall stats (weighted average error, branch direction match, and more) for the whole program. These metrics were inspired by some of the work done in Profi [1] and Google’s ML PGO prediction work [2] for comparing loaded profiles. The script would be extended to include block frequencies as well.

For runtime tracing, this is to an extent target dependent. For x86, we have a Pin Tool plugin (consisting of a single cpp file and makefiles) that can collect relevant data from X86. An initial version for just branch traces can be seen patch D158890. This is one way to collect the accurate traces necessary on x86 but by no means the only way. Other targets would have to find or implement their own tracing methods.

The metric script should try to encapsulate the consumption of runtime traces so that other targets could reuse the metric collection and compiler emitted analysis, needing only to re-implement consumption for the target traces.

Preliminary Analysis of BranchProbabilityInfo

At MediaTek, we implemented ways to do this for just BranchProbabilityInfo. We emitted the probabilities into the ELF with links to their originating conditional branches and collected traces from our benchmarks. Our scripts mapped individual branches to compare the compiler probability against execution taken percent. For metrics on these conditional branches, we collected the difference in the values and whether they bias in the same direction (taken or non-taken). We also generated overall statistics judging the weighted average error and overall percent whose direction matched. Implementation of the prototype emitting process for X86 and script for analysis can be seen in patch D158889 and D158890 respectively.

Initial analysis of Sample PGO on four benchmarks from llvm-test-suite. Sample PGO was picked over Instrumentation since we use that internally and it is growing increasingly popular due to the lower profiling overhead.

The following plot shows the overall metrics for direction match and weighted average error. Baseline execution with no PGO is shown on the left (blue) and Sample PGO execution is shown on the right (orange). There is still room to increase PGO direction match and decrease average error.

image
image

Using this analysis on our internal compiler, we have been able to find and fix some sources of inaccuracies and evaluate other changes:

  • In our backend, we found a case where we dropped profile supplied branch weights even though the transformation could have preserved them.
  • In our backend, we confirmed accuracy improvements via better handling of debug information in the sample compilation.
  • These metrics detected patterns of inaccuracy caused by loop rotation duplicating the guard and latch branch probabilities. (We didn’t get around to trying to fix or correct this).
  • When optimizing for size (a common case for our target), constant iteration loops would sometimes bias the exiting edge even though it was statically known to repeat multiple times. This was caused by skewing common in Sample PGO. We experimented with a pass that adjusts the weights for this case.

In addition to finding improvements, this analysis is helpful in our CI as it provides more context on top of the existing metrics.

Complimentary PGO Work

This RFC is complementary to other efforts in the recent “[RFC] Profile Information Propagation Unittesting” to improve PGO metadata unit testing. Our work focuses on end-to-end accuracy while their work focuses on pass-by-pass metadata handling.


We welcome any feedback on the proposal. Thanks!

[1] W. He, J. Mestre, S. Pupyrev, L. Wang, and H. Yu, “Profile Inference Revisited,” Proceedings of the ACM on Programming Languages, https://dl.acm.org/doi/10.1145/3498714 (accessed Aug. 22, 2023).

[2] E. Raman and X. D. Li, “Learning branch probabilities in compiler from Datacenter workloads,” arXiv.org, https://arxiv.org/abs/2202.06728 (accessed Aug. 22, 2023).

2 Likes

This is amazing work! Thank you for working on this and also for contributing this upstream.

The approach of placing everything in a separate ELF section makes a lot of sense for a variety of reasons:

  1. Like you mentioned, it allows for easily combining information from multiple compilation units.
  2. It avoids ambiguity when mapping information in and out of the executable. Just capturing function names can cause issues with things like functions that only have internal linkage.
  3. Certain build systems (bazel, buck) are mostly modeled on the one compilation artifact per compile step model which makes emitting a separate file (like a CSV) a little bit troublesome in certain applications. This is sort of an artificial limitation, but still one that exists currently.

I only have one comment on the details of the proposed section format, but others might have some comments/suggestions. It would probably be a good idea to add machine basic block frequency information into the format now rather than down the line to avoid churn in the section format. It should be a relatively simple addition.

Looking again at the original basic block address section patches, it seems like they landed incrementally. The section was added, renamed, and adjusted upstream before any object file tooling (llvm-objdump, llvm-readelf) gained support for the section. I’m not sure what others think should be necessary, but just landing a patch to emit the section and then later adding on additional tooling would probably be ideal. It allows for incremental development while getting something out there that is usable by everyone who is interested. Others (mainly reviewers) would have to agree on that though.

Hopefully we can reuse some of the infrastructure from the basic block address map implementation (especially so in tools like llvm-objdump where the code seems more complex), but in trivial cases, code sharing might not be possible/useful.

Pinging @rlavaee on this as he did the implementation of the basic block address map section and might have some comments/ideas.

Regarding the patch containing the analysis scripts/instrumentation tooling, I’m not sure those are useful to have in upstream LLVM. They’re definitely useful and having them in the open is good, but the components within LLVM similar to this are normally focused on enabling a paradigm (like evaluating profile accuracy) rather than specific utility scripts that depend upon third party libraries to do the implementation. This is especially true with the fact that everyone seems to have a different way they want to implement systematic profile evaluation (DynamoRIO, Pin, maybe even something like BOLT instrumentation).

Also, if you need/want someone to help write up patches to help get parts of this implemented, I’m happy to assist, but want to make sure I’m not stepping on any toes, so just let me know if you’re interested.

Looking forward to patches getting posted on Github!

Awesome! Summarizing some stuff we talked about at the dev meeting:

  • can MBFI be part of the proposed changes - we’d love to drop -mbb-profile-dump and use this instead. I guess it’d just be a matter of having a new field?
  • (assuming the above) can the format tolerate opting in/out of MBPI (same for MBFI)?
  • to make sure I understand, and/or stressing out a concern: this has 0 additional cost on binary size if it’s not enabled (incl. if bb addr map of today is enabled), correct?

Is the work X86-specific, or is it essentially ELF-specific - I imagine the latter, and the only reason D158889 is x86-only is because Pin (your current data consumer) is x86-only.

Would you be opposed to focusing this RFC to just the “production” of the data (i.e. the PGOBBAddrMap part), and not tie it to an architecture?

The Pin scripts could be on a separate git repo and referenced from patches (and/or in comments) - that way folks that use Pin can relatively easily discover and use them, and others that don’t use Pin (we use dynamorio) can take inspiration from them.

I think the Pin part of things opens a can of worms which isn’t worth opening, or at least not worth tying to PGOBBAddrMap, like: why makefile support (llvm is cmake); how is it tested; or which build bots will do that (because Pin isn’t a standard dependency).

Yes, this will be the plan. The “first” PGOBBAddrMap will contain BB counts, Branch Probs, and Function Entry Count.

This all makes sense. Will definitely do an incremental updates (at least for the PR). I’m attempting to reuse as much existing code from BBAddrMap (avoiding duplication) without modifying original BBAddrMap code too much.

There might be some room to delegate things once the initial PGOBBAddrMap is implemented, like getting the encoding/decoding into the readelf/objdump tools. Will update once I have an initial version of PGOBBAddrMap up.

It should be reasonable to opt in/out of these analysis in the ELF. I already started coming up with ways to do this so pending implementation.

Yes, all our proposed changes should not interfere with existing consumers of BBAddrMap. On the ELF side, it will be opt-in to an independent feature. On the code side, I’m trying to keep minimal changes to BBAddrMap but some changes may be done in encoding/decoding to promote reuse and avoid duplication.

This work should be ELF-specific where possible. With the redesign around BBAddrMap, it should avoid some of the target changes that were needed in D158889. The goal is to make this feature accessible to any target with minimal changes on the target side.

The RFC can definitely focus on the data production side. After hear about how others would want to consume the data, I see how the Pin tool is not needed. That can definitely be left out.

I’m working on PGOBBAddrMap now and will try to get up a forked branch or draft PR soon.

Thanks for the feedback :slight_smile:

I don’t have any concerns. The proposed change looks good to me. Regarding the implementation, you can use the existing feature byte (which is currently unused) in the BBAddrMap data to denote if PGO counts are included for one function or not. This can also save some space for cold functions.

Thanks for putting together this RFC. I think it more or less addressed some of the feedbacks on the original patch: 1) include block counts in addition to branch probabilities, 2) make it general purpose.

For #2, I second what others mentioned - many people don’t use PIN tool, so that piece is a bit too specific to be included in llvm tree. Instead, I think that support in llvm-objdump/llvm-readelf is all we need. The dumper/consumer part can be implemented incrementally in separate patches, but I’d love to see that going together in a stack with the producer.

Two more implementation related questions:

  • If we emit sections at the end of compilation, do we account for linker’s GC and ICF? I assume that linker won’t touch that extra ELF section, other than concatenating all input sections, even though the associated function maybe removed.

  • is At: 0xXXXX going to be the final address in ELF binary (linker’s output), which means it requires relocation/fixup at static link time.

FWIW: LLVM has a tradition of dumping things into llvm/utils even if they are not useful for everyone, have horrible python-coding style and end up being unmaintained… It’s the directory where this sort of things is allowed (as long as it’s somewhat related to things LLVM). Having some reference implementation of a tool using the information should be helpful, so having the PIN support there as an example should be fine.

There is an effort to clean up the python style there afaik, and at least the test tools have tests… wouldn’t treat it as a dumping place. Plus, I think in the face of broken glass we should rather clean than break more :slight_smile:

Why not its own separate repo for the pin stuff, it costs nothing, and it can grow its own life.

I may have slightly exaggerated my wording and the discussion may belong to a separate thread. But generally I think it is a good thing to have a place for mixed-quality things. Some of the stuff in llvm/utils sure has become load-bearing in the meantime (FileCheck, update_*_checks.py ) and maybe we should move them to llvm/tools instead.

A lot of others things in llvm/utils just sits around there for a niche audience and may have bitrotted in the meantime. I still think it’s good to allow that sort of thing as long as the expectations and level of support is clear, and to me anyway that is an important distinction between llvm/utils and llvm/tools

LLVM has a tradition of dumping things into llvm/utils even if they are not useful for everyone, have horrible python-coding style and end up being unmaintained…

You mentioned a bunch of not so good things there… I don’t see (bad) precedence as enough of a reason to justify it. :slight_smile: Though the more important thing is to have a general purpose proper consumer/dumper in tree.

The point I tried to make, is that I think it’s perfectly fine for some “bad” things to be in llvm/utils as long as it is just that directory. It does enable to publish some scripts and helpers that otherwise may have never been shared or published when they didn’t hit quality or usefulenes bars.

I may have slightly exaggerated my wording and the discussion may belong to a separate thread.

Ack - sorry, I kind of latched on that, panicking :slight_smile:

I… guess I could see it both ways, if there’s more consensus on “ok to have things w/o test, as reference” and especially if the MediaTek folks felt it’d help them to have it there, that would pretty much make it maintained by someone, and wouldn’t be quite “broken glass”. It’d be nice if we had some metric of “is anyone using it?” for things there that don’t have a test.

Still not sure what’s wrong with a different repo, discoverability-wise it’d be the same (I think?) because I’d still imagine there’d be a reference in the “producing” code, can be a path, can be a url.

(I also don’t want to dig a rathole, and I begin to fear I am :slight_smile: )

A separate discussion thread about tools and utils and maybe a third category could be productive. tools is a mix of things like the LLVM versions of binutils, which would be deliverable to end users, as well as developer-only tools like llc and opt. utils includes lit, and things generally used only by lit (e.g. FileCheck, not, split-file), plus random other stuff. The dividing line is not entirely clear in my mind, and it seems I’m not the only one.

Regardless of the outcome from that side discussion on tools vs utils, my main point is we need proper dumper support. Even if we end up having PIN tool scripts in utils (I’m not convinced yet), we still need dumper support for the new ELF section.

I second that. I kind of assumed it to be the case, tbh, since the dumper goes hand in hand (in my mind) with writing unit tests.

I believe that we tie each MachineFunction PGOBBAddrMap section back to the MachineFunction’s section. I’ll try to keep an eye out for this as I’m testing the implementation.

Yes, OutStreamer inserts the symbol which should handle the relocation/fixup.

I agree that the Pin specific parts may not be relevant enough to warrant upstreaming. Some of the python script may be helpful a base for consuming the PGOBBAddrMap data and associating with traces but it also may not need to leave in upstream.

I’m getting close to a first patch for PGOBBAddrMap which should cover most of the implementation in lib/include (Object/ObjectYaml/AsmPrinter) with tests. We are going to be doing some quick internal review of the code this week. After that first patch, there will be a small amount of work to add it into readobj and objdump.

We have an initial implementation that is split into 5 patches. I have posted the first PR with implementation in Object and ObjectYAML on GitHub. I also have draft PR’s on my repo if you want to see the diff the next two patches AsmPrinter and llvm-readobj. These three are useful for minimal testing.

The PRs and branches so far:

  1. Object and ObjectYAML - PR: Implements PGOBBAddrMap in Object and ObjectYAML with tests [1/5] by red1bluelost · Pull Request #71750 · llvm/llvm-project · GitHub Branch: GitHub - red1bluelost/llvm-project at pgo-bb-addr-map--object
  2. AsmPrinter - Fork PR: Implements PGOBBAddrMap in AsmPrinter with tests [2/5] by red1bluelost · Pull Request #1 · red1bluelost/llvm-project · GitHub Branch: GitHub - red1bluelost/llvm-project at pgo-bb-addr-map--asm-printer
  3. llvm-readobj - Fork PR: Implements llvm-readobj with tests. [3/5] by red1bluelost · Pull Request #2 · red1bluelost/llvm-project · GitHub Branch: GitHub - red1bluelost/llvm-project at pgo-bb-addr-map--llvm-readobj
  4. llvm-objdump - Branch: GitHub - red1bluelost/llvm-project at pgo-bb-addr-map--llvm-objdump
  5. llvm obj2yaml - Branch: GitHub - red1bluelost/llvm-project at pgo-bb-addr-map--llvm-obj2yaml

Notably on PR-1 Object/ObjectYAML, we avoided changing the existing API and BBAddrMap data structure. This comes at some cost in complexity when sharing encoding/decoding code. Alternatively, the extra data fields could be added directly to BBAddrMap which would simplify the shared code but make some of the exist API behavior change. Let us know if you feel that the alternative where fields are added directly to BBAddrMap would be preferred.

We look forward to feedback on the patches!

Side Note:
Does anyone know good ways to do a patch series through the GitHub PR system? In phabricator you could stack patches so that you don’t get unnecessary diffs. Not sure if GitHub has the same.

Nothing has been documented about the procedure yet, but it is possible. Essentially you’ll push a branch for each patch under the users namespace (so you’ll name it users/<your github username>/<name of the branch>) and then you’ll open up PRs that merge dependent branches into the branches that they depend on (rather than just main). Github will then automatically handle changing merge targets as PRs get merged. I think merging with rebase still needs to get enabled for this to work fully, but it should allow everything to be put up for review on llvm/llvm-project.

1 Like

@mtrofin and anyone interested. If you want to try the PGOBBAddrMap, I would recommend cloning this branch from my fork which contains all 5 patches so far: GitHub - red1bluelost/llvm-project at pgo-bb-addr-map.

To compile with PGOBBAddrMap on, the following flags should produces it. You can remove elements in the comma separated list for pgo-bb-addr-map if you don’t want branch probabilities or other analyses.

`-fbasic-block-sections=labels -mllvm --pgo-bb-addr-map=func-entry-count,bb-freq,br-prob`

I also made a script that shows one way to comsume the PGOBBAddrMap. If given an ELF with the PGOBBAddrMap including block frequencies and function entry counts, it would emit the same CSV as mbb-profile-dump. Here is a link to that script: https://github.com/red1bluelost/scripts-for-sharing/blob/main/pgobbaddrmap2mbbprofile.py .