[PGO] Details of IR representation of profile data

I have some questions regarding some of the details of (mostly instrumentation-based) PGO in LLVM, which I can’t seem to find the answers to on the internet or in the source code:

  1. In general, why do BlockFrequencyInfo and BranchProbabilityInfo use integers for representing frequencies/probabilities, instead of floating-point numbers?

    • For BlockFrequencyInfo, in BlockFrequency Terminology docs it is said:

      Since downstream users need integers (not floating point), this initial frequency assignment is shifted as necessary into the range of uint64_t.

      What does this need refer to, and why are integers as frequencies needed?

    • For BranchProbabilityInfo, it has been unsigned (/uint32_t) since the beginning.

    Is this maybe related to simplicity/performance/precision/determinism? In some places, ScaledNumber and APFloat are used or referred to as well.

  2. Continuation of the first question: these integer representations in BlockFrequencyInfo can sometimes hide differences between blocks with frequencies, for example, 0.47 and 0.53, which both become 8/16, when they are scaled here. And then this integer frequency is used to compute profile counts in BlockFrequencyInfoImplBase::getProfileCountFromFreq in the same file. So this accuracy is considered acceptable, right? I have also found some discussion on BlockFrequencyInfo accuracy in https://groups.google.com/g/llvm-dev/c/DElc7PHn5VM. Is the information from there relevant these days?

  3. A question about patch ⚙ D61540 [PGO] Use sum of count values to fix func entry count and add a check to verify BFI counts, which is enabled by default. This changes function entry count metadata in such a way that the sum of computed BFI counters (which scales linearly with the entry count) is the same as the sum of profile data counters. And the reason for this change is said to be:

    number pre[c]ision, inconsistent profile, or bugs in BFI.

    However, inlining actually subtracts the computed BFI count of the callsite from the callee’s entry count here. Which means that this BFI count, hopefully matched to the profile data, is subtracted from this function entry count, which is scaled and no longer matches the profile data. Is this considered acceptable as well?

Too much precision is actually not a blessing, but a curse to optimizations. It can introduce unnecessary performance instability due to slight profile changes. In fact, there is a pending patch to tolerate small diffs: 67197. @MatzeB

One big source of profile precision loss is race conditions for profile counters, so you can not really rely on super precise profile in optimization decisions – what important is the we don’t flip directions of biased branches.

Regarding question 3 – it is about special case handling. Since IRPGO does not always instrument the entry Block, the entry count has to be computed during the annotation stage. In cases when the profile collected is not consistent (e.g. with infinite serving loop), the entry count computation can be wrong.

Thanks for the answer!

Interesting, that link also contains a reference to Bfi precision by MatzeB · Pull Request #66285 · llvm/llvm-project · GitHub, which is related to question 2 as well. Shouldn’t more precise internal representations of profile data help with instability anyway, if tolerances to small differences are introduced only when using the representation?

For question 3 - I’ve had inliner subtract too much (because of the BFI scaling) from the entry count so that it becomes 0, but the function was still called from non-inlined callsites as well. The program was single-threaded, so there shouldn’t have been any race conditions, and no infinite serving loops, so presumably the instrumentation profile was ok. But maybe this doesn’t happen all the time, or I guess distinguishing between profile inconsistency and possible BFI bugs would be hard here anyway.

Also wanted to clarify that question 1 was not really related to precision, just questioning the type choice.

Using more precise internal rep will likely help carry more (false) precision, but not increasing stability.

BFI also has other limitations – for instance handling multi-entry loops etc. When that happens, block count can not be fully reconstructed from branch weights. If one callsite sits in such a block, you may end up with zero entry count when there are still remaining live callsites.

Regarding the type choice – using float can increase compile time, also may introduce differences depending on host platform (branch weight is part of the IR).

Ah, alright, thanks. I think I understand now.

Also, is the false precision comment related to BFI only, and so is making BPI as close as possible to actual profile data a valid goal, then, or not?

yes – BPI and reconstructed block count should be as close as possible to the actual profile data is a valid goal (to maintain the ‘shape’ of the profile).

1 Like