[llvm-commits] Branch Probability

I just found a small bug. Fixed version attached.

<kuba_bp3.patch>

Committed as r132613 (and r132616). Thanks Jakub!

To help reviewers understand the patch and where we're headed with it, I've prepared the following design documentation. For now I'm sending it to llvm-dev until we have an official design doc for the profiling work. Please note that Jakub's first patch only pertains to the "Branch Probability" section below. More patches will follow.

One of the goals of the branch profile framework is to keep the final output of compilation independent of floating point imprecision. In other words, we want LLVM to be a deterministic cross-compiler. A sure way to do this is to represent the original profile in fixed point and use only fixed point arithmetic to compute derived forms of the profile. Representing branch profile and block frequency as fixed point values, however, leads to confusing terminology. This document defines the important terms, which should be understood before reviewing the LLVM profiling framework.

** Branch Profile

Profile information derived from the compiler front-end or external source. At IR level, this conceptually represents probabilities on CFG edges such that block successor probabilities sum to 1.0.

Representation: 32-bit unsigned int stored in IR meta-data.

A single integer per profiled conditional branch would be sufficient. But for more convenient incremental update, and consistency with Branch Probability Analysis, we may want an integer per branch target such that

  profiledEdgeProbability(src, dest) ==
    profileWeight(src, dest)
    / (sum profileWeight(src, succ) for succ in src.successors())

Of course, that takes care of switches too.

Note that profiled edge weights are *not* raw sample counts provided by a profiler. The weights are adjusted to express branch probability when compared with other weights from a predecessor block's successor list. The same probability can be expressed multiple ways by scaling the values of all sucessor edges. Successor edge weights need not sum to any particular value. Edge weights have no meaning when compared across different branches.

The details of this representation aren't important because it is not the primary interface used by the optimizer. In fact, we could easily support new meta-data formats without changing anything except the branch probability analysis pass.

Note: Runtime profiling implementers may choose to embed additional information, such as profile counts, in meta-data for the purpose of validating the profiler. But such data is opaque to the optimizer and code generator, so does not need to be addressed by the framework design.

** Branch Probability

The result of Branch Probability Analysis. This is the mid-level IR pass that interprets meta-data and performs static prediction when needed.

Representation: Map of 32-bit unsigned int "edge weight" per CFG edge

Conceptually this is similar to Branch Profile:

edgeProbability(src, dest) ==
  edgeWeight(src, dest)
  / (sum edgeWeight(src, succ) for succ in src.successors())

As with branch profile, these edge weights express branch probability as the ratio between a successor edge and the sum of all other sucessors originating from the same branch. They do not need to sum to a fixed value, so all successor edge weights can be scaled without changing the probability. Edge weights associated with different branches cannot be meaningfully compared.

It's easy and inexpensive to recompute branch probability when needed. But it worth noting that it's also not hard to incrementally update. Removing a CFG edge requires no update, unless we want to cleanup stale entries in the map. Adding an edge requires adding a map entry, which may require scaling down the weights of the sibling successors to avoid integer overflow. To avoid integer overflow, we do ensure that the sum of all successor edge weights from a single branch is less than UINT_MAX.

This is redundant with branch profile meta data when the meta-data exists. If it's a concern, we have the option of omitting the branch probability for branches with meta data and doing on-the-fly look up.

Complication arises when clients want to make use of branch probability. Since we don't want to expose floating point values, except for generating diagnostic output, and we don't want to expose edge weights, because they have no independent meaning, we need a richer API.

Example:

BasicBlock *getHotSucc(BasicBlock *BB);
bool isEdgeHot(BasicBlock *pred, BasicBlock *succ);

The API will be implemented entirely using fixed point arithmetic.

Example:

class BranchProbability {
   friend BranchProbabilityInfo;

   unsigned numerator;
   unsigned denominator;

   BranchProbability(...); // no public ctor
...
};

class BranchProbabilityInfo {
  static BranchProbability HotProb;
  static initProb();
...
};

void BranchProbabilityInfo::initProb() {
  HotProb = BranchProbability(4, 5);
}

bool isEdgeHot(src, dest) {
(uint64_t(HotProb.denominator) * edgeWeight(src, dest))
> (uint64_t(HotProb.numerator)) *
     (sum edgeWeight(src, succ) for succ in pred.successors())
}

We could make the API richer as follows:

edgeProbExceeds(BasicBlock *pred, BasicBlock *succ, BranchProb prob)

Where BranchProb is an opaque type with various predefined levels of "hotness" (e.g. Unexpected, Unlikely, Likely, Expected...).

A low level interface that exposes edge weights will only be used by passes that need to incrementally update the result, or perform computations using branch probability.

** Block Frequency

The result produced by BlockFrequencyAnalysis for mid-level IR and MBBFrequencyAnalysis for MachineInstrs with mostly shared implementation across IR and MI.

Conceptually, this is a floating point value that satisfies the closed form:

frequency(block) == (sum edgeFrequency(pred, block) for pred in block.preds())

Where "edge frequency" is defined as:
  edgeFrequency(src, dest) == frequency(src) * edgeProbability(src, dest)

Representation: 32-bit unsigned int "scaled frequency" per block

For the purpose of diagnostics, the entry block always has a frequency of 1.0. Consequently, we define scaled frequency as a fixed-point value with the lower bits corresponding to the fractional component of frequency. For example, if we want 8 bits of fractional frequency, a frequency of 1.0 is has a scaled frequency of 2^8.

To avoid floating-point imprecision, multiplication by branch probability will be implemented as fixed point arithmetic directly using the edge weights provided by BranchProbabilityInfo.

Example:

edgeFrequency(src, dest) ==
  uint64_t(frequency(src)) * edgeWeight(src, dest)
  / (sum edgeWeight(src, succ) for succ in src.successors())

...with some logic for saturating on overflow and underflow.

As with BranchProbilityInfo, BlockFrequencyAnalysis would have a rich API for use by optimization passes to interpret the scaled frequencies as meaningful values. Typically, a pass will be comparing the frequency of two blocks to determine the relative hotness.

Scaled frequencies can be wrapped in a simple class that supports only comparison and addition to ensure that clients don't misuse them. Subtraction is not supported.

class ScaledFreq {
static ScaledFreq unity();
friend bool operator<(); // and friends
friend ScaledFreq operator+();
ScaledFreq scaled(unsigned num, unsigned denom) const;
};

A high-level API for comparing frequencies may look something like this:

// return true if the probability of B1 relative to B2 is at least Prob.
// e.g. compareFreq(B1, B2, Likely) is true if B1 is likely executed
// relative to B2.
bool compareFreq(B1, B2, Prob) {
return (uint64_t(B1) * (Prob.denom - Prob.numer))
   > (uint64_t(B2) * Prob.numerator)
}

Hi Andrew,

Representation: 32-bit unsigned int stored in IR meta-data.

why store it in the IR, and not just have it be an analysis that you can query
for branch probabilities? If you store it in the IR then it may well get out
of date (eg: when an optimizer realizes that some branches is dead and deletes
it, resulting in probabilities that don't add up to 1 on the other branches).
This happens all the time in GCC and is kind of annoying.

Ciao, Duncan.

Hi Andrew,

Representation: 32-bit unsigned int stored in IR meta-data.

why store it in the IR, and not just have it be an analysis that you can query
for branch probabilities? If you store it in the IR then it may well get out
of date (eg: when an optimizer realizes that some branches is dead and deletes
it, resulting in probabilities that don't add up to 1 on the other branches).
This happens all the time in GCC and is kind of annoying.

We need to store it in IR to support __builtin_expect and dynamical profiling.

Evan

Sorry, my response may not be timely if I'm not copied on the message.

At the IR level, the only things in meta-data are those that cannot be recomputed because they originate from an external source. Such as builtin expect.

Nonetheless, probabilities will always sum to one because they are always computed on the fly from the sum of branch weights. So removing a branch target automatically increases the probability of the remaining targets.

Andy