In LLVM IR, a loop’s latch block can have branch weight metadata that encodes an estimated trip count that is derived from application profile data. Initially, the loop body’s block frequencies agree with the estimated trip count, as expected. However, sometimes loop transformations adjust those branch weights in a way that correctly maintains the estimated trip count but that corrupts the block frequencies.
This RFC seeks the LLVM community’s feedback on our efforts to improve loop transformations to maintain more accurate block frequencies. Our assumption is that more accurate block frequencies would benefit any optimization that makes decisions based on them. Moreover, they would improve PGO-based tools we are developing to report estimates of, for example, FLOP counts.
Corrupt Block Frequencies
Example Loop
Consider the loop in this input LLVM IR:
declare void @f(i32)
define void @test(i32 %n) {
entry:
br label %do.body
do.body:
%i = phi i32 [ 0, %entry ], [ %inc, %do.body ]
%inc = add i32 %i, 1
call void @f(i32 %i)
%c = icmp sge i32 %inc, %n
br i1 %c, label %do.end, label %do.body, !prof !0
do.end:
ret void
}
!0 = !{!"branch_weights", i32 1, i32 9}
Given the above branch weights, once any loop iteration is actually reached, the probability of the loop exiting at the iteration’s end is 1/(1+9). That is, the loop is likely to exit every 10 iterations and thus has an estimated trip count of 10.
opt -passes='print<block-freq>'
shows that 10 is indeed the frequency of the loop body:
Printing analysis results of BFI for function 'test':
block-frequency-info: test
- entry: float = 1.0, int = 1801439852625920
- do.body: float = 10.0, int = 18014398509481984
- do.end: float = 1.0, int = 1801439852625920
To simplify this discussion, we normalize all our frequency calculations to one encounter of the original loop.
Key Observation
The frequency of any particular loop iteration is less than for the previous iteration because the previous iteration has a non-zero probability of exiting the loop. This observation holds even though every loop iteration, once actually reached, has exactly the same probability of exiting and thus exactly the same branch weights.
LoopPeel
Consider what happens to the original loop above when we apply opt -unroll-force-peel-count=2 -passes=loop-unroll
to peel 2 iterations and insert them before the remaining loop. Running opt -passes='print<block-freq>'
reveals a problem:
Printing analysis results of BFI for function 'test':
block-frequency-info: test
- entry: float = 1.0, int = 2814749767070151
...
- do.body.peel: float = 1.0, int = 2814749767070151
...
- do.body.peel2: float = 0.9, int = 2533274790100992
...
- do.body: float = 6.4, int = 18014398509481984
...
- do.end: float = 1.0, int = 2814749767070151
The block frequency becomes 1.0 for the first peeled iteration, 0.9 for the second, and 6.4 for the remaining loop body. The sum of the original loop body’s frequencies is then 8.3 instead of the original 10. The new branch weights suggest why:
!0 = !{!"branch_weights", i32 1, i32 9}
!1 = !{!"branch_weights", i32 1, i32 8}
!2 = !{!"branch_weights", i32 1, i32 7}
The exit probability is now 1/10 for the first peeled iteration, 1/9 for the second, and 1/8 for the remaining loop iterations. It seems LoopPeel is trying to ensure a decreasing block frequency across iterations. However, as in the key observation above, that happens correctly in the original loop without decreasing the branch weights across iterations.
LoopUnroll
Starting again with the original loop above, consider what happens when we apply opt -unroll-count=4 -passes=loop-unroll
. The resulting unrolled loop contains the original loop body 4 times. At the end of each is a conditional branch that either continues to the next iteration or exits the loop. In the case of the fourth, the branch is the unrolled loop’s latch. Again, running opt -passes='print<block-freq>'
reveals a problem:
Printing analysis results of BFI for function 'test':
block-frequency-info: test
- entry: float = 1.0, int = 11448150254814232
- do.body: float = 1.5736, int = 18014398509481984
- do.body.1: float = 1.4162, int = 16212958656856064
- do.body.2: float = 1.2746, int = 14591662789660508
- do.body.3: float = 1.1471, int = 13132496509335502
- do.end: float = 1.0, int = 11448150254814232
The sum of the original loop body’s frequencies is 5.41 instead of the original 10. Again, the new branch weights suggest why:
!0 = !{!"branch_weights", i32 1, i32 9}
!1 = !{!"branch_weights", i32 1, i32 1}
The exit probability for each of the first three unrolled iterations is 1/10, as in the original loop. But LoopUnroll has changed it to 1/2 for the fourth in order to indicate an estimated trip count of 2.
(Aside: The estimated trip count for the unrolled loop should not be 10//4=2, as set by the current implementation. A third iteration of the unrolled loop is required to execute the remainder, 10%4=2. Even so, fixing that is not sufficient to produce correct block frequencies.)
How do we fix block frequencies?
Let p
be the probability that the original loop will continue from one iteration to the next. There is only one set of branch weights for the original loop’s latch, so p
is uniform across all its iterations. Like any probability, 0 <= p <= 1
. In our original loop:
p = 9/(1+9) = 0.9
We treat the loop as having a theoretically infinite set of possible iterations. The frequency of any such loop iteration is just the probability that iteration will be reached (when normalizing our calculations to one encounter of the original loop, as mentioned earlier). To compute the total loop body frequency, we can sum the reaching probabilities of all the loop’s possible iterations:
bodyFreq = p^0 + p^1 + p^2 + ... + p^inf = 1/(1-p) = 1/(1-0.9) = 10
Notice how this formula captures our key observation from earlier: every iteration’s frequency has another factor of p
due to the previous iteration.
If we peel 2 iterations, the above series should still give the total frequency of all occurrences of the original loop body. The only change is that p^0
and p^1
are now the frequencies of the peeled iterations, and so the infinite series starting at p^2
gives the frequencies of the remaining loop’s iterations.
If we instead unroll by 4, the above series should still work. The only change from the original loop is that each iteration of the unrolled loop contains a series of 4 terms.
In other words, in these cases, we should leave the original loop’s branch weights in place uniformly across all occurrences of the original loop body, just as they were in the original loop. That would ensure that p
and the frequency of each of the original loop’s iterations are the same as they were in the original loop. The total frequency of the original loop body, summed across all its occurrences after the transformation, would thus remain 10.
Proposal
We propose that, when possible, loop transformations maintain branch weights consistent with the original loop for the sake of preserving the total frequency of the original loop body.
Corrupt Estimated Trip Counts
LLVM computes a loop’s estimated trip count as the loop body frequency derived from the loop’s latch branch weights considered in isolation. Thus, the above bodyFreq
formula gives the estimated trip count, 10, for the original loop.
If LoopPeel or LoopUnroll were to keep the original loop’s branch weights in place for its latch as we have proposed, then bodyFreq
computed for that latch in isolation would maintain the same estimated trip count, 10. However, peeling 2 iterations or unrolling by 4 should instead reduce the estimated trip count to 8 or 3, respectively. It seems that correcting the block frequencies has corrupted the estimated trip counts.
In other words, branch weights currently have conflicting roles. To maintain accurate block frequencies, branch weights must be set while considering the context of those branch weights, as revealed by our key observation. In contrast, estimated trip counts are computed from a loop latch’s branch weights in isolation. An effect of loop transformations like LoopPeel and LoopUnroll is essentially to extract parts of the loop from the latch into context, thus exposing this conflict.
So how can we accurately maintain both block frequencies and estimated trip counts?
Hack the context?
One possibility is to go ahead and set the branch weight’s on a transformed loop’s latch to produce its estimated trip count, but then we would need to hack other branch weights in its context to compensate so that block frequencies remain accurate.
Consider the case of peeling 2 iterations from our original loop. The remaining loop must have an estimated trip count of 8. If we were to set its latch’s branch weights to produce that estimated trip count, then we would have to set the peeled iterations’ branch weights so that their p
is maximized: p=1
. Any smaller p
will produce less than the desired total frequency of 10.
Consider the case of unrolling our original loop by 4, but this time let’s assume LoopUnroll produces an epilogue remainder loop. The unrolled loop and the remainder loop each must have an estimated trip count of 2. If we were to set their latches’ branch weights to produce those estimated trip counts, then we would have to set branch weights for the guards of the unrolled loop and remainder loop so that their p=1
. Again, any smaller p
will produce less than the desired total frequency of 10.
Setting such branch weights to encode p=1
seems wrong. Profile data has not necessarily provided any evidence that those branches are unconditional. Given the limits of the data that we have, it seems more realistic to distribute the frequency across iterations as in the original loop.
Proposal
We propose that a loop transformation store the new estimated trip count as separate loop metadata named llvm.loop.estimated_trip_count
. We also propose to extend llvm::getLoopEstimatedTripCount
to return it, if present, instead of computing the estimated trip count from the latch’s branch weights.
The Path Forward
In summary, we propose that loop transformations store estimated trip counts in separate metadata so that they are free to set branch weights as needed to maintain accurate block frequencies.
We have created a draft PR to implement our proposal for LoopPeel:
We are also developing a series of PRs for LoopUnroll. We have identified LoopUnroll cases (e.g., runtime unrolling) that are not as simple as the main LoopUnroll example in this RFC. They require computing new branch weights, but they do not violate this RFC’s proposal. To avoid making this RFC even longer, we will explain those cases in their associated PRs.
We also plan to continue to look for additional LLVM transformations that could benefit from similar improvements.
Before we proceed much further, we would appreciate the LLVM community’s feedback:
- Do you agree the issue we have pointed out is worth addressing?
- Does our proposal so far sound reasonable for upstream LLVM?
- We have investigated only LoopPeel and LoopUnroll. What other LLVM transformations could benefit from similar improvements?