[RFC] Fix Loop Transformations to Preserve Block Frequencies

In LLVM IR, a loop’s latch block can have branch weight metadata that encodes an estimated trip count that is derived from application profile data. Initially, the loop body’s block frequencies agree with the estimated trip count, as expected. However, sometimes loop transformations adjust those branch weights in a way that correctly maintains the estimated trip count but that corrupts the block frequencies.

This RFC seeks the LLVM community’s feedback on our efforts to improve loop transformations to maintain more accurate block frequencies. Our assumption is that more accurate block frequencies would benefit any optimization that makes decisions based on them. Moreover, they would improve PGO-based tools we are developing to report estimates of, for example, FLOP counts.

Corrupt Block Frequencies

Example Loop

Consider the loop in this input LLVM IR:

declare void @f(i32)

define void @test(i32 %n) {
entry:
  br label %do.body

do.body:
  %i = phi i32 [ 0, %entry ], [ %inc, %do.body ]
  %inc = add i32 %i, 1
  call void @f(i32 %i)
  %c = icmp sge i32 %inc, %n
  br i1 %c, label %do.end, label %do.body, !prof !0

do.end:
  ret void
}

!0 = !{!"branch_weights", i32 1, i32 9}

Given the above branch weights, once any loop iteration is actually reached, the probability of the loop exiting at the iteration’s end is 1/(1+9). That is, the loop is likely to exit every 10 iterations and thus has an estimated trip count of 10.

opt -passes='print<block-freq>' shows that 10 is indeed the frequency of the loop body:

Printing analysis results of BFI for function 'test':
block-frequency-info: test
 - entry: float = 1.0, int = 1801439852625920
 - do.body: float = 10.0, int = 18014398509481984
 - do.end: float = 1.0, int = 1801439852625920

To simplify this discussion, we normalize all our frequency calculations to one encounter of the original loop.

Key Observation

The frequency of any particular loop iteration is less than for the previous iteration because the previous iteration has a non-zero probability of exiting the loop. This observation holds even though every loop iteration, once actually reached, has exactly the same probability of exiting and thus exactly the same branch weights.

LoopPeel

Consider what happens to the original loop above when we apply opt -unroll-force-peel-count=2 -passes=loop-unroll to peel 2 iterations and insert them before the remaining loop. Running opt -passes='print<block-freq>' reveals a problem:

Printing analysis results of BFI for function 'test':
block-frequency-info: test
 - entry: float = 1.0, int = 2814749767070151
   ...
 - do.body.peel: float = 1.0, int = 2814749767070151
   ...
 - do.body.peel2: float = 0.9, int = 2533274790100992
   ...
 - do.body: float = 6.4, int = 18014398509481984
   ...
 - do.end: float = 1.0, int = 2814749767070151

The block frequency becomes 1.0 for the first peeled iteration, 0.9 for the second, and 6.4 for the remaining loop body. The sum of the original loop body’s frequencies is then 8.3 instead of the original 10. The new branch weights suggest why:

!0 = !{!"branch_weights", i32 1, i32 9}
!1 = !{!"branch_weights", i32 1, i32 8}
!2 = !{!"branch_weights", i32 1, i32 7}

The exit probability is now 1/10 for the first peeled iteration, 1/9 for the second, and 1/8 for the remaining loop iterations. It seems LoopPeel is trying to ensure a decreasing block frequency across iterations. However, as in the key observation above, that happens correctly in the original loop without decreasing the branch weights across iterations.

LoopUnroll

Starting again with the original loop above, consider what happens when we apply opt -unroll-count=4 -passes=loop-unroll. The resulting unrolled loop contains the original loop body 4 times. At the end of each is a conditional branch that either continues to the next iteration or exits the loop. In the case of the fourth, the branch is the unrolled loop’s latch. Again, running opt -passes='print<block-freq>' reveals a problem:

Printing analysis results of BFI for function 'test':
block-frequency-info: test
 - entry: float = 1.0, int = 11448150254814232
 - do.body: float = 1.5736, int = 18014398509481984
 - do.body.1: float = 1.4162, int = 16212958656856064
 - do.body.2: float = 1.2746, int = 14591662789660508
 - do.body.3: float = 1.1471, int = 13132496509335502
 - do.end: float = 1.0, int = 11448150254814232

The sum of the original loop body’s frequencies is 5.41 instead of the original 10. Again, the new branch weights suggest why:

!0 = !{!"branch_weights", i32 1, i32 9}
!1 = !{!"branch_weights", i32 1, i32 1}

The exit probability for each of the first three unrolled iterations is 1/10, as in the original loop. But LoopUnroll has changed it to 1/2 for the fourth in order to indicate an estimated trip count of 2.

(Aside: The estimated trip count for the unrolled loop should not be 10//4=2, as set by the current implementation. A third iteration of the unrolled loop is required to execute the remainder, 10%4=2. Even so, fixing that is not sufficient to produce correct block frequencies.)

How do we fix block frequencies?

Let p be the probability that the original loop will continue from one iteration to the next. There is only one set of branch weights for the original loop’s latch, so p is uniform across all its iterations. Like any probability, 0 <= p <= 1. In our original loop:

p = 9/(1+9) = 0.9

We treat the loop as having a theoretically infinite set of possible iterations. The frequency of any such loop iteration is just the probability that iteration will be reached (when normalizing our calculations to one encounter of the original loop, as mentioned earlier). To compute the total loop body frequency, we can sum the reaching probabilities of all the loop’s possible iterations:

bodyFreq = p^0 + p^1 + p^2 + ... + p^inf = 1/(1-p) = 1/(1-0.9) = 10

Notice how this formula captures our key observation from earlier: every iteration’s frequency has another factor of p due to the previous iteration.

If we peel 2 iterations, the above series should still give the total frequency of all occurrences of the original loop body. The only change is that p^0 and p^1 are now the frequencies of the peeled iterations, and so the infinite series starting at p^2 gives the frequencies of the remaining loop’s iterations.

If we instead unroll by 4, the above series should still work. The only change from the original loop is that each iteration of the unrolled loop contains a series of 4 terms.

In other words, in these cases, we should leave the original loop’s branch weights in place uniformly across all occurrences of the original loop body, just as they were in the original loop. That would ensure that p and the frequency of each of the original loop’s iterations are the same as they were in the original loop. The total frequency of the original loop body, summed across all its occurrences after the transformation, would thus remain 10.

Proposal

We propose that, when possible, loop transformations maintain branch weights consistent with the original loop for the sake of preserving the total frequency of the original loop body.

Corrupt Estimated Trip Counts

LLVM computes a loop’s estimated trip count as the loop body frequency derived from the loop’s latch branch weights considered in isolation. Thus, the above bodyFreq formula gives the estimated trip count, 10, for the original loop.

If LoopPeel or LoopUnroll were to keep the original loop’s branch weights in place for its latch as we have proposed, then bodyFreq computed for that latch in isolation would maintain the same estimated trip count, 10. However, peeling 2 iterations or unrolling by 4 should instead reduce the estimated trip count to 8 or 3, respectively. It seems that correcting the block frequencies has corrupted the estimated trip counts.

In other words, branch weights currently have conflicting roles. To maintain accurate block frequencies, branch weights must be set while considering the context of those branch weights, as revealed by our key observation. In contrast, estimated trip counts are computed from a loop latch’s branch weights in isolation. An effect of loop transformations like LoopPeel and LoopUnroll is essentially to extract parts of the loop from the latch into context, thus exposing this conflict.

So how can we accurately maintain both block frequencies and estimated trip counts?

Hack the context?

One possibility is to go ahead and set the branch weight’s on a transformed loop’s latch to produce its estimated trip count, but then we would need to hack other branch weights in its context to compensate so that block frequencies remain accurate.

Consider the case of peeling 2 iterations from our original loop. The remaining loop must have an estimated trip count of 8. If we were to set its latch’s branch weights to produce that estimated trip count, then we would have to set the peeled iterations’ branch weights so that their p is maximized: p=1. Any smaller p will produce less than the desired total frequency of 10.

Consider the case of unrolling our original loop by 4, but this time let’s assume LoopUnroll produces an epilogue remainder loop. The unrolled loop and the remainder loop each must have an estimated trip count of 2. If we were to set their latches’ branch weights to produce those estimated trip counts, then we would have to set branch weights for the guards of the unrolled loop and remainder loop so that their p=1. Again, any smaller p will produce less than the desired total frequency of 10.

Setting such branch weights to encode p=1 seems wrong. Profile data has not necessarily provided any evidence that those branches are unconditional. Given the limits of the data that we have, it seems more realistic to distribute the frequency across iterations as in the original loop.

Proposal

We propose that a loop transformation store the new estimated trip count as separate loop metadata named llvm.loop.estimated_trip_count. We also propose to extend llvm::getLoopEstimatedTripCount to return it, if present, instead of computing the estimated trip count from the latch’s branch weights.

The Path Forward

In summary, we propose that loop transformations store estimated trip counts in separate metadata so that they are free to set branch weights as needed to maintain accurate block frequencies.

We have created a draft PR to implement our proposal for LoopPeel:

We are also developing a series of PRs for LoopUnroll. We have identified LoopUnroll cases (e.g., runtime unrolling) that are not as simple as the main LoopUnroll example in this RFC. They require computing new branch weights, but they do not violate this RFC’s proposal. To avoid making this RFC even longer, we will explain those cases in their associated PRs.

We also plan to continue to look for additional LLVM transformations that could benefit from similar improvements.

Before we proceed much further, we would appreciate the LLVM community’s feedback:

  • Do you agree the issue we have pointed out is worth addressing?
  • Does our proposal so far sound reasonable for upstream LLVM?
  • We have investigated only LoopPeel and LoopUnroll. What other LLVM transformations could benefit from similar improvements?
4 Likes

IIUC, the issue is the difference between the loop having 10 times 10 iterations (ie average: 10) vs 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19 iterations (also average 10). LoopUnroll/Peel assumes the latter, giving assigning each latch execution equal probability to exit the loop, while in reality it is zero except the last.

I don’t immediatly see how estimated_trip_count solves that. We still cannot distinguish between the distributions with average 10, when exatly the loop exits. The example with 10 iterations and 2 peeled-off iterations could also be 9 execution with 1 iteration (so control leaves after the first peeled-off iteration) and 1 execution with 91 iterations (two of which are peeled-off BBs), leaving the average trip count of the remaining loop at 89.

Without recording the entire trip count distribution, I think the only reliable technique is to re-profile with the new CFG.

Hi Michael. Thanks for your feedback.

I agree that a single estimated trip count (which determines the original loop body’s block frequencies and latch branch weights) alone cannot distinguish between those scenarios. This RFC is not trying to address that issue.

Even before LoopUnroll and LoopPeel get involved, the original latch branch weights imply a distribution of frequencies across an infinite set of theoretically possible loop iterations. Yes, it is based on equal exit probabilities at all iterations. The average trip count from that distribution is 10 in the examples.

LoopUnroll and LoopPeel continue with that approach but mangle the block frequencies in the process. That’s what this RFC is trying to address.

We don’t necessarily have enough data to know that, in reality, the exit probability is zero at any particular iteration. Even if we were to maintain a record of every trip count observed during profiling, that doesn’t mean earlier iterations have zero exit probability.

I’m thinking of commits like e8d5db206c2f5c4afac8288ab96193a5638270ae, which expressed concern about deficient profiling data leading to underestimated trip counts and resulting performance issues with the register allocator. It seems safer to assume a distribution across all possible loop iterations that averages to the estimated trip count.

It’s not trying to. It’s trying to continue to record estimated trip counts without the unnecessary corruption of block frequencies.

I agree that would give more precise info. Before doing that, why must we corrupt the info we already had?

Honestly, this was what convinced me something was off here.

As I see it, we try to do two things with one number, and that works just fine at first:

  1. We use branch profiles for block frequency calculation. Follow the paths, multiply the probabilities, and accumulate your block frequency. Nice.
  2. We can read estimated loop trip counts from block frequencies (or branch profiles).

It is important to note that:
Block frequencies are, by design, a “path-oriented” metric that takes all paths from the entry to the block into account.
Estimated loop trip counts are “localized” metrics that only care about the loop (backedge/exit taken probabilities).
If the profiled CFG is unmodified, the difference doesn’t matter. However, once you modify the loop, you might start to inteleave the path to the loop with the “loop path” itself. That’s where the issues start.

The example shows nicely how the current logic tries to repair the estimated loop trip count after unrolling/peeling. It mostly works, but it will destroy the branch probability since the repair focused on the remainder loop, ignoring the iterations that became part of the path. If we remove the “hacks” to “repair” the estimated loop trip count, we would naturally preserve the branch probabilities (yay!). This is already a strong argument for me.
Again, no extra logic is required to preserve the branch probabilities in the unroll/peel case, and all the complicated special handling destroys this fundamental profile property.

Now, keeping branch probabilities alive is a win, but we would pessimize the estimated loop trip count if we rip out the special handling. To get the best of both worlds, I’d suggest recording the estimated loop trip count eagerly, that is, at profile ingestion time. We read the profile data, for all loops we add the estimated trip count, and all transformations that modify loop trip counts modify those.

This is mostly in line with the draft PR, IIRC.

Thanks for commenting, Johannes.

Agreed for these simple cases. There are more complicated LoopUnroll cases (e.g., runtime unrolling) where new branch weights must be computed. I have solutions, but I suggest we review them in PRs and keep this RFC high level.

One distinction: in the PR now, the original loop does not store llvm.loop.estimated_trip_count. LoopPeel stores it because it is difficult to create latch branch weights that accurately maintain both the estimated trip count and the block frequencies. The PR makes llvm::getLoopEstimatedTripCount smart enough to return the right one in either case. I anticipate this approach will make it a little easier to migrate loop transformations incrementally, but I’m certainly open to alternatives.

(checking my understanding) - you also need to change BlockFrequencyAnalysis to be aware of llvm.loop.estimated_trip_count, correct?

Separately, I also have a preference for @jdoerfert 's suggestion to record llvm.loop.estimated_trip_count at the time of profile ingestion. Main reason is that it enables maintainability: we can assert that, if profiles are present, loop latches are expected to have this metadata, meaning, if I am naive and write a new loop optimization and forget to recalculate this, it’s easy to find my pass as culprit.

@jdenny-ornl correct me if I’m wrong but you don’t, right?
This decouples block frequencies and estimated loop trip counts. Branch profiles will initially inform both, but then only represent the former. The latter is encoded purely with the new metadata. All we are doing in computing frequencies from probabilities will stay as is, but we can focus on maintaining probabilities accurately, without keeping an eye on trip counts.

Thanks for commenting, Mircea.

I have not seen a need so far.

@jdoefert and I chatted more about this offline. I agree it is a reasonable plan. I was thinking of implementing it later after migrating more loop transformations. However, I could implement it now if it’s a selling point for getting the patches approved.

I’m confused. In the loop peeling example, isn’t it that your branch_weights stay 9 : 1 for both loops, but the llvmloop.estimated_trip_counts are 2 and 8, respectively (if the original unpeeled loop’s trip count was 10). Insofar as BFI, the BFI for the block clone in the peeled loop should be 2, and the one in the reminder 8. Correct?

Which example do you mean? The main LoopPeel and LoopUnroll examples each produce only one loop. I did later mention a LoopUnroll case with an epilogue remainder loop. Is that what you meant?

I think we need godbold links here: Compiler Explorer

I’ll try to run through the example.

The first source is the one above.
The second is the peeled version, but with the profile set to 9 : 1 (so keeping the original probability w/o any modification).

The BFI results for the current upstream (shown above as well) look like:

 - entry: float = 1.0, int = 2814749767070151
 - do.body.peel: float = 1.0, int = 2814749767070151
 - do.body.peel2: float = 0.9, int = 2533274790100992
 - do.body: float = 6.4, int = 18014398509481984
 - do.end: float = 1.0, int = 2814749767070151

The estimated loop trip count is “forced” to ~8. (In this case: 9 : 1 → 10 iterations, so subtract 2 and you get 7 : 1 → 8 iterations.) However, the compounded executions of the loop body (computed via BFI) sum up to 1 + 0.9 + 6.4 = 8.3, not 10. This is because the “adjustment” of the loop trip count loses real iterations as it ignores the path probability.

The BFI results for the modified version, which keeps the correct 9 : 1 ratio everywhere, look as follows:

- : float = 1.0, int = 2223999818516971
 - do.body.peel: float = 1.0, int = 2223999818516971
 - do.body.peel2: float = 0.9, int = 2001599836458148
 - do.body: float = 8.1, int = 18014398509481984
 - do.end: float = 1.0, int = 2223999818516971

The number of compound executions of the loop body is 1 + 0.9 + 8.1 = 10. Also, the probability of exiting the loop has not changed. This makes sense. However, computing the trip count now would yield 10, which is obviously not great. Hence, the extra metadata to keep the trip count around without disturbing the probabilities (and BFI).

We could have made this 1 + 1 + 8 = 10, which might be what @mtrofin was thinking. I explained here why I think that is the wrong approach.

OK, I see now where I made my reasoning mistake - forgot that the BFIs would get relative to the context. Thanks!

Thanks for the question. Happy to address any others you might have.

@nikic @fhahn @arsenm Thoughts on the new llvm.loop.estimated_trip_count metadata in order to preserve both BFI and trip count estimates at the same time?

I still have to go through this long discussion in detail, so this is mostly gut/early reaction.

So far this feels like just a bug in the loop peeling branch weight updating? Why would we need this extra metadata, what is it good for? Are we prepared to patch all relevant passes to keep it up-to-date (we seem to be struggling enough already with keeping the branch weights updates as we transform things)…

Thanks for the questions, Matthias.

No, it is a more general conflict between maintaining accurate block frequencies and accurate estimated trip counts when both are encoded in branch weights.

Loop transformations would store estimated trip counts in that metadata so that they are free to set branch weights as needed to maintain accurate block frequencies.

PR #128785 adjusts the existing llvm::setLoopEstimatedTripCount and llvm::getLoopEstimatedTripCount to set/get that metadata. If some passes are not already using that to set/get estimated trip counts, shouldn’t they be?

The proposed changes should not make it harder to keep branch weights updated. They should make it easier to get them right for the sake of block frequencies.

The core issue is that if you only use branch profiles, you have a choice: you can keep BFI up to date and correct, or you keep loop trip counts correct. I think my last posts explained this. One takes path into account and one doesn’t. This cannot fit together. We need a second thing to encode two things.

I do agree with Matthias that this feels more like an issue with branch weight updating in the pass, rather than a “conflict” between BFI and loop counts. Although adding extra metadata might solve the particular bug, I am not convinced it’s the right approach in the long run.

The core issue is that if you only use branch profiles, you have a choice: you can keep BFI up to date and correct, or you keep loop trip counts correct.

I don’t understand what that means. Given any branch weights (probabilities), one can compute in a unique way the corresponding block frequencies, and hence, loop trip counts. Thus, if the branch weights are set correctly, then there shouldn’t be a problem with loop counts and we don’t need extra metadata.
I think your examples above are nice, but they operate with BFI, which by nature is an approximate data structure. Thus, it is hard to see where is an actual problem and where is a noise due to approximation. I find it much easier to argue about the actual block and branch counts, and how they look before and after the passes. Can you come up with a toy example illustrating the problem using actual counts?

  1. Focus on weights + block frequencies (but not on trip count)

The thing that feels is odd to me is why we argue so much about estimated loop trip counts. It’s an estimation you can derive based on the branch weights. The actual information encoded is the branch weights and I’d argue that by far the most important use of those branch weights in LLVM is computing block frequency information which is used at various places (from makging inlining decisions, to code placements (including regalloc spills/fill placement) and code layout are all based on the frequencies. However I don’t see yet why the trip count (which is a derived and estimated value) is worth focusing on. If I am missing some important use-case in LLVM please mention that in the discussion. Inveting new metadata and IR constructs comes with maintenance costs and so far I miss motivation here why it is worth it here?

  1. Currently loop peeling is doing a bad job updating branch weights.

You are making a very good point here and clearly show that whatever loop peeling is doing today is not correct. And I think we all agree that we should fix it and maintain the property you hint at where the frequency of the loop body before peeling is the sam as the sum of the two peeled loop bodyies and the remaining loop.

  1. I’m not sure I follow the “How do we fix block frequencies?” paragraph

I can follow your argument how peeling twice removes the p^0 and p^1 terms from your equation and “so the infinite series starting at p^2 gives the frequencies of the remaining loop’s iterations”.

I do NOT follow the next argument “In other words, in these cases, we should leave the original loop’s branch weights in place uniformly across all occurrences of the original loop body, just as they were in the original loop.”. Yes, the remaining frequency would be “p^2 + p^3 + …” but a loop with weight p does not give you that, does it? As far as I can tell there must be some different value p1 for the new loop so that p1^0 + p1^1 + ... ~= p^2+p^3+p^3

Unfortunately I am lost in a lot of the reamining discussion about conflicting roles, context hacking, …

  1. Peeling twice does not mean subtracting two from the trip count

To me the most interesting insight from this discussion is that you cannot easily compute with the trip count value. Peeling twice does not mean the trip count is exactly two smaller. But I think you and others already point this out/show here? For the given example assuming an average trip count of 10 most possible inputs would reduce the trip count to 8, but there are a few extreme inputs such as claling the function 10 times with %n being [1, 1, 1, 1, 1, 1, 1, 1, 1, 91] gives you a trip count of 10 before peeling, but after peeling twice the resuling loop has a trip count of 82 for this particular input). With that I intuitively agree that the average trip count for the remaining loop is not just subtracting two but slightly higher.

  1. Metadata induces maintenance costs

Just to reiterate this point, introducing new metadata induces maintanenace costs and increases complexity (others now have to understand why the metadata exists, how to update it correctly, …) So I would like to see some good justification what that metadata will be used for, and why it is good to have it.