Unrolling loop with known upper bound on trip count into a switch

Hello,

I was wondering if it would be beneficial to unroll a loop like

__builtin_assume(n <= 10);
do{
 // do something
} while(--n >=0)

in which the upper bound on n is known at compile time into something like

switch i32 %n, label %default.unreachable43 [
    i32 10, label %sw.10
    i32 9, label %sw.9
    ...
    i32 0, label %sw.0
  ]
sw.10:
  %n.10 = phi i32 [ 10, %entry ]
  ...
  br label %sw.9
sw.9:
  %n.9 = phi i32 [ 9, %entry ], [ %n.10, %sw.10 ]
  ...
  br label %sw.8
...
sw.0:
  %n.0 = phi i32 [ 0, %entry ], [ %n.1, %sw.1]
  ...

This question is motivated by #111773 issue.
The downside I see for the general case is increased code size, and it may prevent the code from being vectorized.

Please let me know your thoughts on this type of optimization and whether it would be useful to implement it.

Unrolling by design increases code size, so that’s not an issue.

Currently, LoopUnroll consideres loop with small constant tripcounts for “bounded unrolling”. That is, it optimistically assumes that the loop will run MaxTripCount times and fully unrolls that, and otherwise falls back to the non-unrolled cases.

I think a jump table iwould probably an improvement for many ISAs, but I am not sure what the cost model would be. GPUs for instance are not good with jump tables. Another benefit of unrolling usually is enabling other optimizations on straight-line code (E.g. SLPVectorize), but here every iteration still has its own BasicBlock. Both can be mitigated by having entirely separate unrolled cases, at the cost of quadratic code blowup, so it is only feasible for very small trip counts.

But I would also be interested in doing this with remainder loops in general (which by definition is a loop with known upper bound). Even without cost model, it can always be applied when #pragma clang loop unroll(enable) is present.

1 Like

That is a good point about remainder loops.

There’s a funny interaction with loop indexing here… if you have a loop where the index counts up, from 0 to n, and you unroll a loop this way, the index isn’t a constant in each unrolled iteration, like it would be for normal unrolling. On the other hand, if your loop index counts from n to 0, unrolling the proposed way makes the index a constant, where it wouldn’t be a constant with normal bounded unrolling. Given that vast majority of loops count up, the heuristics get sort of complicated… does the index being a constant matter for a given loop? Can you transform a loop that counts up to a loop that counts down?

I think, it is possible to generalize even if we don’t know starting and end values of induction variable at compile time, but have an upper bound on the number of iterations. For example, in the code below:

void foo(int start, int end){
do{
// do body
} while(++start < end);
}

assume we know that “body” is not going to be executed more than N times. Then we can chain N copies of the body and jump to (N - (end - start))-th block from the switch in the pre-header.

Whether we can constant-fold doesn’t matter for correctness, sure; the indexing is a cost-modeling issue (i.e. whether the transform is actually profitable, and whether we want to execute the loop iterations in a different order from the original source order).

Yeah, it is not clear how to estimate the cost. However, we don’t need to change the order. Let’s assume for simplicity that header block = latch block. If the upper bound on the number of iterations is N, we will create N symmetrical body blocks. If the run time iteration count is n (< N), it does not matter wether we execute first n blocks and jump to exit or last n.