[RFC] Changing LoopUnrollAndJamPass to a function pass.

LoopUnrollAndJamPass is currently a loop pass. It is added in a LPM with only itself.
Notice that loops are traversed in an inner to outer order in a LPM.

The current implementation of LoopUnrollAndJamPass supports only loop nest with one inner loop (L->getSubLoops().size() == 1).
Consider the example below:
Before loop unroll and jam:

for i
for j
for k
A[I][j][k] = 0;

After loop unroll and jam loop-j with a factor of 2:

for i
for j += 2
for k
A[I][j][k] = 0;
A[I][j+1][k] = 0;
for j’=j
for k
A[I][j][k] = 0;

Notice that LoopUnrollAndJamPass can no longer unroll and jam loop-i at the next invocation of LoopUnrollAndJamPass, since there exists two inner loops in loop-i.
If LoopUnrollAndJamPass is a function pass, then it can control the order of the loops being considered. By doing the transformation from outer to inner, both loop-i and loop-j can be unroll and jammed.

In conclusion, I propose to change LoopUnrollAndJamPass from loop to function pass, with the reasons below:

  1. There is no obvious reason why LoopUnrollAndJamPass need to be a loop pass
  2. More loops can be transformed by traversing in a outer to inter order
  3. Less remaining loops are needed if we consider the whole loop nest together
  4. Better cost model can be created by considering the whole loop nest together

Whitney Tsang

This makes sense to me.


Sounds interesting, and it would be great to see Unroll and Jam get some
love. I'm maybe being unimaginative, but I don't know of many cases
where Unroll-and-Jamming the outer loops would be more beneficial than
the code bloat it gave (a lot like normal unrolling). But making this a
function pass sounds useful anyway, it would avoid the awkward case of
accidentally unrolling the inner loop before the outer loop has been UnJ'd!

> 3. Less remaining loops are needed if we consider the whole loop nest

This one is interesting. UnJ was written for things like matrix multiply
on microcontroller's without vectorization available. It helps quite a
lot there. In our experience, we often end up unroll-and-jamming the
middle(j) iteration, then unrolling the inner(k) one, creating a tile of
sorts. It can create quite large code, but that code is quick on these
little cpus.

Oh, and IIRC UnJ was written in a way that it would potentially be
expanded to handle multiple inner loops. Or maybe I just had the idea
that it could be expanded like that, providing that you can work out all
the control flow and dependencies correctly. This might help with
unroll-and-jamming vectorized code too (although if outer loop
vectorization and interleaving is something that can already be handled
in the vectorizer, it may be better to leave it to do that job and have
it included in all it's cost modelling. Reverse-engineering that the
runtime checks are loop-invariant doesn't sound very reliable. If the
vectorizer can already do Unroll and Jam, let it do it on its own).

I look forward to seeing patches!

I am happy to see positive feedback from multiple parties.
I will start working on my first NFCI patch which changes LoopUnrollAndJamPass to a function pass, but keeps traversing the loops from inner to outer.

Whitney Tsang

graycol.gifDavid Green —2020/01/03 08:42:17 AM—Hello Sounds interesting, and it would be great to see Unroll and Jam get some