Replacing a repetitive sequence of code with a loop

Hey guys, in an HPC project I am working on I am given an LLVM program consisting of a linear sequence of repetitive junks of code with an uniform memory access pattern. Each code junk does the following: 1) loads some memory, 2) performs some arithmetic operations, 3) stores the result back to memory. The memory stride between consecutive junks is constant over the whole program, thus the whole program could be replaced with a single loop with it's loop body containing a generic version of said code junk.Here's an example (a short one, the real world program would be much longer):

define void @vec_plus_vec(float* noalias %arg0, float* noalias %arg1, float* noalias %arg2) {
entrypoint:
   %0 = bitcast float* %arg1 to <4 x float>*
   %1 = bitcast float* %arg2 to <4 x float>*
   %2 = load <4 x float>* %0, align 16
   %3 = load <4 x float>* %1, align 16
   %4 = fadd <4 x float> %3, %2
   %5 = bitcast float* %arg0 to <4 x float>*
   store <4 x float> %4, <4 x float>* %5, align 16
   %6 = getelementptr float* %arg1, i64 4
   %7 = getelementptr float* %arg2, i64 4
   %8 = getelementptr float* %arg0, i64 4
   %9 = bitcast float* %6 to <4 x float>*
   %10 = bitcast float* %7 to <4 x float>*
   %11 = load <4 x float>* %9, align 16
   %12 = load <4 x float>* %10, align 16
   %13 = fadd <4 x float> %12, %11
   %14 = bitcast float* %8 to <4 x float>*
   store <4 x float> %13, <4 x float>* %14, align 16
   %15 = getelementptr float* %arg1, i64 8
   %16 = getelementptr float* %arg2, i64 8
   %17 = getelementptr float* %arg0, i64 8
   %18 = bitcast float* %15 to <4 x float>*
   %19 = bitcast float* %16 to <4 x float>*
   %20 = load <4 x float>* %18, align 16
   %21 = load <4 x float>* %19, align 16
   %22 = fadd <4 x float> %21, %20
   %23 = bitcast float* %17 to <4 x float>*
   store <4 x float> %22, <4 x float>* %23, align 16
   %24 = getelementptr float* %arg1, i64 12
   %25 = getelementptr float* %arg2, i64 12
   %26 = getelementptr float* %arg0, i64 12
   %27 = bitcast float* %24 to <4 x float>*
   %28 = bitcast float* %25 to <4 x float>*
   %29 = load <4 x float>* %27, align 16
   %30 = load <4 x float>* %28, align 16
   %31 = fadd <4 x float> %30, %29
   %32 = bitcast float* %26 to <4 x float>*
   store <4 x float> %31, <4 x float>* %32, align 16
   ret void
}

The stride between two consecutive junks is 4*sizeof(float), thus could be condensed into a single loop. (The scenario with the constant stride between repetitive code junks is the first, most simple version. Later I will have to deal with arbitrary strides between junks. Requires an index array stored in the code as static data I guess)...

For this simple scenario what already existent LLVM pass would be closest to what I am trying to achieve?

Thanks,
Frank

There's a loop reroll pass in LLVM trunk that should do exactly this transformation.

- Ben

Though that's a loop pass (runOnLoop). What you could do is add a
previous pass that would recognize the pattern and create a loop of 1
iteration around the code and then run the reroll pass.

If your pattern recognition is too good, it'll be redundant with the
reroll pass, so I'm not sure how to do that without duplicating effort
or destroying the IR (in case you're wrong). Adding 1-iteration loops
to all functions won't work either. You'll have to balance the
heuristics and make the pass optional, maybe with a pragma to help the
compiler.

cheers,
--renato

There's a loop reroll pass in LLVM trunk that should do exactly this transformation.

Though that's a loop pass (runOnLoop). What you could do is add a
previous pass that would recognize the pattern and create a loop of 1
iteration around the code and then run the reroll pass.

This sounds like we should need a separate pass which forms loops from straight line code. I would strongly suggest only forming loops when you can find at least two iterations. We can probably share most of the code with the existing reroller, but the wrapper pass needs a slightly different interface.

Ok, I will do some code reading;the reroll pass seems seems a good starting point.

Renato, thanks for the hint that this is actually a loop pass! You're right that it's not always beneficial to roll things up. However in this particular case I don't worry about insisting to roll the code into a loop no matter what, i.e. not looking at heuristics etc. In the given application it will always be beneficial to roll it, this is an a priori assumption.(the example I posted is far smaller than what the application generates).

All functions are generated at runtime with the builder (I am not coming from a higher level language). Thus, I could embed the whole function in a loop with one iteration. I will probably try this, just to see what the reroll pass thinks about it.It will not solve everything. In real world code the code junks can become quite large (over 1000 instructions) and I doubt that the reroller can handle that in it's current state.

There's onemore thing.The resulting loops will have high iteration counts so that it would make sense to use all available CPU cores and distribute the iterations among them. As far as I can see it, threading must be taken care of 'outside of the LLVM IR', right? Coming from the shown code, I would need a mechanism that modifies the starting and ending iteration of the loop and exposes those as arguments to the function, right? Aka,

define void @vec_plus_vec(int %start_iteration, int %end_iteration, float* noalias %arg0, float* noalias %arg1, float* noalias %arg2) { entrypoint:

and from the calling code (OMP probably) call the function from each thread with thread number matching start/stop parameters. Does anyone see any other solution to threading than this?

Thanks,
Frank

Sorry for the spam. I think my mailer screws up badly.

Frank

Ok, I will do some code reading;the reroll pass seems seems a good
starting point.

Renato, thanks for the hint that this is actually a loop pass! You're
right that it's not always beneficial to roll things up. However in this
particular case I don't worry about insisting to roll the code into a
loop no matter what, i.e. not looking at heuristics etc. In the given
application it will always be beneficial to roll it, this is an a priori
assumption.(the example I posted is far smaller than what the
application generates).

All functions are generated at runtime with the builder (I am not coming

In that case, why not generate the loop yourself, instead of effectively performing a tonsillectomy from the wrong end?

from a higher level language). Thus, I could embed the whole function in
a loop with one iteration. I will probably try this, just to see what
the reroll pass thinks about it.It will not solve everything. In real
world code the code junks can become quite large (over 1000
instructions) and I doubt that the reroller can handle that in it's
current state.

There's onemore thing.The resulting loops will have high iteration
counts so that it would make sense to use all available CPU cores and
distribute the iterations among them. As far as I can see it, threading
must be taken care of 'outside of the LLVM IR', right?

Yes.

Jon

Ok, I will do some code reading;the reroll pass seems seems a good starting point.

Philip is right that we should try to do this in a generic way, but we don’t want to pass it on everything. If you can annotate the IR, maybe we could start with some metadata, and use pragmas later.

All functions are generated at runtime with the builder (I am not coming from a higher level language). Thus, I could embed the whole function in a loop with one iteration. I will probably try this, just to see what the reroll pass thinks about it.It will not solve everything. In real world code the code junks can become quite large (over 1000 instructions) and I doubt that the reroller can handle that in it’s current state.

Probably not, but I can’t see it being too hard to do it and still have the old behaviour.

There’s onemore thing.The resulting loops will have high iteration counts so that it would make sense to use all available CPU cores and distribute the iterations among them. As far as I can see it, threading must be taken care of ‘outside of the LLVM IR’, right?

Not necessarily. You can use OpenMP.

Clang supports a lot of MP pragmas and there’s also a run time library. If you don’t use clang, you might have to tweak a bit, though.

Cheers,
Renato