loop fusion

Is there a transformation in LLVM that will perform loop fusion?
http://en.wikipedia.org/wiki/Loop_fusion

I have the following program, in which I would like the 2 loops (iterating the same number of times) to be merged into 1, after which other nice optimizations such as mem2reg will apply:

; ModuleID = 'test'

define void @vector([16 x float]* nocapture %arg, [16 x float]* nocapture %ret) {
bb.nph12:
   %0 = alloca [16 x float], align 4 ; <[16 x float]*> [#uses=2]
   br label %loop

loop: ; preds = %loop, %bb.nph12
   %indvar13 = phi i64 [ 0, %bb.nph12 ], [ %indvar.next14, %loop ] ; <i64> [#uses=3]
   %gep = getelementptr [16 x float]* %ret, i64 0, i64 %indvar13 ; <float*> [#uses=1]
   %gep1 = getelementptr [16 x float]* %0, i64 0, i64 %indvar13 ; <float*> [#uses=1]
   %load = load float* %gep ; <float> [#uses=1]
   store float %load, float* %gep1
   %indvar.next14 = add i64 %indvar13, 1 ; <i64> [#uses=2]
   %exitcond15 = icmp eq i64 %indvar.next14, 16 ; <i1> [#uses=1]
   br i1 %exitcond15, label %loop3, label %loop

loop3: ; preds = %loop, %loop3
   %indvar = phi i64 [ %indvar.next, %loop3 ], [ 0, %loop ] ; <i64> [#uses=3]
   %gep6 = getelementptr [16 x float]* %0, i64 0, i64 %indvar ; <float*> [#uses=1]
   %gep8 = getelementptr [16 x float]* %arg, i64 0, i64 %indvar ; <float*> [#uses=1]
   %load7 = load float* %gep6 ; <float> [#uses=1]
   store float %load7, float* %gep8
   %indvar.next = add i64 %indvar, 1 ; <i64> [#uses=2]
   %exitcond = icmp eq i64 %indvar.next, 16 ; <i1> [#uses=1]
   br i1 %exitcond, label %end4, label %loop3

end4: ; preds = %loop3
   ret void
}

I did find this note from 2004 that references a LoopFusion pass, but I can't find it in the source code:
http://nondot.org/sabre/LLVMNotes/LoopOptimizerNotes.txt

A little background - I'm working on a SIMD runtime, where I have a scalar program but need to execute it on multiple independent data elements. One approach is to generate a loop for each operation, then rely on the optimizer to merge the loops so that I get good locality. Does this sound like a reasonable approach in LLVM? If you're familiar with CUDA, I'm after something quite similar where the program is scalar but the runtime is vectorized. Any pointers in the right direction would be appreciated!

Andrew

Andrew,

There is not any transformation in LLVM that does loop fusion. I do not of anyone who is working on this. If you're interested to work on it then it'd be great!

Hi Devang,

Do you know if any pass is taking metadata to avoid un-optimizing?
Loop fusion can make it worse if you have strong locality (two
completely separate big memory access) in terms of cache miss in
small-memory platforms. So, would be good if the front-end could pass
on some information down the codegen? That could also help
implementing union types...

I know it's not good to depend on metadata (as for unions), but it
could help some architectures to generate better code, rather than
just "correct" code.

Only thing I know is Dan's recent work on TBAA where FE is using metadata to communicate type info to type based alias analyzer.

Andrew,

There is not any transformation in LLVM that does loop fusion. I do not of anyone who is working on this. If you're interested to work on it then it'd be great!

Hi Devang,

Do you know if any pass is taking metadata to avoid un-optimizing?

Not yet, but it would be valid.

Loop fusion can make it worse if you have strong locality (two
completely separate big memory access) in terms of cache miss in
small-memory platforms.

Agreed. And it can compound register pressure and oversubscribe other
execution resources. A general-purpose loop fusion pass should
consider such things.

So, would be good if the front-end could pass
on some information down the codegen?

Perhaps, although most front-ends don't have the kind of information
that a loop fusion pass would need, unless you're going to venture
into pragma territory.

That could also help
implementing union types...

I don't see the connection here.

I know it's not good to depend on metadata (as for unions), but it
could help some architectures to generate better code, rather than
just "correct" code.

The main constraint on metadata is that it can't be necessary for
correctness.

Dan

Sorry, that was a long jump. I was talking about using metadata to
help union implementation, but you can't depend on it for correctness,
so not a go.