Auto-vectorization and phi nodes

Hi all,

Sorry if this is a dumb or FAQ or the wrong list!

I'm currently investigating LLVM vectorization of my generated code. My codegen emits a lot of recursions that step through arrays via pointers. The recursions are nicely optimized into loops, but the loop vectorization can't seem to work on them because of phi nodes that point to gep nodes.

Some simple IR to demonstrate; it vectorizes nicely with opt -O3 -vectorize-loops -force-vector-width until I uncomment the phi/gep nodes.

define void @add_vector(float* noalias %a, float* noalias %b, float* noalias %c, i32 %num)
{
Top:
         br label %Loop
Loop:
         %i = phi i32 [0,%Top],[%i.next,%Loop]

; phi and gep - won't vectorize
; %a.ptr = phi float* [%a,%Top],[%a.next,%Loop]
; %b.ptr = phi float* [%b,%Top],[%b.next,%Loop]
; %c.ptr = phi float* [%c,%Top],[%c.next,%Loop]

; %a.next = getelementptr float* %a.ptr, i32 1
; %b.next = getelementptr float* %b.ptr, i32 1
; %c.next = getelementptr float* %c.ptr, i32 1

; induction variable as index - will vectorize
         %a.ptr = getelementptr float* %a, i32 %i
         %b.ptr = getelementptr float* %b, i32 %i
         %c.ptr = getelementptr float* %c, i32 %i

         %a.val = load float* %a.ptr
         %b.val = load float* %b.ptr
         %sum = fadd float %a.val, %b.val
         store float %sum, float* %c.ptr

         %i.next = add i32 %i, 1
         %more = icmp slt i32 %i.next, %num
         br i1 %more, label %Loop, label %End
End:
         ret void
}

So it seems that the loop vectorizer would like the pointer stepping to be converted to base+index. However as expected, clang doesn't care whether C code is written as pointer arithmetic or table index.

Is there a pass that converts simple pointer arithmetic to base+index? If not, should I write one (shouldn't be too hard for my limited use case) or try to emit more vector-friendly code from the front end?

Thanks a bunch!
Vesa Norilo

From: "Vesa Norilo" <vnorilo@siba.fi>
To: llvmdev@cs.uiuc.edu
Sent: Tuesday, February 19, 2013 4:40:26 AM
Subject: [LLVMdev] Auto-vectorization and phi nodes

Hi all,

Sorry if this is a dumb or FAQ or the wrong list!

I'm currently investigating LLVM vectorization of my generated code.
My
codegen emits a lot of recursions that step through arrays via
pointers.
The recursions are nicely optimized into loops, but the loop
vectorization can't seem to work on them because of phi nodes that
point
to gep nodes.

Some simple IR to demonstrate; it vectorizes nicely with opt -O3
-vectorize-loops -force-vector-width until I uncomment the phi/gep
nodes.

define void @add_vector(float* noalias %a, float* noalias %b, float*
noalias %c, i32 %num)
{
Top:
         br label %Loop
Loop:
         %i = phi i32 [0,%Top],[%i.next,%Loop]

; phi and gep - won't vectorize
; %a.ptr = phi float* [%a,%Top],[%a.next,%Loop]
; %b.ptr = phi float* [%b,%Top],[%b.next,%Loop]
; %c.ptr = phi float* [%c,%Top],[%c.next,%Loop]

; %a.next = getelementptr float* %a.ptr, i32 1
; %b.next = getelementptr float* %b.ptr, i32 1
; %c.next = getelementptr float* %c.ptr, i32 1

; induction variable as index - will vectorize
         %a.ptr = getelementptr float* %a, i32 %i
         %b.ptr = getelementptr float* %b, i32 %i
         %c.ptr = getelementptr float* %c, i32 %i

         %a.val = load float* %a.ptr
         %b.val = load float* %b.ptr
         %sum = fadd float %a.val, %b.val
         store float %sum, float* %c.ptr

         %i.next = add i32 %i, 1
         %more = icmp slt i32 %i.next, %num
         br i1 %more, label %Loop, label %End
End:
         ret void
}

So it seems that the loop vectorizer would like the pointer stepping
to
be converted to base+index. However as expected, clang doesn't care
whether C code is written as pointer arithmetic or table index.

Is there a pass that converts simple pointer arithmetic to
base+index?

As I recall, loop strength reduction can do this; but that happens only very late in the compilation process (well after vectorization). It would probably be better to update the loop vectorizer to deal with this directly. Nadav?

-Hal

Hi Vesa,

The pass IndVars changes the induction variables to allow SCEV to analyze them and enable other optimizations. This is the canonicalization phase. Later on, LSR lowers the canonicalized induction variables to induction variables that map nicely to the target's addressing modes. In many cases it can remove some of the induction variables.

I suspect that the loop vectorizer does not vectorize the code because SCEV fails to detect the induction variable. Can you run the loop vectorizer with the '-debug' option and check why it fails ?

Thanks,
Nadav

Hi Nadav and Hal and thanks for the help!

To the best of my understanding, indvars doesn't complain and an induction variable is detected. However, the loop vectorizer says:

LV: Checking a loop in "add_vector"
LV: Found a loop: Loop
LV: Found an induction variable.
LV: Found an unidentified PHI. %a.ptr = phi float* [ %a, %Top ], [ %a.next, %Loop ]
LV: Can't vectorize the instructions or CFG
LV: Not vectorizing.

Should indvars have changed this phi node into something else earlier during optimization?

In case I missed something important from the log, I put it all up here:
http://pastebin.com/yBS5RhVN

The IR fed into opt is at the end of the log, functionally unchanged. My command line is opt -S -O3 -vectorize-loops -force-vector-width=8 -debug

Thanks again!
Vesa

19.2.2013 19:22, Nadav Rotem kirjoitti:

Yes. SCEV does not identify this PHI. You can add a breakpoint in "isInductionVariable" to verify this. How are you generating this code ? Do you have a C example for it ?

Thanks,
Nadav

19.2.2013 20:16, Nadav Rotem kirjoitti:

LV: Found an unidentified PHI. %a.ptr = phi float* [ %a, %Top ], [ %a.next, %Loop ]

Yes. SCEV does not identify this PHI. You can add a breakpoint in "isInductionVariable" to verify this. How are you generating this code ? Do you have a C example for it ?

Thanks,
Nadav

Sorry, no C example. I'm writing a frontend for a functional language that generates a bunch of tail calls that iterate through arrays by passing incremented pointers and a decrementing induction var as arguments to tail recursion. LLVM makes very neat scalar loop code out of this, but fails to vectorize.

I wrote the IR I posted by hand to get a minimal example. The reason why it's quite tricky for me to emit [baseptr+indvar] style IR is that base pointer isn't local to the function body until after tail call optimization. Any suggestions are welcome!

I actually wrote a quick pass that moves those phi nodes out of a loop body and replaces them with base+index, but the resulting machine code is rather bad (full of gather/scatter), probably because I needed to screw with the order of llvm optimization passes without a deep enough understanding of how they interact.

All the best,
Vesa