Vectorizing alloca instructions

Hi,

I've been playing around with the SLPVectorizer trying to get it to
vectorize this simple program:

define void @vector(i32 addrspace(1)* %out, i32 %index) {
entry:
  %0 = alloca [4 x i32]
  %x = getelementptr [4 x i32]* %0, i32 0, i32 0
  %y = getelementptr [4 x i32]* %0, i32 0, i32 1
  %z = getelementptr [4 x i32]* %0, i32 0, i32 2
  %w = getelementptr [4 x i32]* %0, i32 0, i32 3
  store i32 0, i32* %x
  store i32 1, i32* %y
  store i32 2, i32* %z
  store i32 3, i32* %w
  %1 = getelementptr [4 x i32]* %0, i32 0, i32 %index
  %2 = load i32* %1
  store i32 %2, i32 addrspace(1)* %out
  ret void
}

My goal is to have this program transformed to the following:

define void @vector(i32 addrspace(1)* %out, i32 %index) {
entry:
  %0 = extractelement <4 x i32> <i32 0, i32 1, i32 2, i32 3>, i32 %index
  store i32 %0, i32 addrspace(1)* %out
}

I've slightly modified the SLPVectorizer (see the attached patch) so
that it will vectorize small trees, and I've also fixed a crash in the
BoUpSLP::Gather() function when it is passed a list of store
instructions. With this patch, the command:

opt -slp-vectorizer -debug -march=r600 -mcpu=redwood -o - vector-alloca.ll -S -slp-threshold=-20

Produces the following output and the program remains unchanged:

slp-vectorize-alloc.patch (1.09 KB)

Hi Tom,

Thanks for working on this. The SLP-vectorizer thinks that %X %Y %Z and %W alias, so it tries to perform 4 scalar store operations (which is a bad idea). We need to figure out why AA thinks that X and Y may alias. Maybe there is a problem with the code that uses AA.

Thanks,
Nadav

Hi,

I've been playing around with the SLPVectorizer trying to get it to
vectorize this simple program:

define void @vector(i32 addrspace(1)* %out, i32 %index) {
entry:
  %0 = alloca [4 x i32]
  %x = getelementptr [4 x i32]* %0, i32 0, i32 0
  %y = getelementptr [4 x i32]* %0, i32 0, i32 1
  %z = getelementptr [4 x i32]* %0, i32 0, i32 2
  %w = getelementptr [4 x i32]* %0, i32 0, i32 3
  store i32 0, i32* %x
  store i32 1, i32* %y
  store i32 2, i32* %z
  store i32 3, i32* %w
  %1 = getelementptr [4 x i32]* %0, i32 0, i32 %index
  %2 = load i32* %1
  store i32 %2, i32 addrspace(1)* %out
  ret void
}

My goal is to have this program transformed to the following:

define void @vector(i32 addrspace(1)* %out, i32 %index) {
entry:
  %0 = extractelement <4 x i32> <i32 0, i32 1, i32 2, i32 3>, i32 %index
  store i32 %0, i32 addrspace(1)* %out
}

I've slightly modified the SLPVectorizer

Just a note, I don't think you should or need to vectorize the actual
alloca stuff. If you can simply transform the dynamically indexed load:

define void @vector(i32 addrspace(1)* %out, i32 %index) {
entry:
  %0 = alloca [4 x i32]
  %x = getelementptr [4 x i32]* %0, i32 0, i32 0
  %y = getelementptr [4 x i32]* %0, i32 0, i32 1
  %z = getelementptr [4 x i32]* %0, i32 0, i32 2
  %w = getelementptr [4 x i32]* %0, i32 0, i32 3
  store i32 0, i32* %x
  store i32 1, i32* %y
  store i32 2, i32* %z
  store i32 3, i32* %w
  %1 = bitcast [4 x i32]* %0 to <4 x i32>*
  %2 = load <4 x i32>* %1
  %3 = extractelement <4 x i32> %2, i32 %index
  store i32 %3, i32 addrspace(1)* %out
  ret void
}

Then running SROA and InstCombine will mop up the rest. So its mostly about
getting the SLPVectorizer to handle the dynamic GEP. As soon as it does
that, everything else will fall away.

Not sure how much this helps, just wanted to point it out.

Just a note, I don’t think you should or need to vectorize the actual alloca stuff. If you can simply transform the dynamically indexed load:

Then running SROA and InstCombine will mop up the rest. So its mostly about getting the SLPVectorizer to handle the dynamic GEP. As soon as it does that, everything else will fall away.

I don’t think that Tom wants the SLP-vectorizer to handle dynamic GEPS. What he wants is for the SLP-vectorizer to convert the first part of the code:

define void @vector(i32 addrspace(1)* %out, i32 %index) {
entry:
%0 = alloca [4 x i32]
%x = getelementptr [4 x i32]* %0, i32 0, i32 0
%y = getelementptr [4 x i32]* %0, i32 0, i32 1
%z = getelementptr [4 x i32]* %0, i32 0, i32 2
%w = getelementptr [4 x i32]* %0, i32 0, i32 3
store i32 0, i32* %x
store i32 1, i32* %y
store i32 2, i32* %z
store i32 3, i32* %w

Into this: Store <i32 0, i32 1, i32 2, i32 3> …

> Just a note, I don't think you should or need to vectorize the actual alloca stuff. If you can simply transform the dynamically indexed load:
>
> Then running SROA and InstCombine will mop up the rest. So its mostly about getting the SLPVectorizer to handle the dynamic GEP. As soon as it does that, everything else will fall away.
>

I don?t think that Tom wants the SLP-vectorizer to handle dynamic GEPS. What he wants is for the SLP-vectorizer to convert the first part of the code:

My goal is to eliminate all alloca instructions, so vectorizing the
stores is just the first step. I do want the dynamic GEPs removed and
replaced with dynamic extractelement instructions. Is the
SLP-vectorizer the right place to do this?

-Tom

Hi Tom,

Thanks for working on this. The SLP-vectorizer thinks that %X %Y %Z and %W alias, so it tries to perform 4 scalar store operations (which is a bad idea). We need to figure out why AA thinks that X and Y may alias. Maybe there is a problem with the code that uses AA.

Thanks for the tip. opt was using NoAliasAnalysis by default, so passing
-basicaa to opt gives me this now:

SLP: Analyzing blocks in vector.
SLP: Found 5 stores to vectorize.
SLP: Analyzing a store chain of length 4.
SLP: Analyzing a store chain of length 4
SLP: Analyzing 4 stores at offset 0
SLP: Checking users of store i32 0, i32* %x.
SLP: Checking users of store i32 1, i32* %y.
SLP: Checking users of store i32 2, i32* %z.
SLP: Checking users of store i32 3, i32* %w.
SLP: We are able to schedule this bundle.
SLP: added a vector of stores.
SLP: Gathering due to C,S,B,O.
SLP: Calculating cost for tree of size 2.
SLP: Check whether the tree with height 2 is fully vectorizable .
SLP: Adding cost -3 for bundle that starts with store i32 0, i32* %x .
SLP: Adding cost 0 for bundle that starts with i32 0 .
SLP: Total Cost -3.
SLP: Found cost=-3 for VF=4
SLP: Decided to vectorize cost=-3
SLP: Extracting 0 values .
SLP: Erasing scalar: store i32 0, i32* %x.
SLP: Erasing scalar: store i32 1, i32* %y.
SLP: Erasing scalar: store i32 2, i32* %z.
SLP: Erasing scalar: store i32 3, i32* %w.
SLP: Optimizing 0 gather sequences instructions.
SLP: vectorized "vector"
; ModuleID = 'vector-alloca.ll'
target datalayout =
"e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-v2048:2048:2048-n32:64"
target triple = "r600--"

define void @vector(i32 addrspace(1)* %out, i32 %index) {
entry:
  %0 = alloca [4 x i32]
  %x = getelementptr [4 x i32]* %0, i32 0, i32 0
  %y = getelementptr [4 x i32]* %0, i32 0, i32 1
  %z = getelementptr [4 x i32]* %0, i32 0, i32 2
  %w = getelementptr [4 x i32]* %0, i32 0, i32 3
  %1 = bitcast i32* %x to <4 x i32>*
  store <4 x i32> <i32 0, i32 1, i32 2, i32 3>, <4 x i32>* %1
  %2 = getelementptr [4 x i32]* %0, i32 0, i32 %index
  %3 = load i32* %2
  store i32 %3, i32 addrspace(1)* %out
  ret void
}

The next step for me is to figure out how to replace the GEP + load with an
extractelement.

-Tom