Hi Hal,

this is one of the first test cases, I would love to have improved vectorizer support. I sent it out earlier, but I think it is a good time to look into it again, after the vectorizer was committed.

The basic examples is a set of scalar loads that load for consecutive elements and store them back right ahead. For me this is an obvious case where vectorization is beneficial (scalar.ll):

define i32 @main() nounwind {

%V1 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 0),

align 16

%V2 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 1),

align 4

%V3= load float* getelementptr ([1024 x float]* @A, i64 0, i64 2),

align 8

%V4 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 3),

align 4

store float %V1, float* getelementptr ([1024 x float]* @B, i64 0, i64

0), align 16

store float %V2, float* getelementptr ([1024 x float]* @B, i64 0, i64

1), align 4

store float %V3, float* getelementptr ([1024 x float]* @B, i64 0, i64

2), align 8

store float %V4, float* getelementptr ([1024 x float]* @B, i64 0, i64

3), align 4

ret i32 0

}

opt -O3 -vectorize can not optimize this straight ahead, as the req-chain is too short.

Adding -bb-vectorize-req-chain-depth=2 allows us to vectorize the code:

define i32 @main() nounwind {

%V1 = load <4 x float>* bitcast ([1024 x float]* @A to <4 x float>*),

align 16

store <4 x float> %V1, <4 x float>* bitcast ([1024 x float]* @B to <4

x float>*), align 16

ret i32 0

}

Is there any way, we can make this case work by default? Maybe we can decrease the req-chain to 2, and increase the cost for non stride one loads or stores?

Another probably unrelated point. I tried also a run with -bb-vectorize-req-chain-depth=1. The generated code is full of shufflevector instructions and eight element vectors. For me this is entirely unexpected. Do you have any ideas what is going on here?

Tobi

scalar.ll (1.1 KB)

vector.ll (586 Bytes)