I have matrix multiplication and stencil code. I vectorise it through the following command.
opt -S -O3 -force-vector-width=2048 stencil.ll -o stencil_o3.ll
in both the examples of matrix mult and stencil it vectorises fine when my loop iterations >2048. but if i keep both iterations and vector width=2048.
it produces scalar code IR not vectorizes it.
Is it llvm bug?
Please help me.
Hi Ahmed,
Can you show us your code?
I tried this example:
void foo(int *a, int *b, int *c) {
for (int i=0; i<2048; i++)
a[i] = b[i] + c[i];
}
Then ran Clang to produce IR and your opt line above and got a vectorised loop:
vector.body: ; preds =
%vector.body.preheader
%0 = bitcast i32* %b to <2048 x i32>*
%wide.load = load <2048 x i32>, <2048 x i32>* %0, align 4, !alias.scope !1
%1 = bitcast i32* %c to <2048 x i32>*
%wide.load17 = load <2048 x i32>, <2048 x i32>* %1, align 4, !alias.scope !4
%2 = add nsw <2048 x i32> %wide.load17, %wide.load
%3 = bitcast i32* %a to <2048 x i32>*
store <2048 x i32> %2, <2048 x i32>* %3, align 4, !alias.scope !6, !noalias !8
br label %for.end
So, this seems to be either a bug in your code (off-by-one, loop
dependencies, etc) or some missing optimisation in Clang, which we'll
only know when we can actually see the code.
cheers,
--renato
Right, that explains it: your tail loop count doesn't reach 2048 iterations:
#define N 2048
for (i = 1; i <= N-2; i++)
for (j = 1; j <= N-2; j++)
a[i][j] = b[i][j];
That'll be 2045 iterations.
Artificially playing with the ranges (N+1, etc) yields vector code, as expected.
Same for the main loop:
float con=0.2;
for (k = 0; k < N; k++) {
for (i = 1; i <= N-2; i++)
for (j = 1; j <= N-2; j++)
b[i][j] = con * (a[i][j] + a[i-1][j] + a[i+1][j] +
a[i][j-1] + a[i][j+1]);
cheers,
--renato