Jacobi 5 Point Stencil Code not Vectorizing

Hello,

I am trying to vectorize following stencil code;

#include <stdio.h>
#define N 100351

// This function computes 2D-5 point Jacobi stencil
void stencil(int a[restrict][N])
{
int i, j, k;
for (k = 0; k < 100; k++)
{ for (i = 1; i <= N-2; i++)
{ for (j = 1; j <= N-2; j++)
{ a[i][j] = 0.25 * (a[i][j] + a[i-1][j] + a[i+1][j] + a[i][j-1] + a[i][j+1]);
}
}
}}

I have used the following commands

clang -S -emit-llvm stencil.c -march=knl -O3 -mllvm -disable-llvm-optzns -o stencil.ll

opt -S -O3 stencil.ll -o stencil_o3.ll

llc -x86-asm-syntax=intel stencil_o3.ll -o stencil.s

But the code is not vectorized. It still uses the scalar instructions;

Please correct me.

Thank You

I even tried polly but still my llvm IR does not contain vector instructions. i used the following command;

clang -S -emit-llvm stencil.c -march=knl -O3 -mllvm -polly -mllvm -polly-vectorizer=stripmine -o stencil_poly.ll

Please specify what is wrong with my code?

Does it happen due to loop carried dependence? if yes what is the solution to vectorize such codes?

please reply. i m waiting.

further i modified the code to the following;

#include <stdio.h>
#define N 100351

// This function computes 2D-5 point Jacobi stencil
void stencil(int a[restrict][N], int b[restrict][N])
{
int i, j, k;
for (k = 0; k < N; k++) {
for (i = 1; i <= N-2; i++)
for (j = 1; j <= N-2; j++)
b[i][j] = 0.25 * (a[i][j] + a[i-1][j] + a[i+1][j] + a[i][j-1] + a[i][j+1]);

for (i = 1; i <= N-2; i++)
for (j = 1; j <= N-2; j++)
a[i][j] = b[i][j];

}
}

but still no vectorization in IR. Also, when I set vector width explicitly to 64, it gives the following error:

remark: :0:0: loop not vectorized: call instruction cannot be vectorized
remark: :0:0: loop not vectorized: value that could not be identified as reduction is used outside the loop

I need serious help on this. Please guide me.

I am able to vectorize it with the following code;

#include <stdio.h>
#define N 100351

// This function computes 2D-5 point Jacobi stencil
void stencil(int a[][N], int b[][N])
{
int i, j, k;
for (k = 0; k < N; k++) {
for (i = 1; i <= N-2; i++)
for (j = 1; j <= N-2; j++)
b[i][j] = 0.25 * (a[i][j] + a[i-1][j] + a[i+1][j] + a[i][j-1] + a[i][j+1]);

for (i = 1; i <= N-2; i++)
for (j = 1; j <= N-2; j++)
a[i][j] = b[i][j];

}
}

I removed restrict over here.

That seems odd. Can you please file a bug report? -Hal

but now it works fine with mentioned code.

If adding “restrict” causes us to stop vectorizing, that’s a bug. Also, bugs should be filed for desired enhancements. -Hal

Hello,

This is not equivalent rewrite: your original code definitely shouldn't vectorize because it has backward cross-iteration (loop-carried) dependency: your value on iteration j+1 depend on value from iteration j you've just written. In case of vectorization you need to do load-operation-store on multiple consecutive values at once and it is impossible in this case.

In your new code all values of a[] in right-hand side of the main loop are from previous k iteration (because on current k iteration you're writing to 'b'). So there is no way to vectorize loop in its original form, but new form is definitely vectorizable.

I am second to recommend you filing a bug over 'restrict' behavior. And you may in fact save some memory by making 'b' 1D array (this is not equivalent rewrite once again though)

// This function computes 2D-5 point Jacobi stencil
void stencil(int a[][N], int b[N])
{
int i, j, k;
for (k = 0; k < N; k++) {
for (i = 1; i <= N-2; i++) {
for (j = 1; j <= N-2; j++)
          b[j] = 0.25 * (a[i][j] + a[i-1][j] + a[i+1][j] + a[i][j-1] + a[i][j+1]);
            for (j = 1; j <= N-2; j++)
                 a[i][j] = b[j];
        }
    }
}

There is a way to vectorize your code even more efficient. This will look like more low-level, but probably closest to what you would expect from vectorization.

void stencil(int a[][N])
{
    int i, j, k;
    for (k = 0; k < 100; k++) {
         for (i = 1; i <= N-2; i++) {
            for (j = 1; j <= N-2; j+=16) {
               int b[16]; // This should become a register on KNL
               for (v = 0; v < 16; v++) {
           b[v] = 0.25 * (a[i][j+v] + a[i-1][j+v] + a[i+1][j+v] + a[i][j-1+v] + a[i][j+1+v]);
               }
               for (v = 0; v < 16; v++) { // This will be a single store operation
                   a[i][j+v] = b[v];
               }
           }
           // You should explicitly take care about the tail of j-loop
#if !MASKED_SHORT_LOOP_VECTORIZATION_SUPPORTED // This is not an actual name, just a designator
           for (;j <= N-2; j++) {
               a[i][j] = 0.25 * (a[i][j] + a[i-1][j] + a[i+1][j] + a[i][j-1] + a[i][j+1]);
           }
#else
          for (v = 0; v < 16-(N-2-j); v++) { // This would become masked non-loop
      b[v] = 0.25 * (a[i][j+v] + a[i-1][j+v] + a[i+1][j+v] + a[i][j-1+v] + a[i][j+1+v]);
          }
          for (v = 0; v < 16-(N-2-j); v++) { // This will be a single masked store operation
              a[i][j+v] = b[v];
          }
#endif
    }
}

Unfortunalely compiler cannot do this for you: this is not equivalent transformation of original code. I am also not aware of any way to express this desired behavior less explicitly (e.g. OpenMP SIMD pragma won't work in this case).

Minor note: You're using 'int' for data, than multiply by 0.25 (divide by 4) and than write it back to 'int'. This will cost you 2 conversion to/from double while you may just place (...) / 4 which should be optimized to simple sequecnce with shifts (not to single shift due to signedness, but still better than conversions with changes of element size 4->8->4 and data size INT->FP->INT).

And by the way why do you divide by 4, not by 5 as number of points suggest?

Serge Preis

Hello,

The following stencil function that you wrote few months ago (for computing 2D-5 point Jacobi stencil) is vectorizable.

void stencil(int a[][N], int b[N])
{
int i, j, k;
for (k = 0; k < N; k++) {
for (i = 1; i <= N-2; i++) {
for (j = 1; j <= N-2; j++)
b[j] = 0.25 * (a[i][j] + a[i-1][j] + a[i+1][j] + a[i][j-1] + a[i][j+1]);
for (j = 1; j <= N-2; j++)
a[i][j] = b[j];
}
}
}

but when i write the above code in main i am getting error

opt -S -O3 stencil.ll -o stencil_o3.ll
remark: :0:0: loop not vectorized: value that could not be identified as reduction is used outside the loop

Why is that so?

my code is follows:
int a[N][N], b[N];

int main(void)
{
int i, j, k;
for (k = 0; k < N; k++) {
for (i = 1; i <= N-2; i++) {
for (j = 1; j <= N-2; j++)
b[j] = 0.25 * (a[i][j] + a[i-1][j] + a[i+1][j] + a[i][j-1] + a[i][j+1]);
for (j = 1; j <= N-2; j++)
a[i][j] = b[j];
}
}
}

Please help.

Thank You.

Hello,

To me this is an issue in llvm loop vectorizer (if N is large enough to prevent complete unrolling of j-loop).

Woud you mind to share stencil.ll than I would say more definitely what the issue is.

Regards,
Serge.

#include <stdio.h>
#define N 66

// This function computes 2D-5 point Jacobi stencil
int a[N][N] b[N];

int main(void)
{
int i, j, k;
for (k = 0; k < N; k++) {
for (i = 1; i <= N-2; i++) {
for (j = 1; j <= N-2; j++)
b[j] = 0.25 * (a[i][j] + a[i-1][j] + a[i+1][j] + a[i][j-1] + a[i][j+1]);
for (j = 1; j <= N-2; j++)
a[i][j] = b[j];
}
}
}

stencil.ll is attached here.

stencil.ll (7.11 KB)

Your problem is due to GVN partial reduction elimination (PRE) which
introduces a PHI node the current loop vectorizer cannot handle:

opt -O3 stencil.ll -pass-remarks=loop-vectorize
-pass-remarks-missed=loop-vectorize
-pass-remarks-analysis=loop-vectorize
remark: <unknown>:0:0: loop not vectorized: value that could not be
identified as reduction is used outside the loop
remark: <unknown>:0:0: loop not vectorized

The message is not entirely accurate. The PHI is not used outside of
the loop, but is in the loop header, therefore cannot be if-converted
and is not a reduction. Polly also had difficulties with this
construction as well.

As presented at the dev meeting, VPlan will be able to insert a
shuffle instruction when encountering this situation.

Michael

Thank you.

But how to resolve it, my work depends on it.

But how to resolve it, my work depends on it.

You're doing a PhD, which pretty much by definition that involves
independent thought. How do *you* think it should be resolved?

Tim.

yes i am doing phd. but my actual work is to use the optimized/ vectorized IR for working on backend. my work deals with backend not optimizer. and i need to use correct IR.

$ opt -O3 stencil.ll -pass-remarks=loop-vectorize -enable-load-pre=0

remark: <unknown>:0:0: vectorized loop (vectorization width: 16,
interleaved count: 1)

Michael

Thank you.

your command is able to vectorize the code