Hello,

This is not equivalent rewrite: your original code definitely shouldn't vectorize because it has backward cross-iteration (loop-carried) dependency: your value on iteration j+1 depend on value from iteration j you've just written. In case of vectorization you need to do load-operation-store on multiple consecutive values at once and it is impossible in this case.

In your new code all values of a[] in right-hand side of the main loop are from previous k iteration (because on current k iteration you're writing to 'b'). So there is no way to vectorize loop in its original form, but new form is definitely vectorizable.

I am second to recommend you filing a bug over 'restrict' behavior. And you may in fact save some memory by making 'b' 1D array (this is not equivalent rewrite once again though)

// This function computes 2D-5 point Jacobi stencil

void stencil(int a[][N], int b[N])

{

int i, j, k;

for (k = 0; k < N; k++) {

for (i = 1; i <= N-2; i++) {

for (j = 1; j <= N-2; j++)

b[j] = 0.25 * (a[i][j] + a[i-1][j] + a[i+1][j] + a[i][j-1] + a[i][j+1]);

for (j = 1; j <= N-2; j++)

a[i][j] = b[j];

}

}

}

There is a way to vectorize your code even more efficient. This will look like more low-level, but probably closest to what you would expect from vectorization.

void stencil(int a[][N])

{

int i, j, k;

for (k = 0; k < 100; k++) {

for (i = 1; i <= N-2; i++) {

for (j = 1; j <= N-2; j+=16) {

int b[16]; // This should become a register on KNL

for (v = 0; v < 16; v++) {

b[v] = 0.25 * (a[i][j+v] + a[i-1][j+v] + a[i+1][j+v] + a[i][j-1+v] + a[i][j+1+v]);

}

for (v = 0; v < 16; v++) { // This will be a single store operation

a[i][j+v] = b[v];

}

}

// You should explicitly take care about the tail of j-loop

#if !MASKED_SHORT_LOOP_VECTORIZATION_SUPPORTED // This is not an actual name, just a designator

for (;j <= N-2; j++) {

a[i][j] = 0.25 * (a[i][j] + a[i-1][j] + a[i+1][j] + a[i][j-1] + a[i][j+1]);

}

#else

for (v = 0; v < 16-(N-2-j); v++) { // This would become masked non-loop

b[v] = 0.25 * (a[i][j+v] + a[i-1][j+v] + a[i+1][j+v] + a[i][j-1+v] + a[i][j+1+v]);

}

for (v = 0; v < 16-(N-2-j); v++) { // This will be a single masked store operation

a[i][j+v] = b[v];

}

#endif

}

}

Unfortunalely compiler cannot do this for you: this is not equivalent transformation of original code. I am also not aware of any way to express this desired behavior less explicitly (e.g. OpenMP SIMD pragma won't work in this case).

Minor note: You're using 'int' for data, than multiply by 0.25 (divide by 4) and than write it back to 'int'. This will cost you 2 conversion to/from double while you may just place (...) / 4 which should be optimized to simple sequecnce with shifts (not to single shift due to signedness, but still better than conversions with changes of element size 4->8->4 and data size INT->FP->INT).

And by the way why do you divide by 4, not by 5 as number of points suggest?

Serge Preis