extra loads in nested for-loop

I was looking at the code generated from the following c code and noticed extra loads in the inner-loop of these nested for-loops:

#define DIM 8
#define UNROLL_DIM DIM
typedef double InArray[DIM][DIM];

void f1( InArray c, InArray a, InArray b ) {
#pragma clang loop unroll_count(UNROLL_DIM)
for( int i=0;i<DIM;i++)
#pragma clang loop unroll_count(UNROLL_DIM)
for( int j=0;j<DIM;j++)
#pragma clang loop unroll_count(UNROLL_DIM)
for( int k=0;k<DIM;k++) {
c[i][k] = c[i][k] + a[i][j]*b[j][k];
}
}

In the inner-most loop there, the generated code (and in the .ll as well) loads a[i][j] every time. In this case I’ve unrolled the loops, but it’s the same situation if they’re not unrolled.

Using -O3 to compile, this is the .ll that results (just showing 2 iterations of the unrolled inner loop):

define void @f1([8 x double]* nocapture %c, [8 x double]* nocapture readonly %a, [8 x double]* nocapture readonly %b) #0 {
entry:
%arrayidx8 = getelementptr inbounds [8 x double]* %c, i64 0, i64 0
%arrayidx12 = getelementptr inbounds [8 x double]* %a, i64 0, i64 0
%0 = load double* %arrayidx8, align 8, !tbaa !1
%1 = load double* %arrayidx12, align 8, !tbaa !1
%arrayidx16 = getelementptr inbounds [8 x double]* %b, i64 0, i64 0
%2 = load double* %arrayidx16, align 8, !tbaa !1
%mul = fmul double %1, %2
%add = fadd double %0, %mul
store double %add, double* %arrayidx8, align 8, !tbaa !1
%arrayidx8.1 = getelementptr inbounds [8 x double]* %c, i64 0, i64 1
%3 = load double* %arrayidx8.1, align 8, !tbaa !1
%4 = load double* %arrayidx12, align 8, !tbaa !1 #EXTRA LOAD, could reuse %1!
%arrayidx16.1 = getelementptr inbounds [8 x double]* %b, i64 0, i64 1
%5 = load double* %arrayidx16.1, align 8, !tbaa !1
%mul.1 = fmul double %4, %5
%add.1 = fadd double %3, %mul.1
store double %add.1, double* %arrayidx8.1, align 8, !tbaa !1

Note the line with the comment at the end: #EXTRA LOAD, could reuse %1

This loading from a[i][j] happens again for each iteration and seems quite inefficient.

I changed the C code to explicitly do the load of a[i][j] outside of the innermost loop and that (as would be expected) eliminates the extra load:

void f1( InArray c, InArray a, InArray b ) {
int a_i_j;
#pragma clang loop unroll_count(UNROLL_DIM)
for(int i=0;i<DIM;i++){
#pragma clang loop unroll_count(UNROLL_DIM)
for(int j=0;j<DIM;j++) {
a_i_j = a[i][j];
#pragma clang loop unroll_count(UNROLL_DIM)
for(int k=0;k<DIM;k++) {
c[i][k] = c[i][k] + a_i_j*b[j][k];
}
}
}
}

I guess I’m a bit surprised that -O3 wouldn’t automatically do what I’ve done in the second version of the C code when generating code from the first version? Is there a specific optimization that can be called to do this?

(we’re using LLVM 3.6 - maybe this is something that’s done in later versions?)

Phil

This is most likely because a, b, and c are assumed to be aliased.

-Krzysztof

I guess I'm a bit surprised that -O3 wouldn't automatically do what I've
done in the second version of the C code when generating code from the
first version?

This is most likely because a, b, and c are assumed to be aliased.

In the context of this particular function wouldn't alias analysis be able
to figure out that there is no alias to a? Although, I guess considering
the possibility of multi-threading it might be difficult to make that
determination in all other parts of the program.

Phil

They are function parameters, so there is no way to tell if they are aliased or by looking at the function's body. You can try using "restrict" to tell the compiler that they are not, if that is the case in your program.

-Krzysztof

Right, it’s literally not possible to say these are not aliased without help :slight_smile:

I see there is a noalias attribute introduced in LLVM 3.8 for that very purpose.