# Unrolling an arithmetic expression inside a loop

Hello,

I'm investigating how optimized code clang generates, and have come across such an example:
I have two procedures:

void exec0(const int *X, const int *Y, int *res, const int N) {
int t1[N],t2[N],t3[N],t4[N],t5[N],t6[N];
for(int i = 0; i < N; i++) {
t1[i] = X[i]+Y[i]; // X+Y
t2[i] = t1[i]*Y[i]; // XY + Y^2
t3[i] = X[i]*Y[i]; // XY
t4[i] = Y[i]*Y[i]; // Y^2
t5[i] = t3[i]+t4[i]; // XY + Y^2
t6[i] = t2[i]-t5[i]; // 0
res[i] = t6[i] + t3[i]; // XY
}
}
vs
void exec1(const int *X, const int *Y, int *res, const int N) {
for(int i = 0; i < N; i++) {
int t1,t2,t3,t4,t5,t6;
t1 = X[i]+Y[i]; // X+Y
t2 = t1*Y[i]; // XY + Y^2
t3 = X[i]*Y[i]; // XY
t4 = Y[i]*Y[i]; // Y^2
t5 = t3+t4; // XY + Y^2
t6 = t2-t5; // 0
res[i] = t6 + t3; // XY
}
}
where I perform some arithmetic operations on a vector. The expression in written in an obfuscated way, but it actually boils down to just res = X*Y. In the first case I create tables to store temporaries, in the second case I have just temporary variables to store intermediate results.

I compile with
clang -S -emit-llvm -std=gnu99 -O3
clang version 2.9 (trunk 118238)
Target: x86_64-unknown-linux-gnu

What I get from clang is:
1) In exec0 it recognizes that it doesn't need the tables and doesn't create them and also, it simplifies the expression, so that the generated code inside the loop is just a multiplication:
%scevgep = getelementptr i32* %X, i64 %indvar
%scevgep2 = getelementptr i32* %Y, i64 %indvar
%scevgep9 = getelementptr i32* %res, i64 %indvar
%3 = load i32* %scevgep, align 4, !tbaa !0
%4 = load i32* %scevgep2, align 4, !tbaa !0
%5 = mul nsw i32 %4, %3
store i32 %5, i32* %scevgep9, align 4, !tbaa !0
2) In exec1 however it fails to recognize that the temporary variables are not reused anywhere and fails to simplify the arithmetic expression, producing:
%scevgep = getelementptr i32* %X, i64 %indvar
%scevgep4 = getelementptr i32* %Y, i64 %indvar
%scevgep5 = getelementptr i32* %res, i64 %indvar
%3 = load i32* %scevgep, align 4, !tbaa !0
%4 = load i32* %scevgep4, align 4, !tbaa !0
%5 = add nsw i32 %4, %3
%6 = mul nsw i32 %5, %4
%7 = mul nsw i32 %4, %4
%8 = sub i32 %6, %7
store i32 %8, i32* %scevgep5, align 4, !tbaa !0

It is interesting, that I get better compilation output when inputting clang with worse input source (creating redundant temporary tables on stack).

Cheers,
Juliusz Sompolski