Unrolling an arithmetic expression inside a loop

Hello,

I'm investigating how optimized code clang generates, and have come across such an example:
I have two procedures:

void exec0(const int *X, const int *Y, int *res, const int N) {
     int t1[N],t2[N],t3[N],t4[N],t5[N],t6[N];
     for(int i = 0; i < N; i++) {
         t1[i] = X[i]+Y[i]; // X+Y
         t2[i] = t1[i]*Y[i]; // XY + Y^2
         t3[i] = X[i]*Y[i]; // XY
         t4[i] = Y[i]*Y[i]; // Y^2
         t5[i] = t3[i]+t4[i]; // XY + Y^2
         t6[i] = t2[i]-t5[i]; // 0
         res[i] = t6[i] + t3[i]; // XY
     }
}
vs
void exec1(const int *X, const int *Y, int *res, const int N) {
     for(int i = 0; i < N; i++) {
         int t1,t2,t3,t4,t5,t6;
         t1 = X[i]+Y[i]; // X+Y
         t2 = t1*Y[i]; // XY + Y^2
         t3 = X[i]*Y[i]; // XY
         t4 = Y[i]*Y[i]; // Y^2
         t5 = t3+t4; // XY + Y^2
         t6 = t2-t5; // 0
         res[i] = t6 + t3; // XY
     }
}
where I perform some arithmetic operations on a vector. The expression in written in an obfuscated way, but it actually boils down to just res = X*Y. In the first case I create tables to store temporaries, in the second case I have just temporary variables to store intermediate results.

I compile with
clang -S -emit-llvm -std=gnu99 -O3
clang version 2.9 (trunk 118238)
Target: x86_64-unknown-linux-gnu
Thread model: posix

What I get from clang is:
1) In exec0 it recognizes that it doesn't need the tables and doesn't create them and also, it simplifies the expression, so that the generated code inside the loop is just a multiplication:
   %scevgep = getelementptr i32* %X, i64 %indvar
   %scevgep2 = getelementptr i32* %Y, i64 %indvar
   %scevgep9 = getelementptr i32* %res, i64 %indvar
   %3 = load i32* %scevgep, align 4, !tbaa !0
   %4 = load i32* %scevgep2, align 4, !tbaa !0
   %5 = mul nsw i32 %4, %3
   store i32 %5, i32* %scevgep9, align 4, !tbaa !0
2) In exec1 however it fails to recognize that the temporary variables are not reused anywhere and fails to simplify the arithmetic expression, producing:
   %scevgep = getelementptr i32* %X, i64 %indvar
   %scevgep4 = getelementptr i32* %Y, i64 %indvar
   %scevgep5 = getelementptr i32* %res, i64 %indvar
   %3 = load i32* %scevgep, align 4, !tbaa !0
   %4 = load i32* %scevgep4, align 4, !tbaa !0
   %5 = add nsw i32 %4, %3
   %6 = mul nsw i32 %5, %4
   %7 = mul nsw i32 %4, %4
   %8 = sub i32 %6, %7
   store i32 %8, i32* %scevgep5, align 4, !tbaa !0

It is interesting, that I get better compilation output when inputting clang with worse input source (creating redundant temporary tables on stack).

Cheers,
Juliusz Sompolski