unable to emit vectorized code in LLVM IR

Hello,
I have written the following code. when i try to vectorize it through opt. i am not getting vectorized instructions.

#include <stdio.h>
#include<stdlib.h>
int main(int argc, char** argv) {
int sum=0; int a=atoi(argv[1]); int b=atoi(argv[2]);
for (int i=0;i<1000;i++)
{
sum+=a+b;

}

printf(“sum: %d\n”, sum);
return 0;
}

i use following commands:

clang -S -emit-llvm sum-main.c -march=knl -O3 -mllvm -disable-llvm-optzns -o sum-main.ll

opt -S -O3 -force-vector-width=64 sum-main.ll -o sum-main03.ll

why is that so? where am i doing mistake? i am not getting vectorized operations rather getting scalar operations.

Please help.

Thank You

Regards

I’m not sure what you expect to have vectorized here. If you look at the emitted code, there’s no loop. It’s just an add and a multiply as you might expect when adding a loop-invariant sum 1000 times in a loop.

I want to vectorize the user given inputs. when opt does vectorization user supplied inputs (from a text file) will be added using AVX vector instructions.

as you pointed; When i changed my code to following:

int main(int argc, char** argv) {
int a[1000], b[1000], c[1000];
int aa=atoi(argv[1]), bb=atoi(argv[2]);
for (int i=0; i<1000; i++) {
a[i]=aa, b[i]=bb;
c[i]=a[i] + b[i];
printf(“sum: %d\n”, c[i]);

}

I am getting error remark: :0:0: loop not vectorized: call instruction cannot be vectorized.

I am running following commands:
clang -S -emit-llvm sum-vec.c -march=knl -O3 -mllvm -disable-llvm-optzns -o sum-vec.ll

opt -S -O3 -force-vector-width=64 sum-vec.ll -o sum-vec03.ll

How to achieve this? Please help.

Move the printf out of the loop and it should vectorize just fine.

int main(int argc, char** argv) {
int a[1000], b[1000], c[1000];
int aa=atoi(argv[1]), bb=atoi(argv[2]);
for (int i=0; i<1000; i++) {
a[i]=aa, b[i]=bb;
c[i]=a[i] + b[i];
printf(“sum: %d\n”, c[i]);
}

Move away the std::printf from the loop. It makes it sequential.

int a[1000], b[1000], c[1000];
int aa = atoi(argv[1]);
int bb = atoi(argv[2]);
for (int i=0; i<1000; i++) {
a[i] = aa;
b[i] = bb;
c[i] = aa + bb;
}

François

i removed printf from loop. Now getting no error. but the IR doesnot contain vectorized code. IR Output is as follows:

; ModuleID = ‘sum-vec.ll’
source_filename = “sum-vec.c”
target datalayout = “e-m:e-i64:64-f80:128-n8:16:32:64-S128”
target triple = “x86_64-unknown-linux-gnu”

; Function Attrs: norecurse nounwind readnone uwtable
define i32 @main(i32, i8** nocapture readnone) local_unnamed_addr #0 {
ret i32 0
}

attributes #0 = { norecurse nounwind readnone uwtable “correctly-rounded-divide-sqrt-fp-math”=“false” “disable-tail-calls”=“false” “less-precise-fpmad”=“false” “no-frame-pointer-elim”=“false” “no-infs-fp-math”=“false” “no-jump-tables”=“false” “no-nans-fp-math”=“false” “no-signed-zeros-fp-math”=“false” “no-trapping-math”=“false” “stack-protector-buffer-size”=“8” “target-cpu”=“knl” “target-features”=“+adx,+aes,+avx,+avx2,+avx512cd,+avx512er,+avx512f,+avx512pf,+bmi,+bmi2,+cx16,+f16c,+fma,+fsgsbase,+fxsr,+lzcnt,+mmx,+movbe,+pclmul,+popcnt,+prefetchwt1,+rdrnd,+rdseed,+rtm,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87,+xsave,+xsaveopt” “unsafe-fp-math”=“false” “use-soft-float”=“false” }

!llvm.ident = !{!0}

!0 = !{!“clang version 4.0.0 (tags/RELEASE_400/final)”}

what to do? please help.

Did you remove the printf completely? Meaning that nothing accesses ‘c’ after the loop? If so it got removed as dead code because it had no visible effect.

even if i make my code as follows: vectorized instructions not get emitted. What to do?

int main(int argc, char** argv) {
int a[1000], b[1000], c[1000]; int g=0;
int aa=atoi(argv[1]), bb=atoi(argv[2]);
for (int i=0; i<1000; i++) {
a[i]=aa, b[i]=bb;
c[i]=a[i] + b[i];
g+=c[i];

}

printf(“sum: %d\n”, g);

return 0;
}

LLVM is very smart about deleting useless code.

If you want to see vectorized code, I would suggest to write such a function.

int f(int* a, int* b, int n) {
int ans = 0;
for (int i = 0; i < n; ++i) {
ans += a[i] * b[i];
}
return ans;
}

That way, the compiler knows nothing about the values of a and b and cannot optimize anything. Also, beware of pointer aliasing which might prevent vectorization or create 2 different paths : one vectorized, one not vectorized.

François Fayard

Making your arrays global will work as well because they’re accessible externally and changes to them cannot be optimized out.

why is it happening? is there any way to solve this?

I assume compiler knows that your only have 2 input values that you just added together 1000 times.

Despite the fact that you stored to a[i] and b[i] here, nothing reads them other than the addition in the same loop iteration. So the compiler easily removed the a and b arrays. Same with ‘c’, it’s not read outside the loop so it doesn’t need to exist. So the compiler turned your loop body back into g+= aa + bb; And since the loop is 1000 iterations and aa and bb never change this got further simplified to (aa+bb)*1000.

int main(int argc, char** argv) {
int a[1000], b[1000], c[1000]; int g=0;

int aa=atoi(argv[1]), bb=atoi(argv[2]);
for (int i=0; i<1000; i++) {
a[i]=aa, b[i]=bb;
c[i]=a[i] + b[i];

g+=c[i];

}

when i change it to following: then get error: remark: :0:0: loop not vectorized: call instruction cannot be vectorized

int main(int argc, char** argv) {
int a[1000], b[1000], c[1000]; int g=0;
for (int i=0; i<1000; i++) {

a[i]=atoi(argv[1]), b[i]=atoi(argv[2]);
c[i]=a[i] + b[i];
g+=c[i];

}

Here my main goal is that i need to use JIT to perform operations on user input file supplied at run time using vector instructions. is it possible and achievable through JIT?
Please help.

Try that:

void f(int* a, int* b, int* c, int n) {
for (int i = 0; i < n; ++i) {
c[i] = a[i] + b[i];
}
}

and compile with: clang++ -S -O3 -mavx2 a.cpp -o a.assembly
and look at the a.assembly file. You’ll get something such as:

LBB0_12: ## =>This Inner Loop Header: Depth=1
vmovdqu -96(%rax), %ymm0
vmovdqu -64(%rax), %ymm1
vmovdqu -32(%rax), %ymm2
vmovdqu (%rax), %ymm3
vpaddd -96(%r11), %ymm0, %ymm0
vpaddd -64(%r11), %ymm1, %ymm1
vpaddd -32(%r11), %ymm2, %ymm2
vpaddd (%r11), %ymm3, %ymm3
vmovdqu %ymm0, -96(%rbx)
vmovdqu %ymm1, -64(%rbx)
vmovdqu %ymm2, -32(%rbx)
vmovdqu %ymm3, (%rbx)
subq $-128, %r11
subq $-128, %rax
subq $-128, %rbx
addq $-32, %r9
jne LBB0_12

That’s vectorized code, unrolled by 4. So you get 4 * (32 / 4) = 32 elements processed every loop. The ymm registers shows that you are using 256 bits vector registers as available on avx cpu. With avx512, you would get zmm registers.

François Fayard

By accessing only argv[1] and argv[2], you only took 2 numbers from the command line as input and added them together over and over again. You need to open a file and read nubmers from it or access more command line parameters.

Ok. I have managed to vectorize the second loop in the following code. But the first loop is still not vectorized? Why?

int main(int argc, char** argv) {
int a[1000], b[1000], c[1000]; int g=0;
int aa=atoi(argv[1]), bb=atoi(argv[2]);

for (int i=0; i<1000; i++) {
a[i]=aa+i, b[i]=bb+i;}

for (int i=0; i<1000; i++) {
c[i]=a[i] + b[i];
g+=c[i];
}

printf(“sum: %d\n”, g);

return 0;
}

When i executed the optimized IR through jit (lli sum-vec03.ll 5 2) i am getting following error:

#0 0x00000000013f965c llvm::sys::PrintStackTrace(llvm::raw_ostream&) /lib/Support/Unix/Signals.inc:402:11
#1 0x00000000013f9b49 PrintStackTraceSignalHandler(void*) /lib/Support/Unix/Signals.inc:466:1
#2 0x00000000013f7ec3 llvm::sys::RunSignalHandlers() /lib/Support/Signals.cpp:0:5
#3 0x00000000013f9ea4 SignalHandler(int) /lib/Support/Unix/Signals.inc:256:1
#4 0x00007fcdece96d10 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x10d10)
#5 0x00007fcded2c3038
#6 0x0000000000f4a8fb llvm::MCJIT::runFunction(llvm::Function*, llvm::ArrayRefllvm::GenericValue) /lib/ExecutionEngine/MCJIT/MCJIT.cpp:538:31
#7 0x0000000000eaff23 llvm::ExecutionEngine::runFunctionAsMain(llvm::Function*, std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > const&, char const* const*) /lib/ExecutionEngine/ExecutionEngine.cpp:471:10
#8 0x00000000007be4e9 main /tools/lli/lli.cpp:627:18
#9 0x00007fcdebe2fa40 __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x20a40)
#10 0x00000000007bc169 _start (/bin/lli+0x7bc169)
Stack dump:
0. Program arguments:lli sum-vec03.ll 5 2
Illegal instruction (core dumped)

What is wrong here? please help.

What was your lli command line? Is this based on your code where you created 2048-bit instructions in the x86 backend?

lli sum-vec03.ll 5 2 #0 0x0000000000c1f818 (lli+0xc1f818)
#1 0x0000000000c1d90e (lli+0xc1d90e)
#2 0x0000000000c1da5c (lli+0xc1da5c)
#3 0x00007f987c2c3d10 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x10d10)
#4 0x00007f987c6f0038
#5 0x0000000000989f8c (lli+0x989f8c)
#6 0x00000000009383dc (lli+0x9383dc)
#7 0x000000000057eedd (lli+0x57eedd)
#8 0x00007f987b464a40 __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x20a40)
#9 0x00000000005a5b49 (lli+0x5a5b49)
Stack dump:
0. Program arguments: lli sum-vec03.ll 5 2
Illegal instruction (core dumped)

No yet there exists no link with those new 2048 element instructions that i discussed earlier. i need to link jit with that later. presently, i am exploring jit in general to know its capabilities.

Here my main goal is that i need to use JIT to perform operations on user input file supplied at run time using vector instructions. is it possible and achievable through JIT?

So your clang command line says -march=knl. Are you using a Xeon Phi to run your code? if not that’s why you’re failing. Try changing it to -march=native

Thank you. You are genius. It solved. Actually, I am working on a compiler project for my phd using LLVM. I have some basic questions:

Here my main goal is that i need to use JIT to perform operations on user input file supplied at run time using vector instructions. is it possible and achievable through JIT?

Also is there any way to vectorize run time input or run time user input file reading using JIT in LLVM? in my tested example of vec-sum.c user loop which reads user input is not vectorized? Why?

I would be grateful if you answer my questions. It would help me to proceed in my project.

Thank You again