No SSE instructions

Hello.
I have compiled the simple program:

#include <stdio.h>
#include <stdlib.h>

int v1[10000];

int main()
{
int i;

for (i = 0; i < 10000; i++) {
v1[i] = i;
}

for (i = 0; i < 10000; i++) {
printf("%d ", v1[i]);
}

return 0;
}

Next, I disasseble the executable file and have not found any SSE instructions.
I know that LLVM support SSE.
So my questions:

  1. It is occur only in my computer?
  2. If it is not only my bug, then there are not SSE optimizations in LLVM?
  3. Have anyone, already worked on this problem?

This program has no floating point, and no vector data types, and no
vector intrinsics.
AFAIK those are the only situations where LLVM would produce SSE code.

GCC indeed produces some SSE instructions at -O3, because unlike LLVM it
has auto-vectorization support.

I doubt that for this particular loop the difference would be
significant though.

Best regards,
--Edwin

Hello.
I have compiled the simple program:

#include <stdio.h>
#include <stdlib.h>

int v1[10000];

int main()
{
int i;

for (i = 0; i < 10000; i++) {
v1[i] = i;
}

This loop is not really vectorizable, even if LLVM had an auto-vectorizer. You need the same operation (floating-point or integer) applied to contiguous elements in a vector. An example of a vectorizable loop body would be “v1[i] = v1[i] * v1[i]” Then, you could use SSE (or any other vector instruction set) to get a substantial speed improvement.

Hello.
I have compiled the simple program:

#include <stdio.h>
#include <stdlib.h>

int v1[10000];

int main()
{
int i;

for (i = 0; i < 10000; i++) {
v1[i] = i;
}

This loop is not really vectorizable, even if LLVM had an auto-vectorizer. You need the same operation (floating-point or integer) applied to contiguous elements in a vector. An example of a vectorizable loop body would be “v1[i] = v1[i] * v1[i]” Then, you could use SSE (or any other vector instruction set) to get a substantial speed improvement.

This is vectorizable. Just start out with a vector of constants <0, 1, 2, 3> and do a store of it every time through the loop, adding <4,4,4,4> as you go.

-Chris

2011/5/22 Chris Lattner <clattner@apple.com>

Hello.
I have compiled the simple program:

#include <stdio.h>
#include <stdlib.h>

int v1[10000];

int main()
{
int i;

for (i = 0; i < 10000; i++) {
v1[i] = i;
}

This loop is not really vectorizable, even if LLVM had an auto-vectorizer. You need the same operation (floating-point or integer) applied to contiguous elements in a vector. An example of a vectorizable loop body would be “v1[i] = v1[i] * v1[i]” Then, you could use SSE (or any other vector instruction set) to get a substantial speed improvement.

This is vectorizable. Just start out with a vector of constants <0, 1, 2, 3> and do a store of it every time through the loop, adding <4,4,4,4> as you go.

-Chris

for (i = 0; i < 10000; i++) {
printf("%d ", v1[i]);
}

return 0;
}

Next, I disasseble the executable file and have not found any SSE instructions.
I know that LLVM support SSE.
So my questions:

  1. It is occur only in my computer?
  2. If it is not only my bug, then there are not SSE optimizations in LLVM?
  3. Have anyone, already worked on this problem?


Serg Anohovsky.


LLVM Developers mailing list
LLVMdev@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Thanks,

Justin Holewinski


LLVM Developers mailing list
LLVMdev@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Thanks, for your notes. In my opinion, there are no different. So another example:

#include <stdio.h>
#include <stdlib.h>

int v0[10000];

int v1[10000];

int main()
{
int i;

for (i = 0; i < 10000; i++) {

v0[i] = i;

}

for (i = 0; i < 10000; i++) {

v1[i] = v0[i] * v0[i] * 4;

}

for (i = 0; i < 10000; i++) {
printf("%d ", v1[i]);
}

return 0;
}

This is should be optimized, but LLVM have not optimized this program. The questions
were not about this specific example. I wont to understand, what vector optimizations LLVM have?
How well implemented this theme in LLVM?

When asking this type of question, you should be specific about how
you built the program, ie did you use clang, llvm-gcc, or dragonegg,
and which options did you use. From your message, I can't tell if you
built at O0 or O3.

In this case, no, LLVM does not have any auto-vectorization
optimizations. However, LLVM does have good support for vector
intrinsics, so if you use xmmintrin.h you should be able to get good
performance.

Reid

LLVM does not have an autovectorizer.

-Chris

2011/5/22 Reid Kleckner <reid.kleckner@gmail.com>

When asking this type of question, you should be specific about how
you built the program, ie did you use clang, llvm-gcc, or dragonegg,
and which options did you use. From your message, I can’t tell if you
built at O0 or O3.

In this case, no, LLVM does not have any auto-vectorization
optimizations. However, LLVM does have good support for vector
intrinsics, so if you use xmmintrin.h you should be able to get good
performance.

Reid

Thanks, for your explanation. It is very useful for me.

Hi Serg,

Next, I disasseble the executable file and have not found any SSE instructions.
I know that LLVM support SSE.
So my questions:
   1. It is occur only in my computer?
   2. If it is not only my bug, then there are not SSE optimizations in LLVM?
   3. Have anyone, already worked on this problem?

the gcc-4.5 tree vectorizer vectorizes this (see LLVM IR below) but LLVM does
not yet have an auto-vectorizer that can do this.

Ciao, Duncan.

IR produced by dragonegg using -O3 and -fplugin-arg-dragonegg-enable-gcc-optzns:

target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-f128:128:128-n8:16:32:64"
target triple = "x86_64-unknown-linux-gnu"

module asm "\09.ident\09\22GCC: (GNU) 4.5.4 20110506 (prerelease) LLVM: 131851M\22"

@v1 = common global [10000 x i32] zeroinitializer, align 32
@.cst = private constant [4 x i8] c"%d \00", align 8

define i32 @main() nounwind {
entry:
   br label %"<bb 3>"

"<bb 3>": ; preds = %"<bb 3>", %entry
   %indvar2 = phi i64 [ %indvar.next3, %"<bb 3>" ], [ 0, %entry ]
   %vect_vec_iv_.8_10 = phi <4 x i32> [ %vect_vec_iv_.8_24, %"<bb 3>" ], [ <i32 0, i32 1, i32 2, i32 3>, %entry ]
   %tmp6 = shl i64 %indvar2, 2
   %scevgep7 = getelementptr [10000 x i32]* @v1, i64 0, i64 %tmp6
   %scevgep78 = bitcast i32* %scevgep7 to <4 x i32>*
   %vect_vec_iv_.8_24 = add nsw <4 x i32> %vect_vec_iv_.8_10, <i32 4, i32 4, i32 4, i32 4>
   store <4 x i32> %vect_vec_iv_.8_10, <4 x i32>* %scevgep78, align 16
   %indvar.next3 = add i64 %indvar2, 1
   %exitcond4 = icmp eq i64 %indvar.next3, 2500
   br i1 %exitcond4, label %"<bb 5>", label %"<bb 3>"

"<bb 5>": ; preds = %"<bb 3>", %"<bb 5>"
   %indvar = phi i64 [ %indvar.next, %"<bb 5>" ], [ 0, %"<bb 3>" ]
   %scevgep = getelementptr [10000 x i32]* @v1, i64 0, i64 %indvar
   %D.3943_6 = load i32* %scevgep, align 4
   %0 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([4 x i8]* @.cst, i64 0, i64 0), i32 %D.3943_6) nounwind
   %indvar.next = add i64 %indvar, 1
   %exitcond = icmp eq i64 %indvar.next, 10000
   br i1 %exitcond, label %"<bb 6>", label %"<bb 5>"

"<bb 6>": ; preds = %"<bb 5>"
   ret i32 0
}

declare i32 @printf(i8* nocapture, ...) nounwind

2011/5/22 Chris Lattner <clattner@apple.com>

LLVM does not have an autovectorizer.

-Chris

Could you tell me please are you going to implement autovecorizer in LLVM in nearby future?

I’m confident it will happen but have no idea on what timeline.

-Chris

Thank you all, for your explanation. This is a real interesting theme for me.

Hi Serg,

there is some preliminary work done in the Polly project[1] on autovectorization. Though we mainly work on loop transformations that will expose more vectoriation opportunities. If you are interested to do research in this area, Polly may be a good start.

Cheers
Tobi

[1] http://polly.grosser.es

Hi,

Intel’s OpenCL SDK 1.1 has an LLVM-based vectorizer which vectorizes every OpenCL kernel (This is possible due to the nature of OpenCL where the outside loop is implicit.).

Our SDK comes with a nifty gui tool which allows you to inspect the vectorized LLVM-IR.

http://software.intel.com/en-us/articles/opencl-sdk/

Cheers,

Nadav