SLP vectorizer on AVX feature

I seem to have problem to get the SLP vectorizer to make use of the full 8 floats available in a SIMD vector on a Sandy Bridge CPU with AVX. The function is attached, the CPU flags are:

flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid

I use LLVM 3.6 checked out yesterday

~/toolchain/install/llvm-3.6/bin/opt -datalayout -basicaa -slp-vectorizer -instcombine < func_4x4x4_scalar_p_scalar.ll -S

the output goes like:

; ModuleID = '<stdin>'

define void @main(i64 %lo, i64 %hi, float* noalias %arg0, float* noalias %arg1, float* noalias %arg2) {
entrypoint:
   %0 = bitcast float* %arg1 to <4 x float>*
   %1 = load <4 x float>* %0, align 4
   %2 = bitcast float* %arg2 to <4 x float>*
   %3 = load <4 x float>* %2, align 4
   %4 = fadd <4 x float> %3, %1
   %5 = bitcast float* %arg0 to <4 x float>*
   store <4 x float> %4, <4 x float>* %5, align 4
....

So, it could make use of <8 x float> available in that machine. But it doesn't. Then I thought, that maybe the YMM registers get used when lowering the IR to machine code. However, the generated assembly doesn't seem to support this assumption :frowning:

main:
     .cfi_startproc
     xorl %eax, %eax
     xorl %esi, %esi
     .align 16, 0x90
.LBB0_1:
     vmovups (%r8,%rax), %xmm0
     vaddps (%rcx,%rax), %xmm0, %xmm0
     vmovups %xmm0, (%rdx,%rax)
     addq $4, %rsi
     addq $16, %rax
     cmpq $61, %rsi
     jb .LBB0_1
     retq

I played with -mcpu and -march switches without success. In any case, the target architecture should be detected with the -datalayout pass, right?

Any idea what I am missing?

Frank

func_4x4x4_scalar_p_scalar.ll (15.3 KB)

I realized that the function parameters had no alignment attributes on them. However, even adding an alignment suitable for aligned loads on YMM, i.e. 32 bytes, didn't convince the vectorizer to use [8 x float].

define void @main(i64 %lo, i64 %hi, float* noalias align 32 %arg0, float* noalias align 32 %arg1, float* noalias align 32 %arg2) {
...

results still in code using only [4 x float].

Thanks,
Frank

Frank,

It sounds like the SLP vectorizer thinks that it is more profitable to use 128bit wide operations (because 256bit operations are double pumped on Sandybridge). Did you see a different result on Haswell?

Thanks,
Nadav

Nadav,

I can check if we have a Haswell CPU somewhere running..

In the meantime I send the link to the debug output of the SLP vectorizer. I don't understand all of it quite yet, but it seems it's not mentioning the 8-fold vectorization opportunity... (please find it here as it's 150KB and slightly over the list attachment limit of 100KB https://www.dropbox.com/s/aarivrzees30zrj/SLP.txt?dl=0)

Also, in a earlier version of my application I saw on similar functions that the SLP vectorizer uses 8xfloat on the same hardward (Sandy Bridge). In those versions I used LLVM 3.4 or 3.5 (trunk).

Thanks,
Frank

128-bit wide vectorization is the limit for the SLP vectorizer:
https://llvm.org/bugs/show_bug.cgi?id=17170#c8

Is it possible that the cases where you saw 256-bit ops were transformed by the loop vectorizer rather than the SLP vectorizer?

Sanjay,

you're right! I used the loop vectorizer in the earlier version.

Increasing the magic number in the SLP vectorizer solved the issue. Now, the code is vectorized with AVX instructions :slight_smile:

Thanks,
Frank

Hi Frank,

What does --debug-only=vectorize says?

You may try to get the datalayout and the triple on the IR header,
just to make sure you got everything right. LLVM will honour those,
and front-ends should create them correctly.

--renato

Hi Renato,

there were two follow-up emails. The issue is solved. The SLP vectorizer has a magic number built into the code which determines the max. vector length to search for. That was set to 128 bits. Increasing it to 256 bits solved the issue.

For inconsistency reasons it must be '--debug-only=SLP' and the output can be found in one of the follow-up emails.

Thanks,
Frank

there were two follow-up emails.

I only got one... weird...

The issue is solved. The SLP vectorizer has
a magic number built into the code which determines the max. vector length
to search for. That was set to 128 bits. Increasing it to 256 bits solved
the issue.

That looks like a simple fix. Is it upstream yet? :slight_smile:

For inconsistency reasons it must be '--debug-only=SLP' and the output can
be found in one of the follow-up emails.

Of course. Maybe we should mean "vectorize" as all of them? Anyway,
that's unrelated.

cheers,
--renato

Is there a patch that will get upstreamed?

Original Message

there were two follow-up emails.

I only got one... weird...

The issue is solved. The SLP vectorizer has
a magic number built into the code which determines the max. vector length
to search for. That was set to 128 bits. Increasing it to 256 bits solved
the issue.

That looks like a simple fix. Is it upstream yet? :slight_smile:

That's not up to me. There were concerns raised about an increased compile time.
https://llvm.org/bugs/show_bug.cgi?id=17170#c8

Frank

From: "Renato Golin" <renato.golin@linaro.org>
To: "Frank Winter" <fwinter@jlab.org>
Cc: "LLVM Dev" <llvmdev@cs.uiuc.edu>
Sent: Wednesday, July 1, 2015 3:29:25 PM
Subject: Re: [LLVMdev] SLP vectorizer on AVX feature

> there were two follow-up emails.

I only got one... weird...

> The issue is solved. The SLP vectorizer has
> a magic number built into the code which determines the max. vector
> length
> to search for. That was set to 128 bits. Increasing it to 256 bits
> solved
> the issue.

That looks like a simple fix. Is it upstream yet? :slight_smile:

The main concern here, as noted in the bug report, is compile-time cost. We need to characterize that.

-Hal

From: "Frank Winter" <fwinter@jlab.org>
To: "Renato Golin" <renato.golin@linaro.org>
Cc: "LLVM Dev" <llvmdev@cs.uiuc.edu>
Sent: Wednesday, July 1, 2015 3:33:18 PM
Subject: Re: [LLVMdev] SLP vectorizer on AVX feature

>> there were two follow-up emails.
> I only got one... weird...
>
>
>> The issue is solved. The SLP vectorizer has
>> a magic number built into the code which determines the max.
>> vector length
>> to search for. That was set to 128 bits. Increasing it to 256 bits
>> solved
>> the issue.
> That looks like a simple fix. Is it upstream yet? :slight_smile:

That's not up to me. There were concerns raised about an increased
compile time.
https://llvm.org/bugs/show_bug.cgi?id=17170#c8

The first step, likely, is to transform this bound into a command-line option for ease of benchmarking. I think that a patch that does that (along with a test case) would be nice.

-Hal

In those days, SLP was a lot more aggressive, maybe? :slight_smile:

You could use OpenMP vectorize pragmas to specify the width of that
loop, which works across compilers. I'm not sure SLP checks that,
though.

cheers,
--renato