Help understanding why SLP vectorization does not apply

I’m trying to generate vectorized IR for simple, linear (i.e. no control flow) functions, but the slp-vectorizer pass seems to have no effect on my code. I’m sure I’m misunderstanding something, but I’m confused as to why I can’t get anything to auto-vectorize.

More concretely, I have the following example IR, saved as foo.ll:

target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"
define double @foo(double %a1, double %a2, double %a3, double %a4, double %a5, double %a6, double %a7, double %a8) local_unnamed_addr {
entry:
  %mul1 = fmul double %a1, %a2
  %mul2 = fmul double %a3, %a4
  %mul3 = fmul double %a5, %a6
  %mul4 = fmul double %a7, %a8
  %add1 = fadd double %mul1, %mul2
  %add2 = fadd double %mul3, %mul4
  %add3 = fadd double %add1, %add2
  ret double %add3
}

(this is greatly simplified from my actual generated IR, which is here: Hastebin: Send and Save Text or Code Snippets for Free | Toptal®)

I run opt -mcpu=native -passes='default<O3>,slp-vectorizer' -S foo.ll on this, and the output shows the IR unchanged:

❯ opt -mcpu=native -passes='default<O3>,slp-vectorizer' -S foo.ll
; ModuleID = 'foo.ll'
source_filename = "foo.ll"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"

; Function Attrs: mustprogress nofree norecurse nosync nounwind readnone willreturn
define double @foo(double %a1, double %a2, double %a3, double %a4, double %a5, double %a6, double %a7, double %a8) local_unnamed_addr #0 {
entry:
  %mul1 = fmul double %a1, %a2
  %mul2 = fmul double %a3, %a4
  %mul3 = fmul double %a5, %a6
  %mul4 = fmul double %a7, %a8
  %add1 = fadd double %mul1, %mul2
  %add2 = fadd double %mul3, %mul4
  %add3 = fadd double %add1, %add2
  ret double %add3
}

attributes #0 = { mustprogress nofree norecurse nosync nounwind readnone willreturn "target-cpu"="tigerlake" "target-features"="+sse2,-tsxldtrk,+cx16,+sahf,-tbm,+avx512ifma,+sha,+gfni,-fma4,+vpclmulqdq,+prfchw,+bmi2,-cldemote,+fsgsbase,-ptwrite,-amx-tile,-uintr,+popcnt,+widekl,+aes,+avx512bitalg,+movdiri,+xsaves,-avx512er,-avxvnni,+avx512vnni,-amx-bf16,+avx512vpopcntdq,-pconfig,+clwb,+avx512f,+xsavec,-clzero,+pku,+mmx,-lwp,+rdpid,-xop,+rdseed,-waitpkg,+kl,+movdir64b,-sse4a,+avx512bw,+clflushopt,+xsave,+avx512vbmi2,+64bit,+avx512vl,-serialize,-hreset,+invpcid,+avx512cd,+avx,+vaes,-avx512bf16,+cx8,+fma,-rtm,+bmi,-enqcmd,+rdrnd,-mwaitx,+sse4.1,+sse4.2,+avx2,+fxsr,-wbnoinvd,+sse,+lzcnt,+pclmul,-prefetchwt1,+f16c,+ssse3,-sgx,+shstk,+cmov,+avx512vbmi,-amx-int8,+movbe,+avx512vp2intersect,+xsaveopt,+avx512dq,+adx,-avx512pf,+sse3" }

It seems as though my CPU’s vector instructions are correctly detected, and I would (naively?) expect the block of fmuls to vectorize. My best guess is that the overhead of storing into/loading from a vector register is higher than the benefit of vectorization here, but I also see no change in my more complex “real” IR, which contains more, longer contiguous regions of data-dependency-free fmuls, fadds, etc.

Does anyone have any idea what I might be doing wrong or missing here? Thanks!

Try to add -slp-threshold=-100 option, most probably SLP vectorizer just finds it not profitable for vectorization.

2 Likes

Ah, thanks! I didn’t realize that the threshold could have negative values.

Out of curiosity, is the default threshold typically reliable, or should I experiment to see if my specific use case benefits from aggressive vectorization even if that requires a lowered threshold?

Isn’t the SLP Vectorizer starting from GEP/Store operation to build vectorizable chain of operations? Does it do something in absence of stores?

Edit: indeed it works fine with just scalar code, bad memory on my side :slight_smile:

I thought (from Auto-Vectorization in LLVM — LLVM 15.0.0git documentation) that it didn’t just look at GEP/store instructions, but maybe I misunderstood.

With -slp-threshold=-100, I get:

❯ opt -mcpu=native -passes='default<O3>,slp-vectorizer,loop-vectorize'  -slp-threshold=-100 -S foo.ll
; ModuleID = 'foo.ll'
source_filename = "foo.ll"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"

; Function Attrs: mustprogress nofree norecurse nosync nounwind readnone willreturn
define double @foo(double %a1, double %a2, double %a3, double %a4, double %a5, double %a6, double %a7, double %a8) local_unnamed_addr #0 {
entry:
  %0 = insertelement <2 x double> poison, double %a1, i32 0
  %1 = insertelement <2 x double> %0, double %a5, i32 1
  %2 = insertelement <2 x double> poison, double %a2, i32 0
  %3 = insertelement <2 x double> %2, double %a6, i32 1
  %4 = fmul <2 x double> %1, %3
  %5 = insertelement <2 x double> poison, double %a3, i32 0
  %6 = insertelement <2 x double> %5, double %a7, i32 1
  %7 = insertelement <2 x double> poison, double %a4, i32 0
  %8 = insertelement <2 x double> %7, double %a8, i32 1
  %9 = fmul <2 x double> %6, %8
  %10 = fadd <2 x double> %4, %9
  %shift = shufflevector <2 x double> %10, <2 x double> poison, <2 x i32> <i32 1, i32 undef>
  %11 = fadd <2 x double> %10, %shift
  %add3 = extractelement <2 x double> %11, i32 0
  ret double %add3
}

attributes #0 = { mustprogress nofree norecurse nosync nounwind readnone willreturn "target-cpu"="tigerlake" "target-features"="+sse2,-tsxldtrk,+cx16,+sahf,-tbm,+avx512ifma,+sha,+gfni,-fma4,+vpclmulqdq,+prfchw,+bmi2,-cldemote,+fsgsbase,-ptwrite,-amx-tile,-uintr,+popcnt,+widekl,+aes,+avx512bitalg,+movdiri,+xsaves,-avx512er,-avxvnni,+avx512vnni,-amx-bf16,+avx512vpopcntdq,-pconfig,+clwb,+avx512f,+xsavec,-clzero,+pku,+mmx,-lwp,+rdpid,-xop,+rdseed,-waitpkg,+kl,+movdir64b,-sse4a,+avx512bw,+clflushopt,+xsave,+avx512vbmi2,+64bit,+avx512vl,-serialize,-hreset,+invpcid,+avx512cd,+avx,+vaes,-avx512bf16,+cx8,+fma,-rtm,+bmi,-enqcmd,+rdrnd,-mwaitx,+sse4.1,+sse4.2,+avx2,+fxsr,-wbnoinvd,+sse,+lzcnt,+pclmul,-prefetchwt1,+f16c,+ssse3,-sgx,+shstk,+cmov,+avx512vbmi,-amx-int8,+movbe,+avx512vp2intersect,+xsaveopt,+avx512dq,+adx,-avx512pf,+sse3" }

which is at least vectorized, although the overhead is almost certainly not worth it for this trivial example.

1 Like

Not only from stores, but also from rets, phis, void calls, buildvectors, etc. Also, in some other cases