Improving SLPVectorizer for Julia

I’m working on some small improvements to SLPVectorizer.cpp so that it can deal with some tuple operations arising from Julia code. Being fairly new to LLVM, I could use some advice, particular from those familiar with the internals of SLPVectorizer.

The motivation can be found in the Julia discussion Do we want fixed-size arrays? · Issue #5857 · JuliaLang/julia · GitHub . Here is an example of the kind of LLVM code I wish to vectorize.

SLPVectorizer.cpp.patch (4.67 KB)

Hi Arch,

Thanks for looking at this.

The reason the SLPVectorizer bails out on many cases that seem vectorizable is scheduling. It needs to produce a legal schedule. The way it does this is by making sure that it can move all vectorized instructions to the last instruction in a bundle. (Alternatively, you could build a dag, make sure that you don’t create cycles and then produce a topological sort, but this was not done out of compile time concerns).

If I understand your patch correctly you are disabling the above mentioned check if the vectorizer starts at an insertelement instruction? What about other users? You still need to detect that you can schedule them correctly.

define <4 x float> @julia_foo111(<4 x float>, <4 x float>) {
top:
  %2 = extractelement <4 x float> %0, i32 0
  %3 = extractelement <4 x float> %1, i32 0
  %4 = fadd float %2, %3
  %5 = insertelement <4 x float> undef, float %4, i32 0
  %6 = extractelement <4 x float> %0, i32 1
  %7 = extractelement <4 x float> %1, i32 1
  %8 = fadd float %6, %7

  %foo = operation which has a use of %8 that potentially feeds %12 but even if not all of its users now need to be move below %16 and we need to check all their users recursively …

  %9 = insertelement <4 x float> %5, float %8, i32 1
  %10 = extractelement <4 x float> %0, i32 2
  %11 = extractelement <4 x float> %1, i32 2
  %12 = fadd float %10, %11
  %13 = insertelement <4 x float> %9, float %12, i32 2
  %14 = extractelement <4 x float> %0, i32 3
  %15 = extractelement <4 x float> %1, i32 3
  %16 = fadd float %14, %15
  %17 = insertelement <4 x float> %13, float %16, i32 3
  ret <4 x float> %17
}

For your case of insertelements that start a vector tree you would get away keeping a set of “insertelement” instructions of of which trytoVectorizeList below started of.

if (InsertElementInst *IE = dyn_cast<InsertElementInst>(it)) {
      SmallVector<Value *, 8> Ops;
      if (!findBuildVector(IE, Ops))
        continue;
      // add insert elements to InsertVectorRoot. you would need to make sure that all ‘other’ uses of those insert elements are below the last insert.
      if (tryToVectorizeList(Ops, R))

Instead of checking “buildsVector”. You could check this set.

      if (RdxOps && RdxOps->count(UI))
         continue;

+ // This user is part of building a vector
+ if (buildsVector) // use something like: if (InsertVectorRoot.count(UI)) instead.
+ continue;

Thanks for the detailed explanation and alerting me to the possibility of other user instructions. I'll first take a stab at adapting the RdxOps logic.

- Arch
``