Vectorizing with gather/scatter instrinsics

On PowerPC, we do not have instructions for gather/scatter. However, we would like to enable the intrinsic for the purpose of vectorization. After enabling the intrinsic, we could enable the hook forceScalarizeMaskedGather to scalarize the intrinsic from ScalarizeMaskedMemIntrin.cpp.
However, we are noting an issue with the way this instrisic is vectorized.

For example, consider this IR after vectorizing:

vector.body:                                      ; preds = %vector.body, %vector.ph
  %lsr.iv14 = phi i64 [ %lsr.iv.next15, %vector.body ], [ 0, %vector.ph ]
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %uglygep16 = getelementptr i8, ptr %bb, i64 %lsr.iv14
  %wide.load = load <4 x i32>, ptr %uglygep16, align 4, !tbaa !5
  %0 = sext <4 x i32> %wide.load to <4 x i64>
  %1 = getelementptr inbounds i32, ptr %aa, <4 x i64> %0
  %wide.masked.gather = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> %1, i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> undef), !tbaa !5
  %uglygep17 = getelementptr i8, ptr %cc, i64 %lsr.iv14
  %wide.load11 = load <4 x i32>, ptr %uglygep17, align 4, !tbaa !5
  %2 = add nsw <4 x i32> %wide.load11, %wide.masked.gather
  store <4 x i32> %2, ptr %uglygep17, align 4, !tbaa !5
  %index.next = add nuw i64 %index, 4
  %lsr.iv.next15 = add nuw nsw i64 %lsr.iv14, 16
  %3 = icmp eq i64 %index.next, %n.vec
  br i1 %3, label %middle.block, label %vector.body, !llvm.loop !9

Because the first argument to llvm.masked.gather.v4i32.v4p0 is <4 x ptr>, and the entire loop is vectorized, there is an overhead to insert and extract the addresses needed for the gather intrinsic into the vector registers. Ideally we would like to have 4 separate loads rather than one wide.load of <4xi32> and use those 4 separate loads to get the 4 separate addresses that can individually be passed to llvm.masked.gather.v4i32.v4p0.

Three ideas we have for implementing this scalarization of the address calculation are:

  1. Having complete custom lowering in the backend to scalarize what we need
  2. Having a target hook called from ScalarizeMaskedMemIntrin.cpp which would ask the target if they want to scalarize both the gather instrinsic and the address calculations
  3. Having the vectorizer to partially vectorize this in a way that the address calculations are kept scalar

The third option seems the cleanest, but I don’t have much experience with the LoopVectorizer. Is this something that can be implemented well with the LoopVectorizer? Other targets might also benefit from such a partial vectorization. For example, any architecture that does not have memory access instructions where the addresses reside in the vector unit. The benefit would come from being able to perform SIMD operations on data that is in non-consecutive memory even without HW support for gather/scatter memory access instructions.

@fhahn Hi Florian, could you please provide some guidance/feedback regarding how this may be vectorized to have scalar address loads?

Hi, can you share a small but buildable IR example? Ideally on https://llvm.godbolt.org

Here is an IR example with the original way that it is vectorized: Compiler Explorer
And here is a hand modified IR example where the addresses for the gather and kept scalar: Compiler Explorer

In the first example, we have additional overhead to build the vector register with the addresses and then extracting the addresses back out from vector registers in order to load.
In the second example, we can avoid this overhead since the addresses are already scalar.

@fhahn Hi Florian, please see above comment for the buildable IR example.

I meant an example before vectorization where the vectorizer could/should do something differently.

Here is an example IR before loop vectorization: Compiler Explorer
When vectorizing this, it is able to use llvm.masked.gather. However, I would like to replace the %wide.load = load <4 x i32>, ptr %2, align 4, !tbaa !0 with 4 separate scalar loads which can then be inserted into a vector before calling llvm.masked.gather as shown in the IR in Compiler Explorer