On PowerPC, we do not have instructions for gather/scatter. However, we would like to enable the intrinsic for the purpose of vectorization. After enabling the intrinsic, we could enable the hook forceScalarizeMaskedGather to scalarize the intrinsic from ScalarizeMaskedMemIntrin.cpp.
However, we are noting an issue with the way this instrisic is vectorized.
For example, consider this IR after vectorizing:
vector.body: ; preds = %vector.body, %vector.ph
%lsr.iv14 = phi i64 [ %lsr.iv.next15, %vector.body ], [ 0, %vector.ph ]
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%uglygep16 = getelementptr i8, ptr %bb, i64 %lsr.iv14
%wide.load = load <4 x i32>, ptr %uglygep16, align 4, !tbaa !5
%0 = sext <4 x i32> %wide.load to <4 x i64>
%1 = getelementptr inbounds i32, ptr %aa, <4 x i64> %0
%wide.masked.gather = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> %1, i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> undef), !tbaa !5
%uglygep17 = getelementptr i8, ptr %cc, i64 %lsr.iv14
%wide.load11 = load <4 x i32>, ptr %uglygep17, align 4, !tbaa !5
%2 = add nsw <4 x i32> %wide.load11, %wide.masked.gather
store <4 x i32> %2, ptr %uglygep17, align 4, !tbaa !5
%index.next = add nuw i64 %index, 4
%lsr.iv.next15 = add nuw nsw i64 %lsr.iv14, 16
%3 = icmp eq i64 %index.next, %n.vec
br i1 %3, label %middle.block, label %vector.body, !llvm.loop !9
Because the first argument to llvm.masked.gather.v4i32.v4p0 is <4 x ptr>, and the entire loop is vectorized, there is an overhead to insert and extract the addresses needed for the gather intrinsic into the vector registers. Ideally we would like to have 4 separate loads rather than one wide.load of <4xi32> and use those 4 separate loads to get the 4 separate addresses that can individually be passed to llvm.masked.gather.v4i32.v4p0.
Three ideas we have for implementing this scalarization of the address calculation are:
- Having complete custom lowering in the backend to scalarize what we need
- Having a target hook called from ScalarizeMaskedMemIntrin.cpp which would ask the target if they want to scalarize both the gather instrinsic and the address calculations
- Having the vectorizer to partially vectorize this in a way that the address calculations are kept scalar
The third option seems the cleanest, but I don’t have much experience with the LoopVectorizer. Is this something that can be implemented well with the LoopVectorizer? Other targets might also benefit from such a partial vectorization. For example, any architecture that does not have memory access instructions where the addresses reside in the vector unit. The benefit would come from being able to perform SIMD operations on data that is in non-consecutive memory even without HW support for gather/scatter memory access instructions.