Hello all,
This is kind of a follow-up to a discussion here on this forum about outer-loop vectorization. I did work on this a bit and wanted to get early feedback. Tagging some people involved in that thread: @PeixinQiao , @fhahn , @nikolaypanchenko , @Shamanth . fhahn already mentioned that the VPlan-native path needs end-to-end tests in the llvm-test-suite and I am willing to propose patches for that as well in the not-so-distant future. A WIP diff for the changes proposed in this thread can be found here.
Introduction
This is a request for comments about improving the outer-loop vectorization in LLVM. LLVM only supports outer-loop vectorization in the VPlan-native path, which has to be explicitly enabled using the flag -enable-vplan-native-path
. LLVMs normal vectorization path produces very good code, is very mature, and uses the VPlan infrastructre as well, but is incapable of outer-loop vectorization. The VPlan-native path is a alternative vectorization code-path that is purely pragma/metadata driven, has no memory-dependency checks, no cost-model, and generates bad code because it vectorized everything (even instructions that compute uniform values) and uses gathers/scatters for every memory access and lacks features such as reductions or recurrences. In return, it supports outer-loop vectorization.
LLVM’s VPlan infrastructure has improved immensely since the VPlan-native path was introduced, thanks to the amazing work of the contributors and maintainers of LLVM’s vectorizer. The changes proposed with this RFC address some of the code-quality issues with the generated IR of the VPlan-native path, but they do not add memory-dependence checks or so. It does this by adding a new VPlanTransforms
transformation for the VPlan-native path that finds consecutive or uniform memory accesses in the VPlan produced by VPlanTransforms::VPInstructionsToVPRecipes
, and then replaces the gather/scatter memory access recipes by uniform or consecutive ones (if possible/allowed). Then, it replaces all recipes where only the value of the first lane is used (e.g. the address calculation of a consecutive load/store) by non-widening, scalar recipes.
Motivation
I think everybody who looked at it can agree that the code-quality of the current VPlan-native path is very far from ideal. LLVM already has a very well optimized vectorizer for inner-most loops, and in the large majority of cases, vectorizing the inner-most loop of a loop-nest is the best thing to do. However, there are some cases, where outer-loop vectorization leads to better performance, (e.g. matrix multiplication) as discussed here.
Another example is presented here, where the scalar input IR to the vectorizer looks like this:
define void @foo_scalar() {
entry:
br label %outer_loop
outer_loop:
%i = phi i64 [ 0, %entry ], [ %i.next, %outer_loop.latch ]
%a_addr = getelementptr inbounds [1000 x float], ptr @A, i64 0, i64 %i
%a_load = load float, ptr %a_addr, align 4, !tbaa !6
br label %inner_loop
inner_loop:
%j = phi i64 [ 0, %outer_loop ], [ %j.next, %inner_loop ]
%sum = phi float [ %a_load, %outer_loop ], [ %sum.next, %inner_loop ]
%b_addr = getelementptr inbounds [1000 x float], ptr @B, i64 0, i64 %j
%b_load = load float, ptr %b_addr, align 4, !tbaa !6
%sum.next = fadd float %sum, %b_load
%j.next = add nuw nsw i64 %j, 1
%inner_exitcond = icmp eq i64 %j.next, 1000
br i1 %inner_exitcond, label %outer_loop.latch, label %inner_loop, !llvm.loop !14
outer_loop.latch:
store float %sum.next, ptr %a_addr, align 4, !tbaa !6
%i.next = add nuw nsw i64 %i, 1
%outer_exitcond = icmp eq i64 %i.next, 1000
br i1 %outer_exitcond, label %exit, label %outer_loop, !llvm.loop !10
exit:
ret void
}
The current output IR after vectorization (and some cleanup passes) looks like this:
define void @foo() {
entry:
br label %vector.body
vector.body:
%index = phi i64 [ 0, %entry ], [ %index.next, %outer_loop.latch4 ]
%vec.ind = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %entry ], [ %vec.ind.next, %outer_loop.latch4 ]
%0 = getelementptr inbounds [1000 x float], ptr @A, i64 0, <4 x i64> %vec.ind
%wide.masked.gather = call <4 x float> @llvm.masked.gather.v4f32.v4p0(<4 x ptr> %0, i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x float> poison), !tbaa !6
br label %inner_loop1
inner_loop1:
%vec.phi = phi <4 x i64> [ zeroinitializer, %vector.body ], [ %3, %inner_loop1 ]
%vec.phi2 = phi <4 x float> [ %wide.masked.gather, %vector.body ], [ %2, %inner_loop1 ]
%1 = getelementptr inbounds [1000 x float], ptr @B, i64 0, <4 x i64> %vec.phi
%wide.masked.gather3 = call <4 x float> @llvm.masked.gather.v4f32.v4p0(<4 x ptr> %1, i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x float> poison), !tbaa !6
%2 = fadd <4 x float> %vec.phi2, %wide.masked.gather3
%3 = add nuw nsw <4 x i64> %vec.phi, <i64 1, i64 1, i64 1, i64 1>
%4 = extractelement <4 x i64> %3, i64 0
%5 = icmp eq i64 %4, 1000
br i1 %5, label %outer_loop.latch4, label %inner_loop1
outer_loop.latch4:
call void @llvm.masked.scatter.v4f32.v4p0(<4 x float> %2, <4 x ptr> %0, i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>), !tbaa !6
%index.next = add nuw i64 %index, 4
%vec.ind.next = add <4 x i64> %vec.ind, <i64 4, i64 4, i64 4, i64 4>
%6 = icmp eq i64 %index.next, 1000
br i1 %6, label %exit, label %vector.body, !llvm.loop !10
exit:
ret void
}
The problems here are:
- The
gather
andscatter
onvector.body
andouter_loop.latch4
are unnecessary because the offset vector used by the GEP will always look like< IV + 0, IV + 1, IV + 2, IV + 4 >
(where IV is the induction variable of the vectorized loop; a consecutive memory access using normalload
/store
instructions would be better). - The
gather
ininner_loop1
is unnecessary because the pointer vector will always contain the same value for all lanes (a simple scalar load followed by a splat would be better). - Knowing that, the calculation of
%vec.ind
and%vec.phi2
(the inner loop induction variable) does not need to be vectorized as only the value in lane zero is interesting.
Proposal
The normal vectorizer uses llvm::LoopAccessInfo::isInvariant(Value *V)
and llvm::getPtrStride(...)
(also implemented in LoopAccessAnalysis.cpp
). Both of these functions are not usable by outer-loop vectorization because they only work if the loop for which they check if the access is invariant or what stride it has is a inner-most one. The proposed changes add a helper function to VPlanTransforms.cpp
which does similar analysis to the two functions mentioned above, but with outer-loop vectorization in mind. The reason for not integrating this functionality directly into the LoopAccessAnalysis is that it does not support non-innermost-loops in general. Also, this way, the bulk of the changes stay in VPlanTransforms.cpp
and do not leak into the LoopAccessAnalysis. If it is preferred by others to modify LoopAccessAnalysis.cpp
instead, that would not be a problem.
In a second step, an algorithm similar to the one in LoopVectorizationCostModel::collectLoopUniforms()
is used to identify recipes where only the value of lane zero of the first part is used, and then those widening recipes are replaced by uniform VPReplicateRecipe
s. The main difference in the two algorithms is that collectLoopUniforms()
operates on LLVM-IR, not VPlan Recipes, and that the function here has to be able to treat inner-loop PHIs (causing def-use circles). The VPlan-native path already has the VPWidenPHIRecipe
for inner-loop PHIs, the changes proposed here add a new VPScalarPHIRecipe
for related use, but for cases where the incoming values are all uniform/scalar.
Using the example above, after the patches for outer-loop vectorization, this would be produced instead (using the same cleanup passes):
define void @foo() {
entry:
br label %vector.body
vector.body:
%index = phi i64 [ 0, %entry ], [ %index.next, %outer_loop.latch2 ]
%0 = getelementptr inbounds [1000 x float], ptr @A, i64 0, i64 %index
%wide.load = load <4 x float>, ptr %0, align 4, !tbaa !6
br label %inner_loop1
inner_loop1:
%scalar.phi = phi i64 [ 0, %vector.body ], [ %4, %inner_loop1 ]
%vec.phi = phi <4 x float> [ %wide.load, %vector.body ], [ %3, %inner_loop1 ]
%1 = getelementptr inbounds [1000 x float], ptr @B, i64 0, i64 %scalar.phi
%2 = load float, ptr %1, align 4, !tbaa !6
%broadcast.splatinsert = insertelement <4 x float> poison, float %2, i64 0
%broadcast.splat = shufflevector <4 x float> %broadcast.splatinsert, <4 x float> poison, <4 x i32> zeroinitializer
%3 = fadd <4 x float> %vec.phi, %broadcast.splat
%4 = add nuw nsw i64 %scalar.phi, 1
%5 = icmp eq i64 %4, 1000
br i1 %5, label %outer_loop.latch2, label %inner_loop1
outer_loop.latch2:
store <4 x float> %3, ptr %0, align 4, !tbaa !6
%index.next = add nuw i64 %index, 4
%6 = icmp eq i64 %index.next, 1000
br i1 %6, label %exit, label %vector.body, !llvm.loop !10
exit:
ret void
}
The main two differences:
- No more
gather
/scatter
intrinsics, but insteadload
/store
instructions because all memory accesses in this example are consecutive or uniform. - The induction variables of both, the outer loop that is vectorized along and the inner loop, are computed using scalar instructions!
Final Words
If there is interest in this RFC, I would be very happy to propose more patches with improvements to the outer-loop vectorization later on. For example, I already have prototypes of DependenceInfo
-based legality checks, tail-loop folding using VPActiveLaneMaskPHIRecipe
, and use of VPReductionRecipe
/VPFirstOrderRecurrencePHIRecipe
, … in the VPlan-native path. All of this, of course, was only possible thanks to the amazing improvements of the VPlan infrastructure used by the standard vectorization path!
Some references:
- The proposed changes (WIP, would be split into several patches if the general approach is approved): D159288
- Vectorization Plan from the LLVM Docs.
- A separated patch for supporting scalable vectors in the VPlan-native path: D157484
- A separated patch for supporting interleaving in the VPlan-native path: D157371
Feedback would be very much appreciated!
Best regards,
Lou