RFC: Improved VPlan-native Path

Hello all,

This is kind of a follow-up to a discussion here on this forum about outer-loop vectorization. I did work on this a bit and wanted to get early feedback. Tagging some people involved in that thread: @PeixinQiao , @fhahn , @nikolaypanchenko , @Shamanth . fhahn already mentioned that the VPlan-native path needs end-to-end tests in the llvm-test-suite and I am willing to propose patches for that as well in the not-so-distant future. A WIP diff for the changes proposed in this thread can be found here.

Introduction

This is a request for comments about improving the outer-loop vectorization in LLVM. LLVM only supports outer-loop vectorization in the VPlan-native path, which has to be explicitly enabled using the flag -enable-vplan-native-path. LLVMs normal vectorization path produces very good code, is very mature, and uses the VPlan infrastructre as well, but is incapable of outer-loop vectorization. The VPlan-native path is a alternative vectorization code-path that is purely pragma/metadata driven, has no memory-dependency checks, no cost-model, and generates bad code because it vectorized everything (even instructions that compute uniform values) and uses gathers/scatters for every memory access and lacks features such as reductions or recurrences. In return, it supports outer-loop vectorization.

LLVM’s VPlan infrastructure has improved immensely since the VPlan-native path was introduced, thanks to the amazing work of the contributors and maintainers of LLVM’s vectorizer. The changes proposed with this RFC address some of the code-quality issues with the generated IR of the VPlan-native path, but they do not add memory-dependence checks or so. It does this by adding a new VPlanTransforms transformation for the VPlan-native path that finds consecutive or uniform memory accesses in the VPlan produced by VPlanTransforms::VPInstructionsToVPRecipes, and then replaces the gather/scatter memory access recipes by uniform or consecutive ones (if possible/allowed). Then, it replaces all recipes where only the value of the first lane is used (e.g. the address calculation of a consecutive load/store) by non-widening, scalar recipes.

Motivation

I think everybody who looked at it can agree that the code-quality of the current VPlan-native path is very far from ideal. LLVM already has a very well optimized vectorizer for inner-most loops, and in the large majority of cases, vectorizing the inner-most loop of a loop-nest is the best thing to do. However, there are some cases, where outer-loop vectorization leads to better performance, (e.g. matrix multiplication) as discussed here.

Another example is presented here, where the scalar input IR to the vectorizer looks like this:

define void @foo_scalar() {
entry:
  br label %outer_loop

outer_loop:
  %i = phi i64 [ 0, %entry ], [ %i.next, %outer_loop.latch ]
  %a_addr = getelementptr inbounds [1000 x float], ptr @A, i64 0, i64 %i
  %a_load = load float, ptr %a_addr, align 4, !tbaa !6
  br label %inner_loop

inner_loop:
  %j = phi i64 [ 0, %outer_loop ], [ %j.next, %inner_loop ]
  %sum = phi float [ %a_load, %outer_loop ], [ %sum.next, %inner_loop ]
  %b_addr = getelementptr inbounds [1000 x float], ptr @B, i64 0, i64 %j
  %b_load = load float, ptr %b_addr, align 4, !tbaa !6
  %sum.next = fadd float %sum, %b_load
  %j.next = add nuw nsw i64 %j, 1
  %inner_exitcond = icmp eq i64 %j.next, 1000
  br i1 %inner_exitcond, label %outer_loop.latch, label %inner_loop, !llvm.loop !14

outer_loop.latch:
  store float %sum.next, ptr %a_addr, align 4, !tbaa !6
  %i.next = add nuw nsw i64 %i, 1
  %outer_exitcond = icmp eq i64 %i.next, 1000
  br i1 %outer_exitcond, label %exit, label %outer_loop, !llvm.loop !10

exit:
  ret void
}

The current output IR after vectorization (and some cleanup passes) looks like this:

define void @foo() {
entry:
  br label %vector.body

vector.body:
  %index = phi i64 [ 0, %entry ], [ %index.next, %outer_loop.latch4 ]
  %vec.ind = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %entry ], [ %vec.ind.next, %outer_loop.latch4 ]
  %0 = getelementptr inbounds [1000 x float], ptr @A, i64 0, <4 x i64> %vec.ind
  %wide.masked.gather = call <4 x float> @llvm.masked.gather.v4f32.v4p0(<4 x ptr> %0, i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x float> poison), !tbaa !6
  br label %inner_loop1

inner_loop1:
  %vec.phi = phi <4 x i64> [ zeroinitializer, %vector.body ], [ %3, %inner_loop1 ]
  %vec.phi2 = phi <4 x float> [ %wide.masked.gather, %vector.body ], [ %2, %inner_loop1 ]
  %1 = getelementptr inbounds [1000 x float], ptr @B, i64 0, <4 x i64> %vec.phi
  %wide.masked.gather3 = call <4 x float> @llvm.masked.gather.v4f32.v4p0(<4 x ptr> %1, i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x float> poison), !tbaa !6
  %2 = fadd <4 x float> %vec.phi2, %wide.masked.gather3
  %3 = add nuw nsw <4 x i64> %vec.phi, <i64 1, i64 1, i64 1, i64 1>
  %4 = extractelement <4 x i64> %3, i64 0
  %5 = icmp eq i64 %4, 1000
  br i1 %5, label %outer_loop.latch4, label %inner_loop1

outer_loop.latch4:
  call void @llvm.masked.scatter.v4f32.v4p0(<4 x float> %2, <4 x ptr> %0, i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>), !tbaa !6
  %index.next = add nuw i64 %index, 4
  %vec.ind.next = add <4 x i64> %vec.ind, <i64 4, i64 4, i64 4, i64 4>
  %6 = icmp eq i64 %index.next, 1000
  br i1 %6, label %exit, label %vector.body, !llvm.loop !10

exit:
  ret void
}

The problems here are:

  • The gather and scatter on vector.body and outer_loop.latch4 are unnecessary because the offset vector used by the GEP will always look like < IV + 0, IV + 1, IV + 2, IV + 4 > (where IV is the induction variable of the vectorized loop; a consecutive memory access using normal load/store instructions would be better).
  • The gather in inner_loop1 is unnecessary because the pointer vector will always contain the same value for all lanes (a simple scalar load followed by a splat would be better).
  • Knowing that, the calculation of %vec.ind and %vec.phi2 (the inner loop induction variable) does not need to be vectorized as only the value in lane zero is interesting.

Proposal

The normal vectorizer uses llvm::LoopAccessInfo::isInvariant(Value *V) and llvm::getPtrStride(...) (also implemented in LoopAccessAnalysis.cpp). Both of these functions are not usable by outer-loop vectorization because they only work if the loop for which they check if the access is invariant or what stride it has is a inner-most one. The proposed changes add a helper function to VPlanTransforms.cpp which does similar analysis to the two functions mentioned above, but with outer-loop vectorization in mind. The reason for not integrating this functionality directly into the LoopAccessAnalysis is that it does not support non-innermost-loops in general. Also, this way, the bulk of the changes stay in VPlanTransforms.cpp and do not leak into the LoopAccessAnalysis. If it is preferred by others to modify LoopAccessAnalysis.cpp instead, that would not be a problem.

In a second step, an algorithm similar to the one in LoopVectorizationCostModel::collectLoopUniforms() is used to identify recipes where only the value of lane zero of the first part is used, and then those widening recipes are replaced by uniform VPReplicateRecipes. The main difference in the two algorithms is that collectLoopUniforms() operates on LLVM-IR, not VPlan Recipes, and that the function here has to be able to treat inner-loop PHIs (causing def-use circles). The VPlan-native path already has the VPWidenPHIRecipe for inner-loop PHIs, the changes proposed here add a new VPScalarPHIRecipe for related use, but for cases where the incoming values are all uniform/scalar.

Using the example above, after the patches for outer-loop vectorization, this would be produced instead (using the same cleanup passes):

define void @foo() {
entry:
  br label %vector.body

vector.body:
  %index = phi i64 [ 0, %entry ], [ %index.next, %outer_loop.latch2 ]
  %0 = getelementptr inbounds [1000 x float], ptr @A, i64 0, i64 %index
  %wide.load = load <4 x float>, ptr %0, align 4, !tbaa !6
  br label %inner_loop1

inner_loop1:
  %scalar.phi = phi i64 [ 0, %vector.body ], [ %4, %inner_loop1 ]
  %vec.phi = phi <4 x float> [ %wide.load, %vector.body ], [ %3, %inner_loop1 ]
  %1 = getelementptr inbounds [1000 x float], ptr @B, i64 0, i64 %scalar.phi
  %2 = load float, ptr %1, align 4, !tbaa !6
  %broadcast.splatinsert = insertelement <4 x float> poison, float %2, i64 0
  %broadcast.splat = shufflevector <4 x float> %broadcast.splatinsert, <4 x float> poison, <4 x i32> zeroinitializer
  %3 = fadd <4 x float> %vec.phi, %broadcast.splat
  %4 = add nuw nsw i64 %scalar.phi, 1
  %5 = icmp eq i64 %4, 1000
  br i1 %5, label %outer_loop.latch2, label %inner_loop1

outer_loop.latch2:
  store <4 x float> %3, ptr %0, align 4, !tbaa !6
  %index.next = add nuw i64 %index, 4
  %6 = icmp eq i64 %index.next, 1000
  br i1 %6, label %exit, label %vector.body, !llvm.loop !10

exit:
  ret void
}

The main two differences:

  • No more gather/scatter intrinsics, but instead load/store instructions because all memory accesses in this example are consecutive or uniform.
  • The induction variables of both, the outer loop that is vectorized along and the inner loop, are computed using scalar instructions!

Final Words

If there is interest in this RFC, I would be very happy to propose more patches with improvements to the outer-loop vectorization later on. For example, I already have prototypes of DependenceInfo-based legality checks, tail-loop folding using VPActiveLaneMaskPHIRecipe, and use of VPReductionRecipe/VPFirstOrderRecurrencePHIRecipe, … in the VPlan-native path. All of this, of course, was only possible thanks to the amazing improvements of the VPlan infrastructure used by the standard vectorization path!

Some references:

  • The proposed changes (WIP, would be split into several patches if the general approach is approved): D159288
  • Vectorization Plan from the LLVM Docs.
  • A separated patch for supporting scalable vectors in the VPlan-native path: D157484
  • A separated patch for supporting interleaving in the VPlan-native path: D157371

Feedback would be very much appreciated!
Best regards,
Lou

4 Likes

Hi @iamlouk,

Thanks for putting this RFC together.

With my colleague Lorenzo we have been looking at using more the native path of VPlan and we also wanted to reduce the use of gather/scatter in the inner loop. We were thinking how to address this, it seems you have a workable way to solve it already.

Our idea was more in the line of sketching some cost model (or “VPlanSomething”) that can reply analysis/cost questions on top of the VPlan itself made by the different VPlanTransforms. (Maybe this is already planned and we just failed to find a reference here?).

We are definitely interested in work to further VPlan-native, specially for outer loop vectorization. Our interest mainly revolves on using the vector length feature of RISC-V in the loop vectorizer.

Kind regards,

Hi @rofirrim,

Thank you very much for your input!

The way I implemented more advanced features (if-conversion, tail-loop-folding, live-out values like reductions, …) that require more Analysis (not part of this RFC as those would be much more invasive changes) was mostly by extending the LoopVectorizationLegality (e.g. using DependenceInfo). The checks on memory accesses etc. are all done before I do the first VPlanTransform that modifies the CFG or invalidates IR-based analysis results (like if-conversion), so I just use the analysis results I got directly on the original LLVM-IR. After if-conversion, I have added some other VPlanTransforms, but those did not turn out to need any advanced analysis results (the algo for selecting recipes to widen/scalarize is purely VPlan-Recipe based).

A VPlan based cost-model for if its better to vectorize a inner loop or the outer one would be nice, but have nothing to show on this currently.

I guess by that you mean using the @llvm.vp.* Intrinsics which take a mask and a explicit vector length parameter? I thought about that, and my approach would have been to add a optional mask (and EVL) operand to the VPWidenRecipe (and other similar ones). Then, in the execute() methods of the optionally masked widening recipes, the vectorizer could generate @llvm.vp.fadd instead of fadd and so on. This would be independent of the VPlan-native path, the execute() methods are 100% shared. This is just an idea though, I have no PoC and I did not think about the EVL parameter yet.

Kind regards,
Lou

Hi Lou,

Thanks a lot for the insights.

The way I implemented more advanced features (if-conversion, tail-loop-folding, live-out values like reductions, …) that require more Analysis (not part of this RFC as those would be much more invasive changes) was mostly by extending the LoopVectorizationLegality (e.g. using DependenceInfo). The checks on memory accesses etc. are all done before I do the first VPlanTransform that modifies the CFG or invalidates IR-based analysis results (like if-conversion), so I just use the analysis results I got directly on the original LLVM-IR.

Oh, this is very interesting because I was looking at the “journey” through VPlan transformations as “views” over the LLVM IR (so we can avoid modifying the LLVM IR until the last moment when we execute the chosen plan).

But your comment makes me realise this approach may not be entirely workable as transforms might change the plan in a way that keeping the relationship to the original connection is difficult or not practical.

It is also good to see you didn’t need that, though, for later transformations. Thanks again for sharing your experience.

I guess by that you mean using the @llvm.vp.* Intrinsics which take a mask and a explicit vector length parameter? I thought about that, and my approach would have been to add a optional mask (and EVL) operand to the VPWidenRecipe (and other similar ones). Then, in the execute() methods of the optionally masked widening recipes, the vectorizer could generate @llvm.vp.fadd instead of fadd and so on. This would be independent of the VPlan-native path, the execute() methods are 100% shared. This is just an idea though, I have no PoC and I did not think about the EVL parameter yet.

Indeed this is the ultimate goal. In our fork we added a bunch of recipes with explicit vector length. Our approach works but it may be a bit too invasive. In that sense I believe @alexey-bataev 's approach in https://reviews.llvm.org/D99750 is more sustainable in the long term.

Kind regards,
Roger

1 Like

cc: @alexey-bataev @idbaev