What is the status of VPlan integration with loop vectorization?
What is the status of VPlan integration with loop vectorization?
I am definitely not the most competent person to answer this, so don’t expect everything I write to be completely correct. I have not contributed to any of this (hopefully only yet):
When the LoopVectorizationPlanner builds a VPlan, it takes all decisions about what instructions will be replicated/uniform/widened before building the first VPlan. It does use VPlan to model the result of its decisions. I can’t think of any vectorization decision not represented in the VPlan.
The LoopVectorizer uses VPlan Recipes for applying/“executing” vectorizations. This seams completed to me. Some recipes execute methods call the methods of the (Inner)LoopVectorizer, but as an external observer I do not see that as a problem.
The cost-model runs on LLVM-IR, not VPlans! This means that one can not look at the VPlan to estimate Costs, but instead must use the complicated decision tracking of the LoopVectorizationPlanner to know what instructions stay scalar, are replicated, are widened, etc.
So all in all, except for the cost model, I’d say (again, as someone not involved in active development) it’s mostly done.
Note that separate from general VPlan integration into the LoopVectorizer, there also is the “VPlan native-path” recently mentioned here. It is not enabled by default though, very restricted, and generates bad and possibly illegal (no legality checks) code. I started working on improvements to it, but have not submitted any patches yet as it’s still a work in progress.
Thank you for the response,
So as far as I know Vplan model will be giving vectorized/ modified LLVM-IR with some VPlan based opcodes . And it’s developed in collaboration with intel so is it compatible with other machines or only with Intel? Please correct me if I’m wrong as i’m new to this topic.
yeah, I have tried using this patch but didn’t observe any improvement.
VPlan is target neutral. It should work with any target that supports vector registers.
No, sorry if I did not make that clear: LLVM-IR is not modified until a VPlan is applied/executed. The LoopVectorizationPlanner just collects metadata about every LLVM-IR instruction so that when it builds the VPlan, in the VPlan, a widening (or uniform, …) “Recipe” (Recipes are kind of the Instructions of a VPlan) is used. Not modifying LLVM-IR while deciding how to vectorize is useful because if the cost-model decides that vectorization is not profitable, the LLVM-IR is unchanged.
And as already mentioned, VPlan is target independent. It must be, as the LoopVectorizer always uses VPlan, its not an optional thing.
Thanks @iamlouk for a great summary!
Just adding some minor additions
For the inner vector loop, I think almost all decisions are directly materialized in the plans directly (there’s been some good progress in that direction over the last year or so).
What’s not yet modeled completely is the scaffolding around the vector loop, like runtime checks or code to compute the final reduction value after the vector loop. This is actively been worked on.
At the moment, legality checks are also done mostly on the LLVM IR; but there is also work in progress to start moving parts of legality checking to VPlan directly.
Thank you for the clarification…
And I’ve been checking the vplan native path , but not observing any performance boost despite some changes in assembly (more no. of instruction, more number of z registers when used SVE and more uzp1 instructions for interleaved access etc) when it comes to integer and long data type. But with single and double precision data types, a descent performance improvement can be seen. I’m trying with matrix multiplication program with different data types in a local ARM cluster. And also i’m inserting compiler directive on top of nested loops.
But , regarding outer loop vectorization I think it’s still ignoring outer loop before vectorization. To confirm this point I have used -fsave-optimization-record flag and it’s showing that the outer loop is deleted. Is it because vplan is not taking part in cost-modelling?
I recently started looking into all of these . Any suggestions or directions are appreciated, and also I want to help on VPlan native path if possible
Hi @Shamanth ,
A big problem with the current outer-loop vectorization is that it generates gather/scatter loads/stores for all memory accesses, even uniform or continuous ones. Because the address operand of a gather/scatter is itself a vector, the address calculation is also done using vectors and thats why you see a lot more vector registers beeing used (even though they are not needed).
Give me two to three more weeks and I can submit a patch for that (if maybe @fhahn would be willing to review it or recommend a reviewer?). It would be a quite heavy one though as it also necessitates adding a new non-widening PHI recipe (for the induction variable of the inner loop that can stay scalar as it is only used for accessing consecutive/uniform memory in matrix multiplication) and basically a complete rewrite of
VPlanTransforms::VPInstructionsToVPRecipes() and modifications in
LoopAccessAnalysis (which is currently disabled via an assert for non inner-most loops) so that functions like
LoopVectorizationLegality::isConsecutivePtr can be used (Alternatively, ScalarEvolution could be used directly, would that be prefered?). Maybe it would be better if I posted a NFC first, detailing the approach, and separated all of this into several smaller patches?
Advice would be very much appreciated!
Once you go to more complex outer loop vectorisation, are going to reimplement SCEV and CSE on VPlan ISA?
I do not currently see the need to re-implement SCEV, one can use regular Scalar Evolution on the “underlying value” (a
llvm::Value) of a
VPValue. Of course, that SE will still see the loop as it was before vectorization (where a canonical IV increases by 1 instead of
UF * VF), and some
VPValues do not have a underlying
llvm::Value, so maybe someone more involved like fhahn disagrees (especially as he mentioned some legality checks will move to VPlan). I managed to get a decent PoC with what is already there.
I always thought that VPlan once reads the IR keeps no references to it, performs optimizations, and writes back to IR. In the optimizations phase there are no references to IR. Do you need such tools in that phase?
If I’m not wrong, in
VPlanTransforms::VPInstructionsToVPRecipes() PHI nodes has something to do with induction variables right? So are you planning to modify that ?
Again, I am no expert, but:
You are mostly right, VPlan is very careful not to change the underlying IR, but it does keep a reference to it (see VPRecipieBase::getUnderlyingInstr() for example). Not all VPlan recipes have a underlying instruction though (most notably those introduced during VPlan transformations).
On a side note: As LoopAccessAnalysis cannot be used to check legality of outer-loop-vectorization, I experimented with using LLVM’s DependenceAnalysis (which supports nested loops very well). It does not support Predicated SE or runtime assumptions/runtime pointer checking though. If you are experienced in PSE, this might be a topic that could need your help.
Yes, the recipes for induction variables are special PHI nodes (well, there can also be derived IVs where the IV is not a PHI directly, but the vplan-native path does not use these as of now).
You just highlighted the challenge. DependenceAnalysis works on LLVM IR and not on VPlan ISA. If you need dependence analysis for outer-loop vectorisation after transforms, you would need to re-implement some kind of dependence analysis on VPlan ISA.
Still I’m not getting the reason for this…Even after using vplan patch, outer most loop vectorization is not happening…any views??