Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors)

Hi all,

WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.

Hi Vineet,

Thanks for sharing! I haven’t looked at the code yet, just read the README file you have and it has already answered a lot of questions that I initially had. Some general comments…

I’m very happy to see that Simon’s predication changes were useful to your work. It’s a nice validation of their work and hopefully will help SVE, too.

Your main approach to strip-mine + fuse tail loop is what I was going to propose for now. It matches well with the bite-sized approach VPlan has and could build on existing vector formats. For example, you always try to strip-mine (for scalable and non-scalable) and then only for scalable, you try to fuse the scalar loops, which would improve the solution and give RVV/SVVE an edge over the other extensions on the same hardware.

There were also in the past proposals to vectorise the tail loop, which could be a similar step. For example, in case the main vector body is 8-way or 16-way, the tail loop would be 7-way or 15-way, which is horribly inefficient. The idea was to further vectorise the 7-way as 4+2+1 ways, same for 15. If those loops are then unrolled, you end up with a nice decaling down pattern. On scalable vectors, this becomes a noop.

There is a separate thread for vectorisation cost model [1] which talks about some of the challenges there, I think we need to include scalable vectors in consideration when thinking about it.

The NEON vs RISCV register shadowing is interesting. It is true we mostly ignored 64-bit vectors in the vectoriser, but LLVM can still generate them with the (SLP) region vectoriser. IIRC, support for that kind of aliasing is not trivial (and why GCC’s description of NEON registers sucked for so long), but the motivation of register pressure inside hot loops is indeed important. I’m adding Arai Masaki in CC as this is something he was working on.

Otherwise, I think working with the current folks on VPlan and scalable extensions will be a good way to upstreaming all the ideas you guys had in your work.

Thanks!
–renato

[1] http://lists.llvm.org/pipermail/llvm-dev/2020-October/146236.html

Hi Renato,

Thanks a lot for your comments!

(more inline.)

Thanks and Regards,

Vineet

Simon’s vector predication ideas fit really nicely with our approach to predicated vectorization, specially the support for EVL parameter. We look forward to more discussions around it. While our implemented approach with tail folding and predication is guided by the research interests of the EPI project, I agree that for a more general implementation your proposed approach for now makes more sense before moving on to better predication support and exploring other approaches. Agreed. It would be very useful to think about a scalable vectors aware cost-model right from the beginning now that there is effort already underway to integrate it into VPlan. There was also a discussion around it in the latest SVE/SVE2 sync-up meeting and I think almost everyone was in agreement. That’s the plan! WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.

Fold the epilog loop into the vector body.

  • This is done by setting the vector length in each iteration. This induces a predicate/mask over all the vector instructions of the loop (any other predicates/masks in the vector body are needed for control flow).

That’s what we do for Arm MVE using intrinsic get.active.lane.mask (*) which is emitted in the vectoriser. It generates a predicate that is used by the masked loads/stores. That’s the current state of the art, long term that should indeed be using the VP intrinsics. Just wanted to point you at get.active.lane.mask, because it would also be nice to get confirmation that this not only works for fixed vectors but also scalable vectors, which I think should be the case…

(*) https://llvm.org/docs/LangRef.html#llvm-get-active-lane-mask-intrinsics

Cheers,
Sjoerd.

Hi Sjoerd,

thanks for pointing us to this intrinsic.

I see it returns a mask/predicate type. My understanding is that VPred intrinsics have both a vector length operand and a mask operand. It looks to me that a “popcount” of get.active.lane.mask would correspond to the vector length operand. Then additional “control flow” mask of predicated code would correspond to the mask operand.

My intepretation was that get.active.lane.mask allowed targets that do not have a concept of vector length (such as SVE or MVE) to represent it as a mask. For those targets, the vector length operand can be given a value that means “use the whole register” and then only the mask operand is relevant to them.

But maybe my interpretation is wrong.

@Simon: what is VE going to do here?

Kind regards,

Missatge de Sjoerd Meijer via llvm-dev <llvm-dev@lists.llvm.org> del dia dj., 5 de nov. 2020 a les 10:00:

For RISC-V V and VE being explicit about %evl is important for performance & correctness and that is what VP does. The get.active.lane.mask intrinsic is used as a hint for the MVE, SVE backends to use hardware tail-predication (the backends reverse engineer that hint by pattern matching for get.active.lane.mask in the mask parameter of “some” masked intrinsics). IMHO, it’s more of a hot fix to get some tail-predication working quickly with the existing infrastructure. It is still useful by itself, eg the ExpandVPIntrinsic pass uses it to expand the %evl parameter in VP intrinsics for scalable vector types. VE uses VP-style SDNodes in the isel layer (upstream patch on Phabricator to follow soon-ish). We simply translate both VP and regular SIMD SDNodes into these custom SDNodes as an intermediate layer. Even the VE machine instructions still have an explicit %evl operand. We have a machine function pass that inserts code to re-configure the VL register in-between vector instructions that have a different %evl value (we had a poster on that at the LLVM US DevMtg '19). This isel strategy has been working well for us. The goal is to teach LV, VPlan to emit VP intrinsics with a convenient builder class (VPBuilder in the reference patch). - Simon

For RISC-V V and VE being explicit about %evl is important for performance & correctness and that is what VP does. The get.active.lane.mask intrinsic is used as a hint for the MVE, SVE backends to use hardware tail-predication (the backends reverse engineer that hint by pattern matching for get.active.lane.mask in the mask parameter of “some” masked intrinsics). IMHO, it’s more of a hot fix to get some tail-predication working quickly with the existing infrastructure. It is still useful by itself, eg the ExpandVPIntrinsic pass uses it to expand the %evl parameter in VP intrinsics for scalable vector types.

So I don’t think that makes it a hot fix ��, but agreed with the general picture here.

VE uses VP-style SDNodes in the isel layer (upstream patch on Phabricator to follow soon-ish). We simply translate both VP and regular SIMD SDNodes into these custom SDNodes as an intermediate layer. Even the VE machine instructions still have an explicit %evl operand. We have a machine function pass that inserts code to re-configure the VL register in-between vector instructions that have a different %evl value (we had a poster on that at the LLVM US DevMtg '19). This isel strategy has been working well for us.

The goal is to teach LV, VPlan to emit VP intrinsics with a convenient builder class (VPBuilder in the reference patch).

Trying to remember how everything fits together here, but could get.active.lane.mask not create the %mask of the VP intrinsics? Or in other words, in the vectoriser, who’s producing the %mask and %evl that is consumed by the VP intrinsics?

Cheers,
Sjoerd.

Hi Sjoerd,

Trying to remember how everything fits together here, but could get.active.lane.mask not create the %mask of the VP intrinsics? Or in other words, in the vectoriser, who’s producing the %mask and %evl that is consumed by the VP intrinsics?

I’m not sure what would be the best way here. I think about the Loop Vectorizer. I imagine at some point we can teach LV to emit VPred for the widening. VPred IR needs two additional operands, as you mentioned, %evl and %mask.

One option is make %evl the max-vector-length of the type being operated and %mask (that is the “outer block mask” in this context) be get.active.lane.mask. This maps well for SVE and MVE not so much for VE and RISC-V (I don’t think it is incorrect but it is not an efficient thing to do). Perhaps VE and RISC-V can work in this scenario if at some point they replace the %evl with something like “%n - %base” operands of get.active.lane.mask, and %mask (the outer block mask) is replaced with a splat of “i1 1”.

Another option here is make “%n - %base” be the %evl (or at least an operand of some target hook because “computing” the %evl is target-specific, targets without evl could compute the identity here) and %mask (the outer block mask) be a splat of “i1 1”. This maps well VE and RISC-V but makes life harder for AVX-512, SVE and MVE (in general any target where TargetTransformInfo::hasActiveVectorLength returns false). Those targets could replace the %evl with the max-vector-length of the operated type and then use get.active.lane.mask(0, %evl) as the outer block mask. My understanding is that Simon used this approach in https://reviews.llvm.org/D78203 but in a more general setting, that would be independent of what Loop Vectorizer does.

Looks to me the second option makes a more effective use of vpred and D78203 shows that we can always soften vpred into a shape that is reasonable for lowering in targets without active vector length.

Thoughts?

Kind regards,

Basically, we would extend TTI to let the targets choose how to use the %mask and %evl operands in the VP intrinsics. So, an ‘fadd’ would turn into an ‘llvm.vp.fadd’ for all predicating targets. However, whether get.active.lane.mask() is used for %mask or whether tail predication is done with a (splat i1 1) for the mask and setting %evl would be target dependent. For VE, we set %evl = min(max_vector_width, %n - %base) … that’s the same idiom that the non-LLVM NEC compilers are emitting for tail predication. Basically, the LV flow could look something like this: The whole point about VP is to make sure there is one set of vector-predicated instructions/intrinsics that everybody is using while giving people the freedom to use these as it fits their targets. We can then concentrate on optimizing VP intrinsic code and all targets benefit. - Simon *: VE’s packed mode (512 x 32bit elements) is a use case for a non-trivial setting of %mask and %evl at the same time (%evl for packs of two 32bit elements (ie %evl must be even for 32bit lanes), %mask for masking out inside packages).

Agreed!

`Hi Simon```

Looks to me the second option makes a more effective use of vpred and D78203 shows that we can always soften vpred into a shape that is reasonable for lowering in targets without active vector length.

The whole point about VP is to make sure there is one set of vector-predicated instructions/intrinsics that everybody is using while giving people the freedom to use these as it fits their targets. We can then concentrate on optimizing VP intrinsic code and all targets benefit.

This is even better than I imagined, then. Thanks for the examples and clarification.

Kind regards,

Hello Simon,

Thanks for your replies, very useful. And yes, thanks for the example and making the target differences clear:

; Some examples:
; RISC-V V & VE(*):
; ```%mask = (splat i1 1)`` ; %evl = min(256, %n - %i) ; MVE/SVE : ; %mask = get.active.lane.mask(%i, %n) ; %evl = call @llvm.vscale() ; AVX: ; %mask = icmp (%i + (seq <8 x i32> 0,1,2,.,)), %n, ; %evl = i32 8`

Unless I miss something, the AVX example is semantically the same as get.active.lane.mask:

%m[i] = icmp ult (%base + i), %n

with i = 8.

Just saying this to see if we can have “1 interface” for generating the mask (which is what I was perhaps expecting), and if you just want an all true mask for VE and if we can merge AVX with the other 2 we just have:

; RISC-V V & VE(*):
; ```%mask = get.active.lane.mask(%i, %i)```
; %evl = min(256, %n - %i)
; MVE/SVE/AVX :
; %mask = get.active.lane.mask(%i, %n)
; %evl = call @llvm.vscale()

I am not sure why MVE (or AVX) would need the vscale(). But if it does, I am wondering if it could be something like:

`; RISC-V V & VE(*):` `; ```%mask = `get.active.lane.mask(%i, %i)``` `; %evl = call @llvm.vscale(256, %n - %i)`; MVE/SVE/AVX : ; %mask = get.active.lane.mask(%i, %n) ; %evl = call @llvm.vscale(… ,…)``

Cheers,
Sjoerd.

Correct (llvm.get.active.lane.mask.v8i1.i32). For VE, we want to do as much predication as possible through %evl and as little as possible with %mask. This has performance implications on VE and RISC-V - VE does not generate a mask from %evl but %evl is directly mapped to hardware, passing the all-true mask is free. So for VE, the %evl does all the predication and there is no reason to have anything other than a (splat i1 1) %mask here. On SVE/MVE you may want to use get.active.lane.mask instead and on RISC-V V, AFAIU, the %evl parameter will have to be computed by some RISC-V specific setvl intrinsic. Both of this is okay because VP gives you that flexibility. The vscale is only necessary with scalable types, eg you can inactivate the %evl parameter like so:The VPIntrinsic class upstream already has the functionality to check whether the %evl parameter is inactivated in this way - Simon

; RISC-V V & VE(*):
; ```%mask = get.active.lane.mask(%i, %i)```
; %evl = min(256, %n - %i)
; MVE/SVE/AVX :
; %mask = get.active.lane.mask(%i, %n)
; %evl = call @llvm.vscale()

For VE, we want to do as much predication as possible through %evl and as little as possible with %mask. This has performance implications on VE and RISC-V - VE does not generate a mask from %evl but %evl is directly mapped to hardware, passing the all-true mask is free.

So for VE, the %evl does all the predication and there is no reason to have anything other than a (splat i1 1) %mask here.

Okay, got it. One way to look at this is that (splat i1 1) is just a special case of get.active.lane.mask, for example get.mask(%i, 0) can trivially be expanded/lowered to a (splat i1 1). This is not terribly important, but shows that get.active.lane.mask could be used for all targets I think; we don't need many cases. And kind of similarly, vscale can be a no-op or do something.

Cheers,
Sjoerd.

I disagree. It doesn’t make sense to substitute the constant splat with something complicated only to get get.active.lane.mask into the picture. - Simon