LV: predication

Hello,

We are working on predication for our vector extension (MVE). Since quite a few people are working on predication and different forms of it (e.g. SVE, RISC-V, NEC), I thought I would share what we would like to add to the loop vectoriser. Hopefully it’s just a minor one and not intrusive, but could be interesting and useful for others, and feedback on this is welcome of course.

TL;DR:

We would like the loop vectoriser to emit a new IR intrinsic for certain loops:

void @llvm.set.loop.elements.i32(i32 )

This represents the number of data elements processed by a vector loop, and will be emitted in the preheader block of the vector loop after querying TTI that the backend understands this intrinsic and that it should be emitted for that loop. The vectoriser patch is available in D79100, and we pick this intrinsic up in the ARM backend here in D79175.

Context:

We are working on predication form that we call tail-predication: a vector hardwareloop has an implicit form of predication that sets active/inactive lanes for the last iteration of the vector loop. Thus, the scalar epilogue loop (if there is one) is tail-folded and tail-predicated in the main vector body. And to support this, we need to know the number of data elements processed by the loop, which is used in the set up of a tail-predicated vector loop. This new intrinsic communicates this information from the vectoriser to the codegen passes where we further lower these loops. In our case, we essentially let @llvm.set.loop.elements.i32 emit the trip count of the scalar loop, which represents the number of data elements processed. Thus, we let the vectoriser emits both the scalar and vector loop trip count.

Although in a different stage in the optimisation pipeline, this is exactly what the generic HardwareLoop pass is doing to communicate its information to target specific codegen passes; it emits a few intrinsics to mark a hardware loop. To illustrate this and also the new intrinsic, this is the flow and life of a tail-predicated vector loop using some heavily edited/reduced examples. First, the vectoriser emits the number of elements processed, and the loads/stores are masked because tail-folding is applied:

vector.ph:

call void @llvm.set.loop.elements.i32(i32 %N)

br label %vector.body

vector.body:

call <4 x i32> @llvm.masked.load

call <4 x i32> @llvm.masked.load

call void @llvm.masked.store
br i1 %12, label %.*, label %vector.body

After the HardwareLoop pass this is transformed into this, which adds the hardware loop intrinsics:

vector.ph:

call void @llvm.set.loop.elements.i32(i32 %N)
call void @llvm.set.loop.iterations.i32(i32 %5)

br label %vector.body

vector.body:

call <4 x i32> @llvm.masked.load

call <4 x i32> @llvm.masked.load

call void @llvm.masked.store
call i32 @llvm.loop.decrement.reg

br i1 %12, label %.*, label %vector.body

We then pick this up in our tail-predication pass, remove @llvm.set.loop.elements intrinsic, and add @vctp which is our intrinsic that generates the mask of active/inactive lanes:

vector.ph:

call void @llvm.set.loop.iterations.i32(i32 %5)

br label %vector.body

vector.body:
call <4 x i1> @llvm.arm.mve.vctp32

call <4 x i32> @llvm.masked.load

call <4 x i32> @llvm.masked.load

call void @llvm.masked.store
call i32 @llvm.loop.decrement.reg

br i1 %12, label %.*, label %vector.body

And this is then further lowered to a tail-predicted loop, or reverted to a ‘normal’ vector loop if some restrictions are not met.

Cheers,
Sjoerd.

The problem with your proposal, as written, is that the vectorizer is producing the intrinsic. Because we don’t impose any ordering on optimizations before codegen, every optimization pass in LLVM would have to be taught to preserve any @llvm.set.loop.elements.i32 whenever it makes any change. This is completely impractical because the intrinsic isn’t related to anything optimizations would normally look for: it’s a random intrinsic in the middle of nowhere.

Probably the simplest path to get this working is to derive the number of elements in the backend (in HardwareLoops, or your tail predication pass). You should be able to figure it from the masks used in the llvm.masked.load/store instructions in the loop.

-Eli

Hi Eli,

The problem with your proposal, as written, is that the vectorizer is producing the intrinsic. Because we don’t impose any ordering on optimizations before codegen, every optimization pass in LLVM would have to be taught to preserve any @llvm.set.loop.elements.i32 whenever it makes any change. This is completely impractical because the intrinsic isn’t related to anything optimizations would normally look for: it’s a random intrinsic in the middle of nowhere.

I do see that point. But is that also not the beauty of it? It just sits in the preheader, if gets removed, then so be it. And if it not recognised, then also no harm done?

Probably the simplest path to get this working is to derive the number of elements in the backend (in HardwareLoops, or your tail predication pass). You should be able to figure it from the masks used in the llvm.masked.load/store instructions in the loop.

This is what we are currently doing and works excellent for simpler cases. For the more complicated cases that we now what to handle as well, the pattern matching just becomes a bit too horrible, and it is fragile too. All we need is the information that the vectoriser already has, and pass this on somehow.

As I am really keen to simply our backend pass, would there be another way to pass this information on? If emitting an intrinsic is a blocker, could this be done with a loop annotation?

Cheers,
Sjoerd.

I agree. Requiring a loose intrinsic to have meaning in the CFG is a
non-starter. That's why we have loop annotations.

To me, this looks like what MLIR has in the Loop or Affine dialects.
It would be great to have that in LLVM as well, for the cases where we
know it from the front-end (for ex. when lowering from MLIR), but this
has to be a loop annotation of some form.

A pass that generates such annotations could run just before loop
optimisations, or it could come from the front-end and hope it doesn't
get removed by some pass in between.

Fortran is bound to have more semantically rich loop information, and
they'll use MLIR. It would be interesting to know how that will be
done, and you could get the ground work done beforehand by working
with them to carry the annotations in the right way.

cheers,
--renato

Hi Eli,

The problem with your proposal, as written, is that the vectorizer is producing the intrinsic. Because we don't impose any ordering on optimizations before codegen, every optimization pass in LLVM would have to be taught to preserve any @llvm.set.loop.elements.i32 whenever it makes any change. This is completely impractical because the intrinsic isn't related to anything optimizations would normally look for: it's a random intrinsic in the middle of nowhere.

I do see that point. But is that also not the beauty of it? It just sits in the preheader, if gets removed, then so be it. And if it not recognised, then also no harm done?

The harm comes if the intrinsic ends up with the wrong value, or attached to the wrong loop.

Probably the simplest path to get this working is to derive the number of elements in the backend (in HardwareLoops, or your tail predication pass). You should be able to figure it from the masks used in the llvm.masked.load/store instructions in the loop.

This is what we are currently doing and works excellent for simpler cases. For the more complicated cases that we now what to handle as well, the pattern matching just becomes a bit too horrible, and it is fragile too. All we need is the information that the vectoriser already has, and pass this on somehow.

As I am really keen to simply our backend pass, would there be another way to pass this information on? If emitting an intrinsic is a blocker, could this be done with a loop annotation?

If the problem is specifically figuring out the underlying element count given a predicate, maybe we could attack it from that angle? For example, introduce a special intrinsic for deriving the mask (sort of like the SVE whilelo).

-Eli

The harm comes if the intrinsic ends up with the wrong value, or attached to the wrong loop.

The intrinsic is marked as IntrNoDuplicate, so I wasn’t worried about it ending up somewhere else. Also, it is a property of a specific loop, a tail-folded vector loop, that holds even after it is transformed I think. I.e. unrolling a vector loop is probably not what you want, but even if you do the element count would remain the same. But yes, I agree that a future whacky optimisation on vector loops could invalidate this, which you can then skip but then you lose out on it… So, I really like this:

If the problem is specifically figuring out the underlying element count given a predicate, maybe we could attack it from that angle? For example, introduce a special intrinsic for deriving the mask (sort of like the SVE whilelo).

That would be an excellent way of doing it and it would also map very well to MVE too, where we have a VCTP intrinsic/instruction that creates the mask/predicate (Vector Create Tail-Predicate). So I will go for this approach. Such an intrinsic was actually also proposed in Sam’s original RFC (see https://lists.llvm.org/pipermail/llvm-dev/2019-May/132512.html), but we hadn’t implemented it yet. This intrinsic will probably look something like this:

@llvm.loop.get.active.mask(AnyInt, AnyInt)

It produces a predicate based on its two arguments, the number of elements and the vector trip count, and it will be used by the predicated masked loads/stores instructions in the vector body. I will start drafting an implementation for this and continue with this in D79100.

Thanks,
Sjoerd.

> The harm comes if the intrinsic ends up with the wrong value, or attached to the wrong loop.

The intrinsic is marked as IntrNoDuplicate, so I wasn't worried about it ending up somewhere else. Also, it is a property of a specific loop, a tail-folded vector loop, that holds even after it is transformed I think. I.e. unrolling a vector loop is probably not what you want, but even if you do the element count would remain the same. But yes, I agree that a future whacky optimisation on vector loops could invalidate this, which you can then skip but then you lose out on it.... So, I really like this:

This approach really doesn't work. Not unless you're willing to impose legality restrictions on optimization passes to preserve the information.

It's helpful to think of the optimizer as being adversarial. The question is not "will the optimizer break this?"; it's "can a malicious optimizer break this?". Unless you can reason from the spec (LangRef) that the answer is no, then the answer is yes.

In your particular example, consider what might happen is loop fission runs on your vectorized loop, then we recognize that iterations N through M of the first loop (after fission) were nops and split it into two loops over narrow ranges. You'd have real trouble matching your intrinsic to anything meaningful in the backend, and getting it wrong would be a correctness bug.

Hi Sjoerd,

That would be an excellent way of doing it and it would also map very well to MVE too, where we have a VCTP intrinsic/instruction that creates the mask/predicate (Vector Create Tail-Predicate). So I will go for this approach. Such an intrinsic was actually also proposed in Sam’s original RFC (see https://lists.llvm.org/pipermail/llvm-dev/2019-May/132512.html), but we hadn’t implemented it yet. This intrinsic will probably look something like this:

@llvm.loop.get.active.mask(AnyInt, AnyInt)

It produces a predicate based on its two arguments, the number of elements and the vector trip count, and it will be used by the predicated masked loads/stores instructions in the vector body. I will start drafting an implementation for this and continue with this in D79100.

I’m curious about this, because this looks to me very similar to the code that -prefer-predicate-over-epilog is already emitting for the “outer mask” of a tail-folded loop.

The following code

void foo(int N, int *restrict c, int *restrict a, int *restrict b) {
#pragma clang loop vectorize(enable) interleave(disable)
for (int i = 0; i < N; i++) {
a[i] = b[i] + c[i];
}
}

compiled with clang --target=x86_64 -mavx512f -mllvm -prefer-predicate-over-epilog -emit-llvm -O2 emits the following IR

vector.body: ; preds = %vector.body, %for.body.preheader.new
%index = phi i64 [ 0, %for.body.preheader.new ], [ %index.next.1, %vector.body ]
%niter = phi i64 [ %unroll_iter, %for.body.preheader.new ], [ %niter.nsub.1, %vector.body ]
%broadcast.splatinsert12 = insertelement <16 x i64> undef, i64 %index, i32 0
%broadcast.splat13 = shufflevector <16 x i64> %broadcast.splatinsert12, <16 x i64> undef, <16 x i32> zeroinitializer
%induction = or <16 x i64> %broadcast.splat13, <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7, i64 8, i64 9, i64 10, i64 11, i64 12, i64 13, i64 14, i64 15>
%4 = getelementptr inbounds i32, i32* %b, i64 %index
%5 = icmp ule <16 x i64> %induction, %broadcast.splat

%wide.masked.load = call <16 x i32> @llvm.masked.load.v16i32.p0v16i32(<16 x i32>* %6, i32 4, <16 x i1> %5, <16 x i32> undef), !tbaa !2

I understand %5 is not the same your proposed llvm.loop.get.active.mask would compute, is that correct? Can you elaborate on the difference here?

Thanks a lot,

Roger

Hi Roger,

That’s a good example, that shows most of the moving parts involved here. In a nutshell, the difference is, and what we would like to make explicit, is the vector trip versus the scalar loop trip count. In your IR example, the loads/stores are predicated on a mask that is calculated from a splat induction variable, which is compared with the vector trip count. Illustrated with your example simplified, and with some pseudo-code, if we tail-fold and vectorize this scalar loop:

for i= 0 to 10
a[i] = b[i] + c[i];

the vector loop trip count is rounded up to 14, the next multiple of 4, and lanes are predicated on i < 10:

for i= 0 to 12
a[i:4] = b[i:4] + c[i:4], if i < 10;

what we would like to generate is a vector loop with implicit predication, which works by setting up the the number of elements processed by the loop:

hwloop 10

[i:4] = b[i:4] + c[i:4]

This is implicit since instructions don’t produce/consume a mask, but it is generated ans used under the hood by the “hwloop” construct. Your observation that the information in the IR is mostly there is correct, but rather than pattern matching and reconstructing this in the backend, we would like to makes this explicit. In this example, the scalar iteration count 10 iis the number of elements processed by this loop, which is what we want to pass on from the vectoriser to backend passes.

Hope this helps.
Cheers,
Sjoerd.

Hi Sjoerd,

thanks a lot for the clarification. Makes sense.

Kind regards,

Missatge de Sjoerd Meijer <Sjoerd.Meijer@arm.com> del dia dt., 5 de maig 2020 a les 0:06:

Why couldn’t you use VP intrinsics and scalable types for this? I see three issues with the llvm.set.loop.elements approach: 1) It is conceptually broken: as others have pointed out, optimization can move the intrinsic around since the intrinsic doesn’t have any dependencies that would naturally keep it in place. 2) The whole proposed set of intrinsics is vendor specific: this causes fragmentation and i don’t see why we would want to emit vendor-specific intrinsics in a generic optimization pass. Soon, we would see reports a la “your optimization caused regressions for MVE - add a check that the transformation must not touch llvm.set.loop.* or llvm.active.mask intrinsics when compiling for MVE…”. I doubt that you would tolerate when that intrinsic were some removed in performance-critical code that would then remain scalar as a result… so, i do not see the “beauty of the approach”. 3) We need a reliable solution to properly support vector ISA such as RISC-V V extension and SX-Aurora and also MVE… i don’t see that reliability in this proposal. If for whatever reason, the above does not work and seems to far away from your proposal, here is another idea to make more explicit hwloops work with the VP intrinsics - in a way that does not break with optimizations: Note that the way VP intrinsics are designed, it is not possible to break this code by hoisting the VP calls out of the loop: passing “%evl >= the operation’s vector size” consitutes UB (see ). We can use attributes to do the same for sinking (eg don’t move VP across hwloop.decrement). - Simon

Hi,
I abandoned that approach and followed Eli’s suggestion, see somewhere earlier in this thread, and emit an intrinsic that represents/calculates the active mask. I’ve just uploaded a new revision for D79100 that implements this.
Cheers.

You have similar problems with Since there are no masked operations, except for load/store… how are LLVM optimizations supposed to know that they must not hoist/sink operations with side-effects out of the hwloop? These operations have an implicit dependence on the iteration variable. What will you do if there are no masked intrinsics in the hwloop body? This can happen once you generate vector code beyond trivial loops or have a vector IR generator other than LV. And i am curious why couldn’t you use the %evl parameter of VP intrinsics to get the tail predication you are interested in? - Simon

You have similar problems with https://reviews.llvm.org/D79100

The new revision D79100 solves your comment 1), and I don’t think your comments2) and 3) apply as there are no vendor specific intrinsics involved at all here. Just to quickly discuss the optimisation pipeline, D79100 is a small extension for the vectoriser, and nothing here is related to hardware-loops or target specific constructs. The vectoriser tail-folds the loop, and creates masked load/stores; so existing functionality, and nothing has changed here. The generic hardware loop codegen pass inserts hardware loop intrinsics. Very late in the pipeline, e.g. in the PPC and ARM backends, this is picked and turned into an actual hardwareloop, in our case possibly predicated, or it is reverted.

What will you do if there are no masked intrinsics in the hwloop body?

Nothing. I.e., it can become a hardware loop, but not one with implicit predication.

And i am curious why couldn’t you use the %evl parameter of VP intrinsics to get the tail predication you are interested in?

In D79100, intrinsic get.active.mask makes the backedge taken count of the scalar loop explicit. I will look again, but I don’t think the VP intrinsics were able to provide this. But to be honest, I have no preference at all what this intrinsic is, it is not relevant, as long as we can make this explicit.

Cheers.

Thanks for explaining it (possibly once again) I wasn’t aware that this will also be used for PPC. Point 3) still stands. Are all vector instructions in the hwloop implicitly predicated or only the masked load/store ops? If not, then the issue is that the predicate parameter of masked load/store basically affects the semantics of all other vector ops in the loop that do not have an explicit mask parameter: VP intrinsics explicitly make every vector instruction in the loop dependent on the ‘%evl’. You would have : My previous mail had an example on how %evl could be tied to the scalar trip count. Re-posting that here: - Simon

Hi Simon,

Thanks for reposting the example, and looking at it more carefully, I think it is very similar to my first proposal. This was met with some resistance here because it dumps loop information in the vector preheader. Doing it this early, we want to emit this in the vectoriser, puts a restriction on (future) optimisations that transform vector loops to honour/update/support this intrinsic and loop information. In D79100, it is integral part of the vector body and has some semantics (I will update it today), and thus doesn’t have these disadvantages. Also, the vectoriser isn’t using the VP intrinsics yet, so using them is a bridge too far for me at this point. But we should definitely re-evaluate at some point if we should use or transition to them in our backend passes.

Are all vector instructions in the hwloop implicitly predicated or only the masked load/store ops?

In a nutshell, when a vector loop with (explicitly) predicated masked loads/stores hit the backend, we translate the generic intrinsic get.active.mask to a target specific one. All predication remains explicit, and this remains the case. Only at the end, we use this intrinsic to instruction select a specific variant of the hardwarloop with some implicit predication.

Cheers,
Sjoerd.

The difference is that in the VP version there is an explicit dependence of every vector operation in the loop to the intrinsic. This dependence is obscured in the hwloop proposals (more on that below). I understand that you are looking to get hwloops working quickly somehow - but any proposal should be designed in a forward-looking way or we could get stuck in a place it’s hard to get out of. I am looking forward to see the semantics for this spelled out. I’d very much like to see LV use VP intrinsics. I invite everybody to collaborate on VP to make it functional and useful quickly! Specifically, i am hoping we can collaborate on masked reduction intrinsics and implement them in the VP namespace. There is also the VP expansion pass on Phabricator right now (D78203 - it says ‘work-in-progress’ in the summary, which probably was a mistake: this is the real thing). I do not see an answer to my question here. If the vectorized loop, prepared for hwloop, looks like this: Will the sdiv execute with implicit hwloop predication? It makes no difference to the semantics of the intrinsic at which point you lower it but how. - Simon

Hi Simon,

I’d very much like to see LV use VP intrinsics. I invite everybody to collaborate on VP to make it functional and useful quickly! Specifically, i am hoping we can collaborate on masked reduction intrinsics and implement them in the VP namespace. There is also the VP expansion pass on Phabricator right now (D78203 - it says ‘work-in-progress’ in the summary, which probably was a mistake: this is the real thing).

at BSC, my colleague Vineet (in CC) has been working the last months on making LV use VPred intrinsics via new VPRecipes so we can target RISC-V V-ext. It is still early work and a few things are not correct respect to VPred semantics yet (or they go and crash our backend :). So far seems promising. We haven’t posted anything upstream yet because we still want to address the problems encountered. We’d be happy to collaborate in this area, though.

You can see an example here https://repo.hca.bsc.es/epic/z/MJGQKC

Kind regards,

Invitation accepted, I am happy to help out with reviews, like I did with the previous VP patches.

And of course agreed that things should be well defined, and that we shouldn’t paint ourselves in a corner, but I don’t think that this is the case. And it’s not that I am in a rush, but I don’t think this change needs to be predicated on a big change landing first like the LV switching to VP intrinsics.

The difference is that in the VP version there is an explicit dependence of every vector operation in the loop to the set.num.elements intrinsic. This dependence is obscured in the hwloop proposals (more on that below).

This discussion is getting complicated, because I think we are discussing 3 topics at the same time now: predication, hardware loops, and a new set of intrinsics, the VP intrinsics.
For the change that kicked off this thread, i.e. 1 new intrinsic to get the active lanes, I think we can eliminate the hardware loops from this story. For us, that is just the context of this, and so I think we can just focus on predication. And if we only talk about predication, I think this new intrinsic can nicely coexist with the VP intrinsics.

And please note again I am not proposing a set.num.elements intrinsic. Well, I first kind of did, but again, abandoned that approach after push back. Correct me if I am wrong, but there’s no difference in your example whether all instructions consume some predicate or only masked loads/stores:

vector.preheader:
%init.evl = i32 llvm.hwloop.set.elements(%n)
vector.body:
%evl = phi 32 [%init.evl, %preheader, %next.evl, vector.body]
%aval = call @llvm.vp.load(Aptr, .., %evl)
call @llvm.vp.store(Bptr, %aval, ..., %evl)
%next.evl = call i32 @llvm.hwloop.decrement(%evl)

No difference in that the problem remains that we have a random intrinsic sitting in the preheader describing a loop property that needs to be maintained.

So, eliminating hardware loops and intrinsic that defines the number of elements produced, I am proposing

vector.body:

%mask = lvm.get.active.lane.mask (%IV, %BTC)
.. = @llvm.masked.load(.., %mask)

where IV is the induction step, and BTC the backedge taken count.

This completely piggy backs on everything that is already there in the vectoriser, and nothing is fundamentally changed here. Now, this seems very generic, and doesn't seem to bite the VP intrinsics.

Cheers,
Sjoerd.

That’s great! Ok. My questions (the example at the end) was asking whether hwloops imply predication (and by that i mean logically - if the hwloop implies that a SIMD instruction may not execute for all lanes in the tail then that is predication as well). Yes, and that is the point: it’s about making the SIMD instructions dependent on the mask … and all of them. The difference is that the intrinsic is connected to every SIMD instruction in the vector loop through data flow. It does not just sit there… in fact it does not matter where it is placed as long as those def-use edges are visible to the hwloop transformation. I see it the other way round: Right now you seem to have an implicit dependence from syntactically unmasked SIMD instructions (eg a regular SIMD sdiv) to the predicate of nearby masked intrinsics (masked.load) - that’s on shaky grounds semantically. VP intrinsics already define a clean semantics for tail predication - so why not piggyback on that? You should define the hwloop support in a way that will not just peacefully coexist with VP but leverage it eventually. I’ll continue in that direction in the review. One specific request (since i got you attention now :wink: ): we need a (generic) IR primitive to express %lane_id < %n for scalable vector types to expand VP intrinsics for targets with SVE support but no tail predication. - Simon