Vector.create_mask for scalable vectors

Hi all,

I’m trying to figure out the best way to generate masks for scalable vectors. Right now, something like:

    %0 = vector.create_mask %n : vector<4xi1>

Lowers to:

    %cst = arith.constant dense<[0, 1, 2, 3]> : vector<4xi32>
    %0 = arith.index_cast %n : index to i32
    %1 = splat %0 : vector<4xi32>
    %2 = arith.cmpi slt, %cst, %1 : vector<4xi32>

Which is very much vector-length-dependent. I’ve created a patch ([mlir][Vector] Enable create_mask for scalable vectors) that creates an alternative lowering when the mask needs to be scalable based on the llvm.get.active.lane.mask.* intrinsics:

    %0 = llvm.mlir.constant(0 : i32) : i32
    %1 = arith.index_cast %n : index to i32
    %2 = llvm.intr.get.active.lane.mask %0, %1 : i32, i32 to vector<[4]xi1>

I’m finding a couple of issues with this approach, and I’d like some feedback and/or ideas.

First, I’m creating a dependency between Vector and LLVMIR. All the other operations that depend on LLVMIR live in VectorToLLVM conversions, which makes sense. I see two solutions to this problem:

  1. Move VectorCreateMaskOpConversion with all the other lowerings that target LLVM specifically. I don’t quite like that because, as it is, this conversion doesn’t need to depend on LLVMIR for fixed-length vectors. Now, I’m not sure if there is any current or planned consumer of this behaviour.
  2. Create a different lowering for CreateMaskOp in VectorToLLVM to address scalable vectors specifically.

The second issue is a bit more problematic. The current lowering and the scalable-compatible lowering do not share the exact same semantics. If the index of the operation is negative, get.active.lane.mask wraps around, while the current lowering clamps to 0. So vector.create_mask -1 will lower to all 0s for fixed-length vectors and all 1s for scalable vectors. Not sure this is necessarily a problem in practice, but it’s definitely there.

One way I might be able to avoid the inconsistency would be using llvm.experimental.stepvector and using a similar approach to the fixed-length vector, e.g.:

    %cst = llvm.intr.experimental.stepvector : vector<[4]xi32>
    %0 = arith.index_cast %n : index to i32
    %1 = splat %0 : vector<[4]xi32>
    %2 = arith.cmpi slt, %cst, %1 : vector<[4]xi32>

But it’s an experimental intrinsic right now, it’s not even in LLVM Dialect, so I’d rather avoid this.

Any ideas or suggestions are greatly appreciated.

Thank you,
Javier

1 Like

Hi all,

I’ve updated the patch implementing option 2 here. I think it makes more sense like this, even if we have effectively split the lowering of this operation in two places. The next step is to handle the issue with negative indices. If nobody has anything to say about it, I think it’s safe to have the two different behaviors. I’ll add a note in the documentation to make sure that if people trip on this in the future, there is an easy to find explanation (and then we can come back to the issue). Other than that, the patch would be ready for review.

Thank you,
Javier

I don’t see a problem with having multiple patterns that convert an operation, especially if they can be made mutually exclusive by checking the applicability condition upfront.

I do however find it problematic to define the semantics of an op depending on how it is lowered. Op semantics should be clear regardless of specific lowerings (one doesn’t have to go to LLVM IR from the vector dialect, or at least to LLVM IR vector instructions, there is some work on a vector programming model for GPUs, for example). It may be possible to get away with the “undefined behavior” trick: the behavior is undefined for negative operands so the both the clamping and the wrapping-around lowerings are fine. If somebody already relies on the clamping behavior, the trick no longer applies. What would it take to implement clamping for scalable vectors and/or wrapping-around for fixed vectors + indicate the version by an op attribute?

I do however find it problematic to define the semantics of an op depending on how it is lowered.

+1. We need well-defined semantics before the lowering since that might impact how this op is optimized/canonicalized.

If somebody already relies on the clamping behavior, the trick no longer applies.

We do have a user of this behavior and had some related discussion to this problem here: ⚙ D116069 [mlir][vector] Allow values outside of [0; dim-size] in create_mask

indicate the version by an op attribute?

That could work. We would also have to decide if clamping/wrapping-around would make sense for the upper bound. Not sure if @sergei-grechanik had a use case for this one.

We definitely depend on the clamping behavior for indices greater than the vector size (this is used in the affine vectorizer for reductions), but I think we don’t depend on clamping negative values to zero.

However, the wrapping behavior just feels weird to me, whereas the clamping behavior seems natural. The ideal solution, in my opinion, would be to check whether the index can be negative and avoid lowering to llvm.intr.get.active.lane.mask unless it’s provable that it can’t (or insert a clamping operation in this case). But it sounds unrealistic, so I’m ok with changing the behavior of negative values to UB (the description of the operation should be changed in this case, and the test case for negative values I added in ⚙ D116069 [mlir][vector] Allow values outside of [0; dim-size] in create_mask should be removed).

cc @aartbik

My two cents. Similar to @ftynse, I am perfectly okay with two lowering patterns, which seems very much in line with MLIR’s progressive lowering philosophy to “bring the IR closer and closer to the target ISA”.

As for the wrap/clamp difference, this is probably the result of the fact that in MLIR, usually the operation (and not the type) determines whether operands are interpreted signed or unsigned, and we historically just picked one interpretation when we implemented this op a long time back. I am fine with either, as long as we document the behavior very well and then are consistent for all implementations (and none of the current interpretations in the tests and code examples break of course). Perhaps UB is needed, but if this is really just used to have one behavior along one lowering and another along the other, I would be very reluctant.

Apologies for the long radio silence, I hit a weird bug while trying to run tests for create_mask and I wanted to make sure I wasn’t saying the wrong thing.

I agree, just wanted to make sure I wasn’t overlooking something. This is the kind of decision that might have repercussions beyond Arm-specific stuff and I thought it was worth discussing options publicly.

The claim is a bit stronger and better defined than that. The semantics won’t depend on how it’s lowered, but on the operation type (how it’s lowered depends on the type of the operation). If it’s a fixed-length operation (returns a fixed-length vector), the semantics will be one, if it’s a scalable operation (returns a scalable vector), the semantics will be the other. You can’t go from one to the other, so that semantic distinction is clear in the operation. That said, if we ever want to explore the possibility of converting fixed-length code into scalable code (or vice versa), this might become an issue, so I think it’s worth trying to find a way to preserve common semantics.

Indeed, I’ve been able to verify this. Because vector.create_mask takes an index but llvm.intr.get.active.lane.mask takes an integer, I’m injecting an index_cast that’s doing the “wrong” thing. I believe this can be fixed, so I will try to do that with the next update (which will also include the appropriate tests).

One easy solution that will prevent inconsistencies in any possible scenario is following the second lowering I proposed, based on llvm.intr.experimental.stepvector. This lowering parallels the lowering for fixed-length vectors, but works for scalable vectors. I mentioned that I had noticed we don’t have experimental intrinsics in LLVM Dialect, is it worth looking into this as a possibility in the near(-ish) future? Not only for this specific case, but in general. Many of these intrinsics provide a cleaner way to handle “particular” cases (like scalable vectors), but I don’t know if there is an unwritten policy against including experimental LLVM IR in the LLVM Dialect and/or what the rationale behind that decision might be.

Thank you Sergei and Diego for your comments and clarifications as well, they’re greatly appreciated! :smiley: I followed the review of D116069, it’s what prompted me to look into out-of-bounds behavior :slight_smile:

Thank you all, I’ll get back with an update when it’s done.

Just rephrasing for correctness: it does not depend on operation type (it isn’t really a thing in MLIR, in LLVM IR an Instruction can be seen as having the type of the value it defines though), it depends on the type of the operand of said operation. This is quite confusing. Your description makes it sound like two distinct operations are a better fit here. Then you’ll have different “operation types/kinds” based on the operation name. My suggestion on having an attribute comes from how we handle signed/unsigned comparisons by an attribute rather than through the operand type.

MLIR doesn’t make a difference between experimental and non-experimental intrinsics, no more than between intrinsics and operations. Just define them as normal operations with the appropriate translation and rename/update if they change. We had matrix intrinsics and vector reductions while they were still experimental, now they are not and we still have them.

1 Like

I think I somehow managed to find the most confusing way to express myself, my apologies. I was trying to say that I don’t see how this very particular difference might cause trouble down the line, but only because I don’t see an immediate problem that doesn’t mean it won’t become one in the future so I’d rather do something about it.

That was a bad assumption on my part. I thought LLVM dialect intrinsics were automatically extracted from LLVM, and not seeing any of the experimental ones made me wonder. I’m going to give that option a go, I believe it’s cleaner and less problematic than adding a maxsi before llvm.intr.get.active.lane.mask, which is what I was doing at the moment.

Thanks again for your advice!

It was alright, I am mostly pointing out that such an op semantics would be very uncommon in MLIR, it’s better to follow to our design patterns when possible. Adding operations is cheap.

We do have tooling that generates the “first draft” of the intrinsic definition (mlir-tblgen -gen-llvmir-intrinsics), but the intended use for it is to take that first draft, clean it up, add a test and commit as code.

1 Like

Hi!

I’ve updated the patch with the new lowering, and now the behavior is the same for both types of vectors, and it’s now ready for final review.

The code generation in LLVM from the stepvector+splat+cmpi is different (and likely somewhat less efficient) than the code for get.active.lane.mask, but that’s something we can either try to fix during instruction selection in the back-end or, failing that, revisiting this lowering and going back to a max + get.active.lane.mask approach.

Thank you,
Javier