Hi all,
In Linalg there are two ways (as far as I know) to deal with vectorization at the boundaries: padding and peeling.

• Padding “extends” the region of memory to align with the vector length
• Peeling extract a partial loop that to a bound that aligns with the vector length and then create a remainder loop to do the rest

However, there is a third option, which I am not able to kick in from Linalg, i.e., masking. Consider a simple saxpy:

``````#map0 = affine_map<(d0) -> ()>
#map1 = affine_map<(d0) -> (d0)>
module {
func @saxpy_linalg(%arg0: f32, %arg1: tensor<33xf32>, %arg2: tensor<33xf32>) -> tensor<33xf32> {

%0 = linalg.generic {indexing_maps = [#map0, #map1, #map1], iterator_types = ["parallel"]} ins(%arg0, %arg1 : f32, tensor<33xf32>) outs(%arg2 : tensor<33xf32>) {
^bb0(%arg5: f32, %arg6: f32, %arg7: f32):
%3 = arith.mulf %arg5, %arg6 : f32
%4 = arith.addf %3, %arg7 : f32
linalg.yield %4 : f32
} -> tensor<33xf32>

return %0 : tensor<33xf32>
}
}
``````

When I tile by 4, this is what I got:

``````#map0 = affine_map<(d0) -> (-d0 + 33, 4)>
#map1 = affine_map<(d0) -> ()>
#map2 = affine_map<(d0) -> (d0)>
module {
func @saxpy_linalg(%arg0: f32, %arg1: tensor<33xf32>, %arg2: tensor<33xf32>) -> tensor<33xf32> {
%c4 = arith.constant 4 : index
%c33 = arith.constant 33 : index
%c0 = arith.constant 0 : index
%0 = scf.for %arg3 = %c0 to %c33 step %c4 iter_args(%arg4 = %arg2) -> (tensor<33xf32>) {
%1 = affine.min #map0(%arg3)
%2 = tensor.extract_slice %arg1[%arg3] [%1] [1] : tensor<33xf32> to tensor<?xf32>
%3 = tensor.extract_slice %arg4[%arg3] [%1] [1] : tensor<33xf32> to tensor<?xf32>
%4 = linalg.generic {indexing_maps = [#map1, #map2, #map2], iterator_types = ["parallel"]} ins(%arg0, %2 : f32, tensor<?xf32>) outs(%3 : tensor<?xf32>) attrs =  {iree_linalg_transform.matched} {
^bb0(%arg5: f32, %arg6: f32, %arg7: f32):
%6 = arith.mulf %arg5, %arg6 : f32
%7 = arith.addf %6, %arg7 : f32
linalg.yield %7 : f32
} -> tensor<?xf32>
%5 = tensor.insert_slice %4 into %arg4[%arg3] [%1] [1] : tensor<?xf32> into tensor<33xf32>
scf.yield %5 : tensor<33xf32>
}
return %0 : tensor<33xf32>
}
}
``````

I would like to lower that to emit `transfer_read` with masks and subsequently `masked_load` instructions. Is there a way to do so directly from linalg?

I see that the vectorization does not even kick in because of the dynamic shapes. Before I throw some ideas, is there any way I can mask the loop? Did someone already do this?

Thanks,
Giuseppe

We have a `vector.predicate` op in the IREE Sandbox that we plan to bring to MLIR to model masking. It will allow us to model masking/predication in a generic way and implement different masking strategies. So far, I only have a basic lowering implemented and the integration with the Linalg vectorizer is missing. Bringing this to MLIR is in my priority list but it may take me a few weeks to get there and send an RFC. In the meantime, maybe you could take a first look at what we have in the Sandbox? Happy to collaborate on the Linalg integration, if you are interested!

Hi @dcaballe ,
Thanks a lot for the pointer!

Yes, I would like to cooperate on the linalg integration, which is my major pain now (how do emit that from linalg vectorization?)

Should we use this thread to agree on a general high level design (and then maybe chatting on iree for technical hurdles)?

What I am trying for now is (but I am just exploring):

• I am adding a vectorization pass that sort-of-peel the loop (if it is dynamic) without adding the loop tail
• Then I wanted to move all the `transfer_read` to have masks, but I think your `vector.predicate` is a way better option (applicable on the whole loop body)

What do you think? Any better idea?

Thanks,
Giuseppe

Something like that could work for a first experiment. You could introduce the `vector.predicate` op for the whole loop body before vectorization (or after if the vectorizer doesn’t like it). Then, this masking strategy should generate the masked xfer ops that you want .

In the long run, we should extend the vectorizer so that it’s able to deal with dynamic shapes and insert the `vector.predicate` op automatically. I’ll try to send an RFC soon so that we can focus on the long term implementation. The simple and more constrained version of the op should be ready to go.

+1 this line of work started a while back but other things got in the way, I am quite optimistic on the approach @dcaballe proposes and lifting it up to Linalg.

Btw, inserting the `vector.predicate` op can be a bit challenging in terms of rewiring the def-use chains so you may want to use this. We can extend it as needed!

For Arm SVE, there are data and predicate registers. For ops data and predicate registers can be input:
foo = mul data1, data2, pred0
You just need a way to create predicates.

Hi all,
I have been able to get somewhere.

What I did:

• I created a `sortofpeeling` transformation where there is no remainder loop (as I mentioned)
• Then I apply a mask to the whole loop through the `vector.predicate` op that @dcaballe mentioned

As for the generic case, this is fine. The loop is masked and everything else flows.

But I am trying to use the masking instructions from the SVE ISA, in particular whilelt. This involved additional steps.

1. I created a `vector.whilelt` operation (which is more powerful than the simple `createmask`). This lowers either to an `affine.min`+`create_mask` or directly to the `sve.whilelt` SVE intrinsic (add that as well).
2. To lower to an SVE instruction, I need scalable vectors in MLIR. So I created a conversion pass that simply replaces fixed sized vectors with scalable vectors (throughout the module). Maybe we can localize that with this proposal
3. Now everything works and the assembly produced (for the saxpy) is as follows:
``````	whilelt	p1.s, x8, x3
ld1w	{ z1.s }, p1/z, [x1, x8, lsl #2]
ld1w	{ z2.s }, p1/z, [x6, x8, lsl #2]
st1w	{ z1.s }, p1, [x6, x7, lsl #2]
cmp	x8, x9
``````

Which is pretty much what I would write manually. Now, could we agree on a good way to push this forward? I can start working on how to insert `vector.predicate` from linalg.

On this last point, I have a question for @dcaballe . Previously, you said:

In the long run, we should extend the vectorizer so that it’s able to deal with dynamic shapes and insert the `vector.predicate` op automatically

Can you shed some more light on this? My perplexities are the following:

• The vectorization framework in `linalg` vectorizes the `linalg.generic` itself and not the loop. But if I don’t peel or pad, the `linalg.generic` will have dynamic shapes and cannot be vectorized (because vectors need a definite shape).
• We can introduce the `sortofpeeling` transformation, BUT without a mask this will be illegal (because the reads will go out-of-boundaries).
• So maybe we could add a `tensor.predicate` instead of `vector.predicate` . And we can add a `PredicateOp` transformation which would be in the same realm of `PeelOp` and `PadOp`.

EDIT

• Another possibility would be to add a `scf.if` and then lower that to a mask when vectorizing.

What do you think?

Thanks,
Giuseppe

As a side note, when I first implemented a lowering for `createmask`, I was using `llvm.intr.get.active.lane.mask`. The instruction selector was picking `whilelt` without issues. In the end, for reasons discussed in “Vector.create_mask for scalable vectors”, I switched to `llvm.intr.experimental.stepvector` and a lowering strategy analogous to fixed-length vectors.

What I’m saying is, this can probably be fixed by either altering how we lower `createmask`, or making the instruction selection a bit smarter. I think we should avoid hw-specific lowerings as much as possible. I know you’re trying to get the best possible performance now, but we should at least keep in mind that, long term, there are probably better ways to deal with this than adding an `arm_sve.whilelt`

Hi Javier,
Thanks for the answer! Well, yes, I am trying to get the best performance, but I am happy with any general solution that allows that

In the end, for reasons discussed in “Vector.create_mask for scalable vectors”, I switched to `llvm.intr.experimental.stepvector` and a lowering strategy analogous to fixed-length vectors.

Yes, that was exactly the pass I was trying to fix From what I understand, `llvm.intr.get.active.lane.mask` is a `whilelt` intrinsic that accepts also fixed types. Is this correct? If it is, that’s great, because I would not need to convert my IR into scalable vectors !

Could you explain me a bit more why `llvm.intr.experimental.stepvector` has been preferred to it? Is it because it is closer to what `create_mask` does?

My point is exactly that if in the ISA we have an instruction that is higher level then `create_mask` , we could create new `vector.whilelt` (or a `vector.while<CMP>`) which can be lowered to `create_mask`+`affine.min` or directly to the `llvm.intr.get.active.lane.mask`.

The alternative of improving the LLVM selector seems harder and I am not sure worth it, since we know exactly how to lower the mask creation. What do you think?

It’s because of how `create_mask` was being lowered to llvm. It was creating a constant vector (0, 1, 2, 3, …), broadcasting the operand of `create_mask`, and comparing both vectors. This has the side effect of having a clamping behaviour (i.e.: if the operand is negative, the resulting vector is all 0s). On the other hand, if you use `get.active.lane.mask`, the behaviour is a wrap-around (negative values get bitcasted to unsigned), which was decided to be undesirable in previous discussions (referenced in the one I linked).

This is fixable, by adding a pre-clamping operation on the operand of `create_mask`, but it’s a bit messier than just switching to a `stepvector`, and there’s something to be said in favour of doing things the same way for both scalable and fixed-length vectors

`llvm.intr.get.active.lane.mask` works both for fixed-length, and scalable vectors, but `create_mask` can be multi-ranked, and complicates the lowering. I thought about it very early on, but stopped the moment it got a bit complicated. The `stepvector` approach was straightforward and I went for that. It might be a good time to get back the `get.active.lane.mask` option?

There’s certainly a lot to improve there

The `vector.predicate` op is orthogonal to the way the mask is generated. We can compute the mask that we want, fixed or scalable, and use it there.

I’m already making progress in this direction. Hopefully we can coordinate so that we don’t duplicate work. The main problem is that having something working for simple cases is simple but we have to support multi-dimensional vectorization, named ops and have a proper lowering for those. We need to iterate a bit before we can have something that we can land. Happy to share my progress on github when I have something stable. I’ll keep you posted.

Not totally follow the `sortofpeeling` approach but the `vector.predicate` op is a high-level abstraction aimed at preventing that we go out of bounds with vector memory operations, if that’s what you are trying to do. It shouldn’t matter if you are dealing with dynamic shapes as long as we compute the mask properly.

I think I’m not aware of the `PeelOp`. Is that from the Transform dialect? `PadOp` is a different case because it’s actually transforming a tensor. It should be ok for vector operations to interact with tensors. We already already have operations that do so (vector transfer ops, for example).

+1 to minimizing hw-specific lowerings and, I would also add, introducing hw-specific abstraction in the Vector dialect. We have to make sure that whatever we come up with also works for RISC-V and does not interfere with the existing support for fixed vector-length vectors.

I think it’s great if you can prototype something quickly! We can learn from it and apply it to the long term approach. We can use the IREE Sandbox for that or share patches on github.