Linalg and masking

Hi all,
In Linalg there are two ways (as far as I know) to deal with vectorization at the boundaries: padding and peeling.

  • Padding “extends” the region of memory to align with the vector length
  • Peeling extract a partial loop that to a bound that aligns with the vector length and then create a remainder loop to do the rest

However, there is a third option, which I am not able to kick in from Linalg, i.e., masking. Consider a simple saxpy:

#map0 = affine_map<(d0) -> ()>
#map1 = affine_map<(d0) -> (d0)>
module {
  func @saxpy_linalg(%arg0: f32, %arg1: tensor<33xf32>, %arg2: tensor<33xf32>) -> tensor<33xf32> {

    %0 = linalg.generic {indexing_maps = [#map0, #map1, #map1], iterator_types = ["parallel"]} ins(%arg0, %arg1 : f32, tensor<33xf32>) outs(%arg2 : tensor<33xf32>) {
    ^bb0(%arg5: f32, %arg6: f32, %arg7: f32):
      %3 = arith.mulf %arg5, %arg6 : f32
      %4 = arith.addf %3, %arg7 : f32
      linalg.yield %4 : f32
    } -> tensor<33xf32>

    return %0 : tensor<33xf32>
  }
}

When I tile by 4, this is what I got:

#map0 = affine_map<(d0) -> (-d0 + 33, 4)>
#map1 = affine_map<(d0) -> ()>
#map2 = affine_map<(d0) -> (d0)>
module {
  func @saxpy_linalg(%arg0: f32, %arg1: tensor<33xf32>, %arg2: tensor<33xf32>) -> tensor<33xf32> {
    %c4 = arith.constant 4 : index
    %c33 = arith.constant 33 : index
    %c0 = arith.constant 0 : index
    %0 = scf.for %arg3 = %c0 to %c33 step %c4 iter_args(%arg4 = %arg2) -> (tensor<33xf32>) {
      %1 = affine.min #map0(%arg3)
      %2 = tensor.extract_slice %arg1[%arg3] [%1] [1] : tensor<33xf32> to tensor<?xf32>
      %3 = tensor.extract_slice %arg4[%arg3] [%1] [1] : tensor<33xf32> to tensor<?xf32>
      %4 = linalg.generic {indexing_maps = [#map1, #map2, #map2], iterator_types = ["parallel"]} ins(%arg0, %2 : f32, tensor<?xf32>) outs(%3 : tensor<?xf32>) attrs =  {iree_linalg_transform.matched} {
      ^bb0(%arg5: f32, %arg6: f32, %arg7: f32):
        %6 = arith.mulf %arg5, %arg6 : f32
        %7 = arith.addf %6, %arg7 : f32
        linalg.yield %7 : f32
      } -> tensor<?xf32>
      %5 = tensor.insert_slice %4 into %arg4[%arg3] [%1] [1] : tensor<?xf32> into tensor<33xf32>
      scf.yield %5 : tensor<33xf32>
    }
    return %0 : tensor<33xf32>
  }
}

I would like to lower that to emit transfer_read with masks and subsequently masked_load instructions. Is there a way to do so directly from linalg?

I see that the vectorization does not even kick in because of the dynamic shapes. Before I throw some ideas, is there any way I can mask the loop? Did someone already do this?

Thanks,
Giuseppe

cc @javiersetoain

We have a vector.predicate op in the IREE Sandbox that we plan to bring to MLIR to model masking. It will allow us to model masking/predication in a generic way and implement different masking strategies. So far, I only have a basic lowering implemented and the integration with the Linalg vectorizer is missing. Bringing this to MLIR is in my priority list but it may take me a few weeks to get there and send an RFC. In the meantime, maybe you could take a first look at what we have in the Sandbox? Happy to collaborate on the Linalg integration, if you are interested!

Hi @dcaballe ,
Thanks a lot for the pointer!

Yes, I would like to cooperate on the linalg integration, which is my major pain now (how do emit that from linalg vectorization?)

Should we use this thread to agree on a general high level design (and then maybe chatting on iree for technical hurdles)?

What I am trying for now is (but I am just exploring):

  • I am adding a vectorization pass that sort-of-peel the loop (if it is dynamic) without adding the loop tail
  • Then I wanted to move all the transfer_read to have masks, but I think your vector.predicate is a way better option (applicable on the whole loop body) :slight_smile:

What do you think? Any better idea?

Thanks,
Giuseppe

Something like that could work for a first experiment. You could introduce the vector.predicate op for the whole loop body before vectorization (or after if the vectorizer doesn’t like it). Then, this masking strategy should generate the masked xfer ops that you want :slight_smile:.

In the long run, we should extend the vectorizer so that it’s able to deal with dynamic shapes and insert the vector.predicate op automatically. I’ll try to send an RFC soon so that we can focus on the long term implementation. The simple and more constrained version of the op should be ready to go.

+1 this line of work started a while back but other things got in the way, I am quite optimistic on the approach @dcaballe proposes and lifting it up to Linalg.

Btw, inserting the vector.predicate op can be a bit challenging in terms of rewiring the def-use chains so you may want to use this. We can extend it as needed!

For Arm SVE, there are data and predicate registers. For ops data and predicate registers can be input:
foo = mul data1, data2, pred0
You just need a way to create predicates.

Hi all,
I have been able to get somewhere.

What I did:

  • I created a sortofpeeling transformation where there is no remainder loop (as I mentioned)
  • Then I apply a mask to the whole loop through the vector.predicate op that @dcaballe mentioned
  • When lowering, I propagate the mask to the read/write instructions

As for the generic case, this is fine. The loop is masked and everything else flows.

But I am trying to use the masking instructions from the SVE ISA, in particular whilelt. This involved additional steps.

  1. I created a vector.whilelt operation (which is more powerful than the simple createmask). This lowers either to an affine.min+create_mask or directly to the sve.whilelt SVE intrinsic (add that as well).
  2. To lower to an SVE instruction, I need scalable vectors in MLIR. So I created a conversion pass that simply replaces fixed sized vectors with scalable vectors (throughout the module). Maybe we can localize that with this proposal
  3. Now everything works and the assembly produced (for the saxpy) is as follows:
	whilelt	p1.s, x8, x3
	ld1w	{ z1.s }, p1/z, [x1, x8, lsl #2]
	ld1w	{ z2.s }, p1/z, [x6, x8, lsl #2]
	add	x8, x8, #4
	fmad	z1.s, p0/m, z0.s, z2.s
	st1w	{ z1.s }, p1, [x6, x7, lsl #2]
	add	x7, x7, x10
	cmp	x8, x9

Which is pretty much what I would write manually. Now, could we agree on a good way to push this forward? I can start working on how to insert vector.predicate from linalg.

On this last point, I have a question for @dcaballe . Previously, you said:

In the long run, we should extend the vectorizer so that it’s able to deal with dynamic shapes and insert the vector.predicate op automatically

Can you shed some more light on this? My perplexities are the following:

  • The vectorization framework in linalg vectorizes the linalg.generic itself and not the loop. But if I don’t peel or pad, the linalg.generic will have dynamic shapes and cannot be vectorized (because vectors need a definite shape).
  • We can introduce the sortofpeeling transformation, BUT without a mask this will be illegal (because the reads will go out-of-boundaries).
  • So maybe we could add a tensor.predicate instead of vector.predicate . And we can add a PredicateOp transformation which would be in the same realm of PeelOp and PadOp.

EDIT

  • Another possibility would be to add a scf.if and then lower that to a mask when vectorizing.

What do you think?

Thanks,
Giuseppe

As a side note, when I first implemented a lowering for createmask, I was using llvm.intr.get.active.lane.mask. The instruction selector was picking whilelt without issues. In the end, for reasons discussed in “Vector.create_mask for scalable vectors”, I switched to llvm.intr.experimental.stepvector and a lowering strategy analogous to fixed-length vectors.

What I’m saying is, this can probably be fixed by either altering how we lower createmask, or making the instruction selection a bit smarter. I think we should avoid hw-specific lowerings as much as possible. I know you’re trying to get the best possible performance now, but we should at least keep in mind that, long term, there are probably better ways to deal with this than adding an arm_sve.whilelt :slight_smile:

Hi Javier,
Thanks for the answer! Well, yes, I am trying to get the best performance, but I am happy with any general solution that allows that :slight_smile:

In the end, for reasons discussed in “Vector.create_mask for scalable vectors”, I switched to llvm.intr.experimental.stepvector and a lowering strategy analogous to fixed-length vectors.

Yes, that was exactly the pass I was trying to fix :slight_smile: From what I understand, llvm.intr.get.active.lane.mask is a whilelt intrinsic that accepts also fixed types. Is this correct? If it is, that’s great, because I would not need to convert my IR into scalable vectors !

Could you explain me a bit more why llvm.intr.experimental.stepvector has been preferred to it? Is it because it is closer to what create_mask does?

My point is exactly that if in the ISA we have an instruction that is higher level then create_mask , we could create new vector.whilelt (or a vector.while<CMP>) which can be lowered to create_mask+affine.min or directly to the llvm.intr.get.active.lane.mask.

The alternative of improving the LLVM selector seems harder and I am not sure worth it, since we know exactly how to lower the mask creation. What do you think?

It’s because of how create_mask was being lowered to llvm. It was creating a constant vector (0, 1, 2, 3, …), broadcasting the operand of create_mask, and comparing both vectors. This has the side effect of having a clamping behaviour (i.e.: if the operand is negative, the resulting vector is all 0s). On the other hand, if you use get.active.lane.mask, the behaviour is a wrap-around (negative values get bitcasted to unsigned), which was decided to be undesirable in previous discussions (referenced in the one I linked).

This is fixable, by adding a pre-clamping operation on the operand of create_mask, but it’s a bit messier than just switching to a stepvector, and there’s something to be said in favour of doing things the same way for both scalable and fixed-length vectors :slight_smile:

llvm.intr.get.active.lane.mask works both for fixed-length, and scalable vectors, but create_mask can be multi-ranked, and complicates the lowering. I thought about it very early on, but stopped the moment it got a bit complicated. The stepvector approach was straightforward and I went for that. It might be a good time to get back the get.active.lane.mask option?

There’s certainly a lot to improve there :smiley:

The vector.predicate op is orthogonal to the way the mask is generated. We can compute the mask that we want, fixed or scalable, and use it there.

I’m already making progress in this direction. Hopefully we can coordinate so that we don’t duplicate work. The main problem is that having something working for simple cases is simple but we have to support multi-dimensional vectorization, named ops and have a proper lowering for those. We need to iterate a bit before we can have something that we can land. Happy to share my progress on github when I have something stable. I’ll keep you posted.

Not totally follow the sortofpeeling approach but the vector.predicate op is a high-level abstraction aimed at preventing that we go out of bounds with vector memory operations, if that’s what you are trying to do. It shouldn’t matter if you are dealing with dynamic shapes as long as we compute the mask properly.

I think I’m not aware of the PeelOp. Is that from the Transform dialect? PadOp is a different case because it’s actually transforming a tensor. It should be ok for vector operations to interact with tensors. We already already have operations that do so (vector transfer ops, for example).

+1 to minimizing hw-specific lowerings and, I would also add, introducing hw-specific abstraction in the Vector dialect. We have to make sure that whatever we come up with also works for RISC-V and does not interfere with the existing support for fixed vector-length vectors.

I think it’s great if you can prototype something quickly! We can learn from it and apply it to the long term approach. We can use the IREE Sandbox for that or share patches on github.