MLIR for Arm SME : Further development suggestions

This thread is here to discuss on suggestions for improving the Arm SME lowering for MLIR. To limit discussions to one topic, as suggested by @banach-space, It is a follow-up of that thread aiming to solve some of Arm SME challenges.
FYI : @MacDue

Extending transform dialect for scalable size

I think it would be nice if we did not have to hardcode the scalable size in the vectorize transformOp depending on the precision.
eg:

// for f32
%tiledmatmul = transform.structured.tile_using_for %matmul [[4], [4], 1] [...]
%vectorized = transform.structured.vectorize %tiledmatmul vector_sizes [[4], [4], 1] [...]
//for f16
%tiledmatmul = transform.structured.tile_using_for %matmul [[8], [8], 1] [...]
%vectorized = transform.structured.vectorize %tiledmatmul vector_sizes [[8], [8], 1]
// And so on for any precision. 

For instance, to be able to get the precision of an op in the payload from the transform sequence, make simple arithmetic computation to get the static scalable size and use it as attribute for tile and vectorize ops. Such suggestions have been discussed at EuroLLVM with @rolfmorel and @ftynse and similarly, an RFC targetting similar features for PDLL has been proposed.

and have something like

// VL can be CSE-ed at compile-time
%p = transform.get_precision_of_op %matmul : index
%VL = arith.divui %registerSize, %p : index
%tiledmatmul = transform.structured.tile_using_for %matmul [[%VL], [%VL], 1] [...]
%vectorized = transform.structured.vectorize %tiledmatmul vector_sizes [[%VL], [%VL], 1]

Set vscale as constant transform and VLS.

VLS has been a subject discussed upstream. A known value vscale offers more constant propagation possibilities. I would like to propose a transformation setting vscale to a constant value. It is a bit of a weird transformation I agree since we bother to keep all this pipeline VLA but being able to set vscale a constant allows to generate both VLA and VLS as it is currently the case, to generate sme FMOPA, but also allows optimizations such as pipelining the inner loops to fit a fixed tiled size, remove the peeled part of the loop as generated in this usecase , hoisting out transpositions, … I wonder if this transform could be of interest to the community and be an upstream candidate. For now, canonicalize at scf level does not propagate constants enough to get rid of.

This suggestion has been supported by @dcaballe.
Code example On Improving Arm SME Lowering Resilience in MLIR - #16 by nujaa

@ftynse outlined the trickiness of propagating this kind of not so constants after all here On Improving Arm SME Lowering Resilience in MLIR - #19 by ftynse

I am happy to keep this thread to find solution for those suggestion or to be a list of proposals. Or both :slightly_smiling_face:

Kiss,
-Hugo

On setting vscale as constant :

Great, what would make it generic and reusable enough for you ?
IIUC, in RVV context, you have downstream some vector.set_vl and vector.get_vl ops which behave quite like vector.vscale. and computes scalable vector sizes of a vector Op. How could we branch this together ?

To be honest, get_vl could be also the exact answer to the transform dialect suggestion.

+1 Though I don’t think that we need the actual runtime register size, we only need the base size:

%p = transform.get_precision_of_op %matmul : index
// @Hugo - we only need to know the vector register base size
%c_128 = arith.constant 128 : index
%tile_size = arith.divui 128, %p : index
%tiled_matmul = transform.structured.tile_using_for %matmul [[%tile_size], [%tile_size], 1] [...]
%vectorized = transform.structured.vectorize %tiled_matmul vector_sizes [[%tile_size], [%tile_size], 1]

We’ve not had the spare cycles to look into this. @ftynse , any suggestions? Would that ^^^ and transform.get_precision_of_op make much sense?

As for …

@nujaa, this comment is key:

In you original input the tensors are dynamic and the missing information are the sizes:

func.func @matmul_transpose_a(
  %arg0: tensor<?x?xf32>,
  %arg1: tensor<?x?xf32>,
  %arg2: tensor<?x?xf32>)

So the fact that the peeled loop cannot be folded away is due to the input being compiled. I imagine that in practice your pipelines include tensor with static sizes, right? In which case the compile should have all the information needed. Could you try that and let us know what you get?

Finally …

So far we have been focusing our effort on VLA and we haven’t hit an issue where having vector.vscale prevented us from optimising something (well, we have been able to fix all such issues). I don’t want to prevent folks from using VLS instead (if that’s what works for them better), but I would also appreciate help fixing VLA issues. Bug reports would be enough - that would give us better idea of what’s missing. Your example with loop peeling is great, but not all hope is lost :sweat_smile: Do you have more?

-Andrzej

I think taking vector.set/get_vl into account and making sure what both approaches are well aligned should be enough from the reusability standpoint. vector.get/set_vl would live on top of scalable vectors and the constant instantiation that you are looking for so so both things should compose well.

You may not have to remove it if you follow an approach that provides a bit more optionality. IIUC, you would like to do something like:

original scalar loop -> main scalar loop + remainder loop -> main scalable vector loop + remainder loop -> main fixed-length vector loop + remainder loop -> main fixed-length vector loop

However, you could do instead:

original scalar loop -> main masked scalable vector loop -> main masked fixed-length vector loop -> main fixed-length vector loop [+ remainder loop]

This approach introduces masks and gets rid of them if them number of iterations is multiple of the final fixed vector length. Also, this approach would allow us to choose between continuing with masks operations or remove them and generate a remainder loop instead. It’s relatively simpler to remove redundant masks (not only in MLIR but also LLVM is able to do this) and it would be really tricky to go from a peeling first approach to a mask approach if the user wanted to go with masks instead of peeling.

+1!

(Sorry if I go silent for a while. I’m starting some time off today :palm_tree:)

As Diego hints, there’s many options here and we haven’t really experimented that much (so far I’ve focused on removing masks).

One thing that I always try to highlight - ATM scalable vectorisation requires masking. We have some ideas how to avoid that, but it’s just not that high on the list of priorities. In the meantime, we follow this:

Also, @nujaa , if your input tensors match perfectly your vector sizes, the peeled loop will never run. So you shouldn’t really experience any perf issues. However, the binary will be a bit larger. Is that something that you’d like to minimise as well?

-Andrzej

The first part is rather straightforward, just generalize the logic already available for tile_using_forall (llvm-project/mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td at 6904e0e8852a587b49a673055997e88855f219ea · llvm/llvm-project · GitHub) to other ops. I believe @rolfmorel wanted to do that, but short of reply/PR from his side, anybody can pick this up.

The integer part is a bit trickier, there are two design points:

  • storing (integer) values in the interpreter without adding a lot of cost of boxing/unboxing them as parameters;
  • whether we have the interpreter process arith directly by adding interfaces to it and the index type (meh), or duplicate the arithmetic operations elsewhere (also meh).

@pablo and I indeed want to do this for Transform ops where we find we need their parameters to depend on the payload/be determined at interpreter runtime. The usage highlighted by @nujaa is one such case.

Regarding integer calculation during interpreter runtime, there is also the question of how to deal with param handles associating multiple values. In effect each such handle holds a dynamically-sized vector of integer values. Hence it seems most useful to me that these arithmetic computations be vectorized as well. Can we leverage existing code for this, e.g. arith ops on 1D (dynamically sized) tensors? I am aware this would incur unboxing/boxing costs, though as we would be taking values in transform.param form and producing derived values in transform.param form, can this really be avoided?

Would you mind opening a github issue with a list of ops where you think it’s necessary? Then we can all coordinate who is currently working on what.

We can at least minimize the cost by limiting unboxing to edges along the following lines.

%0 = transform... : () -> !transform.param<i64>
// unbox here
%1 = transform.process_param %0 {
^bb0(%0_i64: tensor<?xi64>):
  arith.add
  arith.bar // no intermediate boxing
  arith.foo
  yield
}
// box here

But we can also consider storing integers in the value mapping instead of pointers to attributes, which would entirely remove the cost of (un)boxing for integer parameters, with little to no additional cost for other cases (we are storing pointers and already have a dispatch). I haven’t thought about this deeply because I don’t currently have a use case, but if somebody is driving this, I’m willing to help.

:+1: Will coordinate with @pablo when he’s back and make a Github issue listing the ops we will probably need/want to work on.

Re. arithmetic in the interpreter:

A transform.process_param op like that would indeed address the intermediate boxing & unboxing overhead. As for how the params are stored while the interpreter runs, could I just check that this is currently fully abstracted away from users of the transform dialect. That is, the C++ implementation of ops knows about params being attributes under-the-hood but this is not reflected in the IR itself, right?

For now we won’t be pushing for such functionality as our use cases are limited enough that we can make due with one or two custom ops. When our needs regarding arithmetic become more involved (and the need to iterate quickly becomes more important), we will be sure to open a new thread/RFC.

The storage is currently a per-kind dense map Value->SmallVector<(Operation * | Value | Attribute)> with a dispatch based on what kind of interface the type of value implements. We can easily add a Value->SmallVector<int64_t> there. I’m also considering a more unified storage, but the trade-off is unclear to me in absence of any meaningful feedback from larger-scale users.