This thread is here to discuss on suggestions for improving the Arm SME lowering for MLIR. To limit discussions to one topic, as suggested by @banach-space, It is a follow-up of that thread aiming to solve some of Arm SME challenges.
FYI : @MacDue
Extending transform dialect for scalable size
I think it would be nice if we did not have to hardcode the scalable size in the vectorize transformOp depending on the precision.
eg:
// for f32
%tiledmatmul = transform.structured.tile_using_for %matmul [[4], [4], 1] [...]
%vectorized = transform.structured.vectorize %tiledmatmul vector_sizes [[4], [4], 1] [...]
//for f16
%tiledmatmul = transform.structured.tile_using_for %matmul [[8], [8], 1] [...]
%vectorized = transform.structured.vectorize %tiledmatmul vector_sizes [[8], [8], 1]
// And so on for any precision.
For instance, to be able to get the precision of an op in the payload from the transform sequence, make simple arithmetic computation to get the static scalable size and use it as attribute for tile and vectorize ops. Such suggestions have been discussed at EuroLLVM with @rolfmorel and @ftynse and similarly, an RFC targetting similar features for PDLL has been proposed.
and have something like
// VL can be CSE-ed at compile-time
%p = transform.get_precision_of_op %matmul : index
%VL = arith.divui %registerSize, %p : index
%tiledmatmul = transform.structured.tile_using_for %matmul [[%VL], [%VL], 1] [...]
%vectorized = transform.structured.vectorize %tiledmatmul vector_sizes [[%VL], [%VL], 1]
Set vscale as constant transform and VLS.
VLS has been a subject discussed upstream. A known value vscale offers more constant propagation possibilities. I would like to propose a transformation setting vscale to a constant value. It is a bit of a weird transformation I agree since we bother to keep all this pipeline VLA but being able to set vscale a constant allows to generate both VLA and VLS as it is currently the case, to generate sme FMOPA, but also allows optimizations such as pipelining the inner loops to fit a fixed tiled size, remove the peeled part of the loop as generated in this usecase , hoisting out transpositions, … I wonder if this transform could be of interest to the community and be an upstream candidate. For now, canonicalize at scf level does not propagate constants enough to get rid of.
This suggestion has been supported by @dcaballe.
Code example On Improving Arm SME Lowering Resilience in MLIR - #16 by nujaa
@ftynse outlined the trickiness of propagating this kind of not so constants after all here On Improving Arm SME Lowering Resilience in MLIR - #19 by ftynse
I am happy to keep this thread to find solution for those suggestion or to be a list of proposals. Or both
Kiss,
-Hugo