MLIR for arm SME vectorizing matmul-like ops as part of a broader program

This thread is here to discuss on the best way to incorporate the lowering of a matmul using Arm SME logic as part of another pipeline. As requested y @banach-space, It is a follow-up of that thread aiming to solve some of Arm SME challenges.

Goal
I would like to vectorize using scalable vectors and the Arm SME logic linalg.matmul (and its transposed flavours) as part of a larger program. To compile and optimize the rest of the program, I would like to vectorize it not necessarily scalably. For instance using with transform.vectorize_children_and_apply_patterns. In most cases, when vectorizing the module, it tries to vectorize the content of maskOps, creating more ops inside its region which is limited to 1 Op.

Steps to Reproduce: Add a vectorizable op such as an elementwise operation in the payload and add a call to vectorize_children_and_apply_patterns after vectorize.

#map = affine_map<(d0, d1) -> ()>
#map1 = affine_map<(d0, d1) -> (d0, d1)>
func.func @matmul(%A : tensor<?x?xf32>, %B : tensor<?x?xf32>, %C : tensor<?x?xf32>) {
  %cst = arith.constant 0.200000e+00 : f32
  %0 = linalg.generic {indexing_maps = [#map, #map1], iterator_types = ["parallel", "parallel"]} ins(%cst : f32) outs(%A : tensor<?x?xf32>) {
  ^bb0(%in: f32, %out: f32):
    %3 = arith.mulf %in, %out : f32
    linalg.yield %3 : f32
  } -> tensor<?x?xf32>
  %res = linalg.matmul ins(%0, %B: tensor<?x?xf32>, tensor<?x?xf32>)
                       outs(%C: tensor<?x?xf32>) -> tensor<?x?xf32>
  %xf = tensor.cast %res : tensor<?x?xf32> to tensor<*xf32>
  call @printMemrefF32(%xf) : (tensor<*xf32>) -> ()
  return
}

Current solutions
In the TrA use case, running vectorize_children_and_apply_patterns on the module works if it happens after lower_masks. In the matmul case, it would hence require to bufferize before vectorizing the rest of the program, which is far from ideal. The user can obviously target specifically the ops he wants to vectorize, but this is not too handy and reduces optimization opportunities.
An option I haven’t tried and just occurred to me was to tile and vectorize the whole module but the matmul and then run the SME pipeline targeting specifically matmuls and keeping handles that way.
There is another solution provided by the documentation that is outlining / inlining matmuls or their inner kernels. I personally find it a rather ugly solution that works but yet again reduces optimization opportunities.

Solutions
To sort the issue with maskOp, Should we prevent vectorization from iterating inside vector.maskOp, considering it should already be vectorized since it is a vectorOp? I am working on such fix at the moment. Following this suggestion :

Once this is fixed, we might encounter more problems I will be happy to share. :slightly_smiling_face:
Kiss,
-Hugo.

1 Like

Correct. This was also pointed out by @dcaballe who architected the masking support in the vectorizer (On Improving Arm SME Lowering Resilience in MLIR - #15 by dcaballe):

You probably need to update vectorizeOpPrecondition.

Great, please add as as reviewers and/or reach out if you experience any issues with this.

-Andrzej

Posting my answer also in this thread:

It looks like you are decomposing the vector.transfer_read into simpler operations. You should run these patterns (or the pass on top of them) to simplify the vector.mask before decomposing vector.transfer_* ops.