On Improving Arm SME Lowering Resilience in MLIR

Hi, I have been using the solution for quite some time now, and I am deeply grateful for the work done by the community. The end-to-end examples provided have been of immense help, so I’ll be using them as examples.

I recently had a discussion with @banach-space , and here are a few things I found challenging or that could benefit from some improvements. I would like to gather some opinions on how to best address them.

Canonicalization Patterns

If canonicalize is run between scalable vectorize and reductionToContract, the masked vector.multi_reduction <add> on 1D is converted into an arith.add, and FMOPA are not generated. The issue with this is potential missed optimizations since, for instance, in the matmul use case, we have to run bufferize before lowering to outer products.

Steps to Reproduce: In any of the examples, add a call to canonicalize between vectorizeOp and ApplyPatternOp.

Solutions: Should we match broadcast(+transpose)+mulf+addf to generate a contractionOp, or is it the developer’s responsibility to use cse / canonicalize with care? I have not found such a transformation even for non-scalable vectors. To be honest, I’ll probably try to implement it downstream if needed. Please let me know if this is the kind of transformation we may want to see upstream and if it belongs to some specific pattern. In my opinion, it would belong in vector.reduction_to_contract, but since it is not a reductionOp anymore, I am unsure if the semantics are correct.

The Difference in Handling of Transposed / Non-Transposed Cases

Due to the problem as mentioned above, I suspect we would prefer a schedule as in the matmul-transpose-a use case, lowering computation ops directly after scalable vectorization to avoid potential transformations in between. I understand the lowering of multi-dimensional scalable vector transposition is tricky, but I do not understand why running bufferize before lowering contraction helps to sort out the issue. What are the expected behavior and limitations? If this is such a pain, maybe we could consider lowering matmul to matmul_transpose_a (transpose(a)) and respectively with transpose_b allowing to have fixed sized transpose and scalable matmuls.

EDIT : such transform has been merged yesterday.

Vectorization and Scalable Vectorization

Another challenge I found was scalable vectorizing the matmul as part of a larger program and vectorizing the rest of the program with vectorize_children_and_apply_patterns. In most cases, when vectorizing the module, it tries to vectorize the content of maskOps, creating more ops inside a region limited to 1 Op.

Steps to Reproduce: Add a vectorizable op such as an elementwise operation in the payload and add a call to vectorize_children_and_apply_patterns after vectorize.

Solutions
In the TrA use case, vectorizing the rest of the module works if vectorization happens after lower_masks. In the “standard” case, it would hence require to bufferize before vectorizing the rest of the program, which is far from ideal. The user can obviously target specifically the ops he wants to vectorize, but this is not too handy and reduces optimization opportunities again. Outlining / Inlining is a rather ugly solution, in my opinion, that works but yet again reduces optimization opportunities.

Should we prevent vectorization from iterating inside vector.maskOp, considering it should already be vectorized since it is a vectorOp? It might require more changes to vectorization.

Further development suggestions :

Extending transform dialect for scalable size

I think it would be nice if we didnt have to hardcode the scalable size in the vectorize transformOp depending on the precision. For instance, to be able to get the precision of an op in the payload from the transform sequence, make simple arithmetic computation to get the static scalable size and use it as attribute for tile and vectorize ops. Such suggestions have been discussed at EuroLLVM with @rolfmorel and @ftynse and similarly, an RFC targetting similar features for PDLL has been proposed.

Set vscale as constant transform and VLS.

VLS has been a subject discussed upstream. A known value vscale offers more constant propagation possibilities. I would like to propose a transformation setting vscale to a constant value. It is a bit of a weird transformation I agree since we bother to keep all this pipeline VLA but being able to set vscale a constant allows to generate both VLA and VLS as it is currently the case, to generate sme FMOPA, but also allows optimizations such as pipelining the inner loops to fit a fixed tiled size, remove the peeled part of the loop as generated in this usecase, hoisting out transpositions, … I wonder if this transform could be of interest to the community and be an upstream candidate. For now, canonicalize at scf level does not propagate constants enough to get rid of

2 Likes

Hi Hugo,

Thank you for creating this issue - there’s a lot of really “good” problems to discuss :slight_smile: Also, it’s great to see that people are finding the ArmSME work helpful :pray:

There’s a few separate issues here. Could I suggest that we limit this one to “canonicalization” (your first issue) and open separate threads for the other things? There’s a lot to discuss :sweat_smile:

Let me reply to your first question.

To be perfectly honest, we very rarely contribute to the cse/canonicalize patterns and often work around them. The problem with those patterns is that they are introduced at a certain point in time with a certain context in mind. Put differently, how do we decide that a pattern is a canonicalization? One can argue that given your experience, that pattern shouldn’t be a canonicalization.

For reference, you are most likely using this (note “canonicalization” at the end of the TD sequence):

func.func @matmul(%A : tensor<?x?xf32>, %B : tensor<?x?xf32>, %C : tensor<?x?xf32>) -> tensor<?x?xf32> {
  %res = linalg.matmul ins(%A, %B: tensor<?x?xf32>, tensor<?x?xf32>)
                       outs(%C: tensor<?x?xf32>) -> tensor<?x?xf32>
  return %res : tensor<?x?xf32>
}

module attributes {transform.with_named_sequence} {
  transform.named_sequence @__transform_main(%module : !transform.any_op {transform.consumed}) {
    %matmul = transform.structured.match ops{["linalg.matmul"]} in %module
      : (!transform.any_op) -> !transform.any_op

    // Step 1: Tile for size [4] x [4], which corresponds to SVLs x SVLs, where
    // SVLs is the number of 32-bit elements in a vector of SVL bits.
    %tiled_linalg_op, %loops:3 = transform.structured.tile_using_for %matmul[[4], [4], 1]
      : (!transform.any_op) -> (!transform.any_op, !transform.any_op, !transform.any_op, !transform.any_op)

    // Step 2: Vectorize.
    transform.structured.vectorize %tiled_linalg_op vector_sizes [[4], [4], 1]
      : !transform.any_op

    // Step 3: Bufferize ahead of TransferReadDropUnitDimsPattern, which
    // currently only supports memrefs.
    %bufferize = transform.bufferization.one_shot_bufferize %module
      {bufferize_function_boundaries=true} : (!transform.any_op) -> !transform.any_op

    %func = transform.structured.match ops{["func.func"]} in %bufferize
      : (!transform.any_op) -> !transform.any_op

    // THIS IS KEY!!!!
    transform.apply_registered_pass "canonicalize" to %func : (!transform.any_op) -> !transform.any_op

    transform.yield
  }
}

And this is the output:

$ mlir-opt --transform-interpreter file.mlir
(...)
          %7 = vector.mask %6 { vector.transfer_read %subview_5[%c0, %c0], %cst {in_bounds = [true, true]} : memref<?x?xf32, strided<[?, ?], offset: ?>>, vector<[4]x[4]xf32> } : vector<[4]x[4]xi1> -> vector<[4]x[4]xf32>
          %8 = arith.mulf %3, %5 : vector<[4]x[4]x1xf32>
          %9 = vector.create_mask %0, %1 : vector<[4]x[4]xi1>
          %10 = vector.shape_cast %8 : vector<[4]x[4]x1xf32> to vector<[4]x[4]xf32>
          %11 = arith.addf %7, %10 : vector<[4]x[4]xf32>
          %12 = arith.select %9, %11, %10 : vector<[4]x[4]xi1>, vector<[4]x[4]xf32>
          vector.mask %6 { vector.transfer_write %12, %subview_5[%c0, %c0] {in_bounds = [true, true]} : vector<[4]x[4]xf32>, memref<?x?xf32, strided<[?, ?], offset: ?>> } : vector<[4]x[4]xi1>
(...)

Now, going back to your question:

IMHO, no. That would mean “lifting” abstractions. Instead, we should leverage the concept of “progressive lowering”. In this case, it feels like “canonicalization” has done too much. @MacDue has kindly dug out the PR that introduce this:

@vmurali + @dcaballe - what are your thoughts on this? Should we re-consider whether those patterns qualify as canonicalizations?

@nujaa , are you OK to open more threads for the other items to discuss?

-Andrzej

1 Like

As a general fly-by comment, I think we are abusing the canonicalization framework to implement many rewrite patterns that shouldn’t be considered as canonicalizations, whether we define those as always-beneficial work reduction optimizations or transformations bringing IR to a simpler form. I am not even convinced that declaring something canonical makes sense given the extensibility of MLIR, and would rather specify pre/post conditions of certain transformations.

The lack of controllability in vectorization is rather unfortunate. One “generic” solution could be to provide a list of ops to not vectorize, which could be collected by matching vector.mask in this case. This still requires introducing filters into the main logic of vectorization, but shouldn’t preclude other optimizations.

I’m happy to review design/code of adding arithmetic operations to transform. PDL(L) has lots of IMO unnecessary complexity that I am not currently competent to handle.

Beyond of the scalable vectors, it may be interesting to set some values to constants and re-optimize IR, for example, for multi-versioning.

This may be due to missing canonicalizations on SCF, examples where propagation should happend but doesn’t would help.

1 Like

PDL seems to be getting a lot of new investment/attention. Collectively, I work with a number of people who have significant mileage on it, but I don’t think we have organized code owners or responsibility for any of it. We should probably get better organized about who are the go to people for it.

In general, I’m seeing PDL be the go-to tool for graph-level matching/transforms targeted at product. Engineers who use it seem to have a good experience. We’re mainly using the other options for early prototyping.

I’m also not competent to be the go-to reviewer on PDL, but just noting that we have a real lack of reviewers/owners here.

1 Like

Are you saying that the canonicalization has lost information here? It’s not clear to me…
In general the canonical form is meant to be able to handle equivalent form of the program: that is at one given level of abstraction, the input program could come in multiple form and the canonicalization allows processing at this level of abstraction to match a single form.
If we stay in the “vector + arith” dialect, it’s not clear to me why keeping a “reduction operation with dim-1 size” would be a canonical form of the program? (note that arguing this is a canonical form would lead to implementing the opposite pattern!).

I’d be happy to review (given how this work originated), but agreed I think we have more downstream users than upstream reviewers/owners. And this is a core element.

1 Like

Can you provide an example?

Yes, that’s what I meant. To me, we have two options here:

linalg.matmul -> vector.multi_reduction -> vector.contract -> vector.outer_product

or

linalg.matmul -> vector.multi_reduciton -> arith.mulf + arith.addf

The first one makes lowering to outer-product engines (Arm SME is one example) very natural. The latter would require pattern matching that we don’t have. How do we decide which form is more canonical? If any.

I have zero experience with PDL and no opinion. For me, @ftynse’s tutorial on TD (Transform Dialect Tutorial - MLIR) made it super easy to learn (thanks!). Is there anything similar for PDL? Or in-tree ref examples?

Can you elaborate clearly on what information is lost?

Would you canonicalize the arith.mulf + arith.addf to a reduction with dimension 1? If not then why would we keep the reduction here?
What if the input has the arith.mulf + arith.addf code already for other reason? Shouldn’t you be able to handle this gracefully anyway?

Happy to review too!!

(Nit: “more downstream users than upstream reviewers/owners” should apply to every single components right? It is the normality of adoption)

Once we lower to arith.mulf + arith.addf, we’ve lost the information that everything started with linalg.matmul. Now, computing a matrix-multiplication operation using outer-products is quite straightforward. Put differently, this is easy:

  • linalg.matmulvector.outer_product

this is not (if possible at all, but perhaps I’m missing sth obvious):

  • lnalg.matmularith.add + arith.mulfvector.outer_product

The latter also feels very counter-intuitive - why would you go via Arith if you can go directly to Vector?

We can handle that, but we want to avoid such code. It’s likely going to lead to something very slow (loops with hw vector ops rather than one hw matrix Op) unless:

  • we can “guess” that it came from linalg.matmul, and hence
  • rewrite it as vector.outerproduct.

Any ideas on how to match suitable pairs of arith.addf + arith.mulf?

Now, I appreciate that this is a bit hand-wavy - I’ve not investigated the arith.addf + arith.mulf path in much detail. I also haven’t tried running canonicalisations at that point in the pipeline before. That’s why I’m hoping that folks who worked on that could chime in and share their experience.

Also, vector.multi_reduction and vector.contract are quite complex - please let me know if there’s a path here that I’m missing.

I’m not sure why this is relevant from the point of view of the vector dialect? (the canonicalization has nothing to do with linalg as far as I know: it is a transformation on the vector dialect right?)

I would need to see concrete actual example with this canonicalization that could illustrate what you mean, I can’t follow right now.

It could use some documentation love. I’ve seen multiple people be initially reticent based on that, but then with a few hours of time put in, become strong supporters of it (all in the space of graph level pattern matching of ML graphs). I’ve not used it myself – just reporting that I’ve seen a lot of people have a good experience and go on to be long term advocates of it for that use case.

Sorry, I see where the confusion is coming from.

There are 2 categories of Ops in the vector dialect - high level and low level Ops. vector.contract/vector.multireduction are both high level ops - we can quite easily deduce that we are dealing with a matmul/matvec (or something similar) at this level. That’s akin to dealing with linalg.matmul (*). These high level Ops are normally progressively lowered to low level ops like vector.outerproduct/vector.fma.

What’s happening here is that that design ^^^ (high-level vs low-level vector Ops) is not being followed (by using ops from other dialects). To me that’s not “canonical”. Admittedly, that’s just based on my experience - hopefully other folks with more experience with Vector+Arith will also chime in.

Right, because LLVM does not support “scalable” arrays of “scalable” vectors (fixed-width arrays of scalable vectors are fine), the following:

 arith.addf %7, %10 : vector<[4]x[4]xf32>

would have to be lowered as (“late night simplified syntax”):

scf.for %i = 0 : %c4_vscale step 1
  %lhs = vector.load %7[%i] : vector<[4]xf32>
  %rhs = vector.load %10[%i] : vector<[4]xf32>
  %sum = arith.addf %lhs, %rhs : vector<[4]xf32> 
  vector.store %sum

But even that is going to be quite problematic (LLVM can’t handle scalability in 2 dims). We’ve not needed that and just try to avoid things like vector<[4]x[4]xf32> (2 scalable dims) unless we can lower to vector.outer_product.

Thanks for all the discussion so far!

-Andrzej

(*) Sorry, I shouldn’t be using linalg.matmul - what’s key is the higher level semantics of the computation (i.e. matrix multiplication).

OK so that loops back into one specific aspect of my earlier comment I think, emphasis below:

As for the example:

I meant: the complete example showing the 1-dim reduction first and the canonicalization, leading into at the information loss needed for the lowering.

But that point may very well be secondary to a clear separation of layers of abstractions at hand here I think (TBD), at least as I tend to approach canonicalization!

We have a matcher for those within linalg.generic: llvm-project/mlir/lib/Dialect/Linalg/IR/LinalgInterfaces.cpp at fe47e8ff3ae7fc8975eaade6bfa6679737c28b93 · llvm/llvm-project · GitHub that is also exposed via transform. Things become more interesting if you also need to recover the generic from loop structure, though not impossible. @chelini had some work in this direction IIRC.

Your best take at documentation is PDL - Pattern Descriptor Language.pdf - Google Drive, Interpreted Pattern Match Execution- External - Google Slides and https://mlir.llvm.org/OpenMeetings/2021-11-04-PDLL-Pattern-Frontend-Language.pdf if you are also interested in PDLL. There are links to recordings of corresponding talks on Talks - MLIR. It would be nice if some of that downstream interest/investment ended up contributing upstream, at least the documentation.

My opinion of it is that it’s a good replacement for tablegen-defined graph rewrites (DRR) that was designed to work on a large scale, i.e., at least O(100) patterns applied to sufficiently large inputs. If that’s your use case, certainly the best tool for the job. For a small number of patterns, the overhead of dialect conversion + automaton construction + interpretation is not worth it.

AFAIK, it is currently incapable of matching regions or doing non-trivial attribute/type matches beyond equality comparisons (e.g., is the second affine map of linalg.generic a projected permutation), which made it poorly suited for matching Linalg. We had initially considered it as matcher engine for transform with some integration. There is an escape hatch to call back into C++ via a stringly interface, but using it extensively quickly offsets many of the benefits while adding more complexity.

@math-fehr and I have discussed some potential improvements, but nobody had time to follow through.

I generally agree with this vision, but the problem is that we don’t have a clear separation of levels of abstraction, or indeed a good definition of what it is. It sounds helpful to have canonicalization “aware” of the level of abstraction somehow, e.g., preserve vector operations in this case until we explicitly decided otherwise.

2 Likes

Thanks for the post! Great to see more people using SME and scalable vectors!

If I’m parsing the IR correctly, I can see that you are vectorizing K (the reduction dimension) with size 1, which means we are only reducing a single element, making the reduction redundant. I would expect canonicalization to remove such a reduction, in the same way it gets rid of other trivial or redundant computation. If you make the K vector size greater than 1, no canonicalization should happen. This is the intended behavior, IMO.

Perhaps the problem lies in the expectations of having a “low-level” canonical form for a matmul computation once we have lost the structural information that higher level ops provide (e.g., linalg.matmul, vector.contract…), which leads to a deeper issue:

This lowering path, which exists for historical reasons, has been serving our goals so far but it’s not the first time that it has caused problems. As mentioned multiple times in the past, the vector.multi_reduction -> vector.contract step should be removed in favor of direct linalg.matmul -> vector.contract direct lowering. The current step is not a lowering but a raise in abstraction and, as such, a certain level of complexity in the pattern matching is expected. This is the actual problem that should be fixed, IMO.

This looks like an easy-to-fix bug and you pointed to the solution: the vectorizer shouldn’t try to vectorize an op that is already vectorized. Both the mask op and the masked op should have vector types so they shouldn’t pass the vectorizer’s preconditions. I’m also wondering what is the vectorization path that such masked operation is taking, as we mostly vectorize linalg ops and the masked operation shouldn’t be a linalg op at that point since it has been vectorized already. In any case, it should be a matter of bailing out from vectorization if any of the operand/result types of the candidate op has a vector type. It would be good to open a Github issue with a concrete reproducer and intermediate IR dumps.

This is one of the steps we planned to implement as part of the dynamic vector proposal so I would support this if it’s generic/reusable enough :slight_smile:

Indeed. :heart:

Sure, will do by the end of the week.

With such sequence :

module attributes {transform.with_named_sequence} {
  transform.named_sequence @__transform_main(%module : !transform.any_op ) {
    %matmul = transform.structured.match ops{["linalg.matmul_transpose_a"]} in %module
      : (!transform.any_op) -> !transform.any_op
    // Tile to SVL (vscale = 2)
    %tiled_linalg_op_0, %loops_1:3 = transform.structured.tile_using_for %matmul[8, 8, 1] : (!transform.any_op) -> (!transform.any_op, !transform.any_op, !transform.any_op, !transform.any_op)
 
    // Step 1: Tile for size [4] x [4], which corresponds to SVLs x SVLs, where
    // SVLs is the number of 32-bit elements in a vector of SVL bits.
    %tiled_linalg_op, %loop_1, %loop_2, %loop_3 = transform.structured.tile_using_for %tiled_linalg_op_0[[4], [4], 1]
      : (!transform.any_op) -> (!transform.any_op, !transform.op<"scf.for">, !transform.op<"scf.for">, !transform.any_op)
    // Peel inner loop first then outer loop to remove masks in main loop
    %inner_main_loop, %inner_remainder_loop = transform.loop.peel %loop_2 : (!transform.op<"scf.for">) -> (!transform.op<"scf.for">, !transform.op<"scf.for">)
    %outer_main_loop, %outer_remainder_loop = transform.loop.peel %loop_1 : (!transform.op<"scf.for">) -> (!transform.op<"scf.for">, !transform.op<"scf.for">)

    // Step 2: Vectorize.
    %matmulpeeled = transform.structured.match ops{["linalg.matmul_transpose_a"]} in %module : (!transform.any_op) -> !transform.any_op
    transform.structured.vectorize %matmulpeeled vector_sizes [[4], [4], 1]
      :  !transform.any_op

    %func = transform.structured.match ops{["func.func"]} in %module
      : (!transform.any_op) -> !transform.any_op

    // Step 3: Lower vector.multi_reduction to vector.contract (+ some helpful patterns).
    transform.apply_patterns to %func {
      // Add canonicalization to std pipeline,  we lose contracts but generate peeled loops.
      transform.apply_patterns.canonicalization
      transform.apply_patterns.vector.lower_masked_transfers
      transform.apply_patterns.vector.transfer_permutation_patterns
      transform.apply_patterns.vector.reduction_to_contract
    } : !transform.any_op

    // Step 4: Lower vector.contract to vector.outerproduct.
    transform.apply_patterns to %func {
      transform.apply_patterns.vector.lower_contraction lowering_strategy = "outerproduct"
      transform.apply_patterns.vector.lower_masks
    } : !transform.any_op

    transform.yield
  }
}

You have outer tiles of SVL and peeled loops. We lost the contraction because of the canonicalization required by peeling. BTW canonicalizing before vectorize breaks because it goes inside MaskOp again.

 #map = affine_map<(d0)[s0] -> (-d0 + s0, 8)>
#map1 = affine_map<()[s0, s1] -> (s0 - s0 mod s1)>
#map2 = affine_map<(d0)[s0] -> (-d0 + s0)>
module {
  func.func @matmul_transpose_a(%arg0: tensor<?x?xf32>, %arg1: tensor<?x?xf32>, %arg2: tensor<?x?xf32>) {
    %cst = arith.constant 0.000000e+00 : f32
    %c4 = arith.constant 4 : index
    %c8 = arith.constant 8 : index
    %c1 = arith.constant 1 : index
    %c0 = arith.constant 0 : index
    %dim = tensor.dim %arg0, %c0 : tensor<?x?xf32>
    %dim_0 = tensor.dim %arg0, %c1 : tensor<?x?xf32>
    %dim_1 = tensor.dim %arg1, %c1 : tensor<?x?xf32>
    // 3 outer tiles [8, 8, 1]
    %0 = scf.for %arg3 = %c0 to %dim_0 step %c8 iter_args(%arg4 = %arg2) -> (tensor<?x?xf32>) {
      %1 = scf.for %arg5 = %c0 to %dim_1 step %c8 iter_args(%arg6 = %arg4) -> (tensor<?x?xf32>) {
        %2 = scf.for %arg7 = %c0 to %dim step %c1 iter_args(%arg8 = %arg6) -> (tensor<?x?xf32>) {
          %3 = affine.min #map(%arg3)[%dim_0]
          %4 = affine.min #map(%arg5)[%dim_1]
          %extracted_slice = tensor.extract_slice %arg0[%arg7, %arg3] [1, %3] [1, 1] : tensor<?x?xf32> to tensor<1x?xf32>
          %extracted_slice_2 = tensor.extract_slice %arg1[%arg7, %arg5] [1, %4] [1, 1] : tensor<?x?xf32> to tensor<1x?xf32>
          %extracted_slice_3 = tensor.extract_slice %arg8[%arg3, %arg5] [%3, %4] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
          %vscale = vector.vscale
          %c4_vscale = arith.muli %vscale, %c4 : index
          %vscale_4 = vector.vscale
          %c4_vscale_5 = arith.muli %vscale_4, %c4 : index
          %5 = affine.apply #map1()[%3, %c4_vscale]
          // Outer scalable loop
          %6 = scf.for %arg9 = %c0 to %5 step %c4_vscale iter_args(%arg10 = %extracted_slice_3) -> (tensor<?x?xf32>) {
            %8 = affine.apply #map1()[%4, %c4_vscale_5]
            // Main loop
            %9 = scf.for %arg11 = %c0 to %8 step %c4_vscale_5 iter_args(%arg12 = %arg10) -> (tensor<?x?xf32>) {
              %extracted_slice_6 = tensor.extract_slice %extracted_slice[0, %arg9] [1, %c4_vscale] [1, 1] : tensor<1x?xf32> to tensor<1x?xf32>
              %extracted_slice_7 = tensor.extract_slice %extracted_slice_2[0, %arg11] [1, %c4_vscale_5] [1, 1] : tensor<1x?xf32> to tensor<1x?xf32>
              %extracted_slice_8 = tensor.extract_slice %arg12[%arg9, %arg11] [%c4_vscale, %c4_vscale_5] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
              %11 = vector.transfer_read %extracted_slice_6[%c0, %c0], %cst {in_bounds = [true, true]} : tensor<1x?xf32>, vector<1x[4]xf32>
              %12 = vector.broadcast %11 : vector<1x[4]xf32> to vector<[4]x1x[4]xf32>
              %13 = vector.transpose %12, [2, 0, 1] : vector<[4]x1x[4]xf32> to vector<[4]x[4]x1xf32>
              %14 = vector.transfer_read %extracted_slice_7[%c0, %c0], %cst {in_bounds = [true, true]} : tensor<1x?xf32>, vector<1x[4]xf32>
              %15 = vector.broadcast %14 : vector<1x[4]xf32> to vector<[4]x1x[4]xf32>
              %16 = vector.transpose %15, [0, 2, 1] : vector<[4]x1x[4]xf32> to vector<[4]x[4]x1xf32>
              %17 = vector.transfer_read %extracted_slice_8[%c0, %c0], %cst {in_bounds = [true, true]} : tensor<?x?xf32>, vector<[4]x[4]xf32>
              %18 = arith.mulf %13, %16 : vector<[4]x[4]x1xf32>
              %19 = vector.constant_mask [4, 4] : vector<[4]x[4]xi1>
              %20 = vector.shape_cast %18 : vector<[4]x[4]x1xf32> to vector<[4]x[4]xf32>
              %21 = arith.addf %17, %20 : vector<[4]x[4]xf32>
              %22 = arith.select %19, %21, %20 : vector<[4]x[4]xi1>, vector<[4]x[4]xf32>
              %23 = vector.transfer_write %22, %extracted_slice_8[%c0, %c0] {in_bounds = [true, true]} : vector<[4]x[4]xf32>, tensor<?x?xf32>
              %inserted_slice_9 = tensor.insert_slice %23 into %arg12[%arg9, %arg11] [%c4_vscale, %c4_vscale_5] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
              scf.yield %inserted_slice_9 : tensor<?x?xf32>
            }
            // peeled innerloop
            %10 = scf.for %arg11 = %8 to %4 step %c4_vscale_5 iter_args(%arg12 = %9) -> (tensor<?x?xf32>) {
              %11 = affine.apply #map2(%arg11)[%4]
              %extracted_slice_6 = tensor.extract_slice %extracted_slice[0, %arg9] [1, %c4_vscale] [1, 1] : tensor<1x?xf32> to tensor<1x?xf32>
              %extracted_slice_7 = tensor.extract_slice %extracted_slice_2[0, %arg11] [1, %11] [1, 1] : tensor<1x?xf32> to tensor<1x?xf32>
              %extracted_slice_8 = tensor.extract_slice %arg12[%arg9, %arg11] [%c4_vscale, %11] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
              %12 = vector.transfer_read %extracted_slice_6[%c0, %c0], %cst {in_bounds = [true, true]} : tensor<1x?xf32>, vector<1x[4]xf32>
              %13 = vector.broadcast %12 : vector<1x[4]xf32> to vector<[4]x1x[4]xf32>
              %14 = vector.transpose %13, [2, 0, 1] : vector<[4]x1x[4]xf32> to vector<[4]x[4]x1xf32>
              %15 = vector.create_mask %11 : vector<[4]xi1>
              %16 = vector.broadcast %15 : vector<[4]xi1> to vector<1x[4]xi1>
              %17 = vector.transfer_read %extracted_slice_7[%c0, %c0], %cst, %16 {in_bounds = [true, true]} : tensor<1x?xf32>, vector<1x[4]xf32>
              %18 = vector.broadcast %17 : vector<1x[4]xf32> to vector<[4]x1x[4]xf32>
              %19 = vector.transpose %18, [0, 2, 1] : vector<[4]x1x[4]xf32> to vector<[4]x[4]x1xf32>
              %20 = vector.create_mask %c4_vscale, %11 : vector<[4]x[4]xi1>
              %21 = vector.transfer_read %extracted_slice_8[%c0, %c0], %cst, %20 {in_bounds = [true, true]} : tensor<?x?xf32>, vector<[4]x[4]xf32>
              %22 = arith.mulf %14, %19 : vector<[4]x[4]x1xf32>
              %23 = vector.create_mask %c4_vscale, %11 : vector<[4]x[4]xi1>
              %24 = vector.shape_cast %22 : vector<[4]x[4]x1xf32> to vector<[4]x[4]xf32>
              %25 = arith.addf %21, %24 : vector<[4]x[4]xf32>
              %26 = arith.select %23, %25, %24 : vector<[4]x[4]xi1>, vector<[4]x[4]xf32>
              %27 = vector.transfer_write %26, %extracted_slice_8[%c0, %c0], %20 {in_bounds = [true, true]} : vector<[4]x[4]xf32>, tensor<?x?xf32>
              %inserted_slice_9 = tensor.insert_slice %27 into %arg12[%arg9, %arg11] [%c4_vscale, %11] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
              scf.yield %inserted_slice_9 : tensor<?x?xf32>
            }
            scf.yield %10 : tensor<?x?xf32>
          }
          // Peeled outer loop
          %7 = scf.for %arg9 = %5 to %3 step %c4_vscale iter_args(%arg10 = %6) -> (tensor<?x?xf32>) {
            %8 = affine.apply #map1()[%4, %c4_vscale_5]
            %9 = scf.for %arg11 = %c0 to %8 step %c4_vscale_5 iter_args(%arg12 = %arg10) -> (tensor<?x?xf32>) {
              %11 = affine.apply #map2(%arg9)[%3]
              %extracted_slice_6 = tensor.extract_slice %extracted_slice[0, %arg9] [1, %11] [1, 1] : tensor<1x?xf32> to tensor<1x?xf32>
              %extracted_slice_7 = tensor.extract_slice %extracted_slice_2[0, %arg11] [1, %c4_vscale_5] [1, 1] : tensor<1x?xf32> to tensor<1x?xf32>
              %extracted_slice_8 = tensor.extract_slice %arg12[%arg9, %arg11] [%11, %c4_vscale_5] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
              %12 = linalg.matmul_transpose_a ins(%extracted_slice_6, %extracted_slice_7 : tensor<1x?xf32>, tensor<1x?xf32>) outs(%extracted_slice_8 : tensor<?x?xf32>) -> tensor<?x?xf32>
              %inserted_slice_9 = tensor.insert_slice %12 into %arg12[%arg9, %arg11] [%11, %c4_vscale_5] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
              scf.yield %inserted_slice_9 : tensor<?x?xf32>
            }
            %10 = scf.for %arg11 = %8 to %4 step %c4_vscale_5 iter_args(%arg12 = %9) -> (tensor<?x?xf32>) {
              %11 = affine.apply #map2(%arg9)[%3]
              %12 = affine.apply #map2(%arg11)[%4]
              %extracted_slice_6 = tensor.extract_slice %extracted_slice[0, %arg9] [1, %11] [1, 1] : tensor<1x?xf32> to tensor<1x?xf32>
              %extracted_slice_7 = tensor.extract_slice %extracted_slice_2[0, %arg11] [1, %12] [1, 1] : tensor<1x?xf32> to tensor<1x?xf32>
              %extracted_slice_8 = tensor.extract_slice %arg12[%arg9, %arg11] [%11, %12] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
              %13 = linalg.matmul_transpose_a ins(%extracted_slice_6, %extracted_slice_7 : tensor<1x?xf32>, tensor<1x?xf32>) outs(%extracted_slice_8 : tensor<?x?xf32>) -> tensor<?x?xf32>
              %inserted_slice_9 = tensor.insert_slice %13 into %arg12[%arg9, %arg11] [%11, %12] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
              scf.yield %inserted_slice_9 : tensor<?x?xf32>
            }
            scf.yield %10 : tensor<?x?xf32>
          }
          %inserted_slice = tensor.insert_slice %7 into %arg8[%arg3, %arg5] [%3, %4] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
          scf.yield %inserted_slice : tensor<?x?xf32>
        }
        scf.yield %2 : tensor<?x?xf32>
      }
      scf.yield %1 : tensor<?x?xf32>
    }
    %cast = tensor.cast %0 : tensor<?x?xf32> to tensor<*xf32>
    call @printMemrefF32(%cast) : (tensor<*xf32>) -> ()
    return
  }
  module attributes {transform.with_named_sequence} {
  }
  func.func private @printMemrefF32(tensor<*xf32>)
}

if you replace vscale in place by %vscale = arith.constant 2 : index

The peeled loops should disappear as there is no remainder and and the main loop should be only one iteration. I suspect the constant propagation is lost because of the affinemap depending on the outer tile position.

I think the goal is to generate an outerproduct out of those arith.addf + arith.mulf because we are in 1D and it will be easier than creating matmuls and we dont want to “lift” as you mentionned. The important part of the pattern should lie in the broadcast beforehand (otherwise it is just an elementwise fmuladd). There should be transposition inserted along the way which will make it a bit more cumbersome.
If we consider the matmuls have been transposed to fit in the matmul_transpose_A usecase, we have 2 1D vectors A and B of vector<[4]xf32>. multiplied and arith.addf on vector<[4]x[4]xf32> in an accumulator C. We want the result to be something like that.

$\forall i,j $ $C_i,_j$ += $A_i* B_j$

That is broadcasting A “horizontally” (aka broadcast(transpose(A))) (might be more efficient to broadcast the transposed but not sure how hardware can represent that.) and broadcast B “vertically”.
In our case, A and B are 2D vector with a dim of 1. Can we ignore them and calle then “1D-like” ?

Basically inline matching would be

addf(mulf(transpose(broadcast(a), broadcast(b)), C)$) : vector<nxkxfXX>
And take care of commutativity.

If I’m not mistaken, it is not dependent on shapes of any elements other than A and B must be “1D-like” and that broadcasted vectors and C have the same type which should already be checked by verifier.

I took a quick look at this, I don’t think it’s too bad (if you can get rid of some unit dims).

Here’s the ‘canonical’ form of a masked outerproduct (via multi-reduction):

Canonical form
%lhsCast = vector.shape_cast %inputLHS : vector<[4]xf32> to vector<[4]x1xf32>
%lhsBcast = vector.broadcast %lhsCast : vector<[4]x1xf32> to vector<[4]x[4]x1xf32>
%lhsT = vector.transpose %lhsBcast, [1, 0, 2] : vector<[4]x[4]x1xf32> to vector<[4]x[4]x1xf32>
%rhsCast = vector.shape_cast %inputRHS : vector<[4]xf32> to vector<1x[4]xf32>
%rhsBcast = vector.broadcast %rhsCast : vector<1x[4]xf32> to vector<[4]x1x[4]xf32>
%rhs = vector.transpose %rhsBcast, [0, 2, 1] : vector<[4]x1x[4]xf32> to vector<[4]x[4]x1xf32>
%mul = arith.mulf %lhsT, %rhs : vector<[4]x[4]x1xf32>
%tileMask = vector.create_mask %lhsDim, %rhsDim : vector<[4]x[4]xi1>
%dropDim = vector.shape_cast %mul : vector<[4]x[4]x1xf32> to vector<[4]x[4]xf32>
%addAcc = arith.addf %acc, %dropDim : vector<[4]x[4]xf32>
%applyMask = arith.select %tileMask, %acc, %addAcc : vector<[4]x[4]xi1>, vector<[4]x[4]xf32>

Not sure why it shape casts to add the unit dims (does broadcast need that?), also results in a (semi-pointless, moves unit dim only) transpose for the RHS too.

If we got rid of all the unit dims, I don’t think this is too bad:

No unit dims
%lhsBcast = vector.broadcast %lhsCast : vector<[4]xf32> to vector<[4]x[4]xf32>
%lhsT = vector.transpose %lhsBcast, [1, 0] : vector<[4]x[4]xf32> to vector<[4]x[4]xf32>
%rhsBcast = vector.broadcast %rhs : vector<[4]xf32> to vector<[4]x[4]xf32>
%mul = arith.mulf %lhsT, %rhsBcast : vector<[4]x[4]xf32>
%tileMask = vector.create_mask %lhsDim, %rhsDim : vector<[4]x[4]xi1>
%addAcc = arith.addf %acc, %mul : vector<[4]x[4]xf32>
%applyMask = arith.select %tileMask, %acc, %addAcc : vector<[4]x[4]xi1>, vector<[4]x[4]xf32> 

Then this can be lowered with two fairly easy patterns:

Step 1
%lhsBcast = vector.broadcast %lhsCast : vector<[4]xf32> to vector<[4]x[4]xf32>
%lhsT = vector.transpose %lhsBcast, [1, 0] : vector<[4]x[4]xf32> to vector<[4]x[4]xf32>
%rhsBcast = vector.broadcast %rhs : vector<[4]xf32> to vector<[4]x[4]xf32>
%mul = arith.mulf %lhsT, %rhsBcast : vector<[4]x[4]xf32>

This can be rewritten as:

%mul = arm_sme.outerproduct $lhs, $rhs : vector<[4]xf32>, vector<[4]xf32>
Step 2
%mul = arm_sme.outerproduct $lhs, $rhs : vector<[4]xf32>, vector<[4]xf32>
%addAcc = arith.addf %acc, %mul : vector<[4]x[4]xf32>
%applyMask = arith.select %tileMask, %acc, %addAcc : vector<[4]x[4]xi1>, vector<[4]x[4]xf32>

This can be rewritten as:

%lhsMask = vector.create_mask %lhsDim : vector<[4]xf32>
%rhsMask = vector.create_mask %rhsDim : vector<[4]xf32>
%mul = arm_sme.outerproduct $lhs, $rhs acc($acc) masks($lhsMask, $rhsMask) : vector<[4]xf32>, vector<[4]xf32>
1 Like

Isn’t the way transform dialect handles these the same as one would in PDL today, which is custom C++ code? And you build up a set of helpers which then enables one to have bigger set of helpers. So it’s not expressivity being discussed.

But we are heading off track here, and there are some different interests here … So I’ll (mostly:-)) restrain my response.

What is true, is that there is expansion needed of atoms, in whichever ways expressed. But rooted in real use. Which is why I’ve enjoyed seeing folks using PDL more actively. We need to be mindful of extending its core, as Stella remarked in other thread.

Yeah I dumped a engineer in post showing like one pattern and they self served easily. But that’s not an excuse for not having better doc :slight_smile: (coincidentally this engineer was going to write one but unfortunately they had other constraints and I only have an internal slide deck from them). Oh, they also used PDLL for authoring, which with LSP integration does help speed up discovery quite a bit. Of course the PDLL generator can also be improved etc. but their work has been humming along for some time.

We cannot prove that the peeled loop has no iterations. Let’s take the inner loop. Its bounds are %8 to %4, so we need to prove those %4 - %8 <= 0. %8 itself comes from affine apply mixing %4 with constants and vscale: %8 = %4 - %4 mod (4*vscale) = %4 - %4 mod 8. Substituting, %4 - (%4 - %4 mod 8) = %4 mod 8, which should still satisfy %4 mod 8 <= 0 or simply %4 mod 8 == 0 since the remainder is non-negative. So we need to prove that %4 is divisible by 8. %4 itself is a min(%arg5, dim(%arg1, 1)). %arg5 is divisible by 8 because it’s an induction variable of a loop starting at 0 and stepping 8. However, we don’t have any information about divisibility of dim(%arg1, 1) since it comes from tensor<?x?xf32>. So we don’t know whether %4 is always divisible by 8 and therefore cannot eliminate the peeled loop. Same holds for the inner loop.

At this point, there just isn’t enough information in the IR. We need information about divisibility of tensor dimensions. This information can be locally asserted locally, e.g. by peeling the %arg5 loop, or by padding the tensor somehow.

Now, I tried adding that information by having %arg1 typed as tensor<?x640xf32> and that still wasn’t sufficient to erase the peeled loop although some constant propagation did happen. My speculation is that we don’t reason about divisibility in trivial loop removal. This can be added. Consider filing a bug and we can discuss there in more detail.

It does dispatch to C++ code. It does not do so through a stringmap of function pointers with type-punned arguments, populated via a separate registration mechanism. This specific point may be addressed, I believe, via interfaces similar to transform; though it may be adverse to PDLL. It also doesn’t hide the dispatch behind a dialect conversion and an automaton, which make it much harder to define “native” atoms. When most of the matching is dispatching to C++ that just needs to be glued together declaratively (linalg matching case), it’s unclear to me what is the benefit of paying for all that complexity and rigidity. When most of the matching is subgraphs incurred by use-def chains, there may be. These are just different use cases. We may want to converge the support for those, or not, depending on the maintenance complexity. I am happy to discuss that in a separate thread.

Continuing the discussion from On Improving Arm SME Lowering Resilience in MLIR:

Hi, Just went through the discussion again and noticed I missed some parts.

I am not all too familiar with multi-versioning, could you share some examples please ?

Typically with this example, I’ll obviously try to reduce it for the PR :

#map = affine_map<(d0, d1) -> ()>
#map1 = affine_map<(d0, d1) -> (d0, d1)>
func.func @matmul(%A : tensor<?x?xf32>, %B : tensor<?x?xf32>, %C : tensor<?x?xf32>) {
  %cst = arith.constant 0.200000e+00 : f32
  %0 = linalg.generic {indexing_maps = [#map, #map1], iterator_types = ["parallel", "parallel"]} ins(%cst : f32) outs(%A : tensor<?x?xf32>) {
  ^bb0(%in: f32, %out: f32):
    %3 = arith.mulf %in, %out : f32
    linalg.yield %3 : f32
  } -> tensor<?x?xf32>
  %res = linalg.matmul ins(%0, %B: tensor<?x?xf32>, tensor<?x?xf32>)
                       outs(%C: tensor<?x?xf32>) -> tensor<?x?xf32>
  %xf = tensor.cast %res : tensor<?x?xf32> to tensor<*xf32>
  call @printMemrefF32(%xf) : (tensor<*xf32>) -> ()
  return
}
func.func private @printMemrefF32(%ptr : tensor<*xf32>)
module attributes {transform.with_named_sequence} {
  transform.named_sequence @__transform_main(%module : !transform.any_op ) {
    %matmul = transform.structured.match ops{["linalg.matmul"]} in %module
      : (!transform.any_op) -> !transform.any_op

    // Tile and vectorize matmul scalably
    %tiled_linalg_op, %loops:3 = transform.structured.tile_using_for %matmul[[4], [4], 1]
      : (!transform.any_op) -> (!transform.any_op, !transform.any_op, !transform.any_op, !transform.any_op)
    transform.structured.vectorize %tiled_linalg_op vector_sizes [[4], [4], 1]
      : !transform.any_op

    // Step 2: Vectorize generic
    %generics = transform.structured.match ops{["linalg.generic"]} in %module
      : (!transform.any_op) -> !transform.any_op
    %tiled_generic, %loops1:2 = transform.structured.tile_using_for %generics[4, 4]
      : (!transform.any_op) -> (!transform.any_op, !transform.any_op, !transform.any_op)
    %0 = transform.structured.vectorize_children_and_apply_patterns %module {vectorize_nd_extract, vectorize_padding} : (!transform.any_op) -> !transform.any_op
    transform.yield
  }
}

Outputs :

./build/bin/mlir-opt -transform-interpreter -test-transform-dialect-erase-schedule mlir/test/Integration/Dialect/Linalg/CPU/ArmSME/test.mlir
mlir/test/Integration/Dialect/Linalg/CPU/ArmSME/test.mlir:10:10: error: 'vector.mask' op expects only one operation to mask
  %res = linalg.matmul ins(%0, %B: tensor<?x?xf32>, tensor<?x?xf32>)
         ^
mlir/test/Integration/Dialect/Linalg/CPU/ArmSME/test.mlir:10:10: note: see current operation:
%23 = "vector.mask"(%22) ({
  %32 = "vector.transfer_read"(%7, %arg3, %arg7, %0) <{in_bounds = [true, true], operandSegmentSizes = array<i32: 1, 2, 1, 0>, permutation_map = affine_map<(d0, d1) -> (d0, d1)>}> : (tensor<?x?xf32>, index, index, f32) -> vector<[4]x1xf32>
  %33 = "vector.broadcast"(%32) : (vector<[4]x1xf32>) -> vector<[4]x[4]x1xf32>
  %34 = "vector.transpose"(%33) <{permutation = array<i64: 1, 0, 2>}> : (vector<[4]x[4]x1xf32>) -> vector<[4]x[4]x1xf32>
  "vector.yield"(%34) : (vector<[4]x[4]x1xf32>) -> ()
}) : (vector<[4]x1xi1>) -> vector<[4]x[4]x1xf32>

At the previous step "vector.mask"(%22): is like

%extracted_slice = tensor.extract_slice %0[%arg3, %arg7] [%4, 1] [1, 1] : tensor<?x?xf32> to tensor<?x1xf32>
%7 = vector.mask %6 { vector.transfer_read %extracted_slice[%c0, %c0], %cst {in_bounds = [true, true, true], permutation_map = #map4} : tensor<?x1xf32>, vector<[4]x[4]x1xf32> } : vector<[4]x1xi1> -> vector<[4]x[4]x1xf32>

So I think the problem is not vectorization is targeting the MaskOp but updates extracted slice .