SME in MLIR status (20/10/2023)

Previous update:

Current state

Following the groundwork to represent SME in MLIR and implement tile allocation (D154955), we have been making steady progress to allow the lowering from Linalg, via Vector, to LLVM.

Today you can already lower basic examples with linalg.fill to SME and run executables (through an emulator):

We are now focusing our efforts on lowering linalg.matmul to ArmSME. This is expected to land soon (hopeful within a month, definitely before the end of the year).

On a more general note, the lowerings for ArmSME are now more progressive, first lowering to SME ops, and then lowering to intrinsics. Below is an example for vector.broadcast:

1. Input

%tile = vector.broadcast %val : i32 to vector<[4]x[4]xi32>

2. convert-vector-to-arm-sme

Tile operations are rewritten into loops over tile slices. Using arm_sme ops rather than intrinsics. Tiles allocated via arm_sme.get_tile_id and conversion casts.

%tile_id = arm_sme.get_tile_id : i32
%tile = arm_sme.cast_tile_to_vector %tile_id : i32 to vector<[4]x[4]xi32>
%broadcast_slice = vector.broadcast %val : i32 to vector<[4]xi32>
%svl_s = arith.muli %vscale, %c4 : index
// Note: The tile is updated in-place (rather than via scf.yield to keep rewrites simpler)
// See:
scf.for %i = %c0 to %svl_s step %c1 {
  arm_sme.move_vector_to_tile_slice %broadcast_slice, %tile, %i : vector<[4]xi32> into vector<[4]x[4]xi32>
// => %tile

3. convert-vector-to-llvm=“enable-arm-sme”

The ArmSME ops can then be mapped (near 1-to-1) to intrinsics on conversion to LLVM:

%tile_id = arm_sme.get_tile_id : i32
scf.for %i = %c0 to %svl_s step %c1 {
  "arm_sme.intr.write.horiz"(%tile_id, %i, %ptrue, %broadcast_slice) : (i32, i32, vector<[4]xi1>, vector<[4]xi32>) -> ()

=> from here tiles can be allocated with -allocate-arm-sme-tiles, then the rest is the standard LLVM lowerings (which can be handled by -test-lower-to-llvm)

That’s just one example but many more ops on the linalgvectorarm_sme critical path are being worked on. These are focusing sizes that exactly match the size of an SME tile, with the idea that at a higher level operations can be tiled with SME-compatible sizes.

Here’s an (incomplete) summary:

  • arith.constant, vector.broadcast, vector.splat
  • vector.transpose
    • This was added for functional correctness in PR 66760
    • Though we hope to find ways to avoid the transpose (e.g. by ensuring the inputs are transposed as a pre-processing step)
  • vector.transfer_read, vector.transfer_write
    • We’ve had basic support for these ops since early on…
    • Recently Cullen has been looking into all things these ops can do (masking, transposition, padding, etc) and working on enabling SME lowerings
      • This is not all merged, but a draft PR is up PR 69148
  • vector.insert, vector.extract
    • Thanks to recent work by Diego D155034, these ops now accept SSA values as indices
    • This made lowering SME MOVA instructions fairly simple in PR 67786
  • vector.outerproduct
    • Initial support added in PR 65621
    • With support for masking now up for review PR 69604
  • vector.print
    • Incredibly helpful op when it comes to e2e tests - in PR 66691 it was given the ability to print SME tiles

Combined with the previous tile allocation, and recent work on ensuring vector types and their in-memory representations are legal (-arm-sve-legalize-vector-storage, PR 68794), we’re getting close to a functionally correct SME linalg.matmul lowering.

Other activities

We are almost ready to upstream our initial example of an e2e for scalable vectorisation of linalg.matmul targetting SVE:

Also, in addition to this, we have been extending IREE (GitHub - openxla/iree: A retargetable MLIR-based machine learning compiler and runtime toolkit.) to support scalable vectorisation + tiling.

Please note that scalable vectorisation remains work-in-progress (both in MLIR and IREE).

Next steps

This year we are focusing on functional correctness for linalg.matmul → ArmSME. We will continue making contributions to upstream MLIR and IREE so that soon others can start experimenting with this stack (today this is still very experimental).

We are also auditing various relevant bits of MLIR (mostly around the Vector dialect) to make sure that scalable vectors (and matrices) are well supported and thoroughly tested.

As for other ops, we will also be contributing the required lowerings for linalg convolutions, though we might focus on streaming SVE (part of the SME extension) in the initial iterations.

Please reply in this thread if you’d like more details. Also, please join us for AArch64 Sync-up - we will be available there to share more updates.

Thanks for reading!

Post by: @MacDue, @banach-space


CC @frank_gao @bryanpkc @dcaballe @chrisj-quic