SME in MLIR status (20/10/2023)

MacDue · October 20, 2023, 2:39pm

Previous update:

An update on ArmSME dialect and proposed next steps

Current state

Following the groundwork to represent SME in MLIR and implement tile allocation (D154955), we have been making steady progress to allow the lowering from Linalg, via Vector, to LLVM.

Today you can already lower basic examples with linalg.fill to SME and run executables (through an emulator):

For the implementation see D158619
For an e2e example see fill-2d.mlir

We are now focusing our efforts on lowering linalg.matmul to ArmSME. This is expected to land soon (hopeful within a month, definitely before the end of the year).

On a more general note, the lowerings for ArmSME are now more progressive, first lowering to SME ops, and then lowering to intrinsics. Below is an example for vector.broadcast:

1. Input

%tile = vector.broadcast %val : i32 to vector<[4]x[4]xi32>

2. convert-vector-to-arm-sme

Tile operations are rewritten into loops over tile slices. Using arm_sme ops rather than intrinsics. Tiles allocated via arm_sme.get_tile_id and conversion casts.

%tile_id = arm_sme.get_tile_id : i32
%tile = arm_sme.cast_tile_to_vector %tile_id : i32 to vector<[4]x[4]xi32>
%broadcast_slice = vector.broadcast %val : i32 to vector<[4]xi32>
%svl_s = arith.muli %vscale, %c4 : index
// Note: The tile is updated in-place (rather than via scf.yield to keep rewrites simpler)
// See: https://discourse.llvm.org/t/loop-materialization-in-armsme/72354
scf.for %i = %c0 to %svl_s step %c1 {
  arm_sme.move_vector_to_tile_slice %broadcast_slice, %tile, %i : vector<[4]xi32> into vector<[4]x[4]xi32>
}
// => %tile

3. convert-vector-to-llvm=“enable-arm-sme”

The ArmSME ops can then be mapped (near 1-to-1) to intrinsics on conversion to LLVM:

%tile_id = arm_sme.get_tile_id : i32
...
scf.for %i = %c0 to %svl_s step %c1 {
  "arm_sme.intr.write.horiz"(%tile_id, %i, %ptrue, %broadcast_slice) : (i32, i32, vector<[4]xi1>, vector<[4]xi32>) -> ()
}

=> from here tiles can be allocated with -allocate-arm-sme-tiles, then the rest is the standard LLVM lowerings (which can be handled by -test-lower-to-llvm)

That’s just one example but many more ops on the linalg → vector → arm_sme critical path are being worked on. These are focusing sizes that exactly match the size of an SME tile, with the idea that at a higher level operations can be tiled with SME-compatible sizes.

Here’s an (incomplete) summary:

arith.constant, vector.broadcast, vector.splat
- D158586, PR 67659
vector.transpose
- This was added for functional correctness in PR 66760
- Though we hope to find ways to avoid the transpose (e.g. by ensuring the inputs are transposed as a pre-processing step)
vector.transfer_read, vector.transfer_write
- We’ve had basic support for these ops since early on…
- Recently Cullen has been looking into all things these ops can do (masking, transposition, padding, etc) and working on enabling SME lowerings
  - This is not all merged, but a draft PR is up PR 69148
vector.insert, vector.extract
- Thanks to recent work by Diego D155034, these ops now accept SSA values as indices
- This made lowering SME MOVA instructions fairly simple in PR 67786
vector.outerproduct
- Initial support added in PR 65621
- With support for masking now up for review PR 69604
vector.print
- Incredibly helpful op when it comes to e2e tests - in PR 66691 it was given the ability to print SME tiles

Combined with the previous tile allocation, and recent work on ensuring vector types and their in-memory representations are legal (-arm-sve-legalize-vector-storage, PR 68794), we’re getting close to a functionally correct SME linalg.matmul lowering.

Other activities

We are almost ready to upstream our initial example of an e2e for scalable vectorisation of linalg.matmul targetting SVE:

[mlir][SVE] Add an e2e test for vectorization of linalg.matmul by banach-space · Pull Request #69592 · llvm/llvm-project · GitHub

Also, in addition to this, we have been extending IREE (GitHub - openxla/iree: A retargetable MLIR-based machine learning compiler and runtime toolkit.) to support scalable vectorisation + tiling.

Please note that scalable vectorisation remains work-in-progress (both in MLIR and IREE).

Next steps

This year we are focusing on functional correctness for linalg.matmul → ArmSME. We will continue making contributions to upstream MLIR and IREE so that soon others can start experimenting with this stack (today this is still very experimental).

We are also auditing various relevant bits of MLIR (mostly around the Vector dialect) to make sure that scalable vectors (and matrices) are well supported and thoroughly tested.

As for other ops, we will also be contributing the required lowerings for linalg convolutions, though we might focus on streaming SVE (part of the SME extension) in the initial iterations.

Please reply in this thread if you’d like more details. Also, please join us for AArch64 Sync-up - we will be available there to share more updates.

Thanks for reading!

Post by: @MacDue, @banach-space

banach-space · October 20, 2023, 3:33pm

CC @frank_gao @bryanpkc @dcaballe @chrisj-quic

Topic		Replies	Views
MLIR for arm SME : Reducing tile data transfers MLIR	2	47	July 10, 2024
[PSA] ArmSME lowering pipeline and tile allocation changes MLIR	1	176	April 29, 2024
MLIR for Arm SME : Further development suggestions MLIR	9	311	April 30, 2024
MLIR for arm SME vectorizing matmul-like ops as part of a broader program MLIR	2	198	April 26, 2024
[RFC] Creating a ArmSME Dialect MLIR arm , arm64	80	3367	June 9, 2023