[RFC] Creating a ArmSME Dialect

Makes sense. Thanks for pointing this out!

[nit] Could you add syntax highlighting to your snippets?

This example is super helpful, thanks!

Hm, I don’t follow :slight_smile:

Ok, but how do you differentiate between za3d and za7d? Also:

The number of tiles and tile numbers will differ depending on the underlying element size. So arm_sme.acquire needs to know the element size:

%tile1 = arm_sme.acquire <elem-size>
%tile2 = arm_sme.acquire <other-elem-size>

How do you make sure that there’s no overlap between %tile1 and %tile2? Or, if there’s no way to “acquire” 2 independent tiles, which tile acquisition to prioritize? My point being, this may still require something akin a register allocator.

This is my ignorance kicking in - do you mean through MemoryEffects? I’m mostly curious whether MLIR already implements the necessary mechanism for this.

To avoid register allocation issues, could you model ZA as <vscale * vscale * 8 * 8 * hidden>. sme.aquire returns ZA and sme.release releases ZA to model the uniqueness of ZA.

I tried ```mlir, is there a special way to add syntax highlighting on here? :sweat_smile: Nevermind, removing mlir worked out

Right, this got lost as I was changing things around in my example… Previously I was having acquire directly return a 2d vector, which would include the type. I can’t recall the exact requirements for the parser/printer, but yes, we would need to include some type info here.

sorry, za3d should be 0x08, and za7d would be 0x80. This corresponds to the intrinsic definition in LLVM.

The current idea is to simply iterate through available tiles and check to see if they’re already acquired, since each bit here corresponds to a 64bit tile, and current SME instructions only really interacts with 64 and 32 bit tiles:

for (uint8_t tile : b32Tiles)
  if (tilesInUse & tile) continue;
  tilesInUse |= tile;
  return tile
// special case
return 0;

Same goes for release:

tilesInUse &= (^tile);

Yes: MemoryEffects can be used here to model the semantics I believe. (MemoryEffects is a subset of “effects” concept with is a bit more general, but under-developed at the moment)

@frank_gao , apologies for not getting back to this. ATM, we are focusing on streaming SVE [*], but I don’t have much to share just yet (will keep you in the loop).

Just as an FYI, there might be an opportunity to discuss this in person soon: https://discourse.llvm.org/t/eurollvm-2023-roundtable-targeting-cpus-from-ml-frameworks. The notes are always available online - just in case you are unable to attend.


[*] Streaming SVE is a subset of the SME ISA.

I’ve landed an integration test for Streaming SVE (SSVE) that demonstrates targeting SSVE from MLIR using the previously mentioned aarch64_pstate_sm_enabled attribute.


I just came up with another alternate concept for obtaining a tile register for SME… But first let me make a few assumptions clear, please let me know if these seem reasonable:

  1. We will assert that all SME functions in MLIR will be self contained - i.e. we will not allow for a function with the aarch64_pstate_sm_enabled attribute to call another function with the same attribute.
    a. This is due to (from what I can tell) SME functions being generally used for innermost kernels of computations and are mostly self contained.
    b. This helps with the utilization of all available tiles without having to worry about the same tiles being used in a different scope… Also makes the “pseudo-RA” a lot easier (only have to consider the scope of a single function)

  2. No spilling of tile registers will be allowed. i.e. We will emit an error when we try to allocate more tiles than available, this will happen during the LLVMIR translation. The lowering/conversion pass is responsible for generating code that utilizes the appropriate amount of tiles.
    a. In theory the point is to generate high performance code, and we want to avoid spilling in the first place

Given all of that, I want to show an example of what I have in mind:

// Since tiles should be initialized with either a sme.zero or a load,
// we can allocate tiles upon those operations

// Maps to 0x01 for tile enumeration. tilesInUse = 0x01
%tile0 = arm_sme.load_tile %C[%i, %j], %hmask, %vmask
           : memref<?x?xf64>, vector<[2]xi1>, vector<[2x2]xf64>

// Tries to map to 0x11, failes because inUse = (tilesInUse & 0x11) = true
// Tries next tiles 0x22, succeeds.   tilesInUse = 0x23
%tile1 = arm_sme.zero : vector<[4x4]xf32>

// Try 0x01, 0x10, 0x02, 0x20... in sequence
// Gets 0x10.                         tilesInUse = 0x33
%tile2 = arm_sme.zero : vector<[2x2]xf64>

// ...

// Overwrites %tile0 - use of %tile0 will be invalid past this op? Perhaps we 
// may need to introduce a sme.copy which will need to allocate another tile?
%tile0_new = arm_sme.mopa %tile0, %lhs, %rhs, %hmask, %vmask 
            : vector<[2x2]xf64>, vector<[2]xf64>, vector<[2]xi1>

// Emit error? - reference to %tile0 after it has already been overwritten
%tile0_new_new = arm_sme.mopa %tile0, %lhs, %rhs, %hmask, %vmask
            : vector<[2x2]xf64>, vector<[2]xf64>, vector<[2]xi1>

// ...

// Deallocates 0x01. tilesInUse = 0x32
arm_sme.store_tile %C[%i, %j], %tile0, %hmask, %vmask
            : memref<?x?xf64>, vector<[2x2]xf64>, 

A few things to note:

  • I have decided to ditch the vector dialect equivalent of load_tile, store_tile, mopa, and arguably zero.

    • This is because vector dialect MaskableOpInterface currently only accepts one mask vector as an operand, in addition to mopa requiring pattern matching of an vector::OuterProductOp in addition to an arith::AddOp. I feel (for the time being) these operations are unique enough to justify their own op, but this is still open for debate.
    • This avoids the need for sme.allocate, sme.bind, and maybe sme.release which are only there for lowering/analysis to begin with.
  • This implementation will impose some restrictions to how SME ops are generated, but hopefully still retain some level of flexibility with some vector ops, such as storing the entire tile without predication, moving a row/column vector out from a tile etc…

Thoughts, comments, suggestions?

1 Like

Other folks should comment as well, but isn’t this gonna be overly limiting even there: I would think that the innermost kernels may be composed of a few calls to micro-kernel routines which may use the tile registers as well?

While I think this is true for high level languages like C++, I think for us inside of MLIR, we would be the ones generating these micro-kernels ourselves, which would correspond to these SME functions. However this is a gross generalization only for the use cases I have seen myself so far. If this turns out to be more of an issue, then I think the true solution to the allocation problem may have to lie within LLVM itself, with a proper RA.

I had the impression that on the contrary there was very active work to target micro-kernels from MLIR compilers directly. Are you familiar with TPP for example? https://arxiv.org/abs/2104.05755
This is a hot topic of development, for example it was the topic of yesterday’s community meeting in OpenXLA/IREE (recording/slides will come soon).

1 Like

Regardless, I would argue that in the specific case of SME ops, being a hardware-specific dialect, it would either be already contained within said micro-kernels, or we would be calling micro-kernels, and (hopefully?) not both, at least for now, until an RA solution gets implemented in LLVM.

If the SME dialect introduces sme.funcop, to model SME functions and the attributes. You could also add some sme.callOps to model transitions between the different SME modes and not-SME modes.

Now that I’ve learned a bit more about the calling convention for SME, I think decorating a func.func with attribute { passthrough = [ "aarch64_pstate_sm_enabled " ] } should be good enough?

But then your customers must know the exact name of the attribute. sme.funcOp might be easier.

That makes sense.

Hi @frank_gao , thanks for the updates!

I am travelling and won’t be able to post until I am back (next week, either during or after EuroLLVM), but …

… just wanted to let you know that we are drafting an RFC specifically about that aspect of SME support. It’s almost ready - hope to publish it next week :crossed_fingers: (sorry about the delay). Once that’s available, your feedback will be greatly appreciated!


1 Like

I said before that I like sme.funcOp over writing the annotations myself.

Could you think for second for the RFC about adding new SME call Ops. I could have an SME call Op for going from normal state to streaming SVE. Or calling an sme.funcOp from within an sme.funcOp. I believe there was a discussion about these state transitions.

I think in theory, a sme.call would not be necessary as long as functions themselves are declared as sme.func, and the convention would probably follow very closely to this article:

The only complication I see is how to determine if another function can be seen as streaming mode compatible. In this instance a sme.call may help, where if we are within a sme.func, a sme.call would be calling a function that is deemed compatible, whereas a func.call would call a non-compatible function and would require a mode change.

More info on the related work on IREE side right now (from Intel / @rengolin): https://groups.google.com/g/iree-discuss/c/SNkFQvtr0Uw/m/PkG4Ga1aBAAJ?utm_medium=email&utm_source=footer