[RFC] Supporting Armv9 Scalable Matrix Extension (SME) Streaming SVE (SSVE) mode in MLIR

Hi,

I recently posted an RFC to the IREE project for supporting the Armv9 SME SSVE mode that’s relevant for SME support in MLIR.

The summarise, it proposes a pass in IREE that’s enabled when targeting SVE and SME that adds the aarch64_pstate_sm_body attribute [1] to functions. The LLVM backend will emit smstart sm / smstop sm instructions that enable streaming mode in the prologue / epilogue of functions with this attribute.

The RFC also discusses another attribute aarch64_pstate_sm_enabled that was mentioned on the ArmSME dialect RFC [2]. The backend will emit smstart sm / smstop sm instructions around calls to functions with this attribute. The key difference between these attributes is the former is internal to the function and doesn’t change the ABI whereas the latter is part of the interface and does change the ABI.

The RFC proposed to use the internal attribute in IREE as the intention is to enable SSVE on dispatch functions that are created by the compiler and called by the runtime. If the compiler added the external attribute to dispatch functions this would change the ABI, and the responsibility of managing PSTATE.SM on entry/exit to the dispatch function would fall on the runtime which currently can’t support this.

Initially I planned to add the pass to MLIR and use it in IREE but constraints in IREE led to using the internal attribute and I wasn’t sure other users of MLIR would have the same constraints.

SSVE could be enabled in MLIR by adding a pass that initially adds the aarch64_pstate_sm_enabled attribute to all functions, and later selectively enables SSVE based some heuristic.

I’d be happy to post a patch for this, it’s pretty lightweight. Thanks for reading, let us know if you have any questions or comments.

[1] Support for AArch64 Scalable Matrix Extension in LLVM — LLVM 17.0.0git documentation
[2] [RFC] Creating a ArmSME Dialect - #36 by c-rhodes

3 Likes

I am really surprised that you are introducing such a low-level dialect into MLIR. In the other RFC, there was talk about introducing an sme.funcOp. Then you could add a lower pass that lowers the SME funcOp to LLVM and adds the appropriate attributes.

You could also introduce sme.callOps to model the transitions between sme → sme and non-sme to sme calls.

This RFC does not propose any new dialects.

The approach outlined here is much simpler, yet powerful enough to cater for all cases that we’d like to support today. Once we identify cases that folks would like to support and for which this approach would be insufficient, we can discuss alternatives (including one involving a dedicated dialect for SME).

Also, to clarify, this has originally been posted as an RFC for IREE. However, we’d happily move it to MLIR if folks are supportive of the idea. Hence this RFC :slight_smile:

I had a chance to look at both posts in detail. Thanks a lot for the detailed explanation! I’ll try to answer the MLIR part here and the IREE-specific part in the IREE RFC. Sorry if these two discussion lines are confusing but I think it’s important that we have a single abstraction for SSME in MLIR with IREE just being one of its multiple users.

Using attributes as a starting point looks good to me, especially if the goal is to have things working end-to-end, build expertise and understand what kind of SSME optimizations/transformations we want to implement in MLIR. Since the approach is similar to the one followed in LLVM and it’s low level, it’s probably a good idea to introduce these attributes at the end of the MLIR pipeline, assuming that a higher level SSME representation may exist in the future and lower to these attributes.

An sme.func also has a point. I think it would allow us to model the different streaming semantics more precisely by introducing the proper constraints and verification. However, I understand the limitations of going there in the first step. I also see sme.func as part of a more end-to-end SME story that would also include the outlining part. For example, we could introduce sme.streaming { … } region ops to delimit operations within the same function that could be executed in streaming mode. We could have optimizations targeting sme.streaming ops (e.g., fuse them, minimize streaming start/stop points, etc.) and eventually outline sme.streaming ops to individual sme.func ops and finally to llvm.func ops with the corresponding attributes. Not sure if this makes any sense, just thinking out loud, but I think it’s a scenario that would justify a more complex SSME representation, in my opinion.

Anyways, this looks great as a first step! Big +1 from me. Thanks for working on this!

1 Like

In the IREE RFC, you propose to use passthrough for the attributes. This is only a temporary solution. You have to define your own attributes and then you end up with an SME dialect.

See the discussion:
https://reviews.llvm.org/D149450

Yes, this captures the rationale behind this proposal very nicely, thank you!

Good catch, thanks! Yes, this is actually documented (my bad for not noticing earlier):

WARNING: this feature MUST NOT be used for any real workload. It is exclusively intended for quick prototyping. After that, attributes must be introduced as proper first-class concepts in the dialect.

This effort falls somewhere between “quick prototyping” and “real workload”. I wouldn’t want us to rush in creating “a dialect” for one attribute. Instead, I’d keep it as is for the time being and eventually move this to a more suitable location once that emerges.

Thank you,
Andrzej

This makes the RFC look even worse. Let’s abuse the passthrough attribute and we will never create an SME dialect which uses higher abstractions.

I had a chance to look at both posts in detail. Thanks a lot for the detailed explanation!

Thanks for the detailed comments!

Using attributes as a starting point looks good to me, especially if the goal is to have things working end-to-end, build expertise and understand what kind of SSME optimizations/transformations we want to implement in MLIR.

Yeah this aligns with our thinking, this is a lightweight first step to targeting streaming mode and hopefully will set us on the path to more powerful abstractions like sme.func you mentioned.

The warning on the passthrough attribute clarifying it’s only for prototyping wasn’t added until last week (longer after I started this work) and I wasn’t aware I was “abusing” it. Regardless, I would expect adding a proper attribute is a minor implementation detail?

We never said an Arm SME dialect will never be created. As I clarified on the original RFC, full SME support is outside of the scope of this RFC.

This sounds similar to GPU streaming. Could the gpu dialect be actually a device dialect and abstract away the GPU/Accel offloading strategies? Having a more abstract offload dialect will definitely help to move to heterogeneous targets (like a mix of CPU and GPU and accelerator on the same machine).

1 Like

Indeed! (not my area of expertise, so take my comments with a pinch of salt)

We did think about this in the context of SME and it’s bit tricky. SME comes with Streaming SVE (SSVE) and sometimes you may simply want to use SSVE so that you can e.g. leverage wider vectors (assuming that the SME element on your CPU has wider vectors than the host CPU). However, code-gen for SSVE will be very similar when compared to regular SVE. Would this still count as “heterogeneous” environment?

The point that I am trying to make is that indeed there are some similarities between SME/SSVE and offloading to GPUs, but there are also differences. So it feels a bit like a grey area. For this reason we are trying to focus on the bare minimum to have some environment for experimenting so that we can make more informed design decisions. But indeed, a dedicated device dialect could help with some potential challenges that we anticipate in the future.

I didn’t ask for a full-fledged SME dialect, but you have the option for a small SME dialect with MLIR attributes. Later on, you could extend it with funcOps and callOps.

That’s a good question.

Anything that needs “bundling” code into a kernel needs some form of offloading, and to make it more efficient, you combine them into a stream, so that there’s less faffing in between calls (less register transfers, memory movement). This is what to me constitutes a device.

The more complex CPU extensions get, the more device-like they are, and it’s not coincidence that the strategies once exclusive to external devices (PCI, co-proc) are now being used for them as well. Does this mean a CPU that has SME/SSVE/AMX is heterogeneous?

The way I see it, it depends on how you program it. If you use it as a SIMD extension and intermingles code with scalar code, then no. If you bundle code into kernels and line them up as streams, then I’d say (a weak) “yes”.

Regardless, if there are similarities between offloading to SSVE and GPUs, then by all means, that’s a good reason to common up the infrastructure to describe them, and let the implementation handle the lowering.

Absolutely! I’m just throwing an observation, not trying to change your current roadmap. This all falls well within the “exercise for the reader” category.

1 Like

I think this may be a good first-step towards creating a foundation for the SME dialect. In fact, we could probably take this opportunity to refactor/combine the ArmSVE and ArmNeon dialects into something like ArmSIMD?

One question I have for this case is how would you distinguish between a function that is purely SSVE and a function that utilize ZA? I was reading into handling pstate-za, but the section seems a bit empty at the moment.

It’s not easy to know where the line is. llvm.noalias has been there for a while and some production scenarios rely on it to enable LLVM optimizations and get more performance. I would classify the SSME proposal as “quick prototyping” of the end-to-end SSME story if the plan is to build just that. I don’t think the goal here is to use the attributes beyond just passing them from MLIR all the way to LLVM, which is what the passthrough mechanism seems to helpful for, i.e., there won’t be passes reasoning about the attribute semantics or making IR changes based on them.

However,

this also looks like a good increment approach. We can create the attributes within an SME dialect and attach them to a regular func.func op now. It shouldn’t be a lot of work. WDYT?

That would be great as long as we can leave device-specific semantics out of it. I’m totally supportive of having a shared representation and then device-specific passes to transform it based on each device’s needs. Definitely a good long-term direction that will be worth exploring if the offloading mechanism is also needed for SVE at MLIR level.

Yes, I think they key part here is doing some sort of offloading to some sort of on-chip or off-chip co-processor, not so much about heterogeneity (whatever that means in these scenarios).

1 Like

Hey Frank!

So, firstly, from RFC to the IREE project :

The focus of this RFC is SSVE only, not full SME support, but this is an important first step towards this.

But I guess we can take it one step further and discuss ZA as well :slight_smile: So, from the list of attributes available in LLVM:

we would look at:

  • aarch64_pstate_za_new
  • aarch64_pstate_za_shared
  • aarch64_pstate_za_preserved
  • aarch64_expanded_pstate_za

My suggestion would be to start with aarch64_pstate_za_new [1]. You can find the full details of this attribute in ACLE: Arm C Language Extensions.

HTH,
Andrzej

[1] In general, in the first iteration, I would avoid calls to “streaming” functions from within other “streaming” functions. Just like you proposed in your RFC. This helps us simplify the problem space and focus on prototyping. But we should definitely be open to relaxing this.

1 Like

I’m gravitating towards the attributes and pass being the basis for an ArmSME dialect. It’s now clear we shouldn’t be using the passthrough mechanism and instead adding proper attributes to the LLVM dialect. In IREE, the pass needs to operate on func.func ops rather than lower-level llvm.func ops to fit into codegen pipelines and leverage those as a mechanism for heuristically enabling SSVE. I don’t know where this pass would live in MLIR so what you’re suggesting makes sense to me.

Would the ArmSIMD dialect be for NEON/SVE/SME? I’ve read your RFC and the discussion on hw-specific dialects, this is intended as a lightweight first step towards SSVE support in MLIR and I’d like to avoid over committing.

1 Like

+1

I want to avoid folks thinking that that other long discussion on the SME dialect is basically being bypassed in this (still) relatively short thread. So, just to clarify:

  • we would be creating an ArmSME dialect specifically for these attributes,
  • for now, the dialect would be created specifically for the attributes as there doesn’t seem to be more suitable place to keep them (unless folks have other suggestions),
  • this would be tangential to this proposal [RFC] Creating a ArmSME Dialect (as in, we are yet to define and agree on other SME abstractions to include here).

Not against it, but I am a bit concerned about the scope creep :wink:

-Andrzej

1 Like

+1

I don’t think we should merge SVE and Neon. They are not only different ISAs but also use different vector technology. I wouldn’t mix fixed-length and scalable vector ops. It was also stated in the past that most of the ops in the ArmNeon dialect would go away in favor of using target independent ops in the Vector dialect.

1 Like