RFC: AArch64 Linux Scalable Matrix Extension (SME) support for LLDB

Following in the footsteps of the Scalable Vector Extension (SVE) and its sequel (SVE2), is the Scalable Matrix Extension (SME).

https://developer.arm.com/documentation/ddi0616/latest/

If you want to know what you can do with said scalable matrix in code, I am the worst person to ask :). I will however go through the new debug state and how I plan to implement it in LLDB for AArch64 Linux.

Understanding this document will require a bit of knowledge about the Scalable Vector Extension (SVE). If it’s new to you (shameless plug) skim the first few slides of my Linaro Connect talk (https://resources.linaro.org/en/resource/cuzjd4AvMSr2MKJty5n8q3) or read Arm’s announcement (The Arm Scalable Vector Extension - IEEE Micro - Research Articles - Research Collaboration and Enablement - Arm Community).

Streaming SVE Mode

This is a new mode where the same SVE register set is used but certain behaviours change and some instructions are made invalid. My understanding is that in this mode the CPU makes assumptions about your code and if you adhere to those, you can get some performance improvement.

From here on I will refer to the original SVE mode as “non-streaming”.

The details of this are that the existing SVE registers are shared between modes. This means the Z (data registers) and P (predicate registers). However the vector lengths (VLs) are different and each one can be read from either mode. Therefore, I will reuse the existing SVE state in LLDB for the streaming mode registers. All that happens is when we stop we check for streaming mode and read from there if it’s active.

How Do We Know Streaming Mode is Active?

There is a new ptrace regset NT_ARM_SSVE. When read on an SME enabled system, this will always contain a header and if streaming mode is active, also its register data. Same as NT_ARM_SVE, there is a header flag to tell you if the SVE registers are active.

If this is set in the NT_ARM_SSVE header, we’re in streaming mode. If it’s not set, then we go back to the NT_ARM_SVE set and see if we’re in SVE or SIMD mode.

This has some consequences.

  1. LLDB will not let you write to the Z or P registers of the inactive mode.

The architecture does not provide new names for streaming mode P and Z registers (in real hardware I assume they may actually be the same). So there is no good format to name these “new” registers. For example we could end up with z0, streaming_z0 and non_streaming_z0. It’s confusing. GDB has also decided not to make up new names for this.

The other reason to not make up new names is that writing to the inactive mode via ptrace will put you into that mode, with all the other registers zeroed or undefined. Even if you could for example, be in non-streaming mode and write to streaming Z0, when you switched into streaming mode it would have some undefined value. Essentially, it’s very unlikely you’d want to switch modes this way.

  1. The vector granule register “vg” (from which you can derive the vector length) will always reflect the current mode.

I have a plan for this. Since the vector lengths are distinct and are found in the ptrace header not the register data, we can read the inactive mode’s vector length. So I will add a pseudo register “svg” to complement the existing “vg”. This register will always return the streaming vector length.

The table below shows the combinations of modes and registers with my proposed scheme:

Mode vg svg Z<N> P<N>
Non-streaming Non-streaming VL Streaming VL Non-streaming Z<N> Non-streaming P<N>
Streaming Streaming VL Streaming VL Streaming Z<N> Streaming P<N>

The obvious gap is reading non-streaming VL while in streaming mode. However, fixing that is a bit awkward naming wise (non_streaming_vg?) so while it’s not that hard to do, I’ll wait to see if anyone asks for it.

How Does An LLDB User Know What Mode They’re In?

One way would be to look at the Z or P register size. However, on an unfamiliar system you wouldn’t know the expected size. Instead, there is a “streaming vector control register” (SVCR) added in SME. It contains the current streaming mode and whether the array storage is enabled (ignore the latter for now).

This register is readable from EL0 but that would require us to inject code into the debugee (streaming mode is per process). Instead I will emulate the contents using the ptrace data. For the streaming mode state LLDB will simply check if reading NT_ARM_SSVE returns a header with the SVE active flag set.

The final pseudo register will exactly match the architectural register. We can even emulate writing to it, using the right ptrace operations.

Array Storage

The Matrix part of the Scalable Matrix Extension is the array storage (ZA). In some sense this is one giant register. SVE provided registers in size, ZA is a square of x size.

This can be sliced up in many different ways and I will likely represent this by naming each row of this matrix as its own register.

There are many many more ways to refer to different “tiles” and chunks of this matrix. So I may need to improve LLDB’s support for register aliases and sub-registers so users can copy paste assembly register names into LLDB without modification.

(for example, as far as I know you can’t have a pseudo/alias that combines parts of other registers)

This ZA is scalable and can be disabled independently of the streaming mode. Therefore on every stop we will be checking whether it is enabled and what its size is. The same machinery as SVE will be used.

Thread Local Storage

SME adds a second TLS register tpidr2 that is reserved for use by the ABI for SME handling. Debuggers don’t have to do much more than show this (though when we know what the ABI format is, we could do some interesting things).

LLDB doesn’t even show the first TLS register tpidr, so I have a patch up already to fix that first (https://reviews.llvm.org/D152516).

As SME is an optional extension the NT_ARM_TLS set will change size when SME is enabled and therefore this will use a dynamic register set in LLDB.

Core Files

Nothing unique to report here. All the mentioned state has equivalent notes and scalability will be handled in the same way SVE is already handled.

Backtrace, Signals, Expression Evaluation

These are all areas where we will need to pay attention to the current streaming mode and vector length. The details are mostly the same as what was done for SVE.

The only difference is that there is currently no DWARF marker to tell us what mode we are in during a backtrace. However this area is still being worked on so we’ll see if one does appear.

Testing

As for other early architecture support the platform will be QEMU (Testing LLDB using QEMU — The LLDB Debugger) which already supports SME. Patches will come with tests but they will require a QEMU setup to run them.

There is no hardware available that supports SME at this time. For reference, SVE has only been readily available (for some definition of readily) in the last couple of years.

Linaro runs LLDB buildbots for AArch64 but these do not cover the following architecture extensions at this time:

  • Memory Tagging
  • Pointer Authentication
  • Scalable Vector
  • Scalable Vector 2

We have considered setting up a QEMU all features enabled buildbot but it will not happen in the short term.

I hope that we can move forward adding early architecture support in a way that is:

  • Tested in a generic manner as much as possible.
  • Written defensively so as to not disrupt support for the “standard” hardware of the time.

And of course myself and my Linaro colleagues will take responsibility for fixing issues with this code if they do not show up under normal testing circumstances.

Tile Format Types

I mentioned earlier that the ZA array can be addressed in tiles of various sizes. We already have some “vector” types for printing values in LLDB.

(lldb) register read z0 -f uint64_t[]
z0 = {0x2f2f2f2f2f2f2f2f 0x2f2f2f2f2f2f2f2f 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000}

Ok more array than vector, but 1D types if you will. I am considering whether adding 2 dimensional types here would be useful. uint64_t[][] for example.

We’d have to assume that these were “square” types, and that assumption may not even apply to everything SME can do. I need to write some SME code to get a feel for this and talk to some developers, so I’ll come back with specifics if I think this is warranted after all.

Scalable Matrix Extension 2

Just because you might have seen the announcement: Arm A-Profile Architecture Developments 2022

I can’t go into details of SME2’s debug state at this time but I am planning to work on it right after SME. I don’t expect it to be very disruptive when compared to SME.

Conclusion

The changes for SME are pretty straightforward given that SVE support already exists. By reading this you should have some background so you can review the patches if you wish to do so.

My Linaro colleague @omjavaid, the author of the existing SVE support, will handle reviewing the architectural details.

As always any other feedback and questions are welcome.

4 Likes