[RFC] Array Register Files

Hi all,

There's a rather major piece of work that's been in the back of my mind for a while now. Before I actually start any work on it, I'd like to hear people's opinions, if any.

tl;dr: I'd like to augment the CodeGen physical register / register class infrastructure with a mechanism to represent large regular register files more efficiently.

The motivation is that the existing infrastructure really isn't a good fit for the AMDGPU backend. We have ~104 scalar registers and 256 vector registers. In addition to the sheer number of registers, there are some qualitative factors that set us apart from most (all?) other backends:

1. The order of register matters: if we use only a subset of registers, and that subset is on the low end of the range, we can run more work in parallel on the hardware. (This is modeled with regalloc priorities today.)

2. We can indirectly index into both register files. If a function has an alloca'd array of 17 values, we may want to lower that as 17 consecutive registers and access the array using indirect access instructions.

3. Even this aside, the number of register classes we'd really need is quite staggering. We have machine instructions taking operands with anywhere from 1 to 12 consecutive registers.

Modeling this as register classes with sub-registers is clearly not a good match semantically, let alone from a compile time performance point of view. Today, we take effectively a runtime performance hit in some cases due to not properly modeling the properties of our register files.

What I'd like to have

Hi all,

There’s a rather major piece of work that’s been in the back of my mind
for a while now. Before I actually start any work on it, I’d like to
hear people’s opinions, if any.

tl;dr: I’d like to augment the CodeGen physical register / register
class infrastructure with a mechanism to represent large regular
register files more efficiently.

The motivation is that the existing infrastructure really isn’t a good
fit for the AMDGPU backend. We have ~104 scalar registers and 256 vector
registers. In addition to the sheer number of registers, there are some
qualitative factors that set us apart from most (all?) other backends:

  1. The order of register matters: if we use only a subset of registers,
    and that subset is on the low end of the range, we can run more work in
    parallel on the hardware. (This is modeled with regalloc priorities today.)

  2. We can indirectly index into both register files. If a function has
    an alloca’d array of 17 values, we may want to lower that as 17
    consecutive registers and access the array using indirect access
    instructions.

  3. Even this aside, the number of register classes we’d really need is
    quite staggering. We have machine instructions taking operands with
    anywhere from 1 to 12 consecutive registers.

Modeling this as register classes with sub-registers is clearly not a
good match semantically, let alone from a compile time performance point
of view. Today, we take effectively a runtime performance hit in some
cases due to not properly modeling the properties of our register files.

What I’d like to have

I’d like to introduce the notion of array register files. Physical
registers in an ARF would be described by

  • the register array ID
  • the starting index of the “register”
  • the ending index of the “register” (inclusive)

This can be conveniently encoded in the 31 bits we effectively have
available for physical registers (e.g. 13 bits for start / end, 5 bits
for register array ID).

Register array ID 0 would be reserved for the familiar, TableGen-defined
registers (we would still have some of those in AMDGPU, even with this
change).

It would necessarily have to be possible to generate register classes
for ARFs both in TableGen and on-the-fly while compiling. Base register
classes would be defined by:

  • the register array ID
  • the minimum start / maximum end index of registers in the class
  • the size of registers in the class
  • the alignment of registers in the class (i.e., registers must start at
    multiples of N, where N is a power of two)

… and then register classes might be the union of such register
classes and traditional register classes.

(For example, in AMDGPU we would have a register class that includes all
register from the scalar array, with size 2, starting at an even offset,
union’d with a class containing some special registers such as VCC.)

A similar scheme would have to be used for sub-register indices.

I haven’t dug too deeply into this yet, and clearly there are quite a
number of thorny issues that need to be addressed – it’s a rather big
project. But so far I’m not aware of anything absolutely fundamental
that would prevent doing this.

What I’m asking you at this point

Like I said, I haven’t actually started any of this work (and probably
won’t for some time).

However, if you have any fundamental objections to such a change, please
speak up now before I or anybody else embarks on this project. I want to
be confident that people are okay with the general direction.

Also, if you have any advice or suggestions (maybe an alternative that
would also fit the requirements of the AMDGPU backend), I’d be happy to
hear about it!

This sounds like it would also be useful for allocating the consecutive registers needed to implement vectors in SimpleV, a parallelism extension to RISC-V.

Jacob Lifshay

[reordering a bit]

This sounds like it would also be useful for allocating the consecutive
registers needed to implement vectors in SimpleV, a parallelism
extension to RISC-V.

yes it does. comments inline, below

The motivation is that the existing infrastructure really isn't a good
fit for the AMDGPU backend. We have ~104 scalar registers and 256 vector
registers.

256 vector registers is the number that RISC-V RVV future extensions
plan to have. i think the RVV working group would really like to know
of any plans in this area.

1. The order of register matters: if we use only a subset of registers,
and that subset is on the low end of the range, we can run more work in
parallel on the hardware. (This is modeled with regalloc priorities today.)

2. We can indirectly index into both register files. If a function has
an alloca'd array of 17 values, we may want to lower that as 17
consecutive registers and access the array using indirect access
instructions.

SV (the custom extension jacob kindly mentioned) also has register
redirection. it's slightly different in that it's completely
transparent: all *SCALAR* operations, if a register is marked in the
redirection table, will (a) be redirected through a key-value table
from 5 bits (the standard RISC-V register file size) to 6 bits and (b)
checked to see if it's a vector.

3. Even this aside, the number of register classes we'd really need is
quite staggering. We have machine instructions taking operands with
anywhere from 1 to 12 consecutive registers.

is that length part of the instruction or is it set up by way of an
indirect status register of some kind?

What I'd like to have
---------------------
I'd like to introduce the notion of array register files. Physical
registers in an ARF would be described by

- the register array ID
- the starting index of the "register"
- the ending index of the "register" (inclusive)

let me think that through from the SV extension perspective. with
SV, i am following the example of RVV. there is one "global" status
register (example assembly code can be seen here
SIMD Instructions Considered Harmful | SIGARCH) i did
give serious consideration to having a per-register length... am still
mulling that one over as it takes up a lot of extra CSR space.

so if such a notion existed, then the SV-LLVM back-end would need to
look at the table, go "hmm, this ARF entry requires a length of [e.g.]
6, the current global length is 4, let's push that on the stack and
change the global". which would be perfectly manageable.

so yes, as long as those entries in the table cover contiguous
registers, _great_.

also, one thing: do you envisage the ARF table covering *overlapping*
registers? the reason i ask is because SV doesn't care if they are
(i.e. it's the compiler's problem to sort out).

i.e.:
* ARF1 = { id: 0, start=r5, end=r9}
* ARF2 = { id: 1, start=r6, end=r10}

the reason i ask is because on the isa-dev list jacob proposed an
algorithm that could actually be implemented by issuing a parallel-add
involving two (equal length) overlapping ARF entries: add ARF1, ARF2.

This can be conveniently encoded in the 31 bits we effectively have
available for physical registers (e.g. 13 bits for start / end, 5 bits
for register array ID).

cool. SV is 6 bits for start / end, 6 bits for length.

Register array ID 0 would be reserved for the familiar, TableGen-defined
registers (we would still have some of those in AMDGPU, even with this
change).

i'm not familar with what tablegen is, so can't comment.

It would necessarily have to be possible to generate register classes
for ARFs both in TableGen and on-the-fly while compiling. Base register
classes would be defined by:

- the register array ID
- the minimum start / maximum end index of registers in the class
- the size of registers in the class
- the alignment of registers in the class (i.e., registers must start at
multiples of N, where N is a power of two)

ok so you lost me a little, here. so are the 256 vector registers
sub-divideable into FP16/32/64 for example? and, if so, would it be a
good idea to have the size be part of the ARF table? (id, start, end,
bitwidth)?

or, are you talking about having *another* structure that has that
bitwidth? i'm slightly confused as to what the difference between the
ARF start/end and the above-mentioned "mnimum start / maxim end index
of registers in the class" would be.

... and then register classes might be the union of such register
classes and traditional register classes.

that sounds scary.

(For example, in AMDGPU we would have a register class that includes all
register from the scalar array, with size 2, starting at an even offset,
union'd with a class containing some special registers such as VCC.)

oh right! yes, i get it. no, i like it. in SV there is an
additional entry per register which specifies the bitwidth: default,
half, doubled, and 8-bit. requires only 2 bits to express.

with the union system that you propose, the different possible
bitwidths, even though the same underlying registers are used, could
be represented.

A similar scheme would have to be used for sub-register indices.

I haven't dug too deeply into this yet, and clearly there are quite a
number of thorny issues that need to be addressed -- it's a rather big
project.

yes :slight_smile:

But so far I'm not aware of anything absolutely fundamental
that would prevent doing this.

What I'm asking you at this point
---------------------------------
Like I said, I haven't actually started any of this work (and probably
won't for some time).

However, if you have any fundamental objections to such a change, please
speak up now before I or anybody else embarks on this project. I want to
be confident that people are okay with the general direction.

yyeah, it's quite timely.

Also, if you have any advice or suggestions (maybe an alternative that
would also fit the requirements of the AMDGPU backend), I'd be happy to
hear about it!

a couple of questions: i don't know enough about AMDGPU to be able to
say technically if the approach will work specifically for that
architecture: i can however raise some questions about the general
approach.

(1) does the AMDGPU have the concept of predication, and if so has a
data structure been considered to cover it.

(2) what's the difference between the start/end indices of the ARF
and the "class"?

(3) what's the reason for putting the bitwidth (and alignment) into
the "class", i.e. why have the "class" rather than just have one data
structure containing everything: the ARF?

l.

[reordering a bit]

This sounds like it would also be useful for allocating the consecutive
registers needed to implement vectors in SimpleV, a parallelism
extension to RISC-V.

  yes it does. comments inline, below

The motivation is that the existing infrastructure really isn't a good
fit for the AMDGPU backend. We have ~104 scalar registers and 256 vector
registers.

  256 vector registers is the number that RISC-V RVV future extensions
plan to have. i think the RVV working group would really like to know
of any plans in this area.

1. The order of register matters: if we use only a subset of registers,
and that subset is on the low end of the range, we can run more work in
parallel on the hardware. (This is modeled with regalloc priorities today.)

2. We can indirectly index into both register files. If a function has
an alloca'd array of 17 values, we may want to lower that as 17
consecutive registers and access the array using indirect access
instructions.

  SV (the custom extension jacob kindly mentioned) also has register
redirection. it's slightly different in that it's completely
transparent: all *SCALAR* operations, if a register is marked in the
redirection table, will (a) be redirected through a key-value table
from 5 bits (the standard RISC-V register file size) to 6 bits and (b)
checked to see if it's a vector.

I've just skimmed what's in SV, and it seems a bit different from what AMDGPU has, although perhaps it could be put to the same use. There are definitely similarities.

AMDGPU doesn't have any of the fancy unrolling of instructions; there's simply a mode in which an offset from a special register is added to certain register references. E.g., if an instruction references vector register v7 and the special register contains 13, it'll reference v20 instead. (The exact details of how to enable this mode differs between hardware generations.)

3. Even this aside, the number of register classes we'd really need is
quite staggering. We have machine instructions taking operands with
anywhere from 1 to 12 consecutive registers.

  is that length part of the instruction or is it set up by way of an
indirect status register of some kind?

It's complicated :slight_smile:

As far as LLVM is concerned, it's always part of the MachineInstr to keep things sane.

In hardware, the things that use up to 12 consecutive registers are texture sampling instructions. The binary instruction encoding only contains the type of sampling (e.g., whether user-specified gradients are used), but not the dimensionality of the texture (1D vs. 2D vs. 3D), which also affects the number of source registers. The dimensionality of the texture is determined by its image descriptor, which is a MachineOperand (256-bit scalar register, which is in fact just a sequence of 8 consecutive scalar 32-bit registers, which we denote in assembly as e.g. s[8:15]).

What I'd like to have
---------------------
I'd like to introduce the notion of array register files. Physical
registers in an ARF would be described by

- the register array ID
- the starting index of the "register"
- the ending index of the "register" (inclusive)

  let me think that through from the SV extension perspective. with
SV, i am following the example of RVV. there is one "global" status
register (example assembly code can be seen here
SIMD Instructions Considered Harmful | SIGARCH) i did
give serious consideration to having a per-register length... am still
mulling that one over as it takes up a lot of extra CSR space.

  so if such a notion existed, then the SV-LLVM back-end would need to
look at the table, go "hmm, this ARF entry requires a length of [e.g.]
6, the current global length is 4, let's push that on the stack and
change the global". which would be perfectly manageable.

  so yes, as long as those entries in the table cover contiguous
registers, _great_.

  also, one thing: do you envisage the ARF table covering *overlapping*
registers? the reason i ask is because SV doesn't care if they are
(i.e. it's the compiler's problem to sort out).

  i.e.:
* ARF1 = { id: 0, start=r5, end=r9}
* ARF2 = { id: 1, start=r6, end=r10}

the reason i ask is because on the isa-dev list jacob proposed an
algorithm that could actually be implemented by issuing a parallel-add
involving two (equal length) overlapping ARF entries: add ARF1, ARF2.

Maybe I misunderstand you, but I don't envision having an explicit "ARF table" like that in LLVM, precisely because I want to get away from opaque physreg indices.

For your examples above, the "registers" (in the LLVM CodeGen sense) would be represented as (and using 12 bits for start / end just to make the hex look nicer):

ARF1 = 0x01009005
ARF2 = 0x0100a006

(Assuming that 0x01 is used as the integer array ID; 0x02 could then be used for RISC-V float registers.)

So your `add ARF1, ARF2` could perhaps be represented in LLVM after register allocation as a MachineInstr whose MachineOpcodes are simply registers with the above numeric representation.

I have no idea how you'd handle the indirection in LLVM. That's a whole other can of worms that AMDGPU luckily doesn't have to deal with. :slight_smile:

In particular, I don't see how what I'm talking about would help you with managing slots in the indirection CAM. It would only help you get a more natural "encoding" of the underlying (already redirected) registers that instructions are using, for the purpose of register allocation etc.

Also, there are a whole bunch of issues with overlapping registers when it comes to live intervals etc., and I admit I haven't thought through all the implications. I'd like to make it all work, though, and I consider that to be one of the most important things to dig into if I actually start this. I think it'd be nice to avoid sub-registers altogether in this representation (simply always using the canonical representation), but we'll see how feasible that is.

This can be conveniently encoded in the 31 bits we effectively have
available for physical registers (e.g. 13 bits for start / end, 5 bits
for register array ID).

  cool. SV is 6 bits for start / end, 6 bits for length.

Register array ID 0 would be reserved for the familiar, TableGen-defined
registers (we would still have some of those in AMDGPU, even with this
change).

  i'm not familar with what tablegen is, so can't comment.

It would necessarily have to be possible to generate register classes
for ARFs both in TableGen and on-the-fly while compiling. Base register
classes would be defined by:

- the register array ID
- the minimum start / maximum end index of registers in the class
- the size of registers in the class
- the alignment of registers in the class (i.e., registers must start at
multiples of N, where N is a power of two)

  ok so you lost me a little, here. so are the 256 vector registers
sub-divideable into FP16/32/64 for example? and, if so, would it be a
good idea to have the size be part of the ARF table? (id, start, end,
bitwidth)?

Since we have a SIMT/SPMD/whatever-you-want-to-call-it model in AMDGPU, we mostly tend to think of the vector registers as simply 32-bit registers. In reality, they have 64 lanes of 32 bits each.

Unlike CPU SIMD architectures, we don't represent 64-bit values as 32 lanes of 64 bits; instead, 64-bit values are represented as two consecutive "32-bit" registers. So we store 32-bit values in v0, v1, ..., and 64-bit values in v[0:1], v[1:2], ...

That way, every thread of the original program stays within one lane of the SIMD structure.

For 16-bit, we use half-registers, i.e. each 16-bit value uses either the low or high half of a "32-bit" vector register.

  or, are you talking about having *another* structure that has that
bitwidth? i'm slightly confused as to what the difference between the
ARF start/end and the above-mentioned "mnimum start / maxim end index
of registers in the class" would be.

The "min start / max end" is kind of a detail thing which we might not actually need for AMDGPU. I was thinking that some architecture might have encodings that limit the reachable range of instruction operands (kind of like the RISC-V compressed instructions thing).

To describe such MachineInstr opcodes in an LLVM backend, you need register classes with a few more restrictions.

... and then register classes might be the union of such register
classes and traditional register classes.

  that sounds scary.

I know. But unfortunately necessary :slight_smile:

(For example, in AMDGPU we would have a register class that includes all
register from the scalar array, with size 2, starting at an even offset,
union'd with a class containing some special registers such as VCC.)

  oh right! yes, i get it. no, i like it. in SV there is an
additional entry per register which specifies the bitwidth: default,
half, doubled, and 8-bit. requires only 2 bits to express.

  with the union system that you propose, the different possible
bitwidths, even though the same underlying registers are used, could
be represented.

A similar scheme would have to be used for sub-register indices.

I haven't dug too deeply into this yet, and clearly there are quite a
number of thorny issues that need to be addressed -- it's a rather big
project.

  yes :slight_smile:

But so far I'm not aware of anything absolutely fundamental
that would prevent doing this.

What I'm asking you at this point
---------------------------------
Like I said, I haven't actually started any of this work (and probably
won't for some time).

However, if you have any fundamental objections to such a change, please
speak up now before I or anybody else embarks on this project. I want to
be confident that people are okay with the general direction.

  yyeah, it's quite timely.

Also, if you have any advice or suggestions (maybe an alternative that
would also fit the requirements of the AMDGPU backend), I'd be happy to
hear about it!

  a couple of questions: i don't know enough about AMDGPU to be able to
say technically if the approach will work specifically for that
architecture: i can however raise some questions about the general
approach.

  (1) does the AMDGPU have the concept of predication, and if so has a
data structure been considered to cover it.

Yes, although it's mostly implied because we do the whole SIMT/SPMD thing. There's a dedicate EXEC register which masks lanes for all vector instructions.

  (2) what's the difference between the start/end indices of the ARF
and the "class"?

>

  (3) what's the reason for putting the bitwidth (and alignment) into
the "class", i.e. why have the "class" rather than just have one data
structure containing everything: the ARF?

For both of those, I hope the above already clarified it a bit, but the general answer is that "register class" describes constraints e.g. on the operands of specific machine instructions which the register allocator would follow.

For example: If you had 256 registers, but some instruction encodings are limited to the low 64 registers, and you're looking at an instruction that takes consecutive 4-element registers, you'd use a register class with "min start = 0, max end = 66" (assuming the limitation is purely an encoding limitation).

The register allocator would then allocate some consecutive 4 registers in that range, e.g. it could come up with "register start = 17, end = 20".

(In the AMDGPU case, we also have alignment restrictions; e.g. scalar 4-register descriptors must be aligned to multiples of 4: s[0:3], s[4:7], ...)

Cheers,
Nicolai

First: To help the discussion not get sidetracked too much: We do support vector registers today, if your architecture needs to model consecutive registers or tuples you can do this today. Targets like AMDGPU, Hexagon, Apples GPU target do this and it works for them.

With that out of the way, there are some situations in which it doesn't work as well as it could, and this is what this discussion is about.
Today you need to specify tuples upfront in tablegen, you need a separate register class for every size of tuple and easily end up with thousands of tuple registers.

Hi all,

There's a rather major piece of work that's been in the back of my mind for a while now. Before I actually start any work on it, I'd like to hear people's opinions, if any.

tl;dr: I'd like to augment the CodeGen physical register / register class infrastructure with a mechanism to represent large regular register files more efficiently.

The motivation is that the existing infrastructure really isn't a good fit for the AMDGPU backend. We have ~104 scalar registers and 256 vector registers. In addition to the sheer number of registers, there are some qualitative factors that set us apart from most (all?) other backends:

1. The order of register matters: if we use only a subset of registers, and that subset is on the low end of the range, we can run more work in parallel on the hardware. (This is modeled with regalloc priorities today.)

2. We can indirectly index into both register files. If a function has an alloca'd array of 17 values, we may want to lower that as 17 consecutive registers and access the array using indirect access instructions.

Just to rephrase your point, because in principle you can start supporting such an indirect access feature today: The problem is that making a register classes for N-tuple registers creates a huge amount of registers and forces you to specify them in tablegen up-front.

3. Even this aside, the number of register classes we'd really need is quite staggering. We have machine instructions taking operands with anywhere from 1 to 12 consecutive registers.

Modeling this as register classes with sub-registers is clearly not a good match semantically, let alone from a compile time performance point of view. Today, we take effectively a runtime performance hit in some cases due to not properly modeling the properties of our register files.

We should make an effort to name concrete issues. So we can judge what the upside of all the work would be. Things I can think of right now:
- Algorithms modeled based on physregs and/or using MCRegAliasIterator are typically so slow as to be unusable for GPUs today because of the huge number of registers. I remember a number of instances where we had to rewrite code to be based on register units to get reasonable compile time performance.

Is there anything else?

What I'd like to have
---------------------
I'd like to introduce the notion of array register files. Physical registers in an ARF would be described by

- the register array ID
- the starting index of the "register"
- the ending index of the "register" (inclusive)

This can be conveniently encoded in the 31 bits we effectively have available for physical registers (e.g. 13 bits for start / end, 5 bits for register array ID).

Register array ID 0 would be reserved for the familiar, TableGen-defined registers (we would still have some of those in AMDGPU, even with this change).

It would necessarily have to be possible to generate register classes for ARFs both in TableGen and on-the-fly while compiling. Base register classes would be defined by:

- the register array ID
- the minimum start / maximum end index of registers in the class
- the size of registers in the class
- the alignment of registers in the class (i.e., registers must start at multiples of N, where N is a power of two)

... and then register classes might be the union of such register classes and traditional register classes.

(For example, in AMDGPU we would have a register class that includes all register from the scalar array, with size 2, starting at an even offset, union'd with a class containing some special registers such as VCC.)

A similar scheme would have to be used for sub-register indices.

I think this would be mainly about creating register classes in an ad-hoc fashion:
- I think we need matching register classes for all classes.
- If you have a register class anyway, then you don't need the fancy encoding anymore and can just stay with register numbers in the class.
- The change compared to today would be that we do not have pre-computed list of physical registers but make them up on-demand.

I haven't dug too deeply into this yet, and clearly there are quite a number of thorny issues that need to be addressed -- it's a rather big project. But so far I'm not aware of anything absolutely fundamental that would prevent doing this.

This can become tricky fast...
- Would we keep todays notion of register hierarchies as well?
- Will this scheme support arbitrary register constraints?
- Would we keep supporting interfaces like listing the registers contained in a class? Doing MCRegAliasIterator queries (would that even make sense when you create ad-hoc classes on demand?)
- How about code using MCRegisterInfo::getNumRegs()? How about MachineRegisterInfo maintaining double linked lists for all defs/uses of a register?

What I'm asking you at this point
---------------------------------
Like I said, I haven't actually started any of this work (and probably won't for some time).

However, if you have any fundamental objections to such a change, please speak up now before I or anybody else embarks on this project. I want to be confident that people are okay with the general direction.

I think this will be a gigantic project. I also fear it willl not help the complexity of the backend infrastructure, since so far it seems we would keep all of the existing mechanisms for dealing with hierarchies around as well and just add one more possibility for tuple registers...
So while I do see the benefits here especially for GPU targets, I am also undecided/have a bad feeling about adding more complexity into generic LLVM infrastructure.

- Matthias

[hi nicolai, i am so sorry this is enormous! please do take your time.
also is there a resource somewhere, a wiki of some description, where
the team at AMD would be happy to document things collaboratively? i
would be happy to host a temporary home for such a document, if that
would help?].

> SV (the custom extension jacob kindly mentioned) also has register
> redirection. it's slightly different in that it's completely
> transparent: all *SCALAR* operations, if a register is marked in the

I've just skimmed what's in SV,

  oh! sorry, i forgot to give you a reference (and a caveat that it's
still at the phase of being implemented in a simulator, which is
rapidly hammering down on "that would be nice to have" and turning it
into "this is far more practical, let's go with that", if you get my
drift).

and it seems a bit different from what
AMDGPU has, although perhaps it could be put to the same use. There are
definitely similarities.

not knowing the full details of AMDGPU, that's interesting to hear.

AMDGPU doesn't have any of the fancy unrolling of instructions;

oh? intriguing: when you mentioned that AMDGPU is a vector processor,
i assumed that it would have vector lengths... hmm, oh i know what you
mean: the "implicit hard-macro-loop-unrolling" thing, where this:

SETVLENGTH 3
add r0, r4 # normally this is a scalar add

actually gets translated to:
for i in range(3):
    add r(0+i), r(4+i) # 3 scalar adds

it was the simplest way i could think of that would mean an
implementor could take a standard "scalar" RISC implementation and
turn it into a "vector" one. no *actual* modification of the ALUs, at
all. all the SV logic sits in between the decoder phase and the
instruction issue phase.

there's
simply a mode in which an offset from a special register is added to
certain register references. E.g., if an instruction references vector
register v7 and the special register contains 13, it'll reference v20
instead. (The exact details of how to enable this mode differs between
hardware generations.)

okay. so this is similar to SV's "redirection" table. 5-bit key,
6-bit value (and an element width and a boolean "scalar/vector" flag)
that way, the *standard* scalar ops (normally restricted to 32 regs)
redirect to 64 actual target regs.

in this way, there's no need to have a scalar-vector opcode, a
vector-scalar opcode, a vector-vector opcode *and* a scalar-scalar
opcode: it's just... all one opcode.

>>> 3. Even this aside, the number of register classes we'd really need is
>>> quite staggering. We have machine instructions taking operands with
>>> anywhere from 1 to 12 consecutive registers.
>
> is that length part of the instruction or is it set up by way of an
> indirect status register of some kind?

It's complicated :slight_smile:

:slight_smile:

As far as LLVM is concerned, it's always part of the MachineInstr to
keep things sane.

ok. so, some state gets set up, and the instruction's meaning gets
changed? (that's how SV does it). or is it more complex than that?

In hardware, the things that use up to 12 consecutive registers are
texture sampling instructions.

ok! so, here, if an instruction is issued, some state has to be set
(somewhere), and, if set, the instruction uses registers s[8] through
to s[15] instead of s[8] through to s[12]? the important thing being,
it's *not* the *instruction* that gets modified, it's the state
associated with the *register* that's modified, is that right? (for a
given approximation of "right")

(because, if so, that's pretty much exactly what SV does)

The binary instruction encoding only
contains the type of sampling (e.g., whether user-specified gradients
are used), but not the dimensionality of the texture (1D vs. 2D vs. 3D),
which also affects the number of source registers.

that's fascinating. extremely sophisticated.

>>> I'd like to introduce the notion of array register files. Physical
>>> registers in an ARF would be described by
>>>
>>> - the register array ID
>>> - the starting index of the "register"
>>> - the ending index of the "register" (inclusive)
> [....]
> * ARF1 = { id: 0, start=r5, end=r9}
> * ARF2 = { id: 1, start=r6, end=r10}

Maybe I misunderstand you, but I don't envision having an explicit "ARF
table" like that in LLVM, precisely because I want to get away from
opaque physreg indices.

For your examples above, the "registers" (in the LLVM CodeGen sense)
would be represented as (and using 12 bits for start / end just to make
the hex look nicer):

ARF1 = 0x01009005
ARF2 = 0x0100a006

ok so just bit-field representations. that works. SV (presently)
has that global vector-length rather than an explicit per-register
length, that's not a problem: it just means that the (global) VL state
would need to be pushed on the stack if ever it needed to be changed.

(Assuming that 0x01 is used as the integer array ID; 0x02 could then be
used for RISC-V float registers.)

oh. err... ah so the ID is a *type* indicator rather than an actual
indicator of a register *number*? type 0x01 = int, type ox02 = float.
okaaay, got it.

and... if it's a scalar it would be obviously represented
"0x01009009", as the start would equal the end (inclusive).

and ID for AMDGPU would have 0x03 for "texture" register, for example?

*thinks*... hmmm... if that table were to be expanded out in full,
for SV, it would contain... 63+62.... 32 entries. x0 is zero in RISCV
(so no sense "vectorising" that), x1 could have VL set anywhere from 1
to 63 (to mean "when you use x1, please do a hardware-loop issuing
scalar instructions on x1, and on x2, and on x3 ..... x63), then x2
could have VL set anywhere from 1 to 62 (again, hitting a limit
referring to x63) and so on.

it's a very, very large table, which, i think from the consecutive
overlapping / re-use in AMDGPU, would not accctually be a genuine "if
you use this ARF you have *all* possibie entries remaining to be able
to use immediately" if you get my drift

i.e. if a specific texture ARF were to be used which had a length of
12, it's *NOT* just that *one* ARF that would have to be marked as "in
use", its use would actually knock *multiple* holes in the ARF range:
all those which referred *to* the underlying actual registers covered
by that length range would need to be marked as "unavailable".

would you concur?

I have no idea how you'd handle the indirection in LLVM. That's a whole
other can of worms that AMDGPU luckily doesn't have to deal with. :slight_smile:

:slight_smile:

In particular, I don't see how what I'm talking about would help you
with managing slots in the indirection CAM. It would only help you get a
more natural "encoding" of the underlying (already redirected) registers
that instructions are using, for the purpose of register allocation etc.

okaaay, so you're considering setting up the indirection, for a
particular one-shot purpose (a texture rendering for example) and not
changing it, ever, until that execution is completed?

which seems like an absolutely perfectly reasonable and rational
thing to do, and has the advantage of making the fact that redirection
exists completely sane rather than absolutely mind-bendingly NUTS :slight_smile:

i'd been thinking of designing some conventions for SV's registers:
two in particular stand out: one is LOAD-Multi/STORE-Multi (for entry
and exit to functions), and the other is for userspace-kernelspace
context-switches. set VL=63, do a stack-store operation on x1, BLAM,
the entire register file is saved to stack with a single instruction.

going beyond that: setting some caller-callee conventions as well so
that certain ranges of registers are to be used as function call
parameters, some as temporaries and so on, all makes perfect sense,
set up once, and leave it *well* alone.

Also, there are a whole bunch of issues with overlapping registers when
it comes to live intervals etc., and I admit I haven't thought through
all the implications.

i can honestly say that i feel the most sensible thing there would be
to leave overlaps to assembly writers for now. unless you happen to
have access to a couple of 190 IQ geniuses at AMD... who are planning
to stay there for life :slight_smile:

i worked for Aspex Semiconductors back in 2003 (well before they were
bought by Ericsson), and the registers there... well... um... they
were constructed dynamically from 2-bit ALUs. so hypothetically you
could have the entire processor operate on single 8192-bit-long
"registers", or you could break it down into as small as 4096 2-bit
SIMD registers.

it was insane.

we actually had to hand-write assembly algorithms in spreadsheets of
all things, emulating all possible permutations of possible register
sizes, calculating the time taken for the algorithm if the register
size was 2, or 4, or 8 .... all because the memory load / store was a
fixed time (obviously), but any given algorithm would be a different
duration. to know the optimal algorithm speed you *literally* had to
go through all possible permutations.

productivity was measured in days per line of code. no, not lines of
code per day: DAYS per line of code.

bottom line: if there's *any way* that you can get the registers set
up (even if it's by hand or by convention) in advance, and not change
their configurations except when that particular algorithm is done and
out of the CPU, i'd really *really* recommend and concur with that
approach, initially.

even if the proposed backend were to automate the generation of
instructions based on pre-arranged (hand-allocated) register
allocations, that would be infinitely better than how things had to be
done with the Aspex ASP.

once that's in place, later you could look at running optimisation
loops (dumb brute-force ones, even) which test different register
allocations, and, through a rapid iteration process (or just
brute-force search) get a better handle on things.

> ok so you lost me a little, here. so are the 256 vector registers
> sub-divideable into FP16/32/64 for example? and, if so, would it be a
> good idea to have the size be part of the ARF table? (id, start, end,
> bitwidth)?

Since we have a SIMT/SPMD/whatever-you-want-to-call-it model in AMDGPU,
we mostly tend to think of the vector registers as simply 32-bit
registers. In reality, they have 64 lanes of 32 bits each.

Unlike CPU SIMD architectures, we don't represent 64-bit values as 32
lanes of 64 bits; instead, 64-bit values are represented as two
consecutive "32-bit" registers. So we store 32-bit values in v0, v1,
..., and 64-bit values in v[0:1], v[1:2], ...

ok so *exactly* how SV works :slight_smile: except not with the overlaps, perhaps.

in SV it's permitted to break down as far as 8-bit in SV, although i
would not recommend actually breaking the register file itself down
into 64 x 8 separate and distinct 8-bit "registers". [if done at that
depth, the reg-renaming needed for an OoO microarchitecture would be a
huge burden, as a single 64-bit operation would, in hardware, need to
carry EIGHT renamed registers with it. not nice]

so, i said "x1" earlier... actually that is as follows:

union {
    unsigned char b_reg[64*8];
    unsigned short s_reg[64*4];
    unsigned int i_reg[64*2];
    unsigned long l_reg[64];
}

and so x1 would be accessed as either:

regs->b_reg[1*8] OR
regs->s_reg[1*4] OR
regs->i_reg[1*2] OR
regs->i_reg[1]

but it is important to note that there is no way to directly access
regs->b_reg[3], for example [except through vector predication].

in SV it is highly likely that implementors would go "wtf? i am *not*
subdividing the registerfile into 64*8 separate and distinct
registers!" and would instead settle for a predicated SIMD ALU at some
depth that they feel comfortable with (similar to AMDGPU's choice,
below, of 32).

so effectively very very similar to AMDGPU, i think? except where
AMDGPU would allow v[0:1], v[1:2], v[2:3], SV would *not* allow
v[1:2], it would *only* allow v[0:1] and v[2:3], in effect, and i
believe this may have been what you were referring to about the "class
restrictions" (below).

HOWEVER... it just occurred to me: on an RV32 system, setting
"elwidth=default x 2" on a given register would pretty much give the
*exact* same behaviour in SV as AMDGPU. in that particular case, an
"add x2, x6" would result in x2+x3 being treated as a 64-bit operand,
and x6+x7 likewise. it would be equivalent to AMDGPU terminology "ADD
v[2:3], v[6:7]".

also... please don't freak out, i'm having difficulty coping with the
microarchitectural implications myself... in both SV and RVV it's
possible to set *different* element widths on the source operands and
destination operands, and the microarchitecture has the fun job of
working out how to convert and sign-extend between the different
element widths.

That way, every thread of the original program stays within one lane of
the SIMD structure.

indeed. SV by contrast is designed to fit either on a [predicated!]
SIMD microarchitecture *or* a multi-issue superscalar one. if a
multi-issue superscalar architecture is to be used, that "parallel
loop" you saw pseudo-code for in the spec would just... push
individual element-wise "unpackaged" scalar ops into the *standard*
scalar instruction FIFO.

if a SIMD microarchitecture is chosen, all that happens is that if
the SIMD width is 4, then 4 bits of predicate are pushed down along
with 4 batches of operands at a time, into the SIMD ALU. at the end
of a loop (where the last elements of the loop are only 3, 2 or 1
long), implicit predication will mask off the unused lanes,
automatically.

goodbyyye SIMD corner-cases :slight_smile:

point being: as far as the compiler / assembly is concerned, the
underlying microarchitecture is opaque to SV. apart from speed
differences, you genuinely don't care.

For 16-bit, we use half-registers, i.e. each 16-bit value uses either
the low or high half of a "32-bit" vector register.

okay, so, starting to get SIMD-like, there. and i see you have
predication, below, or "lane masking".

> or, are you talking about having *another* structure that has that
> bitwidth? i'm slightly confused as to what the difference between the
> ARF start/end and the above-mentioned "mnimum start / maxim end index
> of registers in the class" would be.

The "min start / max end" is kind of a detail thing which we might not
actually need for AMDGPU. I was thinking that some architecture might
have encodings that limit the reachable range of instruction operands
(kind of like the RISC-V compressed instructions thing).

from what i've seen, traditional vector architectures typically
don't: their life is complicated enough as it is. SV on the other
hand, it would actually be necessary to explicitly add logic to
exclude the C (compressed) instructions, because they're actually a
"remapper" to standard 32-bit opcodes. it's just that some of the
remapping has, as you say, restricted range.

ok so yes, having a way to express that restricted range, excellent idea.

To describe such MachineInstr opcodes in an LLVM backend, you need
register classes with a few more restrictions.

>>> ... and then register classes might be the union of such register
>>> classes and traditional register classes.
>
> that sounds scary.

I know. But unfortunately necessary :slight_smile:

i'm starting to get where you're coming from. hasn't totally sunk
in... i *believe* it may be the union thing i expressed above (and the
architectural similarities between AMDGPU and SV are closer than we
initially thought).

so far, i believe i've been able to determine that:

* where the AMDGPU register files are broken down into 32-bit
segments, SV can break down as far as 8-bit (although implementors are
free to not do so, and instead to implement a SIMD microarchitecture
*as long as it is a predicated one*)
* where the AMDGPU register file naming conventions are always at the
32-bit level, SV can be either 32, or 64, or 128, depending on whether
the implementation is RV32, RV64 or RV128.

> (1) does the AMDGPU have the concept of predication, and if so has a
> data structure been considered to cover it.

Yes, although it's mostly implied because we do the whole SIMT/SPMD
thing. There's a dedicate EXEC register which masks lanes for all vector
instructions.

ok, so this is where SV definitely significantly differs: there's a
secondary CSR (control status register) "predication" table, where (in
english) the logic goes as follows:

"before the instruction is executed, check to see if the destination
register has an entry in the "predication" key-value store. if it
does, use the listed target register as a predicate".

in this way it is possible to have *all* and any registers be
predicated. in combination with redirection, it would even be
possible for one operation to be predicated (targetting the same
underlying registers) and the very next operation *not* be
predicated... or for the predicate to be inverted... but using the
*exact* same underlying registers.

so that would allow if-then-else constructs to be done, for example,
without needing a global mask instruction in between them.

> (2) what's the difference between the start/end indices of the ARF
> and the "class"?
>
> (3) what's the reason for putting the bitwidth (and alignment) into
> the "class", i.e. why have the "class" rather than just have one data
> structure containing everything: the ARF?

For both of those, I hope the above already clarified it a bit,

yes. getting there.

but the
general answer is that "register class" describes constraints e.g. on
the operands of specific machine instructions which the register
allocator would follow.

makes sense. i wouldn't have thought of that approach.

For example: If you had 256 registers, but some instruction encodings
are limited to the low 64 registers, and you're looking at an
instruction that takes consecutive 4-element registers, you'd use a
register class with "min start = 0, max end = 66" (assuming the
limitation is purely an encoding limitation).

The register allocator would then allocate some consecutive 4 registers
in that range, e.g. it could come up with "register start = 17, end = 20".

(In the AMDGPU case, we also have alignment restrictions; e.g. scalar
4-register descriptors must be aligned to multiples of 4: s[0:3],
s[4:7], ...)

yes. so there would be effectively a similar restriction for SV [1].
if a register's element width CAM entry is set to "16 bit" for example
then that would result in any 64-bit or 32-bit "add" actually being
interpreted as a 16-bit add...

... however, looking back up at that union table i wrote (for
reference), with an instruction "add x13, x13" it would *NOT* be
possible to do tb->s_reg[13] for example, it would *only* be possible
to do tb->s_reg[13*2]. likewise if the entry was set to 8-bit, it
would only be possible to do tb->b_reg[13*4].

this is really, really cool. it would i think be great to get ARM's
input, here, as well as the RISC-V RVV Working Group's input, to see
how and if those respective architectures would fit into the data
structure(s) envisaged. also, i am bcc'ing a couple of people from
other companies that i know are doing RISC-V vectorisation
architectures. they may also be interested.

also, can i suggest: just as with SV needing to do the trick of
putting the "global" vector length onto the stack (as an
implementation detail) where AMDGPU can just set up the lengths of
registers once and only once, that we consider creating a similar
scheme for predication that is as "general-purpose", even though
things would be reversed for AMDGPU compared to SV: it would be AMDGPU
having to set a "global register", where SV could set up a table once
and only once.

other architectures may have something in between (or just different).
Nyuzi (by Jeff Bush), i know, for example, actually has the
predication as a fixed part of the opcode: 4 bits of the (64-bit)
opcode are reserved as a mask. RVV has only the one predicate
register (v1) https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc
- however its use is specifically coded (activated) in every single
vector opcode (2 bits are reserved in every opcode to indicate if it's
to be used at all, inverted or not, and so on). i don't know enough
about ARM ASE to say what they might have?

i feel it would be good to properly recognise the predication as an
integral part of the backend data structures (even if some
architectures had to save/restore global state to get the "full"
features), because predication is fundamentally how parallel
architectures appear to do branches.

many *many* apologies, this took a long time to write, i do appreciate
that this is a complex topic.

l.

[1] [i need to document this] although maybe it's not a good idea to
mention that, due to exceptions occurring in the middle of a parallel
operation due to one of the "real" scalar ops requiring e.g. a virtual
memory trap, the "loop" you spotted in the spec pseudo-code actually
has to be re-entrant, and that in turn means that the "element offset
index" has to be saveable and restorable state. i am reluctant to
mention it in case compiler writers decide to abuse it to overcome the
alignment limitations :slight_smile:

First: To help the discussion not get sidetracked too much: We do support vector registers today, if your architecture needs to model consecutive registers or tuples you can do this today. Targets like AMDGPU, Hexagon, Apples GPU target do this and it works for them.

Right. With the caveat that working on AMDGPU is what drove me to bring this up in the first place :slight_smile:

With that out of the way, there are some situations in which it doesn't work as well as it could, and this is what this discussion is about.
Today you need to specify tuples upfront in tablegen, you need a separate register class for every size of tuple and easily end up with thousands of tuple registers.

Almost 40k of them, to be precise, if we were serious about actually using all features of the hardware in AMDGPU. We are not doing that today.

Hi all,

There's a rather major piece of work that's been in the back of my mind for a while now. Before I actually start any work on it, I'd like to hear people's opinions, if any.

tl;dr: I'd like to augment the CodeGen physical register / register class infrastructure with a mechanism to represent large regular register files more efficiently.

The motivation is that the existing infrastructure really isn't a good fit for the AMDGPU backend. We have ~104 scalar registers and 256 vector registers. In addition to the sheer number of registers, there are some qualitative factors that set us apart from most (all?) other backends:

1. The order of register matters: if we use only a subset of registers, and that subset is on the low end of the range, we can run more work in parallel on the hardware. (This is modeled with regalloc priorities today.)

2. We can indirectly index into both register files. If a function has an alloca'd array of 17 values, we may want to lower that as 17 consecutive registers and access the array using indirect access instructions.

Just to rephrase your point, because in principle you can start supporting such an indirect access feature today: The problem is that making a register classes for N-tuple registers creates a huge amount of registers and forces you to specify them in tablegen up-front.

Right.

3. Even this aside, the number of register classes we'd really need is quite staggering. We have machine instructions taking operands with anywhere from 1 to 12 consecutive registers.

Modeling this as register classes with sub-registers is clearly not a good match semantically, let alone from a compile time performance point of view. Today, we take effectively a runtime performance hit in some cases due to not properly modeling the properties of our register files.

We should make an effort to name concrete issues. So we can judge what the upside of all the work would be. Things I can think of right now:
- Algorithms modeled based on physregs and/or using MCRegAliasIterator are typically so slow as to be unusable for GPUs today because of the huge number of registers. I remember a number of instances where we had to rewrite code to be based on register units to get reasonable compile time performance.

Is there anything else?

I imagine there's a benefit to register allocation if we could just be honest about the linear nature of the register file. Doing a good job of allocating differently-sized registers from the same file is hard enough as it is, doing so in an unstructured sea of register units as opposed to a linearly ordered register file certainly doesn't help.

What I'd like to have
---------------------
I'd like to introduce the notion of array register files. Physical registers in an ARF would be described by

- the register array ID
- the starting index of the "register"
- the ending index of the "register" (inclusive)

This can be conveniently encoded in the 31 bits we effectively have available for physical registers (e.g. 13 bits for start / end, 5 bits for register array ID).

Register array ID 0 would be reserved for the familiar, TableGen-defined registers (we would still have some of those in AMDGPU, even with this change).

It would necessarily have to be possible to generate register classes for ARFs both in TableGen and on-the-fly while compiling. Base register classes would be defined by:

- the register array ID
- the minimum start / maximum end index of registers in the class
- the size of registers in the class
- the alignment of registers in the class (i.e., registers must start at multiples of N, where N is a power of two)

... and then register classes might be the union of such register classes and traditional register classes.

(For example, in AMDGPU we would have a register class that includes all register from the scalar array, with size 2, starting at an even offset, union'd with a class containing some special registers such as VCC.)

A similar scheme would have to be used for sub-register indices.

I think this would be mainly about creating register classes in an ad-hoc fashion:
- I think we need matching register classes for all classes.
- If you have a register class anyway, then you don't need the fancy encoding anymore and can just stay with register numbers in the class.
- The change compared to today would be that we do not have pre-computed list of physical registers but make them up on-demand.

The advantage of the start / end encoding is that checking for physical register containment / overlap / etc. becomes trivial.

I haven't dug too deeply into this yet, and clearly there are quite a number of thorny issues that need to be addressed -- it's a rather big project. But so far I'm not aware of anything absolutely fundamental that would prevent doing this.

This can become tricky fast...
- Would we keep todays notion of register hierarchies as well?
- Will this scheme support arbitrary register constraints?

What do you mean by this?

- Would we keep supporting interfaces like listing the registers contained in a class? Doing MCRegAliasIterator queries (would that even make sense when you create ad-hoc classes on demand?)
- How about code using MCRegisterInfo::getNumRegs()? How about MachineRegisterInfo maintaining double linked lists for all defs/uses of a register?

All good questions, and I certainly don't claim to have answers to them yet!

I can tell you thought that at least in AMDGPU, the uses of MCRegisterInfo::getNumRegs() and MCRegAliasIterator seem to be more accidental and due to the way the current design happens to be. It looks like they could be replaced by something more elegant with the scheme I described.

I haven't looked at other targets yet, though it goes without saying that I don't want to ignore them.

What I'm asking you at this point
---------------------------------
Like I said, I haven't actually started any of this work (and probably won't for some time).

However, if you have any fundamental objections to such a change, please speak up now before I or anybody else embarks on this project. I want to be confident that people are okay with the general direction.

I think this will be a gigantic project.

Agreed.

I also fear it willl not help the complexity of the backend infrastructure, since so far it seems we would keep all of the existing mechanisms for dealing with hierarchies around as well and just add one more possibility for tuple registers...
So while I do see the benefits here especially for GPU targets, I am also undecided/have a bad feeling about adding more complexity into generic LLVM infrastructure.

Point taken.

But what if we can actually transition *everybody* to something that expresses registers as (start, end) pairs over an underlying linear "array register file"? Maybe it sounds even crazier at first, but bear with me :wink:

Consider x86, for example. What if we said that we think of the base x86 register file (with rax, rbx, etc.) as an array of 16 * 8 = 128 bytes. In this scheme, we'd have:

AL = (0, 0)
AH = (1, 1)
AX = (0, 1)
EAX = (0, 3)
RAX = (0, 7)
BL = (8, 8)

and so on. Probably we'd want to define many register classes as explicit lists, although there's something to be said for defining 32-bit registers as: "size = 4, start = 0 (mod 8), max_end = 63"

Then everybody would benefit from fast physical register overlap check, at least.

We may even be able to get rid of the whole concept of subregisters, if we can find a good data structure for looking up defs and uses of intervals in the underlying linear register file.

An important question is: Is there a target that has registers which *cannot* be expressed as intervals over an underlying linear array of register units?

Admittedly all of this is largely me brainstorming and trying to bounce ideas off you and others.

Cheers,
Nicolai

First: To help the discussion not get sidetracked too much: We do support vector registers today, if your architecture needs to model consecutive registers or tuples you can do this today. Targets like AMDGPU, Hexagon, Apples GPU target do this and it works for them.

Right. With the caveat that working on AMDGPU is what drove me to bring this up in the first place :slight_smile:

With that out of the way, there are some situations in which it doesn’t work as well as it could, and this is what this discussion is about.
Today you need to specify tuples upfront in tablegen, you need a separate register class for every size of tuple and easily end up with thousands of tuple registers.

Almost 40k of them, to be precise, if we were serious about actually using all features of the hardware in AMDGPU. We are not doing that today.

Hi all,

There’s a rather major piece of work that’s been in the back of my mind for a while now. Before I actually start any work on it, I’d like to hear people’s opinions, if any.

tl;dr: I’d like to augment the CodeGen physical register / register class infrastructure with a mechanism to represent large regular register files more efficiently.

The motivation is that the existing infrastructure really isn’t a good fit for the AMDGPU backend. We have ~104 scalar registers and 256 vector registers. In addition to the sheer number of registers, there are some qualitative factors that set us apart from most (all?) other backends:

  1. The order of register matters: if we use only a subset of registers, and that subset is on the low end of the range, we can run more work in parallel on the hardware. (This is modeled with regalloc priorities today.)

  2. We can indirectly index into both register files. If a function has an alloca’d array of 17 values, we may want to lower that as 17 consecutive registers and access the array using indirect access instructions.

Just to rephrase your point, because in principle you can start supporting such an indirect access feature today: The problem is that making a register classes for N-tuple registers creates a huge amount of registers and forces you to specify them in tablegen up-front.

Right.

  1. Even this aside, the number of register classes we’d really need is quite staggering. We have machine instructions taking operands with anywhere from 1 to 12 consecutive registers.

Modeling this as register classes with sub-registers is clearly not a good match semantically, let alone from a compile time performance point of view. Today, we take effectively a runtime performance hit in some cases due to not properly modeling the properties of our register files.

We should make an effort to name concrete issues. So we can judge what the upside of all the work would be. Things I can think of right now:

  • Algorithms modeled based on physregs and/or using MCRegAliasIterator are typically so slow as to be unusable for GPUs today because of the huge number of registers. I remember a number of instances where we had to rewrite code to be based on register units to get reasonable compile time performance.
    Is there anything else?

I imagine there’s a benefit to register allocation if we could just be honest about the linear nature of the register file. Doing a good job of allocating differently-sized registers from the same file is hard enough as it is, doing so in an unstructured sea of register units as opposed to a linearly ordered register file certainly doesn’t help.

What I’d like to have

I’d like to introduce the notion of array register files. Physical registers in an ARF would be described by

  • the register array ID
  • the starting index of the “register”
  • the ending index of the “register” (inclusive)

This can be conveniently encoded in the 31 bits we effectively have available for physical registers (e.g. 13 bits for start / end, 5 bits for register array ID).

Register array ID 0 would be reserved for the familiar, TableGen-defined registers (we would still have some of those in AMDGPU, even with this change).

It would necessarily have to be possible to generate register classes for ARFs both in TableGen and on-the-fly while compiling. Base register classes would be defined by:

  • the register array ID
  • the minimum start / maximum end index of registers in the class
  • the size of registers in the class
  • the alignment of registers in the class (i.e., registers must start at multiples of N, where N is a power of two)

… and then register classes might be the union of such register classes and traditional register classes.

(For example, in AMDGPU we would have a register class that includes all register from the scalar array, with size 2, starting at an even offset, union’d with a class containing some special registers such as VCC.)

A similar scheme would have to be used for sub-register indices.

I think this would be mainly about creating register classes in an ad-hoc fashion:

  • I think we need matching register classes for all classes.
  • If you have a register class anyway, then you don’t need the fancy encoding anymore and can just stay with register numbers in the class.
  • The change compared to today would be that we do not have pre-computed list of physical registers but make them up on-demand.

The advantage of the start / end encoding is that checking for physical register containment / overlap / etc. becomes trivial.

We use register units for interference checks. I think they already are what you wish for (a linear “small” set of things to check). For your typical GPU in practice each single register (smallest addressable unit, or however you want to call it) will be mapped onto a register unit and the register allocators do all the interferences checking and book keeping at the register unit level. So as far as I can tell we are already there and I don’t see any benefits in changing this.

I haven’t dug too deeply into this yet, and clearly there are quite a number of thorny issues that need to be addressed – it’s a rather big project. But so far I’m not aware of anything absolutely fundamental that would prevent doing this.

This can become tricky fast…

  • Would we keep todays notion of register hierarchies as well?
  • Will this scheme support arbitrary register constraints?

What do you mean by this?

I was thinking about the “classic” hierarchies like X86 AX being composed of AH/AL, or having different sized versions of the same register (X86 RAX, EAX, AX, AL or Xn, Wn on AArch64). The question would be whether we keep modeling them the way we do today. Intuitively I would say yes, but that mean the dynamic register class creation stuff is built on top of the existing hierarchies and we need to support both…

  • Would we keep supporting interfaces like listing the registers contained in a class? Doing MCRegAliasIterator queries (would that even make sense when you create ad-hoc classes on demand?)
  • How about code using MCRegisterInfo::getNumRegs()? How about MachineRegisterInfo maintaining double linked lists for all defs/uses of a register?

All good questions, and I certainly don’t claim to have answers to them yet!

I can tell you thought that at least in AMDGPU, the uses of MCRegisterInfo::getNumRegs() and MCRegAliasIterator seem to be more accidental and due to the way the current design happens to be. It looks like they could be replaced by something more elegant with the scheme I described.

Yes the strange situation today is that you need to write algorithms in terms of register units to avoid them becoming a compiletime problem in GPU settings.

I haven’t looked at other targets yet, though it goes without saying that I don’t want to ignore them.

What I’m asking you at this point

Like I said, I haven’t actually started any of this work (and probably won’t for some time).

However, if you have any fundamental objections to such a change, please speak up now before I or anybody else embarks on this project. I want to be confident that people are okay with the general direction.

I think this will be a gigantic project.

Agreed.

I also fear it willl not help the complexity of the backend infrastructure, since so far it seems we would keep all of the existing mechanisms for dealing with hierarchies around as well and just add one more possibility for tuple registers…
So while I do see the benefits here especially for GPU targets, I am also undecided/have a bad feeling about adding more complexity into generic LLVM infrastructure.

Point taken.

But what if we can actually transition everybody to something that expresses registers as (start, end) pairs over an underlying linear “array register file”? Maybe it sounds even crazier at first, but bear with me :wink:

Consider x86, for example. What if we said that we think of the base x86 register file (with rax, rbx, etc.) as an array of 16 * 8 = 128 bytes. In this scheme, we’d have:

AL = (0, 0)
AH = (1, 1)
AX = (0, 1)
EAX = (0, 3)
RAX = (0, 7)
BL = (8, 8)

and so on. Probably we’d want to define many register classes as explicit lists, although there’s something to be said for defining 32-bit registers as: “size = 4, start = 0 (mod 8), max_end = 63”

Then everybody would benefit from fast physical register overlap check, at least.

We may even be able to get rid of the whole concept of subregisters, if we can find a good data structure for looking up defs and uses of intervals in the underlying linear register file.

An important question is: Is there a target that has registers which cannot be expressed as intervals over an underlying linear array of register units?

See above.

In principle we also have the feature of arbitrary register aliases. Though last time I looked it seemed to be only used in one target (I think it was Mips) in a way where you could model things differently and get away with arbitrary aliases if you wanted to change the code. Then again register units also handle aliases already. So to me the implementation/practical side here would be coming up with a concept of dynamic register classes and getting more consistent in using register units wherever possible. I don’t see any benefits for the proposed encoding yet…

in order to keep my own head straight (i've been doing collaborative
development for two decades, i *know* i will not be able to keep up,
with just a mailing list), i created this page:
https://libre-riscv.org/llvm_vector_backend/

it's a stub at the moment, i'll fill in pieces from the thread as i
best understand it.

l.

nicolai, hi,

couple things occurred to me, after writing this out
https://libre-riscv.org/llvm_vector_backend/

(1) a way to express the link between "what's wanted" and "what's
available" is needed. i.e. there needs to be a key-value store. as
they stand, proposed ARF and Reg classes only express "what's
available", they don't express "what's wanted".

(2) really SV and RVV both absolutely critically require that "length"
CSR (VL) to be part of the data structures, in order for the registers
to actually be "arrays", at all. if there is no length specified (at
the "what's wanted" level), there's no way for the backend to
determine "what's available".

(3) if the length of an array is specified as part of the data
structures, microarchitectures that don't have that concept can simply
set that to "1" in all data structures. i *think* that means that for
AMDGPU standard vector regs, length would be 1, and for those special
shader registers, it would be 12. or 1-12. or whatever they had been
globally set to for the duration of the application lifetime.

(4) VL ties in with robin kruppe's intermediary representation RFC
(initially designed for RVV). i think it's important to get in touch
with him on that.

(5) the idea of unioning traditional register classes is a good one: i
would hesitate to special-case that. if the ARF and Reg classes are
not capable of expressing the traditional register classes, i would
say that there's something wrong with how the ARF and Reg classes are
designed.

l.

This thread is now mixing two very different notions of length and arrays, and your requirements sound very different from Nicolai's. What you're describing also sounds like it's misunderstanding what AMDGPU does.

So, for clarity, a brief summary for the AMDGPU situation:

(Up to) "~104" scalar regs at 32b each, (up to) 256 vector regs at 2048b each (subdivided into 64 lanes of 32b each). Fixed vector length, not configurable by app. Every vector operation is implicitly predicated by EXEC mask (stored as 64 bits split across 2 scalar registers, one bit per lane).

There are also some packed operations that operate on individual 32b lanes as packed 2x16b or 4x8b values, but predication etc. is always on the 32b lanes. There is type conversion when loading and storing for a variety of formats, but the in-register format is quite constrained.

The "array" part here is that many AMDGPU instructions will access many registers; not so much that they necessarily correspond to arrays in the sense an "end user" would use the term (although that's possible, as Nicolai mentions). To give an example of mundane "arrays", pointers are 64-bit and generally need to be specified as a pair of scalar 32-bit registers.

The most extreme examples are generally to do with texture sampling. For example, the instruction

   image_sample_d v[0:3], v[4:11], s[0:7], s[8:11], dmask:0xf

writes the 4 vector registers v0 through v3 (holding the "r", "g", "b" and "a" channels of the result); v4 through v11 specify the location to sample as well as derivatives of the location with respect to screen x and y coordinates (in graphics usage); s0 through s7 contain a 256-bit description of the texture resource (address, format, width, height, depth if 3D, array count if an array, number of mip levels, and so forth), and s8 through s11 contain a 128-bit description of the "sampler" (additional parameters describing how to perform the texture sampling process).

However, none of these registers are actually _special_ - all of these are "standard" registers. Nor are they groups of 8, or 12, or whatever registers for the entire runtime of the kernel. Values are just grouped into these consecutive registers prior to the sample instruction, and return values are likewise grouped. But this represents nothing "fundamental" about these values; it's just that some instructions access lots of registers and require them to be consecutive (and, in some cases, to have initial indices that are multiples of 2 or 4) primarily for instruction size reasons. (The instruction above references 12 vector registers and 12 scalar registers; specifying 24 individual register numbers in the opcode would be prohibitive.)

What you're describing for RVV so far sounds _very_ different, and sounds to me like it wants somewhat different abstractions. For example, the RVV version sounds like it needs to be very aware of the ARF you mention, and plan out how it's occupied around whole kernels, whereas the AMDGPU use case needs values to be packed in certain ways on the lead-up towards thing like sampling instructions (or when dealing with scalar 64-bit data), but otherwise mostly treats everything as individual registers.

-Fabian

A clear and simple benefit is that it may be possible to implement a bunch of sub/super register stuff without loops, in branch-free code. E.g., MCRegisterInfo::getSubReg would reduce to just a few arithmetic instructions.

And sure, I'd be happy to rephrase the whole thing as: making register units the fundamental, first-class object that everything else is built on.

Cheers,
Nicolai

nicolai, hi,

couple things occurred to me, after writing this out
llvm vector backend

(1) a way to express the link between "what's wanted" and "what's
available" is needed. i.e. there needs to be a key-value store. as
they stand, proposed ARF and Reg classes only express "what's
available", they don't express "what's wanted".

(2) really SV and RVV both absolutely critically require that "length"
CSR (VL) to be part of the data structures, in order for the registers
to actually be "arrays", at all. if there is no length specified (at
the "what's wanted" level), there's no way for the backend to
determine "what's available".

(3) if the length of an array is specified as part of the data
structures, microarchitectures that don't have that concept can simply
set that to "1" in all data structures. i *think* that means that for
AMDGPU standard vector regs, length would be 1, and for those special
shader registers, it would be 12. or 1-12. or whatever they had been
globally set to for the duration of the application lifetime.

(4) VL ties in with robin kruppe's intermediary representation RFC
(initially designed for RVV). i think it's important to get in touch
with him on that.

Luke, I've been hesitant to reply to you here because I wanted to
write a more helpful explanation than I managed below, but since you
name dropped me and this subthread is getting larger and larger, I'll
just go ahead and say: I don't know what you are talking about
whenever you mention RVV in this context. My opinion after having
constructed most of a prototype backend for RVV is that the existing
ways of representing registers in LLVM work just fine for it and the
pain points that do exist are completely different from the ones this
RFC revolves around.

Please read Fabian's great explanation carefully. In particular note
how AMDGPU instructions use groups of multiple architectural registers
that are numbered consecutively, which is unrelated to how a single
vector register contains multiple scalar values. The former concept
does not exist in RVV (instructions reference 1-3 independent source
registers and write to one destination register), nor do things like
an indirectly addressable register field that one could use for an
alloca'd array. So even an extended RVV encoding with the same number
of architectural vector registers as AMDGPU would not run into most of
the problems that motivate this RFC. Maybe the compile time issues
from the sheer number of registers, as 256 registers is quite a bit
more than the few dozens of registers of most CPU architectures.
However, since RVV registers are isolated from each other instead of
being the leaves in a massive hierarchy of artificial "combined"
registers, even that problem seems different (and likely less severe)
from what AMDGPU faces.

Cheers,
Robin

yes, apologies (not a problem, robin. btw, i really appreciate the
way you worded this, it's very respectful): i worked that out last
night... and, because i'd written 2 messages already i did not want to
say anything further until someone else responded. sorry to take up
your time.

so, apologies! the RFC that you wrote, robin, is not relevant for
combining with this one as far as RVV is concerned [it is however
extremely relevant for SV].

it was only this morning that i suddenly realised that the ARF RFC
actually has more in common with SIMD than with RVV (or any of the
other vector unit engine developers that i privately bcc'd).

in particular, the ARF RFC would be *really* useful to deal with the
horrible-ness of that x86 MMX/SSE extension, the one that overloaded
the floating-point register file as SIMD *integer* elements?

in that regard, nicolai (et al, at AMD) it may be worthwhile alerting
some colleagues at AMD, if there are any who work on LLVM for amd64,
see if they are interested to participate?

l.

The new encoding would only support sequences of registers. What would we do if we need differently constructed tuples?

So I'd rather we keep expressing things through register classes and have register classes define how tuples are constructed>
i.e. if you have a 4 tuple class, then you know that register number 4 in this class is the (r4,r5,r6,r7) tuple. In the class of even numbered register pairs, register number 1 would be (r2, r4), etc.

This is more or less how things already work today. The main thing that needs to change IMO is the assumption that we can precompute everything. We need ways to create register classes that can computationally enumerate registers, come up with matching lanemasks, etc. At the same time we need to force ourself to not make any assumption that we know all registers and classes upfront. That would mean we cannot use RegAliasIterator or preallocate a double linked list for every register in MachineRegisterInfo. And the latter two is the problem we need to solve IMO, find ways to force ourself to not rely on completely precomputed register hierarchies and wasting resources by building lists over all possible tuple combinations.

The new encoding would only support sequences of registers. What would we do if we need differently constructed tuples?

How realistic is that requirement?

Obviously it's good to be flexible, but there's a real cost in terms of compile time, mental overhead, and possibly also in terms of algorithmic flexibility.

So I'd rather we keep expressing things through register classes and have register classes define how tuples are constructed>
i.e. if you have a 4 tuple class, then you know that register number 4 in this class is the (r4,r5,r6,r7) tuple. In the class of even numbered register pairs, register number 1 would be (r2, r4), etc.

This is more or less how things already work today. The main thing that needs to change IMO is the assumption that we can precompute everything. We need ways to create register classes that can computationally enumerate registers, come up with matching lanemasks, etc. At the same time we need to force ourself to not make any assumption that we know all registers and classes upfront. That would mean we cannot use RegAliasIterator or preallocate a double linked list for every register in MachineRegisterInfo. And the latter two is the problem we need to solve IMO, find ways to force ourself to not rely on completely precomputed register hierarchies and wasting resources by building lists over all possible tuple combinations.

That does sound like a good start.

Cheers,
Nicolai

+1

I’m late in this thread, but I’m very interested in this development and I would be up for looking to help. In general I feel that whatever change ends up happening we should keep API changes to a minimum and the idea Matthias was proposing about avoid pre-computation seems very promising, because it would basically mean we have the same system with only the generation of the definitions being different.
If anybody interested in this is at the dev meeting next week we could meetup maybe and talk about the goals for this!

Marcello

The new encoding would only support sequences of registers. What would we do if we need differently constructed tuples?

How realistic is that requirement?

Looking around LLVM it seems to be very rare, but there are the *Spc register classes in ARMRegisterInfo.td

Obviously it’s good to be flexible, but there’s a real cost in terms of compile time, mental overhead, and possibly also in terms of algorithmic flexibility.

So I’d rather we keep expressing things through register classes and have register classes define how tuples are constructed>
i.e. if you have a 4 tuple class, then you know that register number 4 in this class is the (r4,r5,r6,r7) tuple. In the class of even numbered register pairs, register number 1 would be (r2, r4), etc.
This is more or less how things already work today. The main thing that needs to change IMO is the assumption that we can precompute everything. We need ways to create register classes that can computationally enumerate registers, come up with matching lanemasks, etc. At the same time we need to force ourself to not make any assumption that we know all registers and classes upfront. That would mean we cannot use RegAliasIterator or preallocate a double linked list for every register in MachineRegisterInfo. And the latter two is the problem we need to solve IMO, find ways to force ourself to not rely on completely precomputed register hierarchies and wasting resources by building lists over all possible tuple combinations.

That does sound like a good start.

I would also expect getting rid of the known register count and known list of register classes assumption to be the biggest challenge/chunk of work of changing the current system.