Subword register allocation

Hi,

I have a question about implementing subword register allocation
problems (see the REFERENCES in the end of this message) on LLVM. I
have algorithms, but don't know the best way to implement them in
LLVM.

I asked similar question before:

  http://lists.cs.uiuc.edu/pipermail/llvmdev/2005-
May/004001.html

Because I still don't have a satisfying solution now, I try to
elaborate it again. Pardon.

All registers are 128-bit. Each register can be divided into four
32-bit subwords. Each subword can be independently read and written. A
symbolic name is given to each subword: x, y, z, w.

  MUL r0.xyz, r1.xyz, r2.xxx
  SUB r0.w, r3,y, r4.z
  ADD r5.xyzw, r0.xyzw, r2.xyzw

MUL defines the three subwords of r0, and SUB defines the rest one.
Note that ADD uses the four subwords defined by the previous two
instructions. The register allocate must be aware of this, otherwise
additional MOV instructions may have to be inserted:

  MUL r0.xyz, r1.xyz, r2.xxx
  SUB r9.x, r3,y, r4.z
  MOV r0.w, r9.x
  ADD r5.xyzw, r0.xyzw, r2.xyzw

In the previous code snippet, when allocating the destination register
for SUB, the register allocator doesn't choose r0.w, and later when
it's found the four subwords are referenced in ADD, an additional MOV
must be inserted to move the subword from r9.x to r9.w.

Even if the subwords in a 128-bit registers are never referenced
together in an instruction, minimizing the number of registers is
always preferred in many architectures to avoid spills.

For example, the code

  ADD r0.xy, r1.xy, r2.xy
  MUL r4.xy, r1.xy, r2.xy

cn be improved to save r4, since the rest of the two subwords, z and w
of r0, are not available:

  ADD r0.xy, r1.xy, r2.xy
  MUL r0.zw, r1.xy, r2.xy

I know some register algorithms [1][2]. My question is how to
implement them in LLVM. I try to avoid making too much changes to
existing LLVM live interval analysis and register allocators. I wish
them could be re-used without modification, and perhaps using some
tricks in the TableGen .td file.

I don't know how to do it, but it may be like what is done in
X86RegisterInfo.td. AL and AH are defined to be the alias of AX by
RegisterGroup. But this method doesn't seem work. I'm not sure, and I
need your comments before implementing this similar techniques because
I have a tight schedule.

REFERENCES

[1] S. Tallam and R. Gupta, "Bitwidth aware global register
allocation", Annual Symposium on Principles of Programming Languages,
pp.85 - 96, 2003.

[2] Bengu Li, Youtao Zhang, and Rajiv Gupta, "Speculative Subword
Register Allocation in Embedded Processors", The 17th International
Workshop on Languages and Compilers
for Parallel Computing, 2004.

I have a question about implementing subword register allocation
problems (see the REFERENCES in the end of this message) on LLVM. I
have algorithms, but don't know the best way to implement them in
LLVM.

I asked similar question before:

http://lists.cs.uiuc.edu/pipermail/llvmdev/2005-
May/004001.html

Because I still don't have a satisfying solution now, I try to
elaborate it again. Pardon.

Ok.

All registers are 128-bit. Each register can be divided into four
32-bit subwords. Each subword can be independently read and written. A
symbolic name is given to each subword: x, y, z, w.

MUL r0.xyz, r1.xyz, r2.xxx
SUB r0.w, r3,y, r4.z
ADD r5.xyzw, r0.xyzw, r2.xyzw

MUL defines the three subwords of r0, and SUB defines the rest one.
Note that ADD uses the four subwords defined by the previous two
instructions.

Understood. This is exactly what the register aliases concept is for. Basically you have "R0" and "R0.x", which alias. Likewise, R0 aliases "R0.y", but "R0.y" does not alias "R0.x".

The register allocate must be aware of this, otherwise
additional MOV instructions may have to be inserted:

MUL r0.xyz, r1.xyz, r2.xxx
SUB r9.x, r3,y, r4.z
MOV r0.w, r9.x
ADD r5.xyzw, r0.xyzw, r2.xyzw

The funny thing about this is that you actually have different opcodes. Consider ADD for example. You actually have four different operations:

ADD r, r, r
ADD rr, rr, rr
ADD rrr, rrr, rrr
ADD rrrr, rrrr, rrrr

My guess is that you'll get the most mileage out of describing these to the target as different operations. An alternative would be to make each operand a reg #/permute mask pair.

Not having worked on an arch like this before, I can't say which way would be the best approach.

In the previous code snippet, when allocating the destination register
for SUB, the register allocator doesn't choose r0.w, and later when
it's found the four subwords are referenced in ADD, an additional MOV
must be inserted to move the subword from r9.x to r9.w.

For this, if you're modeling the registers as wholes, which have pieces modified out of them, you really want to model the instruction as a two address instruction, which reads the register, modifies it, then writes it out. See the 'isTwoAddress' opcodes in the X86 or PPC backend for examples.

Note that you'd probably want a seperate opcode that destroys the whole input (if wxyz are all set), so that the register allocator doesn't think it need to preserve the input in this case.

Even if the subwords in a 128-bit registers are never referenced
together in an instruction, minimizing the number of registers is
always preferred in many architectures to avoid spills.

For example, the code

ADD r0.xy, r1.xy, r2.xy
MUL r4.xy, r1.xy, r2.xy

cn be improved to save r4, since the rest of the two subwords, z and w
of r0, are not available:

ADD r0.xy, r1.xy, r2.xy
MUL r0.zw, r1.xy, r2.xy

Sure, ok. If you modeled your register file as a whole bunch of partial registers, this would come naturally.

I know some register algorithms [1][2]. My question is how to
implement them in LLVM. I try to avoid making too much changes to
existing LLVM live interval analysis and register allocators. I wish
them could be re-used without modification, and perhaps using some
tricks in the TableGen .td file.

The real problem here is that you in some cases you want whole registers, and in some cases you want partial registers. One bit of unimplemented functionality is support for sub-registers. This is something we *will* implement, but I don't know precisely when. The idea is that it will allow specifying constraints on registers.

For example, on X86, we'd say that we have a register class with the 32-bit X registers in it [EAX, ECX, EBX, EDX], and a register class with the low-8-bit counterparts [AL, CL, BL, DL]. We want to say that the second is a designated subreg (say #0) of the first class. This would allow us to use subregisters of virtual registers as operands to instructions, and have the register allocator change them to the appropriate physregs when appropriate.

This partially helps your problem, but not entirely: you still need to decide how to model the instructions in the first place. Also, this support isn't implemented yet. :frowning:

I don't know how to do it, but it may be like what is done in
X86RegisterInfo.td. AL and AH are defined to be the alias of AX by
RegisterGroup.

Yup, this is the simple alias part.

But this method doesn't seem work. I'm not sure, and I
need your comments before implementing this similar techniques because
I have a tight schedule.

Unfortunately, I don't know a magic answer. My advice would be to start with something simple (but suboptimal), and refine it over time.

-Chris

REFERENCES

[1] S. Tallam and R. Gupta, "Bitwidth aware global register
allocation", Annual Symposium on Principles of Programming Languages,
pp.85 - 96, 2003.

[2] Bengu Li, Youtao Zhang, and Rajiv Gupta, "Speculative Subword
Register Allocation in Embedded Processors", The 17th International
Workshop on Languages and Compilers
for Parallel Computing, 2004.

-Chris