How to allocate consecutive registers? (Array register files?)

Hi,

My target has similar requirements to the AMDGPU backend.

My target has several register types, and some of the machine instructions can take from 1 to 10+ consecutive registers of all register types.

I think llvm RegisterTuples does not fit my needs, because:

  1. With RegisterTuples, it looks like I have to crazily define all possible register sequences, for example:
    def Reg32: ... // r0,       r1,      r2, ..., r1023
    def Reg64: ... // r0_r1,    r1_r2,       ...
    def Reg96: ... // r0_r1_r2, r1_r2_r3,    ...
  1. Some of the register types are very large, over 1024 registers, I don’t want to generate all of the register sequences for these register types!

I found two other posts:
(a) how to allocate consecutive register?
(b) preferring consecutive registers in allocater

But I can’t figure out how ARM’s LDRD/STRD force to use consecutive registers. Could someone please give me some hints!

I guess getRegAllocationHints may not be suitable because I need to force regalloc to use consecutive registers for some instructions.

I noticed there are two instruction flags:

  ExtraSrcRegAllocReq,
  ExtraDefRegAllocReq,

I am not sure how I could use these flags, because it looks like these flags are used to prevent some post register allocation optimizations.

Thanks :slight_smile:
CY

Digging a bit deeper into getRegAllocationHints, AllocationOrder, and RAGreedy, it looks like I should be using RegAllocationHints mechanism.

If getRegAllocationHints returns true, RAGreedy will ‘force’ to allocate the designated physical register even spilling.

I may need a pass to break down a single instruction to multiple instructions if the required consecutive registers are not available to avoid expensive spilling.

(But it looks like the AMDGPU backend doesn’t use RegAllocationHints)

After some experiments, I can achieve my needs in my small unit test. I am also trying a strange design to get rid of RegisterTuples?!

I try to define scalar registers only. For instructions which generate one more scalars in consecutive registers, I will create implicit defined scalar registers, and set register allocation hints on original destination register and newly created implicit-defined registers. For example:

$dest:Reg128 = TOY_OP %a:Reg32
=> $dest:Reg32 = TOY_OP %a:Reg32
                        implicit-def %dest1:Reg32, implicit-def %dest2:Reg32, 
                        implicit-def %dest3:Reg32
=> setRegAllocationHint for $dest, $dest1, $dest2, $dest3

TOY_OP.ExtraDefRegAllocReq = true

In

setRegAllocationHint(Register VReg, unsigned Type, Register PrefReg)

I use Type and PrefReg to encode start base register and number/index of consecutive registers, and override getRegAllocationHints to parse my hint scheme.

It appears to be working fine now, but I don’t know if this approach makes sense, or if there are any potential issues I’m not aware of.

Any comments are welcome!
Thanks :slight_smile:
CY

https://reviews.llvm.org/D91775
https://reviews.llvm.org/D91774

Maybe these two diffs help you. Arm had similar constraints.

1 Like

Hey tschuett,

Thank you for pointing out Arm’s patch! I can see in the patch Arm was using RegisterTuples to define GPR64x8Class for LD64B/ST64B, this is something I want to avoid :stuck_out_tongue: , because my target has huge register files (1K to 4K), and some of the instructions can take 1 to 10+ 32-bit registers, so I think RegisterTuples might not be suitable to my needs! It looks like register allocation hint can satisfy my requirements for now, but I need to test with larger kernel sources to see if there is any hidden issue I am not aware of.

Thank you :slight_smile:
CY

Register allocation hints may not always work in your case.

Suppose you have:

%1 = def
%2 = def
%3 = def
%4 = use %2, %1
%5 = use %3, %1

If the register allocator first allocates %2 and %3 to different registers, whichever hint you suggest for %1 it can’t be fullfilled.
I think the best thing you can do with the current infrastructure is to implement a post-RA pass that looks for instructions working with consecutive registers and replaces them with a single instruction working with register tuple. This is how ARMLoadStoreOptimizer works, AFAIK (it still relies on hints somehow).

ExtraSrcRegAllocReq and ExtraDefRegAllocReq forbid post-RA passes to rename registers (source and destination, respectively). You will need them for instructions which works with consecutive registers, so that the constraint is not broken by, e.g., MachineCopyPropagation.

The idea with implicit defs sounds good. See also variadicOpsAreDefs in Target.td.

PS
Please take my comments with a grain of salt, I’m as new to the register allocator as you are.

1 Like

Hi Sergei,

Thanks for your comments and suggestions! I didn’t dig into ARMLoadStoreOptimizer’s implementation. Thank you for the explanation! Now I have a bit more understanding of ARMLoadStoreOptimizer.

ARM uses ARMPreAllocLoadStoreOpt (PreAlloc) to “move load / stores from consecutive locations close” and uses ARMLoadStoreOpt (PostAlloc) to “combine load / store instructions to form ldm / stm”.

I can see what you mean, I guess in this case, I may use a pseudo instruction to express that I want to create consecutive registers for %1 to %3. For example:

%1, %2, %3 = AllocateRegisterArrayDword 

To the best of my knowledge, llvm allocates register in Top-Down direction, so AllocateRegisterArrayDword is allocated before

%4 = use %2, %1
%5 = use %3, %1

Cheers :smiley:
CY

I meant that (%2, %1) and (%3, %1) should be allocated to the same consecutive registers, e.g. (r1, r2), because they share one register (%1). If %2 and %3 allocated to different registers, say r10 and r20, then %1 cannot be allocated to both r11 and r21 simultaneously :slight_smile: It theoretically could be done by splitting the live range of %1, but that’s a different story…

It is true (to some extent), if all live ranges are within one basic block. If some of them span multiple basic blocks, those take priority. There are a few other heuristics; please see RAGreedy::enqueue(PQueue &, LiveInterval *) for details.

I think investigating ARMLoadStoreOptimizer is the right direction, good luck with it :wink: