Digging a bit deeper into getRegAllocationHints, AllocationOrder, and RAGreedy, it looks like I should be using RegAllocationHints mechanism.
If getRegAllocationHints returns true, RAGreedy will ‘force’ to allocate the designated physical register even spilling.
I may need a pass to break down a single instruction to multiple instructions if the required consecutive registers are not available to avoid expensive spilling.
(But it looks like the AMDGPU backend doesn’t use RegAllocationHints)
After some experiments, I can achieve my needs in my small unit test. I am also trying a strange design to get rid of RegisterTuples?!
I try to define scalar registers only. For instructions which generate one more scalars in consecutive registers, I will create implicit defined scalar registers, and set register allocation hints on original destination register and newly created implicit-defined registers. For example:
I use Type and PrefReg to encode start base register and number/index of consecutive registers, and override getRegAllocationHints to parse my hint scheme.
It appears to be working fine now, but I don’t know if this approach makes sense, or if there are any potential issues I’m not aware of.
Thank you for pointing out Arm’s patch! I can see in the patch Arm was using RegisterTuples to define GPR64x8Class for LD64B/ST64B, this is something I want to avoid , because my target has huge register files (1K to 4K), and some of the instructions can take 1 to 10+ 32-bit registers, so I think RegisterTuples might not be suitable to my needs! It looks like register allocation hint can satisfy my requirements for now, but I need to test with larger kernel sources to see if there is any hidden issue I am not aware of.
Register allocation hints may not always work in your case.
Suppose you have:
%1 = def
%2 = def
%3 = def
%4 = use %2, %1
%5 = use %3, %1
If the register allocator first allocates %2 and %3 to different registers, whichever hint you suggest for %1 it can’t be fullfilled.
I think the best thing you can do with the current infrastructure is to implement a post-RA pass that looks for instructions working with consecutive registers and replaces them with a single instruction working with register tuple. This is how ARMLoadStoreOptimizer works, AFAIK (it still relies on hints somehow).
ExtraSrcRegAllocReq and ExtraDefRegAllocReq forbid post-RA passes to rename registers (source and destination, respectively). You will need them for instructions which works with consecutive registers, so that the constraint is not broken by, e.g., MachineCopyPropagation.
The idea with implicit defs sounds good. See also variadicOpsAreDefs in Target.td.
PS
Please take my comments with a grain of salt, I’m as new to the register allocator as you are.
Thanks for your comments and suggestions! I didn’t dig into ARMLoadStoreOptimizer’s implementation. Thank you for the explanation! Now I have a bit more understanding of ARMLoadStoreOptimizer.
ARM uses ARMPreAllocLoadStoreOpt (PreAlloc) to “move load / stores from consecutive locations close” and uses ARMLoadStoreOpt (PostAlloc) to “combine load / store instructions to form ldm / stm”.
I can see what you mean, I guess in this case, I may use a pseudo instruction to express that I want to create consecutive registers for %1 to %3. For example:
%1, %2, %3 = AllocateRegisterArrayDword
To the best of my knowledge, llvm allocates register in Top-Down direction, so AllocateRegisterArrayDword is allocated before
I meant that (%2, %1) and (%3, %1) should be allocated to the same consecutive registers, e.g. (r1, r2), because they share one register (%1). If %2 and %3 allocated to different registers, say r10 and r20, then %1 cannot be allocated to both r11 and r21 simultaneously It theoretically could be done by splitting the live range of %1, but that’s a different story…
It is true (to some extent), if all live ranges are within one basic block. If some of them span multiple basic blocks, those take priority. There are a few other heuristics; please see RAGreedy::enqueue(PQueue &, LiveInterval *) for details.
I think investigating ARMLoadStoreOptimizer is the right direction, good luck with it