After some experiments, I can achieve my needs in my small unit test. I am also trying a strange design to get rid of RegisterTuples?!
I try to define scalar registers only. For instructions which generate one more scalars in consecutive registers, I will create implicit defined scalar registers, and set register allocation hints on original destination register and newly created implicit-defined registers. For example:
Thank you for pointing out Arm’s patch! I can see in the patch Arm was using RegisterTuples to define GPR64x8Class for LD64B/ST64B, this is something I want to avoid , because my target has huge register files (1K to 4K), and some of the instructions can take 1 to 10+ 32-bit registers, so I think RegisterTuples might not be suitable to my needs! It looks like register allocation hint can satisfy my requirements for now, but I need to test with larger kernel sources to see if there is any hidden issue I am not aware of.
Register allocation hints may not always work in your case.
Suppose you have:
%1 = def
%2 = def
%3 = def
%4 = use %2, %1
%5 = use %3, %1
If the register allocator first allocates %2 and %3 to different registers, whichever hint you suggest for %1 it can’t be fullfilled.
I think the best thing you can do with the current infrastructure is to implement a post-RA pass that looks for instructions working with consecutive registers and replaces them with a single instruction working with register tuple. This is how ARMLoadStoreOptimizer works, AFAIK (it still relies on hints somehow).
ExtraSrcRegAllocReq and ExtraDefRegAllocReq forbid post-RA passes to rename registers (source and destination, respectively). You will need them for instructions which works with consecutive registers, so that the constraint is not broken by, e.g., MachineCopyPropagation.
The idea with implicit defs sounds good. See also variadicOpsAreDefs in Target.td.
Please take my comments with a grain of salt, I’m as new to the register allocator as you are.
I meant that (%2, %1) and (%3, %1) should be allocated to the same consecutive registers, e.g. (r1, r2), because they share one register (%1). If %2 and %3 allocated to different registers, say r10 and r20, then %1 cannot be allocated to both r11 and r21 simultaneously It theoretically could be done by splitting the live range of %1, but that’s a different story…
It is true (to some extent), if all live ranges are within one basic block. If some of them span multiple basic blocks, those take priority. There are a few other heuristics; please see RAGreedy::enqueue(PQueue &, LiveInterval *) for details.
I think investigating ARMLoadStoreOptimizer is the right direction, good luck with it