Was it the subreg lane masks / mapping that was added to address the missed coalescing?
Yes, and the TRI::getCommonSuperRegClass() function.
This solution is nice, but I don't think it'll work for me. I have 8-element vector registers that can be grouped into virtual super regs for bulk save/restore, and as soon as I have more than 4 in a tuple, the unsigned int used to hold the lane masks overflows and switches over to the "bit 31 set == lanes unresolvable" mode, and coalescing fails.
What about moving the lane masks to a BitVector, that wouldn't need to be constrained artificially? Too much of a performance impact going that way?
Yes, in particular that would impose a cost on all the targets that don’t need this feature.
I didn’t expect any targets to need more than 31 bits for lane masks. Usually, ARM and x86 together span the envelope of insanity.
We can bump it to 64 bits if you like. It should be done with an MCLaneMask typedef à la MCPhysReg.
I'd be open to any thoughts/suggestions. I studied the ARM s_sub/d_sub/q_sub structure but that fits within the 32 bit lane mask. I also thought that LDM/STM would be similar, but the registers are physically enumerated, which is different from these virtual super reg frames I'm trying to construct.
Yes, ldm/stm is too complex to model in the register allocator, so they are handled by a post pass. You may need to do something similar.
It’s also worth noting that RAGreedy is not a full-blown 2D register allocator. It doesn’t track liveness per lane, so if the individual lanes in a virtual register have significantly different liveness, registers can go to waste.
I am thinking about fixing this by having a bimodal liveness representation. Most virtual registers are represented by a single LiveInterval, but when needed a vector could be switched to a representation where each lane has its own LiveInterval.
It’s a nontrivial project, but it would make the lane masks unnecessary, and it would take care of the wasted registers.