We should make it clear that the performance problem is with register aliases, not classes.
Yes, those are orthogonal issues.
Register aliasing is a serious problem with the current implementation. The postRA scheduler is *cubic* in the size of the alias list. You dramatically improved compile time with this simple fix:
r130801 | stoklund | 2011-05-03 15:31:24 -0700 (Tue, 03 May 2011) | 10 lines
Mark ultra-super-registers QQQQ as call-clobbered instead of the D sub-registers.
Completely unintended, I might add.
Our representation of call clobbered registers is bad enough that it deserves special attention. Every call instruction uses about 1500 bytes for the clobber list, and there are performance issues as well.
We also need to address the problem with register aliases. Currently, we don't model register sequence constraints properly for ARM NEON registers. I want to expand the current model of using QQ and QQQQ registers, but it would create more register aliases by adding pseudo super-registers.
The RegisterTuples I added to TableGen recently are meant for this, but I don't want to add them to the ARM target until we have compile time under control.
And the new regalloc is not yet designed to scale with the number of register aliases. (But that could be fixed under some limited definition of aliasing).
I think I can make it work while supporting the current aliasing model which consists of sub/super registers and arbitrary pair aliasing.
The idea is to divide registers into atoms such that two registers alias iff they share an atom. In a normal architecture with sub-register structure, the atoms are simply the smallest sub-registers. With arbitrary aliasing, atoms may not correspond to real registers, but they can still be computed.
On x86, for example, the %rax, %eax, and %ax registers all consist of the atoms (%ah, %al).
The register allocators will check interference using atoms instead of aliases. That will be faster since every register has fewer atoms than aliases. It also scales well when adding support for register sequence constraints since new super-registers don't add any atoms.
In the greedy and basic allocators, it means that we will have one LiveIntervalUnion per atom instead of one per physical register as we do it today.
Whereas, I don't see a reason that proliferating register classes would be a problem now or in the future.
Is it easy to distinguish a a couple classes that make up a disjoint coverage set (e.g. int regs, float regs, ...)? Some algorithms partition the problem this way.
Initially, I just want the set of register classes to be closed under intersections and sub-register operations. That's what the coalescer needs. It is about having getCommonSubClass and getMatchingSuperRegClass return maximal results. Today, they sometimes return NULL or a class that is smaller than necessary.
That doesn't give you a disjoint covering, you would need to close under set differences as well to get that. We could do that, but I think such a covering would be too fine-grained to be useful. The classes would be much smaller than 'int regs' and 'float regs'.
The x86 GPRs would be divided into nearly single-register classes. More regular architectures would do better.