Register class proliferation

In the past, I've seen some pushback on the list against adding more register classes. You can see it in the code as well, TargetLowering::getRegClassForInlineAsmConstraint() returns a vector of registers instead of a real register class.

What is the reason we don't like adding register classes? Is it still a valid reason?

The new register allocators, fast and greedy, don't care at all. Their performance is independent of the number of register classes. Linear scan is running some kind of union find on the register classes, but should be fine otherwise.

I noticed that TableGen's asm matcher generator does stuff with register classes. It is possible it should be fixed so it only looks at register classes that are explicitly mentioned by instructions.

I want register classes to be cheap, and I want TableGen to generate some of them automatically. The register allocators use them as constraints, and the coalescer wants to combine constraints. That is not always possible today because the register class representing the desired constraint is missing. If TableGen could fill in the gaps, we would get better coalescing.

/jakob

In the past, I've seen some pushback on the list against adding more register classes. You can see it in the code as well, TargetLowering::getRegClassForInlineAsmConstraint() returns a vector of registers instead of a real register class.

What is the reason we don't like adding register classes? Is it still a valid reason?

As I recall, it's a performance issue, as some of the algorithms involved were non-linear and expensive. I think you've refactored most of that away by now, though.

The new register allocators, fast and greedy, don't care at all. Their performance is independent of the number of register classes. Linear scan is running some kind of union find on the register classes, but should be fine otherwise.

I noticed that TableGen's asm matcher generator does stuff with register classes. It is possible it should be fixed so it only looks at register classes that are explicitly mentioned by instructions.

Having the other classes show up there shouldn't interfere with what's going on (InstAlias printing mechanism uses them). It might make for prettier output if we limit it to explicit user classes, but I don't expect it'll be strictly necessary.

I want register classes to be cheap, and I want TableGen to generate some of them automatically. The register allocators use them as constraints, and the coalescer wants to combine constraints. That is not always possible today because the register class representing the desired constraint is missing. If TableGen could fill in the gaps, we would get better coalescing.

This makes a lot of sense.

-Jim

I can't think of any super-linear algorithms remaining except for getCommonSubClass() which could easily be fixed. I've never seen it show up on a Shark trace, though.

If we give each register class a bit vector of sub-classes, the currently linear isSubClass() / isSuperClass() would become constant time, and getCommonSubClass() could be made linear with CountTrailingZeros(A and B).

Yes, those bit vectors would require N^2 space, but the constant factor is 1 bit, so I don't care. I am not adding that many register classes.

/jakob

We should make it clear that the performance problem is with register aliases, not classes. Register aliasing is a serious problem with the current implementation. The postRA scheduler is *cubic* in the size of the alias list. You dramatically improved compile time with this simple fix:

We should make it clear that the performance problem is with register aliases, not classes.

Yes, those are orthogonal issues.

Register aliasing is a serious problem with the current implementation. The postRA scheduler is *cubic* in the size of the alias list. You dramatically improved compile time with this simple fix:

------------------------------------------------------------------------
r130801 | stoklund | 2011-05-03 15:31:24 -0700 (Tue, 03 May 2011) | 10 lines
Mark ultra-super-registers QQQQ as call-clobbered instead of the D sub-registers.

Completely unintended, I might add.

Our representation of call clobbered registers is bad enough that it deserves special attention. Every call instruction uses about 1500 bytes for the clobber list, and there are performance issues as well.

We also need to address the problem with register aliases. Currently, we don't model register sequence constraints properly for ARM NEON registers. I want to expand the current model of using QQ and QQQQ registers, but it would create more register aliases by adding pseudo super-registers.

The RegisterTuples I added to TableGen recently are meant for this, but I don't want to add them to the ARM target until we have compile time under control.

And the new regalloc is not yet designed to scale with the number of register aliases. (But that could be fixed under some limited definition of aliasing).

I think I can make it work while supporting the current aliasing model which consists of sub/super registers and arbitrary pair aliasing.

The idea is to divide registers into atoms such that two registers alias iff they share an atom. In a normal architecture with sub-register structure, the atoms are simply the smallest sub-registers. With arbitrary aliasing, atoms may not correspond to real registers, but they can still be computed.

On x86, for example, the %rax, %eax, and %ax registers all consist of the atoms (%ah, %al).

The register allocators will check interference using atoms instead of aliases. That will be faster since every register has fewer atoms than aliases. It also scales well when adding support for register sequence constraints since new super-registers don't add any atoms.

In the greedy and basic allocators, it means that we will have one LiveIntervalUnion per atom instead of one per physical register as we do it today.

Whereas, I don't see a reason that proliferating register classes would be a problem now or in the future.

Is it easy to distinguish a a couple classes that make up a disjoint coverage set (e.g. int regs, float regs, ...)? Some algorithms partition the problem this way.

Initially, I just want the set of register classes to be closed under intersections and sub-register operations. That's what the coalescer needs. It is about having getCommonSubClass and getMatchingSuperRegClass return maximal results. Today, they sometimes return NULL or a class that is smaller than necessary.

That doesn't give you a disjoint covering, you would need to close under set differences as well to get that. We could do that, but I think such a covering would be too fine-grained to be useful. The classes would be much smaller than 'int regs' and 'float regs'.

The x86 GPRs would be divided into nearly single-register classes. More regular architectures would do better.

/jakob

That's one way to make an algorithm scale well--just make the best case expensive enough. Actually, I like this approach very much, I just wonder if we need to 8x liveintervals for pure 64-bit code.

-Andy

8x is not necessary, atoms are not bytes or bits.

The x86 integer registers would each map to two atoms, except for the 8-bit registers which map to a single atom.

Today, we have live interval unions for %rax, %eax, %ax, %ah, and %al. Using atoms, we would only have live interval unions for %ah and %al. This probably makes interference checking cheaper than it is today.

When a virtual register is assigned to a physical register, we only need to insert it into one live interval union. With atoms, the virtual register live range would be inserted into multiple live interval unions - one per atom. On x86, that means assignment will be twice as expensive as it is today.

We perform many more interference checks than we do register assignments, so I think it will be a good tradeoff.

The x86 reg --> atoms map would look like this:

%rax --> {1, 2}
%eax --> {1, 2}
%ax --> {1, 2}
%ah --> {1}
%al --> {2}

%rbx --> {3, 4}
...

Registers overlap iff they share an atom, that's all you need. X86 would get 60% less live interval unions.

/jakob

Awesome. It now occurs to me that situations where this would add much overhead in live intervals will probably never be a compile-time concern.

For sane register architectures, it means that the size of the AliasSet would now effectively be 1. And we have several algorithms in codegen where the number of operations performed is multiplied by the size of the AliasSet.

-Andy