Reg units for unaddressable register parts?

On X86, the registers AX, EAX and RAX all share the exact same register units. In terms of units, there is no difference between these registers. This makes register units insufficient to track liveness, since live AX does not imply live EAX.

Would it make sense to have register units (and lane masks) for the parts of registers that are not individually addressable?

-Krzysztof

With this arrangement, each register would be covered by its units.

-Krzysztof

On X86, the registers AX, EAX and RAX all share the exact same register units. In terms of units, there is no difference between these registers. This makes register units insufficient to track liveness, since live AX does not imply live EAX.

That is exactly the intent.
If AX is live, you don’t want another value to use EAX or RAX.

I'm not sure what value you are referring to.

I have this situation in mind:
   RAX = ... (1)
   EAX = ... (2)
   ... = RAX (3)
There doesn't seem to be a way to determine whether (1) is live based on lane masks, and to distinguish it from
   EAX = ... (1)
   RAX = ... (2)
   ... = RAX (3)

-Krzysztof

You can track liveness but you may have to make conservative assumptions like the whole RAX register being live even though it may only be AL+AH in reality. Usually that is good enough (it is for the register allocator!). There are very few cases where the difference matters. The only ones I can think of right now are ABIs corner cases (see below [1]) and X86FixupBWInsts. In general however we should be careful not to produce unnecessary/extra regunits, that will just increase memory consumption and slow down the compiler. Having said that I wouldn't mind producing enough extra regunits to be able to express clobber masks in register units. Generally producing regunits for all unaddressable parts is not a good idea IMO.

[1] ABIs:
x86 / win64 calling convention preserves a bunch of XMM registers but not the corresponding YMM registers => the lower 128bit are saved, the upper 128 bit of the YMM register are clobbered.
AArch64: Preserves the 64bit of several float register (Dxx) but not the upper parts (Qxx).
This forces us to express the clobbered register masks in terms of physreg numbers and not register unit which is unfortunate as we are usually better of expressing liveness in registser units.

- Matthias

On X86, the registers AX, EAX and RAX all share the exact same register units. In terms of units, there is no difference between these registers. This makes register units insufficient to track liveness, since live AX does not imply live EAX.

That is exactly the intent.
If AX is live, you don’t want another value to use EAX or RAX.

I'm not sure what value you are referring to.

I mean if AX hold some value, we do not want RAX to hold something else otherwise you will clobber AX.

I have this situation in mind:
RAX = ... (1)
EAX = ... (2)
... = RAX (3)
There doesn't seem to be a way to determine whether (1) is live based on lane masks, and to distinguish it from
EAX = ... (1)
RAX = ... (2)
... = RAX (3)

RegUnit are more for availability checks (check if this register is free/occupied) than proper liveness.
There never was an intent to capture precise liveness per se. I guess you are looking at them in the context of what I said regarding live-in sets. In that context they are ideal because you have a conservative representation of what is available/not available in terms of registers with a compact representation.

Having some reg units for unaddressable register parts may make sense, but generally speaking we had preferred to avoid it because we would have to filter them out for most analysis that deals with RegUnit.

The cases where that it could make sense to use unaddressable register units are:
1. If we want to switch RegMasks to RegUnit (what Matthias explained)
2. If we want to track precise liveness for physical registers
3. If we want to fix the register pressure sets for x86 23423 – [TableGen] Register Pressure information is wrong for x86

#1 is a cleanup but not rely in the way of anything useful.
#2 is not a problem IMO since most of our work with liveness happens on unallocated code.
#3 would be a nice fix but the overload benefits compared to the infrastructure needed to fix it does not seem worth it.

Cheers,
Q.

This is what I'm working on (RDF). I generate a data-flow graph for physical registers, and I need to be able to accurately connect defs to uses.
Currently it has target-specific hooks to determine covering, and the only target hook for now is for Hexagon. The generic code is not very precise and using lane masks would
(1) simplify some parts of the code quite a bit,
(2) make it work better for other targets.

There are post-RA optimizations that this would enable, at least for Hexagon. We already have 1 specific consumer, aside from some simple copy propagation/dce, and there will likely be more.

So far it's been developed on Hexagon (and is under lib/Target/Hexagon). Vivek Pandya offered to do some work to make it available for all targets.

-Krzysztof

Thanks for the context Krzysztof.

I agree with Matthias that having a few more unaddressable register units may be useful, but we don’t want to be exhaustive as it will be bad for performances.

Also, the API is probably not great but have you tried to use MCRegisterInfo::getSubRegIdxSize, MCRegisterInfo::getSubRegIdxOffset & co. for your problem?

Out of curiosity, could describe why this is useful to have such precision in the liveness tracking?

I am not sure I see any use case, especially because I would not rely on the semantic we have for the target instructions.
E.g.,
RAX = …
EAX = … <— Does this definition clobber the high part of RAX?

Indeed, we do not necessarily describe the exact semantic of an instruction. For instance, on x86 it is probably right to assume most instruction do not touch the high bits, but on AArch64 this is the opposite.

What I am saying is that even if we had the infrastructure for the unaddressable reg units, we would probably need a lot of work to be able to use it.

The bottom line is I would like to see target independent use cases that would make such investment worth it and so far I haven’t seen that.

Side question, have you check how the scheduler check dependencies for in post RA mode? I wonder if it is already possible to build the information you want form the existing APIs.

Cheers,
-Quentin

Short reply right now, regarding this:

Thanks, that didn't occur to me. According to the Intel documentation, moving into a 32-bit register zero-extends the value to 64 bits. So in the example above, EAX=... would indeed overwrite the high half of RAX.

It seems that targeting any part of the lowest 16 bits (or all of them) preserves the rest, while changing the second-lowest 16 bits (16..31) does affect bits 32..63.

-Krzysztof

Out of curiosity, could describe why this is useful to have such precision in the liveness tracking?

RDF is meant to allow optimizations across the whole function. As a result, registers may change between basic blocks, and there is code to recalculate it. Accuracy is required to avoid unnecessary block live-ins.
For example, calculate live-ins to BB1:
   BB#1:
     R0 = ... // Does not affect R1
     ... = D0 // D0 is a pair R1:R0
Here we want R1 to be the live-in, but not the whole D0 or R0.
At the same time, on x86-64,
   BB#1:
     EAX = ...
     ... = RAX
RAX would not be a live-in (since EAX=... overwrites all bits in RAX).

One potential target optimization (for Hexagon) would do with register renaming. To rename registers we would have to isolate their live ranges very accurately.

Indeed, we do not necessarily describe the exact semantic of an instruction. For instance, on x86 it is probably right to assume most instruction do not touch the high bits, but on AArch64 this is the opposite.

That's not necessary. In the x86-64 case, if EAX had an extra reg unit that it would share with RAX (for the unaddressable part extending from bit 16 upwards), then none of AL=, AH=, or AX= would invalidate the rest of EAX and RAX, while EAX= would, since it would store into the "hidden" reg unit.

The fact that RAX ends up with 0s in the high part would not be exploited by any target-independent code.

The problem is that at the moment, the last instruction in
   EAX = ...
   AX = ...
   ... = EAX
would seem to only use the value from the second one, since AX= defines all lanes/units that EAX has. This kind of inaccuracy is not just suboptimal, it would lead to an incorrect conclusion. Currently, only x86-specific knowledge would tell us that the first instruction is still live, and I'd like to be able to tell by examining lane masks/reg units.

What I am saying is that even if we had the infrastructure for the unaddressable reg units, we would probably need a lot of work to be able to use it.

Maybe I have overstated the degree of complexity of what I'm looking for. The information I'm interested in is: "what part of the super-register survives a definition of a subregister". And the "what part" does not have to be precise in terms of exact bits, but just some identification like a bit in a lane mask.

Side question, have you check how the scheduler check dependencies for in post RA mode? I wonder if it is already possible to build the information you want form the existing APIs.

It checks register aliasing. If two registers are aliased, there will be a dependency between them.

-Krzysztof

Code like this does works ok to merge the top half of EAX with the new
value inserted in AX (or AL, AH), but on many CPUs it is very slow --
slower than using proper machine-independent masking operations.

This is because the CPUs *themselves* track EAX and AX separately in the
register renaming machinery, and have to wait until the write to AX has
actually retired before EAX can be read again.

On Pentium Pro, P2, P3 this caused about a half dozen cycle stall. On Core2
it was reduced to 2 or 3 cycles. I'm not sure about P4. I think not good
:slight_smile: Sometime around Nehalem or Sandy Bridge it was finally eliminated.

Quentin,
If such units were something that targets could explicitly request via some construct in a .td file, would you find that acceptable?

-Krzysztof

In the x86-64 case, if EAX had an extra reg unit that it would share
with RAX (for the unaddressable part extending from bit 16 upwards),
then none of AL=, AH=, or AX= would invalidate the rest of EAX and RAX,
while EAX= would, since it would store into the “hidden” reg unit.

Quentin,
If such units were something that targets could explicitly request via some construct in a .td file, would you find that acceptable?

Quick thought (I haven’t had time to look closely to your other emails).

If we add some explicit construct in the .td files, how could we use them?

Basically, I am wondering if say in RDF we would you have some mode where we check for that property to be on and support two different modes, or we would have everything working on RegUnit and if we don’t use the unaddressable mode in the td files, we get conservative answers (due to the nature of RegUnit)?

The reason I am asking is because I believe it may already be possible to add unaddressable register units by hand.
One would need to create additional subregs in their td file to fill the holes, then mark all the registers mapping to those subregs as unallocatable.

E.g.,

let SubRegIndices = [sub_16bit] in {
def EAX : X86Reg<“eax”, 0, [AX]>, DwarfRegNum<[-2, 0, 0]>;

—>

let SubRegIndices = [dummysubIdx_16bit, sub_16bit] in {
def EAX : X86Reg<“eax”, 0, [ADummyXH, AX]>, DwarfRegNum<[-2, 0, 0]>;
[…]
def DummyRegClass : RegisterClass[…]/list all dummy regs/ {
isAllocatable = 0;
}

In other words, you may be able to explore if that would solve your problem or if we have to come up with something smarter.

Cheers,
-Quentin

I think this would solve it!

-Krzysztof