Strong vs. default phi elimination and single-reg classes

Hello again,

I am trying to implement an optimization pass for PowerPC such that
simple loops use the special "counter register" (CTR) to track the
induction variable. This is helpful because, in addition to reducing
register pressure, there is a combined decrement-compare-and-branch
instruction BZND (there are also other related instructions).

I started this process by converting the Hexagon Hardware-Loops pass to
work with analogous PPC instructions. This worked fairly well, but
left a bunch of unneeded unconditional branches. To fix this, I
added support into AnalyzeBranch (and InsertBranch and RemoveBranch).
Unfortunately, this really broke things, the branch instructions (which
both used and defined the count register) were moved around in invalid
ways causing both compile-time (live-out assertions) and run-time
failures. Instead of trying to track down these problems, I thought it
would be better to use a register class instead of the physical
register. This register class has only one register (the count
register), and because of how the loops pass is setup, spilling this
register is never necessary.

Here's the problem: If I use strong phi elimination, then this works,
and the CTR register is allocated as it should be. When using the
default phi elimination, extra copies are introduced (which I don't
completely understand), and the register allocator tries to spill the
count register. For example, with strong-phi elimination, I get (as a
simple example):

BB#0: derived from LLVM BB %entry
            Live Ins: %X3
%vreg2<def> = COPY %X3<kill>; G8RC:%vreg2
%vreg4<def> = LI 2048; GPRC:%vreg4
%vreg3<def> = OR8To4 %vreg2<kill>, %vreg2; GPRC:%vreg3 G8RC:%vreg2
%vreg9<def> = COPY %vreg4<kill>; GPRC:%vreg9,%vreg4
%vreg10<def> = RLDICL %vreg9<kill>, 0, 32; GPRC:%vreg10,%vreg9
%vreg11<def> = MTCTR8r %vreg10<kill>; CTRRC8:%vreg11 GPRC:%vreg10
            Successors according to CFG: BB#1

112B BB#1: derived from LLVM BB %for.body, ADDRESS TAKEN
            Predecessors according to CFG: BB#0 BB#1
%vreg12<def> = PHI %vreg13, <BB#1>, %vreg11,
<BB#0>;CTRRC8:%vreg12,%vreg13,%vreg11
%vreg5<def> = LDtoc <ga:@a>, %X2; G8RC:%vreg5
%vreg6<def> = LWZ 0, %vreg5; mem:Volatile LD4[@a](tbaa=!"int")
GPRC:%vreg6 G8RC:%vreg5
%vreg7<def> = ADD4 %vreg6<kill>, %vreg3; GPRC:%vreg7,%vreg6,%vreg3
STW %vreg7<kill>, 0, %vreg5<kill>; mem:Volatile ST4[@a](tbaa=!"int")
GPRC:%vreg7 G8RC:%vreg5
%vreg13<def> = COPY %vreg12<kill>; CTRRC8:%vreg13,%vreg12
%vreg13<def> = BDNZ8 %vreg13, <BB#1>; CTRRC8:%vreg13
B <BB#2>
            Successors according to CFG: BB#2 BB#1

but with default phi elimination I get:

0B BB#0: derived from LLVM BB %entry
            Live Ins: %X3
%vreg2<def> = COPY %X3<kill>; G8RC:%vreg2
%vreg4<def> = LI 2048; GPRC:%vreg4
%vreg3<def> = OR8To4 %vreg2<kill>, %vreg2; GPRC:%vreg3 G8RC:%vreg2
%vreg9<def> = COPY %vreg4<kill>; GPRC:%vreg9,%vreg4
%vreg10<def> = RLDICL %vreg9<kill>, 0, 32; GPRC:%vreg10,%vreg9
%vreg11<def> = MTCTR8r %vreg10<kill>; CTRRC8:%vreg11 GPRC:%vreg10
%vreg14<def> = COPY %vreg11<kill>; CTRRC8:%vreg14,%vreg11
Successors according to CFG: BB#1

128B BB#1: derived from LLVM BB %for.body, ADDRESS TAKEN
            Predecessors according to CFG: BB#0 BB#1
%vreg12<def> = COPY %vreg14<kill>; CTRRC8:%vreg12,%vreg14
%vreg5<def> = LDtoc <ga:@a>, %X2; G8RC:%vreg5
%vreg6<def> = LWZ 0, %vreg5; mem:Volatile LD4[@a](tbaa=!"int")
GPRC:%vreg6 G8RC:%vreg5
%vreg7<def> = ADD4 %vreg6<kill>, %vreg3; GPRC:%vreg7,%vreg6,%vreg3
STW %vreg7<kill>, 0, %vreg5<kill>; mem:Volatile ST4[@a](tbaa=!"int")
GPRC:%vreg7 G8RC:%vreg5
%vreg14<def> = COPY %vreg13<kill>; CTRRC8:%vreg14,%vreg13
%vreg13<def> = COPY %vreg12<kill>; CTRRC8:%vreg13,%vreg12
%vreg13<def> = BDNZ8 %vreg13, <BB#1>; CTRRC8:%vreg13
B <BB#2>
Successors according to CFG: BB#2 BB#1

Is is those two extra copies at the end that seem to be the problem.
The register allocator believes that it needs two count registers, one
for vreg14 and one for vreg13 (thus the need to spill).

Maybe turning on strong phi elimination by default would be a good
thing, but is there another solution here? Does this problem indicate
that I'm setting up my phi nodes (or register constraints, etc.)
incorrectly?

Thanks in advance,
Hal

Phi-elim works by inserting copies in the PHI predecessors before the first terminator. That isn't possible here since the first terminator defines the value of %vreg13 we want to copy. I think that phi-elim is not only causing spilling. It is miscompiling the code.

We simply don't support terminator instructions that define virtual registers ATM, it seems.

I can see how strong phi-elim would work in this case, but I don't think it can properly handle this in general. For example:

112B BB#1: derived from LLVM BB %for.body, ADDRESS TAKEN
           Predecessors according to CFG: BB#0 BB#1
%vreg12<def> = PHI %vreg13, <BB#1>, %vreg11, <BB#0>;CTRRC8:%vreg12,%vreg13,%vreg11
%vreg99<def> = PHI %vreg13, <BB#1>, %vreg2, <BB#0>
%vreg13<def> = COPY %vreg12<kill>; CTRRC8:%vreg13,%vreg12
%vreg13<def> = BDNZ8 %vreg13, <BB#1>; CTRRC8:%vreg13
B <BB#2>
           Successors according to CFG: BB#2 BB#1

Since it is impossible to copy %vreg13 at the end of BB#1, it must be merged with %vreg12. The second PHI use means that it must also be merged with %vreg99, but that is not possible if %vreg11 and %vreg2 have different values leaving BB#0.

Instead of trying to solve this hairy problem, I think you would be better off without the CTRRC8 register class. Your problems with {Analyze,Insert,Remove}Branch should be fixable.

/jakob

> 112B BB#1: derived from LLVM BB %for.body, ADDRESS TAKEN
> Predecessors according to CFG: BB#0 BB#1
> %vreg12<def> = PHI %vreg13, <BB#1>, %vreg11,
> <BB#0>;CTRRC8:%vreg12,%vreg13,%vreg11 %vreg13<def> = COPY
> %vreg12<kill>; CTRRC8:%vreg13,%vreg12 %vreg13<def> = BDNZ8 %vreg13,
> <BB#1>; CTRRC8:%vreg13 B <BB#2>
> Successors according to CFG: BB#2 BB#1

Phi-elim works by inserting copies in the PHI predecessors before the
first terminator. That isn't possible here since the first terminator
defines the value of %vreg13 we want to copy. I think that phi-elim
is not only causing spilling. It is miscompiling the code.

We simply don't support terminator instructions that define virtual
registers ATM, it seems.

I can see how strong phi-elim would work in this case, but I don't
think it can properly handle this in general. For example:

112B BB#1: derived from LLVM BB %for.body, ADDRESS TAKEN
           Predecessors according to CFG: BB#0 BB#1
%vreg12<def> = PHI %vreg13, <BB#1>, %vreg11,
<BB#0>;CTRRC8:%vreg12,%vreg13,%vreg11 %vreg99<def> = PHI %vreg13,
<BB#1>, %vreg2, <BB#0> %vreg13<def> = COPY %vreg12<kill>;
CTRRC8:%vreg13,%vreg12 %vreg13<def> = BDNZ8 %vreg13, <BB#1>;
CTRRC8:%vreg13 B <BB#2>
           Successors according to CFG: BB#2 BB#1

Since it is impossible to copy %vreg13 at the end of BB#1, it must be
merged with %vreg12. The second PHI use means that it must also be
merged with %vreg99, but that is not possible if %vreg11 and %vreg2
have different values leaving BB#0.

Instead of trying to solve this hairy problem, I think you would be
better off without the CTRRC8 register class.

Thanks for explaining! I can roll-back.

Your problems with
{Analyze,Insert,Remove}Branch should be fixable.

I don't think that the problem was with those functions. Adding support
for BDNZ and friends in those functions just enabled other passes to
start moving the blocks around, and that seems to have exposed problems
of its own. For example, sometimes LiveIntervals asserts with:
                register:
%CTR8
clang: /llvm-trunk/lib/CodeGen/LiveIntervalAnalysis.cpp:446:
void llvm::LiveInterval
s::handlePhysicalRegisterDef(llvm::MachineBasicBlock*,
llvm::MachineBasicBlock::iterator, llvm::SlotIndex,
llvm::MachineOperand&, llvm::LiveInt erval&): Assertion
`!isAllocatable(interval.reg) && "Physregs shouldn't be live out!"'
failed.

in this case the loop is quite simple:
944B BB#8: derived from LLVM BB %for.inc6, ADDRESS TAKEN
            Live Ins: %CTR8
            Predecessors according to CFG: BB#8 BB#3
960B BDNZ8 <BB#8>, %CTR8<imp-def>, %CTR8<imp-use,kill>
            Successors according to CFG: BB#8 BB#10

the preheader is:

240B BB#3:
            Predecessors according to CFG: BB#2
256B %vreg28<def> = LI 0; GPRC:%vreg28
272B %vreg30<def> = COPY %vreg17<kill>; GPRC:%vreg30,%vreg17
288B %vreg31<def> = RLDICL %vreg30<kill>, 0, 32;GPRC:%vreg31,%vreg30
304B MTCTR8 %vreg31<kill>,%CTR8<imp-def,dead>; GPRC:%vreg31
320B B <BB#8>
            Successors according to CFG: BB#8

So maybe LiveInterval would need to be updated to support terminators
that define registers? There are also mis-compiles, but I'm hoping that
they all stem from the same underlying problem.

Thanks again,
Hal

When machine code is still in SSA form, there are restrictions on what can be done with physical registers, which by their nature can’t be in SSA form. Lang and I have been trying to come up with some rules, but we haven’t found the right set yet.

So far, we are enforcing:

  • Allocatable registers cannot be live across basic blocks, except on entry to ABI blocks (the entry block and landing pads).

  • Unallocatable registers may be live across basic blocks, but you can’t use aliases. (For example, you can’t define CTR8 in one block and read CTR in another).

  • Reserved registers can do whatever you want, but defs are treated as having side effects, blocking certain optimizations like dead code elimination. (You don’t want to DCE a stack pointer update).

In this case, the problem is that CTR and CTR8 are allocatable because CTRRC and CTRRC8 are allocatable. You can set ‘isAllocatable = 0’ in those register classes.

Unfortunately, that breaks your TCRETURNri instructions. Sorry!

Singleton register classes are dangerous because it is so easy to extend the live range of virtual registers without checking for interference. If you accidentally cross two live ranges, you get horrible spilling as you’ve seen. We try to use explicit unallocatable physreg liveness for singletons like X86::EFLAGS and ARM::CPSR, but isel and TableGen aren’t making it easy.

/jakob

> For example, sometimes LiveIntervals asserts with:
> register:
> %CTR8
> clang: /llvm-trunk/lib/CodeGen/LiveIntervalAnalysis.cpp:446:
> void llvm::LiveInterval
> s::handlePhysicalRegisterDef(llvm::MachineBasicBlock*,
> llvm::MachineBasicBlock::iterator, llvm::SlotIndex,
> llvm::MachineOperand&, llvm::LiveInt erval&): Assertion
> `!isAllocatable(interval.reg) && "Physregs shouldn't be live out!"'
> failed.

FYI: I just committed the relevant code (disabled by default); and I
submitted a bug to track this issue:
http://llvm.org/bugs/show_bug.cgi?id=13057
(It seems that I was doing this as you were responding to my e-mail, so
it may be out of date already).

When machine code is still in SSA form, there are restrictions on
what can be done with physical registers, which by their nature can't
be in SSA form. Lang and I have been trying to come up with some
rules, but we haven't found the right set yet.

So far, we are enforcing:

- Allocatable registers cannot be live across basic blocks, except on
entry to ABI blocks (the entry block and landing pads).

- Unallocatable registers may be live across basic blocks, but you
can't use aliases. (For example, you can't define CTR8 in one block
and read CTR in another).

- Reserved registers can do whatever you want, but defs are treated
as having side effects, blocking certain optimizations like dead code
elimination. (You don't want to DCE a stack pointer update).

In this case, the problem is that CTR and CTR8 are allocatable
because CTRRC and CTRRC8 are allocatable. You can set 'isAllocatable
= 0' in those register classes.

Thanks for explaining! [We should add this to the docs somewhere].

Unfortunately, that breaks your TCRETURNri instructions. Sorry!

I'm guessing that I can rewrite TCRETURN to reference CTR/CTR8
directly. I'll have to try that.

Singleton register classes are dangerous because it is so easy to
extend the live range of virtual registers without checking for
interference. If you accidentally cross two live ranges, you get
horrible spilling as you've seen.

Indeed, so it seems :wink:

We try to use explicit
unallocatable physreg liveness for singletons like X86::EFLAGS and
ARM::CPSR, but isel and TableGen aren't making it easy.

/jakob

Thanks again,
Hal

When machine code is still in SSA form, there are restrictions on
what can be done with physical registers, which by their nature can't
be in SSA form. Lang and I have been trying to come up with some
rules, but we haven't found the right set yet.

[We should add this to the docs somewhere].

http://llvm.org/bugs/show_bug.cgi?id=13058

Unfortunately, that breaks your TCRETURNri instructions. Sorry!

I'm guessing that I can rewrite TCRETURN to reference CTR/CTR8
directly. I'll have to try that.

That should be possible, yes.

/jakob

>> When machine code is still in SSA form, there are restrictions on
>> what can be done with physical registers, which by their nature
>> can't be in SSA form. Lang and I have been trying to come up with
>> some rules, but we haven't found the right set yet.

> [We should add this to the docs somewhere].

http://llvm.org/bugs/show_bug.cgi?id=13058

>> Unfortunately, that breaks your TCRETURNri instructions. Sorry!
>
> I'm guessing that I can rewrite TCRETURN to reference CTR/CTR8
> directly. I'll have to try that.

That should be possible, yes.

As it turns out, I don't need to (the patterns in question are never
actually used in the current backend).

Furthermore, your suggestion to mark the register classes as
non-allocatable seems to have worked. Doing that and fully enabling this
now yields no (new) test-suite failures! :slight_smile:

Thank you very much,
Hal

I'm guessing that I can rewrite TCRETURN to reference CTR/CTR8
directly. I'll have to try that.

That should be possible, yes.

As it turns out, I don't need to (the patterns in question are never
actually used in the current backend).

Neat.

Furthermore, your suggestion to mark the register classes as
non-allocatable seems to have worked. Doing that and fully enabling this
now yields no (new) test-suite failures! :slight_smile:

Great! You can probably go ahead and delete the CTRRC* register classes, if they are unused. That has the same effect as setting 'isAllocatable = 0'.

/jakob