Problems with subreg-liveness and Greedy RA

I am having a really difficult time with subregister related issues when I turn
on subregister liveness tracking.

Before RA:
79760B %2216:vsrc = LXVDSX %5551:g8rc_and_g8rc_nox0, %2215:g8rc :: (load 8 from %ir.scevgep1857.cast, !alias.scope !92, !noalias !93)
79872B %2225:vsrprc = LXVP 352, %661:g8rc_and_g8rc_nox0
84328B %5540:vsrc = contract nofpexcept XVMADDADP %5540:vsrc(tied-def 0), %2225.sub_vsx0:vsrprc, %2216:vsrc, implicit $rm

After RA (greedy):
79744B %2214:vsrc = LXVDSX %5551:g8rc_and_g8rc_nox0, %6477:g8rc :: (load 8 from %ir.scevgep1860.cast, !alias.scope !92, !noalias !93)
79872B %7503:vsrprc = LXVP 352, %661:g8rc_and_g8rc_nox0
80248B %7527:vsrprc = COPY %7503:vsrprc
80988B undef %7526.sub_64:vsrprc = COPY %7527.sub_64:vsrprc
84324B undef %7501.sub_64:vsrprc = COPY %7526.sub_64:vsrprc
84328B %5546:vsrc = contract nofpexcept XVMADDADP %5546:vsrc(tied-def 0), %7501.sub_vsx0:vsrprc, %2214:vsrc, implicit $rm

Subregister definitions for PPC:
def sub_64 : SubRegIndex<64>;
def sub_vsx0 : SubRegIndex<128>;
def sub_vsx1 : SubRegIndex<128, 128>;
def sub_pair0 : SubRegIndex<256>;
def sub_pair1 : SubRegIndex<256, 256>;

So the instruction at 84328B uses the full register %2216 and the high order
128 bits of (256-bit) register %2225. However, the register allocator splits
the live range and introduces a copy of the high order 64 bits of that 256-bit
register, then another copy of that copy and rewrites the use in instruction
84328B to that copy. The copy is marked undef so the register allocator
assigns just some random register to the use of that copy in 84328B.

Or maybe I am completely misinterpreting the meaning of the debug dumps
from the register allocator.

This appears to be related to lane masks and dead lane detection although
I don’t see dead lane detection marking anything unexpected as undef (seems

to just be INSERT_SUBREG and PHI).

If anyone has suggestions on what might be the issue and/or how to go about figuring this out and fixing it, I would really appreciate it.

Nemanja

I am having a really difficult time with subregister related issues when I turn
on subregister liveness tracking.

Before RA:
79760B %2216:vsrc = LXVDSX %5551:g8rc_and_g8rc_nox0, %2215:g8rc :: (load 8 from %ir.scevgep1857.cast, !alias.scope !92, !noalias !93)
79872B %2225:vsrprc = LXVP 352, %661:g8rc_and_g8rc_nox0
84328B %5540:vsrc = contract nofpexcept XVMADDADP %5540:vsrc(tied-def 0), %2225.sub_vsx0:vsrprc, %2216:vsrc, implicit $rm

After RA (greedy):
79744B %2214:vsrc = LXVDSX %5551:g8rc_and_g8rc_nox0, %6477:g8rc :: (load 8 from %ir.scevgep1860.cast, !alias.scope !92, !noalias !93)
79872B %7503:vsrprc = LXVP 352, %661:g8rc_and_g8rc_nox0
80248B %7527:vsrprc = COPY %7503:vsrprc
80988B undef %7526.sub_64:vsrprc = COPY %7527.sub_64:vsrprc
84324B undef %7501.sub_64:vsrprc = COPY %7526.sub_64:vsrprc
84328B %5546:vsrc = contract nofpexcept XVMADDADP %5546:vsrc(tied-def 0), %7501.sub_vsx0:vsrprc, %2214:vsrc, implicit $rm

Subregister definitions for PPC:
def sub_64 : SubRegIndex<64>;
def sub_vsx0 : SubRegIndex<128>;
def sub_vsx1 : SubRegIndex<128, 128>;
def sub_pair0 : SubRegIndex<256>;
def sub_pair1 : SubRegIndex<256, 256>;

So the instruction at 84328B uses the full register %2216 and the high order
128 bits of (256-bit) register %2225. However, the register allocator splits
the live range and introduces a copy of the high order 64 bits of that 256-bit
register, then another copy of that copy and rewrites the use in instruction
84328B to that copy. The copy is marked undef so the register allocator
assigns just some random register to the use of that copy in 84328B.

Or maybe I am completely misinterpreting the meaning of the debug dumps
from the register allocator.

This appears to be related to lane masks and dead lane detection although
I don’t see dead lane detection marking anything unexpected as undef (seems

to just be INSERT_SUBREG and PHI).

Are the copies added by dead lane detection or by live-range splitting?

The undef flag on the definition of %7501 is suspicious and depending on how you look at it, so is the one on %7526. Essentially, we are losing the full copy in this chain of copies and I wonder what is at fault here.

Could you share the debug output of regalloc?

Thank you so much for taking the time to answer Quentin.

The bad copies are definitely added by live range splitting. The issue seems to be the LaneBitmasks for the various subregisters. Honestly, I don’t really know what the bits of LaneBitmask produced by TblGen are meant to mean, but I can’t make any sense of them. And those seem to lead the register allocator astray.
Here are the LaneBitmasks from the register include file:

static const LaneBitmask SubRegIndexLaneMaskTable[] = {
LaneBitmask::getAll(),
LaneBitmask(0x0000000000000001), // sub_32
LaneBitmask(0x0000000000000002), // sub_64
LaneBitmask(0x0000000000000004), // sub_eq
LaneBitmask(0x0000000000000001), // sub_gp8_x0
LaneBitmask(0x0000000000000200), // sub_gp8_x1
LaneBitmask(0x0000000000000008), // sub_gt
LaneBitmask(0x0000000000000010), // sub_lt
LaneBitmask(0x0000000000000042), // sub_pair0
LaneBitmask(0x0000000000000180), // sub_pair1
LaneBitmask(0x0000000000000020), // sub_un
LaneBitmask(0x0000000000000002), // sub_vsx0
LaneBitmask(0x0000000000000040), // sub_vsx1
LaneBitmask(0x0000000000000040), // sub_vsx1_then_sub_64
LaneBitmask(0x0000000000000080), // sub_pair1_then_sub_64
LaneBitmask(0x0000000000000080), // sub_pair1_then_sub_vsx0
LaneBitmask(0x0000000000000100), // sub_pair1_then_sub_vsx1
LaneBitmask(0x0000000000000100), // sub_pair1_then_sub_vsx1_then_sub_64
LaneBitmask(0x0000000000000200), // sub_gp8_x1_then_sub_32
};

For example, what does it mean that the mask for sub_64 and sub_vsx0 are the same?

That just means they overlap. That’s fine (I think!)

From LaneBitmask.h

/// Iff the target has a register such that:
///
/// getSubReg(Reg, A) overlaps getSubReg(Reg, B)
///
/// then:
///
/// (getSubRegIndexLaneMask(A) & getSubRegIndexLaneMask(B)) != 0

The two subregisters certainly do not represent the same lanes in their respective registers. The sub_vsx0 subregister is the first VSX register in a VSX register pair. And each of the two subregisters of a VSX register pair (sub_vsx0, sub_vsx1) have their own scalar subregister (sub_64).

I have also attached the output of RA, but it is huge :frowning:
It is the result of specifying options -debug-only=regalloc -print-before=greedy -print-after=greedy on the command line.

Thanks, I’ll try to take a look this week.
Looking at these lines, I wonder if the issue is not simply that we didn’t pass the right subregindex. I.e., the following code would have been fine with sub_vsx0 instead of sub_64.

80988B undef %7526.sub_64:vsrprc = COPY %7527.sub_64:vsrprc
84324B undef %7501.sub_64:vsrprc = COPY %7526.sub_64:vsrprc
84328B %5546:vsrc = contract nofpexcept XVMADDADP %5546:vsrc(tied-def 0), %7501.sub_vsx0:vsrprc

Sorry, it would appear that the dev list stripped my attachment. I have reduced the file using bugpoint and produced the output from that. Attaching it here.

Nemanja

ra-before-after-debug.txt (647 KB)

Hi Nemanja,

Do you have something I could run on my side?

I’d like to see how/when we create %296 and the dump, doesn’t have that information:


selectOrSplit VSRpRC:%39 [536r,6168B:0) 0@536r L0000000000000002 [536r,6168B:0) 0@536r L0000000000000040 [536r,536d:0) 0@536r weight:7.122666e+05 w=7.122666e+05
RS_Split Cascade 11
Analyze counted 3 instrs in 2 blocks, through 0 blocks.
Cost of isolating all blocks = 2147483648.8
$vsrp0 no positive bundles
$vsrp1 no positive bundles
$vsrp2 no positive bundles
$vsrp3 no positive bundles
$vsrp4 no positive bundles
$vsrp5 no positive bundles
$vsrp6 no positive bundles
$vsrp17 no positive bundles
$vsrp18 no positive bundles
$vsrp16 no positive bundles
$vsrp19 no positive bundles
$vsrp20 no positive bundles
$vsrp21 no positive bundles
$vsrp22 no positive bundles
$vsrp23 no positive bundles
$vsrp24 no positive bundles
$vsrp25 no positive bundles
$vsrp15 no positive bundles
$vsrp14 no positive bundles
$vsrp13 no positive bundles
$vsrp12 no positive bundles
$vsrp11 no positive bundles
$vsrp10 no positive bundles
$vsrp9 no positive bundles
$vsrp8 no positive bundles
$vsrp7 no positive bundles
$vsrp31 no positive bundles
$vsrp30 no positive bundles
$vsrp29 no positive bundles
$vsrp28 no positive bundles
$vsrp27 no positive bundles
$vsrp26 no positive bundles
enterIntvBefore 5296r: valno 0
leaveIntvAfter 5320r: valno 0
useIntv [5292r;5320r): [5292r;5320r):1
Multi-mapped complement 0@5316r for parent 0@536r hoist to %bb.2 5316r
Direct complement def at 536r
Removing 1 back-copies.
Removing 5316r undef %296.sub_64:vsrprc = COPY %39.sub_64:vsrprc. <———— THIS COPY
blit [536r,6168B:0): [536r;5292r)=0(%296)(recalc) [5292r;5320r)=1(%297)(recalc) [5320r;6168B)=0(%296)(recalc)
rewr %bb.1 536r:0 %296:vsrprc = LXVP 0, %33:g8rc_and_g8rc_nox0 :: (load 32 from constant-pool)
rewr %bb.2 5320B:1 %290:vsrc = contract nofpexcept XVMADDADP %290:vsrc(tied-def 0), %297.sub_vsx0:vsrprc, %31:vsrc, implicit $rm
rewr %bb.2 5296B:1 %289:vsrc = contract nofpexcept XVMADDADP %289:vsrc(tied-def 0), %297.sub_vsx0:vsrprc, %62:vsrc, implicit $rm
rewr %bb.2 5292B:0 undef %297.sub_64:vsrprc = COPY %296.sub_64:vsrprc
queuing new interval: %296 [536r,6168B:0) 0@536r L0000000000000040 [536r,536d:0) 0@536r L0000000000000002 [536r,6168B:1) 0@x 1@536r weight:3.596946e+05
queuing new interval: %297 [5292r,5320r:0) 0@5292r L0000000000000002 [5292r,5320r:0) 0@5292r weight:1.520298e+07

Ideally, if you can file a bug, that would make the tracking simpler.

Cheers,
-Quentin

Hi Quentin,
I am really sorry for the delay. I am working on getting a reproducer for this and once I have something manageable, I’ll open a bug.

Thanks for all your help,
Nemanja

Hi Nemanja,
I’ve seen alike issue with our downstream arch as well. I’ve fixed with this patch, see if solves your problem:

diff --git a/llvm/lib/CodeGen/LiveInterval.cpp b/llvm/lib/CodeGen/LiveInterval.cpp
--- a/llvm/lib/CodeGen/LiveInterval.cpp
+++ b/llvm/lib/CodeGen/LiveInterval.cpp
@@ -984,16 +984,18 @@
   for (const MachineOperand &MO : MRI.def_operands(reg)) {
     if (!MO.isUndef())
       continue;
-    unsigned SubReg = MO.getSubReg();
-    assert(SubReg != 0 && "Undef should only be set on subreg defs");
-    LaneBitmask DefMask = TRI.getSubRegIndexLaneMask(SubReg);
-    LaneBitmask UndefMask = VRegMask & ~DefMask;
-    if ((UndefMask & LaneMask).any()) {
-      const MachineInstr &MI = *MO.getParent();
-      bool EarlyClobber = MO.isEarlyClobber();
-      SlotIndex Pos = Indexes.getInstructionIndex(MI).getRegSlot(EarlyClobber);
-      Undefs.push_back(Pos);
+    // If it is a SubReg, see if they overlap,
+    // or if it is the entire reg, just add it
+    if (unsigned SubReg = MO.getSubReg()) {
+      LaneBitmask DefMask = TRI.getSubRegIndexLaneMask(SubReg);
+      LaneBitmask UndefMask = VRegMask & ~DefMask;
+      if (!(UndefMask & LaneMask).any())
+        return;
     }
+    const MachineInstr &MI = *MO.getParent();
+    bool EarlyClobber = MO.isEarlyClobber();
+    SlotIndex Pos = Indexes.getInstructionIndex(MI).getRegSlot(EarlyClobber);
+    Undefs.push_back(Pos);
   }
 }

Sorry about that, seems it lost the correct diff format. Attached.
PS: We are working off-tree, over llvm 12.0, not trunk.

fix-subreg-liveness-tracking.diff (1.26 KB)