SUBREG_TO_REG semantics (or x86's zext implementation is broken)

I’m debugging a regression on x86 and my current belief is it’s due to undef handling when combined with SUBREG_TO_REG. I do not know how to resolve the two points in the current definition for it:

/// SUBREG_TO_REG - Assert the value of bits in a super register.
/// The result of this instruction is the value of the second operand inserted
/// into the subregister specified by the third operand. All other bits are
/// assumed to be equal to the bits in the immediate integer constant in the
/// first operand. This instruction just communicates information; No code
/// should be generated.
/// This is typically used after an instruction where the write to a subregister
/// implicitly cleared the bits in the super registers.
HANDLE_TARGET_OPCODE(SUBREG_TO_REG)

All other bits are assumed to be equal to the bits in the immediate integer constant in the first operand.

This instruction just communicates information; No code should be generated.

In this failure, the original IR has an undef phi input. Prior to the coalescer, there is MIR that looks like:

    
  // entry block
  %45:gr64_with_sub_8bit = IMPLICIT_DEF

  // block in loop
    undef %45.sub_32bit:gr64_with_sub_8bit = MOV32rr %45.sub_32bit
    %45:gr64_with_sub_8bit = SUBREG_TO_REG 0, %45.sub_32bit, %subreg.sub_32bit
    JMP_1 %bb.5

  // ...

The coalescer decides that SUBREG_TO_REG is merely a copy, and it’s deleted as these registers coalesce with another. Later during allocation, the resulting coalesced copy inside this block are deleted such that the use blocks see different undefs (and the entry IMPLICIT_DEF is disconnected and immediately killed).

If I look at x86, I see this:


def : Pat<(i64 (zext GR32:$src)),
          (SUBREG_TO_REG (i64 0), (MOV32rr GR32:$src), sub_32bit)>;

If this is merely an assert that the high bits are 0, this is just broken. An explicit zeroing is needed. Either SUBREG_TO_REG requires preserving the high bits and cannot be coalesced with copies, or this pattern is broken and requires explicit zeroing.

MOV32rr is an explicit instruction, which zeros the high bits. On x86, every instruction that produces a 32-bit result zeros the high 32 bits of the corresponding 64-bit register. That’s why it’s explicitly lowering to MOV32rr, and not a COPY. (There are other equivalent instructions, but “mov” is the shortest/fastest way to write this.)

If some optimization is deciding that the MOV32rr can be removed, that optimization is wrong (or the x86 backend is giving it wrong information about the semantics of MOV32rr).

It’s more the undef flag ends up set on its def.

Before coalescing:

    %23:gr32 = COPY killed %6.sub_32bit
    %24:gr32 = MOV32rr killed %23
    %7:gr64 = SUBREG_TO_REG 0, killed %24, %subreg.sub_32bit
    %46:gr64 = COPY killed %7
    JMP_1 %bb.5

After coalescing:

  bb.3:
    successors: %bb.5(0x80000000)
  
    undef %45.sub_32bit:gr64_with_sub_8bit = MOV32rr %45.sub_32bit
    JMP_1 %bb.5

The downstream user includes a 64-bit shift, so those high bits probably aren’t supposed to be undef.

The “before coalescing” IR looks fine; the MOV32rr zeros the high bits, and the SUBREG_TO_REG indicates that we’re actually using the whole 64-bit register even though MOV32rr doesn’t explicitly mention the 64-bit register. This is how SUBREG_TO_REG is meant to be used; it indicates the operand register was produced by an instruction that actually defines more bits than the definition indicates.

(I won’t argue this is the best design if we were designing it from scratch, but it’s what we’ve done for a long time.)

Fully reduced and I think it’s pretty clear the coalescer should not have introduced this undef flag:

# RUN: llc -mtriple=x86_64-- -run-pass=register-coalescer -o - %s | FileCheck %s

---
name: subreg_to_reg_folds_to_undef
tracksRegLiveness: true
body:             |
  bb.0:
    liveins: $rax

    ; CHECK-LABEL: name: subreg_to_reg_folds_to_undef
    ; CHECK: liveins: $rax
    ; CHECK-NEXT: {{  $}}
    ; CHECK-NEXT: [[COPY:%[0-9]+]]:gr64_with_sub_8bit = COPY $rax
    ; CHECK-NEXT: undef %4.sub_32bit:gr64_with_sub_8bit = MOV32rr [[COPY]].sub_32bit
    ; CHECK-NEXT: RET 0, implicit %4
    %0:gr64 = COPY killed $rax
    %1:gr32 = COPY killed %0.sub_32bit
    %2:gr32 = MOV32rr killed %1
    %3:gr64 = SUBREG_TO_REG 0, killed %2, %subreg.sub_32bit
    %4:gr64 = COPY killed %3
    RET 0, implicit %4

...

So there’s hidden liveness introduced by SUBREG_TO_REG, that’s not captured

IIRC, both rax and eax have the same register unit, so the loss of liveness seems really strange.

I think the real fix is to stop special casing this and use a tied operand with a subregister index, instead of this separate subregister index operand. As-is I think you need to special case teach that there’s a read of this unrelated register ~everywhere

The x86 target seems to be wrong about MOV32rr:

let hasSideEffects = 0, isMoveReg = 1 in {
[...]
def MOV32rr : I<0x89, MRMDestReg, (outs GR32:$dst), (ins GR32:$src),
                "mov{l}\t{$src, $dst|$dst, $src}", []>, OpSize32;
[...]
}

That instruction does have side-effects, which are not expressed by any operand or flag.

I’m approaching a solution which is to have SUBREG_TO_REG add an implicit-def of the full virtual whenever it coalesces with something else. The later simple is-copy recognition then needs to not consider copy/move with an implicit def as a simple copy

⚙ D156346 CodeGen: Disable isCopyInstrImpl if there are implicit operands ends the stack to fix this, ⚙ D156345 RegisterCoalescer: Add implicit-def of super register when coalescing SUBREG_TO_REG is the main bit