How to copy propagate physical register introduced before RA

I have come across an IR pattern after RISC-V instruction selection,

  %5:gpr = COPY $x10
  %6:gpr = LW %stack.2.c, 0 :: (dereferenceable load (s32) from %ir.c);
  ADJCALLSTACKDOWN 0, 0, implicit-def dead $x2, implicit $x2
  $x10 = COPY %6:gpr
  $x11 = COPY %5:gpr

However, machine scheduler makes it

%6:gpr = LW %stack.2.c, 0 :: (dereferenceable load (s32) from %ir.c)
%5:gpr = COPY $x10
ADJCALLSTACKDOWN 0, 0, implicit-def dead $x2, implicit $x2
$x10 = COPY %6:gpr
$x11 = COPY %5:gpr

and introduces dependency between registers and machine-cp cannot act on it.
And generates sub-optimal code.

  lw a1, 0(sp)
  mv a2, a0
  mv a0, a1
  mv a1, a2

Optimized code would look like

  mv a1, a0
  lw a0, 0(sp)

I was wondering how to solve this problem? When I force machine scheduler to top-down list scheduling(–misched-topdown) it work as expected, but I am not sure this is intentional. Is there a way to introduce copy propagation before machine scheduler, or change should be done in scheduler or improve the machine-cp?

1 Like

Improving machine-cp seems like the best option:

Step 1: a1 only used once in a copy to a0, so “rematerialize” a0 with the a1’s definition instead:

# deleted: lw a1, 0(sp)
mv a2, a0
lw a0, 0(sp)
mv a1, a2

Step 2: propagate copy a1 = a2 backwards:

mv a1, a0
lw a0, 0(sp)
# deleted mv a1, a2
2 Likes

Hi @niwinanto. By looking at this pattern, I think this is generated by LowerCall. The main situation here is that, by the ABI, we have an overlap between the return and argument registers (we have the same problem with ARM, and it disappears with the pre-ra scheduler off). I can reproduce this behavior with:

void init_var(int *v);
int chain(int c, int n);

void start() {
    int a, b, c;

    init_var(&a);
    init_var(&b);
    init_var(&c);

    int r = chain(b, a);
    r = chain(c, r);
}

In this case, we have the following code:

start:                                  # @start
# %bb.0:                                # %entry
        addi    sp, sp, -16
        sw      ra, 12(sp)                      # 4-byte Folded Spill
        addi    a0, sp, 8
        call    init_var
        addi    a0, sp, 4
        call    init_var
        mv      a0, sp
        call    init_var
        lw      a0, 4(sp)
        lw      a1, 8(sp)
        call    chain
        lw      a1, 0(sp)
        mv      a2, a0
        mv      a0, a1
        mv      a1, a2
        call    chain
        lw      ra, 12(sp)                      # 4-byte Folded Reload
        addi    sp, sp, 16
        ret

In this code, a0 is both used and defined, so, if we first copy to a0 and then a1 we will need an extra temporary. But, if we reverse the copy, we can facilitate machine-cp work.

If we use this:

diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 26475d94aef0..5713654d9ad6 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -17769,7 +17769,7 @@ SDValue RISCVTargetLowering::LowerCall(CallLoweringInfo &CLI,
   SDValue Glue;
 
   // Build a sequence of copy-to-reg nodes, chained and glued together.
-  for (auto &Reg : RegsToPass) {
+  for (auto &Reg : llvm::reverse(RegsToPass)) {
     Chain = DAG.getCopyToReg(Chain, DL, Reg.first, Reg.second, Glue);
     Glue = Chain.getValue(1);
   }

We can reduce a bit the problem to:

start:                                  # @start
# %bb.0:                                # %entry
        addi    sp, sp, -16
        sw      ra, 12(sp)                      # 4-byte Folded Spill
        addi    a0, sp, 8
        call    init_var
        addi    a0, sp, 4
        call    init_var
        mv      a0, sp
        call    init_var
        lw      a0, 4(sp)
        lw      a1, 8(sp)
        call    chain
        lw      a2, 0(sp)
        mv      a1, a0
        mv      a0, a2
        call    chain
        lw      ra, 12(sp)                      # 4-byte Folded Reload
        addi    sp, sp, 16
        ret

As this problem also exists in other targets, it will be nice to fix it in a target-independent way.

I also took a look and saw that GCC can generate the code as you pointed out.

About the scheduler, I think this can trigger an interesting discussion because it solves this problem along with the generation of a bit of more compact code for this target.

Regards,

Andreu

2 Likes

I am wondering why the LD is so aggressively positioned (immediately at return.)
If the LD were delayed for just a moment–there is time to move the return value into
the appropriate argument and this gets rid of the extra MOV::

I think that the pre-scheduler sees just one real instruction there (the other ones are just copies), and this instruction usually has high latency for the results (LW) in most targets, so, in general, it is a good decision to schedule it first to avoid stalls. I think that by manipulating latencies we can, in some way, fix the problem, but the side effects can be negative.

I would look into your scheduler heuristic to beginning with.

The one thing that is strange to me is I would have expected that the scheduler would have kept the copy with the physical registers with as small live-ranges as possible (for the physical registers).

In your case the opposite happens: the physical register live-range that defines %5 is being extended and this is bad for register allocation because you just create additional artificial coloring constraints.
I’m guessing that you’re too aggressive for hiding the latency of the load in this case.

2 Likes