MachineCSE of copy instructions


I noticed that MachineCSE::isCSECandidate does not consider COPY instructions as CSE candidates and I’m wondering why. I would expect COPY to the the best way to enable target independent optimizations to work.

We have to process instructions after instruction selection to make sure their operands satisfy a few restrictions based on the register classes of the operands. Sometimes the same copy will be inserted to the required register class if multiple instructions need the same operand legalized, but these aren’t getting eliminated as expected.

In this example, we need to insert a copy for the src1/%b operand of each FMA.

define void @test_s0_s1_k(float addrspace(1)* %out, float %a, float %b) #0 {
  %fma0 = call float @llvm.fma.f32(float %a, float %b, float 1024.0) #1
  %fma1 = call float @llvm.fma.f32(float %a, float %b, float 4096.0) #1
  store volatile float %fma0, float addrspace(1)* %out
  store volatile float %fma1, float addrspace(1)* %out
  ret void

A COPY is inserted for when processing each instruction’s operands:

%vreg12<def> = COPY %vreg4; VGPR_32:%vreg12 SGPR_32:%vreg4
%vreg11<def> = V_FMA_F32 0, %vreg3, 0, %vreg12, 0, %vreg10, 0, 0, %EXEC<imp-use>; VGPR_32:%vreg11,%vreg12,%vreg10 SGPR_32:%vreg3

%vreg15<def> = COPY %vreg4; VGPR_32:%vreg15 SGPR_32:%vreg4
%vreg14<def> = V_FMA_F32 0, %vreg3, 0, %vreg15, 0, %vreg13, 0, 0, %EXEC<imp-use>; VGPR_32:%vreg14,%vreg15,%vreg13 SGPR_32:%vreg3

Which ends up getting emitted as:

v_mov_b32_e32 v1, s0
v_mov_b32_e32 v2, s0 // redundant copy of s0
v_fma_f32 v0, s2, v2, v0
v_fma_f32 v1, s2, v1, v2

I would expect the redundant copy to be eliminated, but it is not. If I remove the MI->isCopyLike() restriction, it is CSEd as expected in this case and others (although a variety of tests break mostly with assertions).

Also if I modify the operand legalization to insert the v_mov_b32_e32 instruction directly, it is also correctly CSE’d. However, I would expect inserting COPY would be more ideal since it will allow the PeepholeOptimizer and other passes to optimize the copies. Why is this restriction there? Would it be possible to fix MachineCSE to support copies and add a target option for them? There might not be a reason to avoid emitting the v_mov_b32 right away, but for 64-bit copies it requires emitting 2 instructions so it’s more convenient to emit the COPY and have that be split later.


Hi Matt,

This is expected.

Basically, for regular copies we want them to be handled by the coalescer, which has a nicer profitability metric.
For cross-bank copies, this is handled by the peephole optimizer. The case you expose is a limitation of the peephole optimizer. This is not fundamentally complicated to fix it, just we did not have motivating examples. The bottom line is we should fix the peephole optimizer for your test case.

You may find more details here:
See for more detailed.