Unnecessary moves after sign-extension in 2-address target

My two-address target machine has sign-extension instructions to extend
i8->i32 and i16->i32. When I compile this simple program:

    int
    sext (unsigned a, unsigned b, int c)
    {
      return (signed char) a + (signed short) b + c;
    }

I get this IR:

    define i32 @sext(i32 %a, i32 %b, i32 %c) nounwind readnone {
    entry:
        %conv = trunc i32 %a to i8 ; <i8> [#uses=1]
        %conv1 = sext i8 %conv to i32 ; <i32> [#uses=1]
        %conv3 = trunc i32 %b to i16 ; <i16> [#uses=1]
        %conv4 = sext i16 %conv3 to i32 ; <i32> [#uses=1]
        %add = add i32 %conv1, %c ; <i32> [#uses=1]
        %add6 = add i32 %add, %conv4 ; <i32> [#uses=1]
        ret i32 %add6
    }

And this not-so-great assembler code:

sext:
    sextb r1
    mov r4,r1 ### unnecessary move
    add r4,r3
    sextw r2
    mov r1,r2 ### unnecessary move
    add r1,r4
    jmp [r30]

Which should be this:

sext:
    sextb r1
    add r1,r3
    sextw r2
    add r1,r2
    jmp [r30]

The debug output from LLVM shows this:

********** REWRITING TWO-ADDR INSTRS **********
********** Function: sext
    %reg1028<def> = sextb_r %reg1025<kill>
        prepend: %reg1028<def> = mov_rr %reg1025<kill>
        rewrite to: %reg1028<def> = sextb_r %reg1028
...
    %reg1030<def> = sextw_r %reg1026<kill>
        prepend: %reg1030<def> = mov_rr %reg1026<kill>
        rewrite to: %reg1030<def> = sextw_r %reg1030

Because sextb_r and sextw_r have destination tied to source operands,
TwoAddressInstructionPass thinks it needs a copy. However, since the
sext kills its source, the copy is unnecessary. Why does this happen?
Is TwoAddressInstructionPass relying on a later pass to notice this and
transform it again?

G

Yes, the later pass is the coalescer. It would be worthwhile to
understand why it is not coalescing the copies.

Dan

Greg McGary wrote:

********** REWRITING TWO-ADDR INSTRS **********
********** Function: sext
    %reg1028<def> = sextb_r %reg1025<kill>
        prepend: %reg1028<def> = mov_rr %reg1025<kill>
        rewrite to: %reg1028<def> = sextb_r %reg1028
...
    %reg1030<def> = sextw_r %reg1026<kill>
        prepend: %reg1030<def> = mov_rr %reg1026<kill>
        rewrite to: %reg1030<def> = sextw_r %reg1030

Because sextb_r and sextw_r have destination tied to source operands,
TwoAddressInstructionPass thinks it needs a copy. However, since the
sext kills its source, the copy is unnecessary. Why does this happen?
Is TwoAddressInstructionPass relying on a later pass to notice this and
transform it again?
  
Since no one responded, I'll assume that this situation doesn't ring a
bell with anyone, so I'll need to dig deeper...

G

Greg McGary wrote:

Since no one responded, I'll assume that this situation doesn't ring a
bell with anyone, so I'll need to dig deeper...
  
Sorry, I wrote this too soon--I just saw Dan's response. Thanks! I'm examining the coalescer now.

G

Dan Gohman wrote:

  

Because sextb_r and sextw_r have destination tied to source operands,
TwoAddressInstructionPass thinks it needs a copy. However, since the
sext kills its source, the copy is unnecessary. Why does this happen?
Is TwoAddressInstructionPass relying on a later pass to notice this and
transform it again?
    
Yes, the later pass is the coalescer. It would be worthwhile to
understand why it is not coalescing the copies.
  
I discovered a curious phenomenon:

The copies are necessary because TwoAddressInstructionPass commutes
the second add. When I suppress the commute, the movs disappear and
the code became optimal. It seems the two-address commuter is either buggy
or inherently short-sighted/simple-minded and paints itself into a corner.

How do you recommend I approach this problem?

G

Dan Gohman wrote:

Because sextb_r and sextw_r have destination tied to source operands,
TwoAddressInstructionPass thinks it needs a copy. However, since the
sext kills its source, the copy is unnecessary. Why does this happen?
Is TwoAddressInstructionPass relying on a later pass to notice this
and
transform it again?

Yes, the later pass is the coalescer. It would be worthwhile to
understand why it is not coalescing the copies.

I discovered a curious phenomenon:

The copies are necessary because TwoAddressInstructionPass commutes
the second add. When I suppress the commute, the movs disappear and
the code became optimal. It seems the two-address commuter is either buggy
or inherently short-sighted/simple-minded and paints itself into a corner.

How do you recommend I approach this problem?

You can try to identify the problem in 2addr pass and try to provide us with a patch. Or file a bugzilla with a reproducible test case.

Evan

Evan Cheng wrote:

The copies are necessary because TwoAddressInstructionPass commutes
the second add. When I suppress the commute, the movs disappear and
the code became optimal. It seems the two-address commuter is either buggy
or inherently short-sighted/simple-minded and paints itself into a corner.

How do you recommend I approach this problem?
    
You can try to identify the problem in 2addr pass and try to provide us with a patch. Or file a bugzilla with a reproducible test case.
  
OK. I was really fishing for strategic direction: e.g., caveats with the heuristics/algorithms
we use now. Of course, if such existed, I expect it would be present as comments, so it was
probably too much to hope for... I'll see if I can get the x86 target to trip on this bug. My
new port won't be suitable for a bug report, since it's an odd-ball proprietary CPU that will
probably never be distributed outside the company.

G

Greg McGary wrote:

I discovered a curious phenomenon:

The copies are necessary because TwoAddressInstructionPass commutes
the second add. When I suppress the commute, the movs disappear and
the code became optimal. It seems the two-address commuter is either buggy
or inherently short-sighted/simple-minded and paints itself into a corner.
  
I have some understanding of what goes no here. One of the cases where
TwoAddressInstructionPass commutes is when it can shorten the live range
of a src/dest operand. Maybe this is good for x86, but is bad for my
CPU which has 32 GPRs, so doesn't need to sacrifice in order to reduce
register pressure. Actually, in my small tests I haven't seen x86 suffer
from removing this class of commutations. I'll see how things go for
x86 on some larger things in test-suite before I determine whether my
change works for x86. If it loses for x86, it still wins for my port,
so I can make this commutation case depend on a parameter of the target.

G