Possible missed optimization?

Hello, while testing trivial functions in my backend i noticed a suboptimal way of assigning regs that had the following pattern, consider the following function:

typedef unsigned short t;
t foo(t a, t b)
{
t a4 = b^a^18;
return a4;
}

Argument “a” is passed in R15:R14 and argument “b” is passed in R13:R12, the return value is stored in R15:R14.

Producing the following asm code:
xor r15, r13 ; xor top part
mov r8, r14
xor r8, r12 ; xor bottom part
movi r14, 18
xor r14, r8 ; xor bottom part with imm value

However it could have produced:
xor r15, r13
xor r14, r12
movi r8, 18
xor r14, r8

This saves the first move instruction by using any scratch register when moving the imm value instead of using r14 which is used as an incoming reg and as the return value. Is this a missed optimization from LLVM or did i miss something out?
Changing the register allocation order didnt work.

Thanks

Only by commuting the xor operands. Have you marked xor as commutable? See the other targets for how to do that.

Indeed, i’ve marked it as commutable:

let isCommutable = 1,
isTwoAddress = 1 in
def XORRdRr : FRdRr<0b0010,
0b01,
(outs GPR8:$dst),
(ins GPR8:$src1, GPR8:$src2),
“xor\t$dst, $src2”,
[(set GPR8:$dst, (xor GPR8:$src1, GPR8:$src2))]>;

I’ve noticed this pattern happening with other operators aswell, but used xor in this example. As i said before, i tried with different register allocation orders, but it will produce always the same result. GCC is emitting longer code, but since LLVM is so nearer to the optimal code sequence i wanted to reach it.

Hello

and as the return value. Is this a missed optimization from LLVM or did i
miss something out?
Changing the register allocation order didnt work.

What are the patterns for xor / mov ?

The pattern for the copy integer to reg is:

let isReMaterializable = 1,
isAsCheapAsAMove = 1 in
def MOVIRdK : FRdK<0b1110,
(outs GPR8:$dst),
(ins i8imm:$src),
“movi\t$dst, $src”,
[(set GPR8:$dst, imm:$src)]>;

The xor pattern is :

let isCommutable = 1,
isTwoAddress = 1 in
def XORRdRr : FRdRr<0b0010,
0b01,
(outs GPR8:$dst),
(ins GPR8:$src1, GPR8:$src2),
“xor\t$dst, $src2”,
[(set GPR8:$dst, (xor GPR8:$src1, GPR8:$src2))]>;

Could it be that an optimization was actually missed?
It seems that there is some sort of look-ahead optimization is
happening because the constant value was moved into r14, when the
optimal choice was not to move r14 and keep the variable into r14. All
these R14 with different values is confusing. My mental SSA form looks
like:

t1 = b_high ^ a_low
t2 = b_low ^ a_low
t3 = 18
t4 = t2 ^ t3
5: return t1, t4

Now, when the registers are allocated, the t4 attempts to use a
register already used by either t2 or t3. However a constant value is
in t3. The allocator could attempt to re-use t3 instead of t2 because
it doesn't know about t3 being a constant. Thus it inserts a move on
t2 and re-uses t3. At the allocator standpoint consider:

1: t1 = b_high ^ a_high ; b_high and a_high are dead and t1 may be any
b_high or b_low.
2: t2 = b_low ^ a_low ; b_low and a_low are also dead. t2 may be either.
3: t3 = 18 ; 18 may be any register
4: t4 = t2 ^ t3 ; t4 may reuse t2 or t3.
5: return t1, t4

The optimal value at (1) is of course R14 or a_high's old value. This
is a simple choice. The other choice of t4 is a bit harder because it
may be t2 or t3. At first glance a large enumerative tree search would
be exponential. However, I don't know much of the theory behind
reusing registers like this.

Thanks,
Jeff Kunkel

In LLVM, copies are coalesced away as much as possible before registers are allocated, so the allocation order wouldn't affect it.

Try looking at the output of -debug-only=regcoalescing to see what is going wrong.

I’ve compiled the program with -debug-only=regcoalescing and looked at the output which i’ve attached in this email. It’s pretty advanced so I understand part of it, but i get lost with all the register live ranges. The instructions are:

44 %R15 = XORRdRr %R15, %R13
52 %reg1029 = MOVRdRr %R14
60 %reg1029 = XORRdRr %reg1029, %R12
68 %R14 = MOVIRdK 18
76 %R14 = XORRdRr %R14, %reg1029

i guess 68 should be assigned a virtual reg instead of 52 leaving reg1029 for later allocation.
If there is anymore debug info i can attach please let me know.

Thanks.

reg.txt (2.7 KB)

If you want to take a look at this yourself, the issue is easy to
reproduce with Thumb1:
$ cat > test.c
typedef unsigned long long t;
t foo(t a, t b)
{
    t a4 = b^a^18;
    return a4;
}
$ clang -cc1 -triple thumbv5-u-u -S -O2 test.c -o -
[...]
  eors r1, r3
  mov r3, r0
  eors r3, r2
  movs r0, #18
  eors r0, r3
  bx lr
[...]

-Eli

Thanks, Eli. Nice catch!

This IR:

target triple = “thumbv5-u-u”

define arm_aapcscc i64 @foo(i64 %a, i64 %b) nounwind readnone {
entry:
%xor = xor i64 %a, 18 ; [#uses=1]
%xor2 = xor i64 %xor, %b ; [#uses=1]
ret i64 %xor2
}

produces these instructions before coalescing:

4L %reg16387 = COPY %R3
12L %reg16386 = COPY %R2
28L %reg16384 = COPY %R0
36L %reg16388 = COPY %reg16385
44L %reg16388, %CPSR<def,dead> = tEOR %reg16388, %reg16387, pred:14, pred:%reg0
56L %reg16389 = COPY %reg16384
64L %reg16389, %CPSR<def,dead> = tEOR %reg16389, %reg16386, pred:14, pred:%reg0
76L %reg16390, %CPSR<def,dead> = tMOVi8 18, pred:14, pred:%reg0
88L %reg16391 = COPY %reg16390
96L %reg16391, %CPSR<def,dead> = tEOR %reg16391, %reg16389, pred:14, pred:%reg0
108L %R0 = COPY %reg16391
116L %R1 = COPY %reg16388
128L tBX_RET %R0<imp-use,kill>, %R1<imp-use,kill>

and after:

44L %R1, %CPSR<def,dead> = tEOR %R1, %R3, pred:14, pred:%reg0
56L %reg16389 = COPY %R0
64L %reg16389, %CPSR<def,dead> = tEOR %reg16389, %R2, pred:14, pred:%reg0
76L %R0, %CPSR<def,dead> = tMOVi8 18, pred:14, pred:%reg0
96L %R0, %CPSR<def,dead> = tEOR %R0, %reg16389, pred:14, pred:%reg0
128L tBX_RET %R0<imp-use,kill>, %R1<imp-use,kill>

We see, as Borja pointed out, that %R0 from the 108L COPY has been joined with %reg16391 and %reg16390 so it is too late to commute the xor.

Passing -disable-physical-join to prevent the %R0 sabotage, we get:

4L %reg16387 = COPY %R3; tGPR:%reg16387
12L %reg16386 = COPY %R2; tGPR:%reg16386
20L %reg16388 = COPY %R1; tGPR:%reg16388
28L %reg16389 = COPY %R0; tGPR:%reg16389
44L %reg16388, %CPSR<def,dead> = tEOR %reg16388, %reg16387, pred:14, pred:%reg0; tGPR:%reg16388,16387
64L %reg16389, %CPSR<def,dead> = tEOR %reg16389, %reg16386, pred:14, pred:%reg0; tGPR:%reg16389,16386
76L %reg16391, %CPSR<def,dead> = tMOVi8 18, pred:14, pred:%reg0; tGPR:%reg16391
96L %reg16391, %CPSR<def,dead> = tEOR %reg16391, %reg16389, pred:14, pred:%reg0; tGPR:%reg16391,16389
108L %R0 = COPY %reg16391; tGPR:%reg16391
116L %R1 = COPY %reg16388; tGPR:%reg16388
128L tBX_RET %R0<imp-use,kill>, %R1<imp-use,kill>

It is not easy to see here that the 96L tEOR should be commuted. You would have to notice that the hints for %reg16389 and %reg16391 are clashing.

After register allocation with hinting it becomes:

%R1, %CPSR<def,dead> = tEOR %R1, %R3, pred:14, pred:%reg0
%R0, %CPSR<def,dead> = tEOR %R0, %R2, pred:14, pred:%reg0
%R2, %CPSR<def,dead> = tMOVi8 18, pred:14, pred:%reg0
%R2, %CPSR<def,dead> = tEOR %R2, %R0, pred:14, pred:%reg0
%R0 = COPY %R2
tBX_RET %R0<imp-use,kill>, %R1<imp-use,kill>

There are two fundamental deficiencies here:

  1. The coalescer is not very good at handling conflicting joins. The examples show that different orders of joining can give different results. The coalescer uses heuristics to pick an order. It doesn’t try to find an optimal order.

  2. Commuting two-address instructions is not really integrated into the coalescer algorithm. It is more of an afterthought, calling RemoveCopyByCommutingDef when a copy could otherwise not be removed.

/jakob

Great analysis Jakob, should this be reported as a missed optimization or are you going to handle this as you’re the register allocator expert here?

Please file a PR with the Thumb1 code. I don't know an easy fix.

/jakob

Filled in http://llvm.org/bugs/show_bug.cgi?id=8112 (in case you want to track it).

2010/9/8 Jakob Stoklund Olesen <stoklund@2pi.dk>