Missing optimization - constant parameter

For the little C test program where a constant is stored in memory and also
used as a parameter:

#include <stdint.h>
uint64_t val, *p;
extern uint64_t xtr( uint64_t);

uint64_t caller() {
uint64_t x;
p = &val;
x = 12345123400L;
*p = x;
return xtr(x);
}

clang (3.2, 3.3 and svn) generates the following X86 code (at -O3):

caller:
movq $val, p(%rip)
movabsq $12345123400, %rax # imm = 0x2DFD3A248
movq %rax, val(%rip)
movabsq $12345123400, %rdi # imm = 0x2DFD3A248
jmp xtr # TAILCALL

Notice that the constant value is loaded twice, once into %rax
and again into %rdi as the first parameter for the (tail) call.

There is no need for the second constant load - the value is already in
a register. If the target register was assigned as %rdi
there would be no need for the second load instruction at all.

gcc (4.6+) generates better code:

caller:
movabsq $12345123400, %rdi
movq $val, p(%rip)
movq %rdi, val(%rip)
jmp xtr

I expected that this optimization would be picked
up in a cse, gvn, machine-cse or even peepholing pass.

Comments?

I expected that this optimization would be picked
up in a cse, gvn, machine-cse or even peepholing pass.

Comments?

At the LLVM IR level this is represented as

define i64 @caller() #0 {
entry:
  store i64* @val, i64** @p, align 8, !tbaa !0
  store i64 12345123400, i64* @val, align 8, !tbaa !3
  %call = tail call i64 @xtr(i64 12345123400) #2
  ret i64 %call
}

Which is probably the best representation to have at this relatively high level.

At the machine level it looks like it is the register coalescer that
is duplicating the constant. It transforms

0B BB#0: derived from LLVM BB %entry
16B %vreg0<def> = MOV64rm %RIP, 1, %noreg,
<ga:@val>[TF=5], %noreg; mem:LD8[GOT] GR64:%vreg0
32B %vreg1<def> = MOV64rm %RIP, 1, %noreg, <ga:@p>[TF=5],
%noreg; mem:LD8[GOT] GR64:%vreg1
48B MOV64mr %vreg1, 1, %noreg, 0, %noreg, %vreg0;
mem:ST8[@p](tbaa=!"any pointer") GR64:%vreg1,%vreg0
64B %vreg2<def> = MOV64ri 12345123400; GR64:%vreg2
80B MOV64mr %vreg0, 1, %noreg, 0, %noreg, %vreg2;
mem:ST8[@val](tbaa=!"long long") GR64:%vreg0,%vreg2
96B %RDI<def> = COPY %vreg2; GR64:%vreg2
112B TCRETURNdi64 <ga:@xtr>, 0, <regmask>, %RSP<imp-use>,
%RDI<imp-use,kill>

into

0B BB#0: derived from LLVM BB %entry
16B %vreg0<def> = MOV64rm %RIP, 1, %noreg,
<ga:@val>[TF=5], %noreg; mem:LD8[GOT] GR64:%vreg0
32B %vreg1<def> = MOV64rm %RIP, 1, %noreg, <ga:@p>[TF=5],
%noreg; mem:LD8[GOT] GR64:%vreg1
48B MOV64mr %vreg1, 1, %noreg, 0, %noreg, %vreg0;
mem:ST8[@p](tbaa=!"any pointer") GR64:%vreg1,%vreg0
64B %vreg2<def> = MOV64ri 12345123400; GR64:%vreg2
80B MOV64mr %vreg0, 1, %noreg, 0, %noreg, %vreg2;
mem:ST8[@val](tbaa=!"long long") GR64:%vreg0,%vreg2
96B %RDI<def> = MOV64ri 12345123400
112B TCRETURNdi64 <ga:@xtr>, 0, <regmask>, %RSP<imp-use>,
%RDI<imp-use,kill>

I am not sure why. Maybe this should be delayed until the register
allocator, which can split the range if it cannot assign rdi to vreg2?

Jakob, should I open a bug?

Cheers,
Rafael

MachineCSE skips cheap instructions on purpose:

  // Heuristics #1: Don't CSE "cheap" computation if the def is not local or in
  // an immediate predecessor. We don't want to increase register pressure and
  // end up causing other computation to be spilled.
  if (MI->isAsCheapAsAMove()) {
    MachineBasicBlock *CSBB = CSMI->getParent();
    MachineBasicBlock *BB = MI->getParent();
    if (CSBB != BB && !CSBB->isSuccessor(BB))
      return false;
  }

This code is older than the greedy register allocator. We could delete it if somebody is willing to check for regressions in the test suite.

I thought we had a PR about this already, but I can’t find it now.

Thanks,
/jakob

Are you sure that’s it? I commented that block out, rebuilt llvm 3.3, and it still duplicates the constant.

My concern is that long constant loads increase code size and if they can be avoided by better targeting it would be a win. My project’s application of llvm tends to use a lot of long constants so this can be a significant optimization.

I’ll do some more debugging now that you have pointed me in the right direction.

thanks

/maurice

It is also possible that the coalescer is duplicating the instruction like Rafael suggested. It will do that if it can't eliminate a copy.

/jakob

I went back to this problem after looking at some other things. Turning on debugging I noticed that the register coalescer is trying to do the “right thing” and merge vreg0 (the previous load of the constant) with the first parameter of the call (%rdi), which is exactly what gcc does in this case - eliminating a second constant load, but its refusing to do the merge:

Here’s the debug output for that part of the compilation:

********** SIMPLE REGISTER COALESCING **********
********** Function: caller
********** JOINING INTERVALS ***********
entry:
64B %RDI = COPY %vreg0; GR64:%vreg0
Considering merging %vreg0 with %RDI
Can only merge into reserved registers.
Remat: %RDI = MOV64ri 12345123400
Shrink: [32r,64r:0) 0@32r
Shrunk: [32r,48r:0) 0@32r
Trying to inflate 0 regs.
********** INTERVALS **********
%vreg0 = [32r,48r:0) 0@32r
RegMasks: 80r

Jakob, what does “can only merge into reserved registers” mean in this instance. I don’t see any reason for it not to do the merge.

The coalescer won’t merge virtual and allocatable physical registers because that will extend the live range of the physical registers, constraining register allocation.

Instead, the register allocator will attempt to choose %rdi for%vreg0 so the copy can be eliminated after register allocation.

/jakob

I reported http://llvm.org/pr17784 to make sure we track this.