Scheduling quirks

Hello all!

When I compile the following more or less stupid functions with
   clang++ -O3 -S test.cpp
===>
int test_register(int x) {
   x ^= (x >> 2);
   x ^= (x >> 3);
   x = x ^ (x >> 4);
   int y = x; x >>= 5; x ^= y; // almost the same but explicit
   return x;
   }

int test_scheduler(int x) {
   return ((x>>2) & 15) ^ ((x>>3) & 31);
   }
<===
...I get the following result:
===>
  .file "test.cpp"
  .text
  .globl _Z13test_registeri
  .align 16, 0x90
  .type _Z13test_registeri,@function
_Z13test_registeri: # @_Z13test_registeri
  .cfi_startproc
# BB#0: # %entry
  movl %edi, %eax
  sarl $2, %eax
  xorl %edi, %eax
  movl %eax, %ecx
  sarl $3, %ecx
  xorl %eax, %ecx
  movl %ecx, %edx
  sarl $4, %edx
  xorl %ecx, %edx
  movl %edx, %eax
  sarl $5, %eax
  xorl %edx, %eax
  retq
.Ltmp0:
  .size _Z13test_registeri, .Ltmp0-_Z13test_registeri
  .cfi_endproc

  .globl _Z14test_scheduleri
  .align 16, 0x90
  .type _Z14test_scheduleri,@function
_Z14test_scheduleri: # @_Z14test_scheduleri
  .cfi_startproc
# BB#0: # %entry
  movl %edi, %eax
  shrl $2, %eax
  andl $15, %eax
  shrl $3, %edi
  andl $31, %edi
  xorl %eax, %edi
  movl %edi, %eax
  retq
.Ltmp1:
  .size _Z14test_scheduleri, .Ltmp1-_Z14test_scheduleri
  .cfi_endproc

  .ident "clang version 3.5 (trunk 199507)"
  .section ".note.GNU-stack","",@progbits
<===

Now once more in detail.

The lines
   x ^= (x >> 2);
and
   x = x ^ (x >> 4);
and (!)
   int y = x; x >>= 8; x ^= y; // almost the same but explicit
are compiled into code like
  movl %edi, %eax
  sarl $2, %eax
  xorl %edi, %eax
As far as I know optimal for all x86 but the very latest 4th generation Intel Core processors the following variant is better (2 instead of 3 cycles; I proved this for e.g. Intel i7 920) because the first two lines can be executed simultaneously:
  movl %edi, %eax
  sarl $2, %edi # modify source instead of copy
  xorl %edi, %eax
Is there a special reason to do that this way?
Interestingly most compilers including ICC and GCC show this strange behavior. I had reported this in an Intel forum as well as for GCC a long time ago but there has been no real reaction...
Also, why are 4 registers used whereas 2 are sufficient?

In the second function the line
   return ((x>>2) & 15) ^ ((x>>3) & 31);
is compiled into
  movl %edi, %eax
  shrl $2, %eax
  andl $15, %eax
  shrl $3, %edi
  andl $31, %edi
  xorl %eax, %edi
  movl %edi, %eax
I would have expected that the scheduler interleaves the subexpressions and would be able to get rid of the final move like this:
  movl %edi, %eax
  shrl $3, %edi # modify source instead of copy, see above
  shrl $2, %eax
  andl $31, %edi
  andl $15, %eax
  xorl %edi, %eax # we need %eax here

I think this is independent of the used high level language.
Is this known to the LLVM community?
May I help to correct this?

Best regards
Jasper

Hello all!

When I compile the following more or less stupid functions with
clang++ -O3 -S test.cpp
===>
int test_register(int x) {
x ^= (x >> 2);
x ^= (x >> 3);
x = x ^ (x >> 4);
int y = x; x >>= 5; x ^= y; // almost the same but explicit
return x;
}

int test_scheduler(int x) {
return ((x>>2) & 15) ^ ((x>>3) & 31);
}
<===
...I get the following result:
===>
  .file "test.cpp"
  .text
  .globl _Z13test_registeri
  .align 16, 0x90
  .type _Z13test_registeri,@function
_Z13test_registeri: # @_Z13test_registeri
  .cfi_startproc
# BB#0: # %entry
  movl %edi, %eax
  sarl $2, %eax
  xorl %edi, %eax
  movl %eax, %ecx
  sarl $3, %ecx
  xorl %eax, %ecx
  movl %ecx, %edx
  sarl $4, %edx
  xorl %ecx, %edx
  movl %edx, %eax
  sarl $5, %eax
  xorl %edx, %eax
  retq
.Ltmp0:
  .size _Z13test_registeri, .Ltmp0-_Z13test_registeri
  .cfi_endproc

  .globl _Z14test_scheduleri
  .align 16, 0x90
  .type _Z14test_scheduleri,@function
_Z14test_scheduleri: # @_Z14test_scheduleri
  .cfi_startproc
# BB#0: # %entry
  movl %edi, %eax
  shrl $2, %eax
  andl $15, %eax
  shrl $3, %edi
  andl $31, %edi
  xorl %eax, %edi
  movl %edi, %eax
  retq
.Ltmp1:
  .size _Z14test_scheduleri, .Ltmp1-_Z14test_scheduleri
  .cfi_endproc

  .ident "clang version 3.5 (trunk 199507)"
  .section ".note.GNU-stack","",@progbits
<===

Now once more in detail.

The lines
x ^= (x >> 2);
and
x = x ^ (x >> 4);
and (!)
int y = x; x >>= 8; x ^= y; // almost the same but explicit
are compiled into code like
  movl %edi, %eax
  sarl $2, %eax
  xorl %edi, %eax
As far as I know optimal for all x86 but the very latest 4th generation Intel Core processors the following variant is better (2 instead of 3 cycles; I proved this for e.g. Intel i7 920) because the first two lines can be executed simultaneously:
  movl %edi, %eax
  sarl $2, %edi # modify source instead of copy
  xorl %edi, %eax
Is there a special reason to do that this way?
Interestingly most compilers including ICC and GCC show this strange behavior. I had reported this in an Intel forum as well as for GCC a long time ago but there has been no real reaction...
Also, why are 4 registers used whereas 2 are sufficient?

In the second function the line
return ((x>>2) & 15) ^ ((x>>3) & 31);
is compiled into
  movl %edi, %eax
  shrl $2, %eax
  andl $15, %eax
  shrl $3, %edi
  andl $31, %edi
  xorl %eax, %edi
  movl %edi, %eax
I would have expected that the scheduler interleaves the subexpressions and would be able to get rid of the final move like this:
  movl %edi, %eax
  shrl $3, %edi # modify source instead of copy, see above
  shrl $2, %eax
  andl $31, %edi
  andl $15, %eax
  xorl %edi, %eax # we need %eax here

I think this is independent of the used high level language.
Is this known to the LLVM community?
May I help to correct this?

In your example, we're copying from/to argument/return registers, hence the extra copies. Normally, it is the register coalescer's job to remove copies but
- our register coalescer is not always very smart
- in this case, it intentionally leaves the physreg (arg/return) copies in place to give the register allocator the most freedom.

The register allocator attempts to remove physreg copies by biasing allocation, but the decisions are subject to the order of allocation which really amounts to luck in cases like this.

We tend to not introduce complexity to optimize special cases involving multiple physreg copies because
- we care more about code within loops
- the performance impact of register copies is very small

In these cases, I think it comes down to luck given the allocation order that we choose within a block (I’m not sure how scheduling could affect it). Do you still see the extra copies on LLVM trunk?

-Andy

Hello Andrew!

Referring:
int test_scheduler(int x) {
  return ((x>>2) & 15) ^ ((x>>3) & 31);
  }
=>
  movl %edi, %eax
  shrl $2, %eax
  andl $15, %eax
  shrl $3, %edi
  andl $31, %edi
  xorl %eax, %edi
  movl %edi, %eax
instead of
  movl %edi, %eax
  shrl $3, %edi # modify source instead of copy
  shrl $2, %eax # modifications interlaced
  andl $31, %edi
  andl $15, %eax
  xorl %edi, %eax # we need %eax here

> In your example, we're copying from/to argument/return registers,
> hence the extra copies.

This might be the reason for the final move.
However a simple peephole optimizer could catch this case.

> [...] Do you still see the extra copies on LLVM trunk?

I retested this with a fresh checkout and compile of trunk (svn 199769) and got the same result.

BTW: Providing e.g. -march=atom helps a bit by sporting the expected interleaving for this example but IMHO this ought to be the case even without explicitely specifying a processor type.

The missed optimization regarding the modification of a copy occurs quite often and costs a cycle on almost all x86 processors.

Best regards
Jasper

I have seen many cases where we could eliminate copies. However, I haven’t thought of a very simple fix that would handle the general problem and haven’t been motivated to add complexity to handle specific cases. Keep in mind that physreg copies as in your case are handled totally different from vreg copies. Reducing instructions is always good. However a register copy is a small instruction and I *think* it is usually zero cycles on the last few generations from Intel.

Anyway, feel free to file a bug and offer suggested fixes.

If I wanted to solve this particular case, the first thing I would do is figure out why the register allocator isn’t biasing the xorl def to %eax. I think it’s because local allocation order is top-down. It also doesn’t realize that biasing the incoming arguments is useless because of the interference so may back itself into a corner.

It’s possible that we could add copy-eliminating peepholes after regalloc. I don’t know that it is a good idea and we would have to show pretty compelling performance results to add the complexity.

-Andy