BUGS n code generated for target i386 compiling __bswapdi3, and for target x86-64 compiling __bswapsi2()

Hi @ll,

targetting i386, LLVM/clang generates wrong code for the following
functions:

unsigned long __bswapsi2 (unsigned long ul)
{
    return (((ul) & 0xff000000ul) >> 3 * 8)
         > (((ul) & 0x00ff0000ul) >> 8)
         > (((ul) & 0x0000ff00ul) << 8)
         > (((ul) & 0x000000fful) << 3 * 8);
}

unsigned long long __bswapdi2(unsigned long long ull)
{
    return ((ull & 0xff00000000000000ull) >> 7 * 8)
         > ((ull & 0x00ff000000000000ull) >> 5 * 8)
         > ((ull & 0x0000ff0000000000ull) >> 3 * 8)
         > ((ull & 0x000000ff00000000ull) >> 8)
         > ((ull & 0x00000000ff000000ull) << 8)
         > ((ull & 0x0000000000ff0000ull) << 3 * 8)
         > ((ull & 0x000000000000ff00ull) << 5 * 8)
         > ((ull & 0x00000000000000ffull) << 7 * 8);
}

You can find these sources in "compiler-rt/lib/builtins/bswapsi2.c"
and "compiler-rt/lib/builtins/bswapdi2.c", for example!

Compiled with "-O3 -target i386" this yields the following code
(see <https://godbolt.org/z/F4UIl4>):

__bswapsi2: # @__bswapsi2
    push ebp
    mov ebp, esp
    mov eax, dword ptr [ebp + 8]
    bswap eax
    pop ebp
    ret

__bswapdi2: # @__bswapdi2
    push ebp
    mov ebp, esp
    mov edx, dword ptr [ebp + 8]
    mov eax, dword ptr [ebp + 12]
    bswap eax
    bswap edx
    pop ebp
    ret

__bswapsi2() is correct, but __bswapdi2() NOT: swapping just the
halves of a "long long" is OBVIOUSLY WRONG!

From the C source, the expected result for the input value

0x0123456789ABCDEF is 0xEFCDAB8967452301; the compiled code but
produces 0x67452301EFCDAB89

And compiled for x86-64 this yields the following code (see
<https://godbolt.org/z/uM9nvN>):

__bswapsi2: # @__bswapsi2
    mov eax, edi
    shr eax, 24
    mov rcx, rdi
    shr rcx, 8
    and ecx, 65280
    or rax, rcx
    mov rcx, rdi
    shl rcx, 8
    and ecx, 16711680
    or rax, rcx
    and rdi, 255
    shl rdi, 24
    or rax, rdi
    ret

__bswapdi2: # @__bswapdi2
    bswap rdi
    mov rax, rdi
    ret

Both are correct, but __bswapsi2() should of course use BSWAP too!

Stefan Kanthak

PS: for comparision with another compiler, take a look at
    <https://skanthak.homepage.t-online.de/msvc.html#example5>

bswapsi2 on the x86-64 isn’t using the bswap instruction because “unsigned long” is 64-bits on x86-64 linux. But its 32-bits on x86-64 msvc.

Not sure about the bswapdi2 i386 case.

bswapdi2 for i386 is correct

Bits 31:0 of the source are loaded into edx. Bits 63:32 are loaded into eax. Those are each bswapped. The ABI for the return is edx contains bits [63:32] and eax contains [31:0]. This is opposite of how the register were loaded.

bswapsi2 on the x86-64 isn't using the bswap instruction because "unsigned
long" is 64-bits on x86-64 linux.

Believe me: I KNOW THIS!
Now take one more THOROUGH look at the C source of _bswapsi2():
1. ALL 4 & operations clear the higher 32 bits;
2. the 4 >> operations BSWAP the lower 32 bits.

Assuming that the argument was loaded into RDI, this is a
    BSWAP EDI
in Intel x86-64 machine code!
Remember that in "long mode", writing to a 32-bit register
clears the higher 32 bits of the respective 64-bit register.

AGAIN: why doesn't LLVM generate the BSWAP operation here?

The expected code is

__bswapsi2: # @__bswapsi2
     bswap edi
     mov rax, rdi
     ret

not amused
Stefan Kanthak

bswapdi2 for i386 is correct

OUCH!

Bits 31:0 of the source are loaded into edx. Bits 63:32 are loaded into
eax. Those are each bswapped.

This exchanges the high byte of each 32-bit PART with its low byte, but
NOT the high byte of the whole 64-bit operand with its low byte!

Please get a clue!

The ABI for the return is edx contains bits [63:32] and eax contains
[31:0]. This is opposite of how the register were loaded.

My post is NOT about swapping EDX with EAX, but the bytes WITHIN both.

With the 64-bit argument loaded into EDX:EAX, the instruction sequence

    bswap edx
    bswap eax
    xchg eax, edx

is NOT equivalent to

    bswap rdi

with the 64-bit argument loaded into RDI.

Just run the following code on x86-64:

    mov rdi, 0123456789abcdefh ; pass (fake) argument in RDI
; split argument into high and low part
    mov rdx, rdi
    shr rdx, 32 ; high part in EDX
    mov eax, rdi ; low part in EAX
; perform __bswapdi2() as in 32-bit mode
    xchg eax, edx ; swap parts, argument now loaded
                                     ; like in 32-bit mode
    bswap edx
    bswap eax ; result like that in 32-bit mode
; load result into 64-bit register
    shl rdx, 32
    or rax, rdx
; perform _bswapdi2() in native 64-bit mode
    bswap rdi
; compare results
    xor rax, rdi

not amused
Stefan Kanthak

I just compiled the two attached files in 32-bit mode and ran it.

It printed efcdab8967452301.

I verified via objdump that the my_bswap function contains the follow assembly which I believe matches the assembly you linked to on godbolt.

_my_bswap:
1f70: 55 pushl %ebp
1f71: 89 e5 movl %esp, %ebp
1f73: 8b 55 08 movl 8(%ebp), %edx
1f76: 8b 45 0c movl 12(%ebp), %eax
1f79: 0f c8 bswapl %eax
1f7b: 0f ca bswapl %edx
1f7d: 5d popl %ebp
1f7e: c3 retl

test.c (186 Bytes)

my_bswap.c (473 Bytes)

bswapdi2 for i386 is correct

OUCH!

Bits 31:0 of the source are loaded into edx. Bits 63:32 are loaded into
eax. Those are each bswapped.

This exchanges the high byte of each 32-bit PART with its low byte, but
NOT the high byte of the whole 64-bit operand with its low byte!

Incorrect.

This reverses the bytes within the low and high 32-bit halves, and also swaps the two halves, which _is_ equivalent to byte-reversing the entire 64-bit value.

It's a classic divide-and-conquer approach: one way to reverse a sequence is to cut it into two contiguous pieces (doesn't matter how; in this case, split at the halfway point), reverse the two pieces individually (the two BSWAPs), and then combine the two pieces in reverse order (accomplished implicitly by loading the original top half into eax and the bottom half into edx).

Please get a clue!

Throwing insults around will not help you get your problem resolved any quicker, it will just get you ignored.

Just run the following code on x86-64:

     mov rdi, 0123456789abcdefh ; pass (fake) argument in RDI
; split argument into high and low part
     mov rdx, rdi
     shr rdx, 32 ; high part in EDX
     mov eax, rdi ; low part in EAX
; perform __bswapdi2() as in 32-bit mode
     xchg eax, edx ; swap parts, argument now loaded
                                      ; like in 32-bit mode
     bswap edx
     bswap eax ; result like that in 32-bit mode
; load result into 64-bit register
     shl rdx, 32
     or rax, rdx
; perform _bswapdi2() in native 64-bit mode
     bswap rdi
; compare results
     xor rax, rdi

Syntax error aside (that should be "mov eax, edi" not "mov eax, rdi" up there), I get:

(gdb) set disassembly-flavor intel
(gdb) x/20i main
    0x4004a0 <main>: movabs rdi,0x123456789abcdef
    0x4004aa <main+10>: mov rdx,rdi
    0x4004ad <main+13>: shr rdx,0x20
    0x4004b1 <main+17>: mov eax,edi
    0x4004b3 <main+19>: xchg edx,eax
    0x4004b4 <main+20>: bswap edx
    0x4004b6 <main+22>: bswap eax
    0x4004b8 <main+24>: shl rdx,0x20
    0x4004bc <main+28>: or rax,rdx
    0x4004bf <main+31>: bswap rdi
    0x4004c2 <main+34>: xor rax,rdi
    0x4004c5 <main+37>: int3
    0x4004c6 <main+38>: nop WORD PTR cs:[rax+rax*1+0x0]
    0x4004d0 <__libc_csu_init>: push r15
    0x4004d2 <__libc_csu_init+2>: mov r15,rdx
    0x4004d5 <__libc_csu_init+5>: push r14
    0x4004d7 <__libc_csu_init+7>: mov r14,rsi
    0x4004da <__libc_csu_init+10>: push r13
    0x4004dc <__libc_csu_init+12>: mov r13d,edi
    0x4004df <__libc_csu_init+15>: push r12
(gdb) r
Starting program: /home/fabiang/code/bswap/bswap

Program received signal SIGTRAP, Trace/breakpoint trap.
0x00000000004004c6 in main ()
(gdb) x/i $pc
=> 0x4004c6 <main+38>: nop WORD PTR cs:[rax+rax*1+0x0]
(gdb) p $rax
$1 = 0

exactly as it should be. So your point is?

-Fabian

I just compiled the two attached files in 32-bit mode and ran it.

It printed efcdab8967452301.

I verified via objdump that the my_bswap function contains the follow
assembly which I believe matches the assembly you linked to on godbolt.

_my_bswap:
   1f70: 55 pushl %ebp
   1f71: 89 e5 movl %esp, %ebp
   1f73: 8b 55 08 movl 8(%ebp), %edx
   1f76: 8b 45 0c movl 12(%ebp), %eax
   1f79: 0f c8 bswapl %eax
   1f7b: 0f ca bswapl %edx
   1f7d: 5d popl %ebp
   1f7e: c3 retl

Contrary to my initial WRONG claim this code is CORRECT: two 32-bit BSWAP
instructions on the swapped halves of a 64-bit register are equivalent to
a 64-bit BSWAP instruction.
Sorry for the confusion, I should have slept another night before responding.

The other part of my post but holds: the following function, compiled for
x86-64, is NOT properly optimized (see <https://godbolt.org/z/uM9nvN>):

unsigned long __bswapsi2 (unsigned long ul)
{
    return (ul >> 3 * 8) & 0xff000000ul
         > (ul >> 8) & 0x00ff0000ul
         > (ul << 8) & 0x0000ff00ul
         > (ul << 3 * 8) & 0x000000fful;
}

The expected optimized code is

__bswapsi2: # @__bswapsi2
    bswap edi
    mov eax, edi
    ret

regards
Stefan Kanthak

[ fullquote removed ]

That looks like another instance of truncation in the wrong places causing the pattern matches to fail. There have been a couple of those in the past with other things like rotates.

This is indeed a bug, but if you need a quick workaround, the pattern does work when you use a 32-bit (e.g. unsigned int or uint32_t) type.

-Fabian

In this case there are no truncates. Everything is a 64-bit type. The bswap pattern is matched by the middle end of the compiler with no target specific knowledge. The middle end sees the type is 64 bits and so doesn’t match it as bswap because only the lower half is swapped. The X86 specific backend doesn’t do any BSWAP matching so we don’t apply the x86 specific knowledge that zeroing the upper bits is free.