__builtin_parity assembly without popcnt

Hello all!

I’ve been mucking around in an old codebase at work looking for easy performance wins. One avenue involves replacing a switch-based variable assignment with something derived from the parity of an input variable. I was pretty surprised when I saw the generated assembly, and I’m wondering about the reasoning behind it.

In short, it boils down to the assembly __builtin_parity() produces. Clang 6.0.1 (and trunk on Godbolt) produces:

parity(int): # @parity(int)

mov eax, edi

shr eax

and eax, 1431655765

sub edi, eax

mov eax, edi

and eax, 858993459

shr edi, 2

and edi, 858993459

add edi, eax

mov eax, edi

shr eax, 4

add eax, edi

and eax, 17764111

imul eax, eax, 16843009

shr eax, 24

and eax, 1

ret

While GCC 8.1.0 (and trunk on Godbolt) produces

parity(int):

mov eax, edi

shr edi, 16

xor eax, edi

xor al, ah

setnp al

movzx eax, al

ret

I know a popcnt followed by an and would be better, but unfortunately some of my users don’t have computers that support the popcnt instruction, so I can’t use a newer -march flag.

Could someone explain why the difference between Clang and GCC here, and whether it should make a difference? The code in question is in a hot loop in my code, so I’d imagine the size difference could impact unrolling (and result in icache differences too), but I haven’t finished poking around with benchmarks.

Thanks,

Alex

LLVM doesn’t have any special support for computing parity, so it’s just getting lowered to “popcount(x)&1”; if your target doesn’t have a popcount instruction, it uses the generic expansion (something like ). This is obviously not the fastest lowering, but computing parity is not a common operation, so nobody has spent any time optimizing it. -Eli

This looks (not closely checked) like it’s implementing the first part of pop1() combined with the last part of pop3(). Which is suggested in the comment just after pop1().

http://www.hackersdelight.org/hdcodetxt/pop.c.txt

And then anding with 1, obviously.

I wonder if division is good enough on modern machines to make pop2() faster.

XORing down to a byte and then using the x86 built in parity flag is obviously better if you are on an x86, of course. Other machines don’t usually have that.

This looks (not closely checked) like it’s implementing the first part of pop1() combined with the last part of pop3(). Which is suggested in the comment just after pop1().

http://www.hackersdelight.org/hdcodetxt/pop.c.txt

And then anding with 1, obviously.

I wonder if division is good enough on modern machines to make pop2() faster.

XORing down to a byte and then using the x86 built in parity flag is obviously better if you are on an x86, of course. Other machines don’t usually have that.

Oh, using an x86-specific feature would explain a lot. Actually gives me some ideas as to some more possible microoptimizations…

Thanks to you (and Eli) for the explanation!

Why does gcc xor down to a byte? Can’t you get the parity flag for the whole value by just oring with itself and grabbing the flag?

I thought the parity flag only affects the last byte, meaning GCC’s lowering is optimal.

So it does. Never realized that

I have a patch to improve builtin_parity in the x86 backend. I’ll post it shortly.

Patch up for review here https://reviews.llvm.org/D50165