"Optimized implementations"?

<https://compiler-rt.llvm.org/index.html> boasts:

The builtins library provides optimized implementations of this
and other low-level routines, either in target-independent C form,
or as a heavily-optimized assembly.

Really?

Left: inperformant code shipped in # Right: slightly improved code,
      clang_rt.builtins-* # which the optimiser REALLY
                                      # should have generated

___cmpdi2:
        mov ecx, [esp+16] # mov ecx, [esp+16]
        xor eax, eax # xor eax, eax
        cmp [esp+8], ecx # cmp ecx, [esp+8]
        jl @f # jg @f
        mov eax, 2 # mov eax, 2
        jg @f # jl @f
        mov ecx, [esp+4] #
        mov edx, [esp+12] # mov ecx, [esp+12]
        mov eax, 0 # xor eax, eax
        cmp ecx, edx # cmp ecx, [esp+4]
        jb @f # ja @f
        cmp edx, ecx #
        mov eax, 1 #
        adc eax, 0 # adc eax, 1
@@: # @@:
        ret # ret

                                      # 3 instructions less, 10 bytes saved

___ucmpdi2:
        mov ecx, [esp+16] # mov ecx, [esp+16]
        xor eax, eax # xor eax, eax
        cmp [esp+8], ecx # cmp ecx, [esp+8]
        jb @f # ja @f
        mov eax, 2 # mov eax, 2
        ja @f # jb @f
        mov ecx, [esp+4] #
        mov edx, [esp+12] # mov ecx, [esp+12]
        mov eax, 0 # xor eax, eax
        cmp ecx, edx # cmp ecx, [esp+4]
        jb @f # ja @f
        cmp edx, ecx #
        mov eax, 1 #
        adc eax, 0 # adc eax, 1
@@: # @@:
        ret # ret

                                      # 3 instructions less, 10 bytes saved

Now properly written code, of course branch-free, faster and shorter:

Clang never generates calls to ___paritysi2, ___paritydi2, ___cmpdi2, or ___ucmpdi2 on X86 so its not clear the performance of this matters at all.

"Craig Topper" <craig.topper@gmail.com> wrote;

Clang never generates calls to ___paritysi2, ___paritydi2, ___cmpdi2, or
___ucmpdi2 on X86 so its not clear the performance of this matters at all.

So you can safely remove them for X86 and all the other targets where such
unoptimized code is never called!
But fix these routines for targets where they are called.

The statement does NOT make any exceptions, and it does not say

ships unoptimized routines the compiler never calls

but

optimized target-independent implementations

Stefan

BTW: do builtins like __builtin_*parity* exist?
     If yes: do they generate the same bad code?

__builtin_parity uses setnp on older x86 and popcnt with sse4.2

Turn on optimizations.

The -O0 code isn’t the same as the builtins library, it’s worse.

The code in the builtins library is written in C and can’t use __builtin_parity since it needs to compile with compilers that don’t have __builtin_parity like MSVC. So to get the optimized code, we would have to have a version written in assembly for X86. Probably two assembly versions since MASM and gcc use different assembly syntax. So we’d have 3 different versions of parity for 3 different bit widths. Maybe some macros could allow us to share some bit widths or something. But ultimately it was a question of where to spend effort.

I guess you were probably asking about __builtin_parity not the builtins library. For __builtin_parity, the frontend emits a llvm.ctpop+ and w/ 1. The backend has a peephole to pattern match this to the parity sequence. But the peephole is part of a larger optimization pass that isn’t running in -O0 to save compile time. So it doesn’t get pattern matched so we emit the AND and an expand popcount. To fix this correctly for -O0 we need to add a llvm.parity intrinsic that the frontend can emit directly instead of the ctpop+and. Then we wouldn’t need the peephole pattern match.

Currently we skip the build of __udivmodti4(), __multi3(), __modti3() etc. when compiling compiler-rt with MSVC. The problem is more than just the ABI. A bunch of the C code also assumes for example that you can do >> on __uint128_t. But if the __uint128_t is a struct as would be needed for MSVC that code won’t compile. Similar for addition, subtraction, etc.

Relevant code from the compiler-rt where we disable 128-bit.

// MSVC doesn’t have a working 128bit integer type. Users should really compile
// compiler-rt with clang, but if they happen to be doing a standalone build for
// asan or something else, disable the 128 bit parts so things sort of work.
#if defined(_MSC_VER) && !defined(clang)
#undef CRT_HAS_128BIT
#endif

Those two lines of code are slightly different. The first asumes sr to be 0-63. The second allows sr to be 0-127. So we might end up with a runtime check on bit 6 and a cmove to handle 127-64 and 63-0 differently. Maybe the compiler figures out from the surrounding code that bit 6 is 0, but I’d have to check.

foo.high = (foo.high << sr) | (foo.low >> (64 - sr));
instead of just
foo.all <<= sr,

Sorry the first allows sr to be 1-63. The 64-sr makes 0 not allowed.