And that, in turn, depends on the length of the loop. If the values are
known at compile-time for `count`, the compiler will know whether
performing SSE operations or not "is worth it".
I had a case where I was passing an array of int and a length to a
function, and clang generated a whole lot of instructions to unroll the
loop and make it SSE - not realizing that the common value for `length` was
1 and never bigger than some small number (16 or 32). I didn't make any
effort to imrpove the compiled code, as I realized it was relatively simple
to inline the whole piece of code in LLVM-IR (it was part of my Pascal
compiler project). But I believe if I had added a `assert(length < 16)` to
the code, it would have done a decent job with it. [Inlining it helps my
Pascal compiler beat the FreePascal implementation by about 2-3x using that
particular algorithm for solving suduko - the call itself was quite an
overhead, and "not having a loop when you don't need to" helps even more]
While Clang's code is significantly larger, that is probably on purpose: Clang has vectorized the goto-loop.
Even with -mno-sse (and I suspect there's a better way to inhibit vectorization, but this worked), it looks like Clang is doing something a little strange. Where gcc uses
.L7:
add eax, edi
sub edi, 1;
jne .L7,
Clang has
.LBB0_2:
add eax, edi
cmp edi, 1
lea ecx, [rdi - 1]
mov edi, ecx
jg .LBB0_2
It also has a redundant xor eax, eax before the loop for some reason. That said...
To validate whether that was correct and a good idea, plug both results into a benchmark and look at the actual performance data.
While Clang's code is significantly larger, that is probably on purpose: Clang has vectorized the goto-loop.
Even with -mno-sse (and I suspect there's a better way to inhibit vectorization, but this worked), it looks like Clang is doing something a little strange. Where gcc uses
I think that the 'official' way would be to provide -fno-vectorize