Clang Optimizer freaks out on "simple" goto code?

FYI

found this example while reading:
https://github.com/jameysharp/corrode/issues/30#issuecomment-231969365
and compared it with current gcc 6.2, clang 3.9

gcc 6.2 result is quite small - clang 3.9 produces much much more code
for this example

is that a missing optimization opportunity or just wrong behavior of the
optimizer?

Hi Dennis,

While Clang’s code is significantly larger, that is probably on purpose: Clang has vectorized the goto-loop.

To validate whether that was correct and a good idea, plug both results into a benchmark and look at the actual performance data.

Philip

And that, in turn, depends on the length of the loop. If the values are
known at compile-time for `count`, the compiler will know whether
performing SSE operations or not "is worth it".

I had a case where I was passing an array of int and a length to a
function, and clang generated a whole lot of instructions to unroll the
loop and make it SSE - not realizing that the common value for `length` was
1 and never bigger than some small number (16 or 32). I didn't make any
effort to imrpove the compiled code, as I realized it was relatively simple
to inline the whole piece of code in LLVM-IR (it was part of my Pascal
compiler project). But I believe if I had added a `assert(length < 16)` to
the code, it would have done a decent job with it. [Inlining it helps my
Pascal compiler beat the FreePascal implementation by about 2-3x using that
particular algorithm for solving suduko - the call itself was quite an
overhead, and "not having a loop when you don't need to" helps even more]

While Clang's code is significantly larger, that is probably on purpose: Clang has vectorized the goto-loop.

Even with -mno-sse (and I suspect there's a better way to inhibit vectorization, but this worked), it looks like Clang is doing something a little strange. Where gcc uses

.L7:
  add eax, edi
  sub edi, 1;
  jne .L7,

Clang has

.LBB0_2:
  add eax, edi
  cmp edi, 1
  lea ecx, [rdi - 1]
  mov edi, ecx
  jg .LBB0_2

It also has a redundant xor eax, eax before the loop for some reason. That said...

To validate whether that was correct and a good idea, plug both results into a benchmark and look at the actual performance data.

I didn't actually do this.

I’m actually not sure we handle this correctly now.

We will bail out of vectorization if we know the length precisely and it’s smaller than 16, but I don’t think we will for a bound.

While Clang's code is significantly larger, that is probably on purpose: Clang has vectorized the goto-loop.

Even with -mno-sse (and I suspect there's a better way to inhibit vectorization, but this worked), it looks like Clang is doing something a little strange. Where gcc uses

I think that the 'official' way would be to provide -fno-vectorize

Cheers,
  Roel