Puzzling vector optimisation inconsistency

I've been investigating performance inconsistencies of some vector code when compiled with clang.

Trying to boil it down to a minimal example, I'm puzzled by the following:

   #include <stdint.h>
   #include <stdio.h>
   #include <stdlib.h>
   #include <time.h>

   typedef uint32_t uint32x4_t __attribute__((vector_size(16)));
   static uint8_t buffer[1 << 28];

   int main(void) {
     CLASS uint32x4_t state = { 1, 2, 3, 4 };

     for (size_t i = 0; i < sizeof(buffer); i++)
       buffer[i] = i;

     double start = (double) clock() / CLOCKS_PER_SEC;
     for (size_t j = 0; j < sizeof(buffer) - 15; j += 16) {
       /* XOR in a chunk of buffer ignoring endianness: */
       for (uint8_t i = 0; i < 16; i++)
         ((uint8_t *) &state)[i] ^= buffer[j + i];
       /* Do some random vector work on top of each chunk: */
       state = state * state | (uint32x4_t) { 5, 5, 5, 5 };

     double finish = (double) clock() / CLOCKS_PER_SEC;
     printf("%0.1f MB/s: ", sizeof(buffer) / (finish - start) / (1 << 20));
     printf("%08x,%08x,%08x,%08x\n", state[0], state[1], state[2], state[3]);
     return EXIT_SUCCESS;

This program runs at a quarter of the speed compiled with -DCLASS= (so the state vector is automatic) compared to when it is compiled with -DCLASS=static (so the state vector is static):

   $ clang -Wall -DCLASS= -O3 test.c -o test && ./test
   819.7 MB/s: 71453c4d,b15df5e5,64535a1d,68709dd5

   $ clang -Wall -DCLASS=static -O3 test.c -o test && ./test
   3519.1 MB/s: 71453c4d,b15df5e5,64535a1d,68709dd5

It is also fast if the state vector is moved out to global/file scope. The behaviour is the same between two different x86-64 clangs on two different OSes:

   $ clang --version
   Alpine clang version 10.0.1
   Target: x86_64-alpine-linux-musl
   Thread model: posix
   InstalledDir: /usr/bin

   $ clang --version
   Apple LLVM version 10.0.1 (clang-1001.0.46.4)
   Target: x86_64-apple-darwin18.7.0
   Thread model: posix
   InstalledDir: /Library/Developer/CommandLineTools/usr/bin

I know the inner-16 loop is silly and could be written as a single vector XOR against a cast chunk of buffer, but this is heavily boiled-down. (The real code isn't really amenable to being transformed like that. It does a lot more vector work too, so the performance effect is more subtle.)

Despite the byte-wise inner-loop, the compiler does a superb job when the state vector is declared static or global, storing it in a vector register without writing to memory at all, and optimising the XOR to a single vector operation.

Is there any way I can coax it into compiling the auto state as efficiently as the static one? Is there something I've underspecified here, so I 'get lucky' in the one case?

Many thanks in advance for any help or pointers anyone can offer.

Best wishes,


PS One thing I wonder is if there's there a cleaner way to access bytes of the vector as lvalues that will optimise more consistently, but I can't see an improvement that doesn't perform worse. For example, one experiment I tried was to replace uint32x4_t state with a union:

   static union {
     uint32x4_t u32;
     uint8x16_t u8;
   } state;

and use state.u8[i ^ 3] ^= ... to update the bytes instead of ((uint8_t *) &state)[i ^ 3] ^= ...

But this makes it always slow (like the auto case) instead of always fast (like the static case).