Clang vectorization

(resent because first one is held?)

Hello cfe-users,

I’m trying to get clang (or GCC for that matter) to vectorize a very simple loop, and I’m wondering what I’m doing wrong. I’d rather write the loop as a loop instead of using intrinsics or the clang vector extensions, because I want the code to be portable. Pragmas and magic attributes are also undesirable, but they’re better than intrinsics.

This file is representative of what I’m trying to do. I’m compiling with -O3 -std=c99 -mavx2, but the same issues should apply for other vector settings.
“””
#include <stdint.h>

typedef struct this_should_totally_be_a_vector {
   uint64_t limb[8];
} __attribute__((aligned(32))) a_vector;

void add(a_vector *a, const a_vector *b) {
   for (int i=0; i<8; i++) a->limb[i] += b->limb[i];
}

void mac(a_vector *a, const a_vector *b) {
   const a_vector c = {{0,1,2,3,4,5,6,7}};
   for (int i=0; i<8; i++) a->limb[i] += b->limb[i] + 3*c.limb[i];
}
“””

Can someone suggest flags, pragmas, attributes etc which would cause these functions to produce good code? I’m seeing lots of problems. I’m testing for now on clang-3.6 release.

For starters, the compiler is unable to determine that there is no loop dependency, and therefore unrolls the loop instead of vectorizing. When passed #pragma clang loop unroll(disable) vectorize(enable), it is still not able to determine that there is no dependency, and so branches to a scalar version if a is close to b. Furthermore, it ignores the alignment hint and uses vmovdqu for everything, though maybe that doesn’t actually cost any performance. In fact, there cannot be a loop dependency both because of the alignment and because the arrays are in structs.

Clang produces the correct code if a is declared __restrict__, but in the real code it is possible that a=b so I’d rather not say __restrict__ if I don’t have to (especially since the code may be inlined, possibly causing alias analysis to break). GCC has #pragma GCC ivdep, which causes it to vectorize properly, but does Clang have any equivalent to #pragma ivdep? Also, __restrict__ still doesn’t give me vmovdqa.

For mac, with __restrict__ (again undesirable) I get decent 2-way vectorized sse3 code, which isn’t bad I guess, but I’d rather the compiler automatically produced 4-way avx2 code. If I add #pragma clang loop unroll(disable) vectorize(enable), I get
“”"
  vmovdqa mac.c(%rip), %ymm0
  vpbroadcastq .LCPI2_0(%rip), %ymm1
  vpmuludq %ymm1, %ymm0, %ymm2
  vpxor %ymm3, %ymm3, %ymm3
  vpmuludq %ymm3, %ymm0, %ymm4
  vpsllq $32, %ymm4, %ymm4
  vpaddq %ymm4, %ymm2, %ymm2
  vpsrlq $32, %ymm0, %ymm0
  vpmuludq %ymm1, %ymm0, %ymm0
  vpsllq $32, %ymm0, %ymm0
  vpaddq %ymm0, %ymm2, %ymm0
  vpaddq (%rsi), %ymm0, %ymm0
  vpaddq (%rdi), %ymm0, %ymm0
  vmovdqu %ymm0, (%rdi)
  vmovdqa mac.c+32(%rip), %ymm0
  vpmuludq %ymm1, %ymm0, %ymm2
  vpmuludq %ymm3, %ymm0, %ymm3
  vpsllq $32, %ymm3, %ymm3
  vpaddq %ymm3, %ymm2, %ymm2
  vpsrlq $32, %ymm0, %ymm0
  vpmuludq %ymm1, %ymm0, %ymm0
  vpsllq $32, %ymm0, %ymm0
  vpaddq %ymm0, %ymm2, %ymm0
  vpaddq 32(%rsi), %ymm0, %ymm0
  vpaddq 32(%rdi), %ymm0, %ymm0
  vmovdqu %ymm0, 32(%rdi)
  vzeroupper
  retq
“”"
In other words, clang has failed to propagate constants, and is trying to do 64-bit multiplies (lowered to vpsllq and vpmuludq) at runtime.

Can anyone help me get decent, portable code out of this? GCC performs well on add with #pragma GCC ivdep, but it also does silly things with mul.

Is there a way to do this which doesn’t depend on intrinsics or extensions? If I absolutely have to write this with intrinsics or extensions, is there a nice way to do it which doesn’t change the struct definition and doesn’t break strict aliasing?

Thanks a lot,
— Mike