What is the strategy of llvm select simd register and how to adjust it?

For the following simple example

static inline void Sum16FloatValuesN(float *dst, const float **src, size_t src_num) {
  __m256 m_dst_1 = _mm256_loadu_ps(dst);      // NOLINT
  __m256 m_dst_2 = _mm256_loadu_ps(dst + 8);  // NOLINT
  __m256 m_src;                               // NOLINT
  for (int i = 0; i < src_num; i++) {
    m_src = _mm256_loadu_ps(src[i]);          // NOLINT
    m_dst_1 = _mm256_add_ps(m_dst_1, m_src);  // NOLINT
    m_src = _mm256_loadu_ps(src[i] + 8);      // NOLINT
    m_dst_2 = _mm256_add_ps(m_dst_2, m_src);  // NOLINT
  }
  _mm256_storeu_ps(dst, m_dst_1);      // NOLINT
  _mm256_storeu_ps(dst + 8, m_dst_2);  // NOLINT
}

static inline void Sum16FloatValuesN2(float *dst, const float **src, size_t src_num) {
  for (int i = 0; i < src_num; i++)
	for (int j = 0; j < 16; ++j)
		dst[j] += src[i][j];
}

Use -O3 -march=haswell to compile. From the assembly, I found that clang’s automatic vectorization optimization uses the xmm register. And I tried to increase the size of dst from 16 to 64, clang will also choose the ymm register.
I have two questions to ask:

  1. For what reasons does the compiler not give priority to using ymm?
  2. Are there any parameters that can change this strategy? From an answer on stackoverflow, I learned about the -mprefer-vector-width=512 option, but the test did not work.
1 Like

The asssembly listing for the 16 case isn’t being vectorized at all. It’s using vmovss and vaddss which are instructions that operate on the lower 32 bits of an xmm register. The compiler is probably unable to determine if dst and src point to overlapping memory.

Adding __restrict to the dst pointers gets it to vectorize, but the code is terrible. It appears the inner loop gets fully unrolled before the vectorizer so it vectorizes the outer loop. Adding #pragma nounroll to the inner loop prevents the unrolling and I was able to get this vector code. Compiler Explorer

Thank you for your answer, I didn’t pay attention to the instructions before.
Based on your answer, I also did some testing and it seems clang is not vectorizing because it does loop unrolling. __restrict seems unnecessary. Should I manually add #pragma nounroll anywhere in this code to achieve this behavior? Is there any other way?

Another discovery is that when adding #pragma nounroll and __restrict, icx 2023 gets the same code performance as Sum16FloatValuesN. Let me first take a look at the differences between their assembly.