What is the strategy of llvm select simd register and how to adjust it?

zcfh · January 18, 2024, 8:09am

For the following simple example

static inline void Sum16FloatValuesN(float *dst, const float **src, size_t src_num) {
  __m256 m_dst_1 = _mm256_loadu_ps(dst);      // NOLINT
  __m256 m_dst_2 = _mm256_loadu_ps(dst + 8);  // NOLINT
  __m256 m_src;                               // NOLINT
  for (int i = 0; i < src_num; i++) {
    m_src = _mm256_loadu_ps(src[i]);          // NOLINT
    m_dst_1 = _mm256_add_ps(m_dst_1, m_src);  // NOLINT
    m_src = _mm256_loadu_ps(src[i] + 8);      // NOLINT
    m_dst_2 = _mm256_add_ps(m_dst_2, m_src);  // NOLINT
  }
  _mm256_storeu_ps(dst, m_dst_1);      // NOLINT
  _mm256_storeu_ps(dst + 8, m_dst_2);  // NOLINT
}

static inline void Sum16FloatValuesN2(float *dst, const float **src, size_t src_num) {
  for (int i = 0; i < src_num; i++)
	for (int j = 0; j < 16; ++j)
		dst[j] += src[i][j];
}

Use -O3 -march=haswell to compile. From the assembly, I found that clang’s automatic vectorization optimization uses the xmm register. And I tried to increase the size of dst from 16 to 64, clang will also choose the ymm register.
I have two questions to ask:

For what reasons does the compiler not give priority to using ymm?
Are there any parameters that can change this strategy? From an answer on stackoverflow, I learned about the -mprefer-vector-width=512 option, but the test did not work.

topperc · January 18, 2024, 9:33pm

The asssembly listing for the 16 case isn’t being vectorized at all. It’s using vmovss and vaddss which are instructions that operate on the lower 32 bits of an xmm register. The compiler is probably unable to determine if dst and src point to overlapping memory.

Adding __restrict to the dst pointers gets it to vectorize, but the code is terrible. It appears the inner loop gets fully unrolled before the vectorizer so it vectorizes the outer loop. Adding #pragma nounroll to the inner loop prevents the unrolling and I was able to get this vector code. Compiler Explorer

zcfh · January 19, 2024, 3:06am

Thank you for your answer, I didn’t pay attention to the instructions before.
Based on your answer, I also did some testing and it seems clang is not vectorizing because it does loop unrolling. __restrict seems unnecessary. Should I manually add #pragma nounroll anywhere in this code to achieve this behavior? Is there any other way?

Another discovery is that when adding #pragma nounroll and __restrict, icx 2023 gets the same code performance as Sum16FloatValuesN. Let me first take a look at the differences between their assembly.

Topic		Replies	Views
vectorization for X86 LLVM Dev List Archives	3	125	March 16, 2016
Optimizing math code LLVM Dev List Archives	6	87	February 18, 2014
SLP vectorize not working Beginners	5	176	September 8, 2023
About discussion of vectorization pass and openmp `simd` and `ordered simd` directives LLVM Dev List Archives	13	118	September 15, 2021
LLVM is optimizing the intrinsic code as well Using Clang clang , llvm	2	394	November 30, 2022

What is the strategy of llvm select simd register and how to adjust it?

Related Topics