Suggestions on code generation for SIMD

Hi everyone,

I’m quite new to LLVM, but am working on a project that might need to generate some SIMD code using LLVM. The SIMD code will be using INTEL MIC intrinsics and I’m not sure about thesteps and tool set that I need to use to generate those.

I also have a confusion on the following problems:

  1. Do people usually generate SIMD code at source code level, using __m512?
  2. If not, does LLVM have corresponding IR instructions for the SIMD registers and instructions?

Since I’m new, I would appreciate any help that could give me some directions at any level. Some references would also help. Thanks in advance!

Hi Linchuan,

I believe clang supports Intel AVX512 intrinsics so it should be possible to generate vector code using that.

For 2), LLVM has first class vector types such as <4 x i32> and can do the usual things on those types, including masking. The vectoriser is where most of the vector code that LLVM generates will originate from. These types aren’t target specific however, and there are no notions of vector “registers” at the IR level.

Cheers,
Amara

Thanks Amara so much for the info!

One more question: what do people usually do if they want to generate vectorized code for some existing c/c++ code?
Do they usually do C/C++ source level transformation, or do at LLVM’s IR level?

I know clang supports auto vectorizations, such as loop vectorization and SLP, but they are not flexible enough if we
want to do more custom vectorizations or handle more complex cases, for example, SLP might not be able to handle
branches in the code (or may be latest version already can handle branches using mask).

The vast majority of the time people will rely on source level pragmas [1], LLVM IR is designed to be machine friendly, not something intended for users to manually edit themselves. You can do it, but it’s tedious and error prone. If you need more control over the vectorisation than the pragmas allow, then the C intrinsics are the best choice.

Amara

[1] http://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-loop-hint-optimizations

Thanks Amara very much! I will take a look!

    The vast majority of the time people will rely on source level pragmas [1],
    LLVM IR is designed to be machine friendly, not something intended for
    users to manually edit themselves. You can do it, but it’s tedious and
    error prone. If you need more control over the vectorisation than the
    pragmas allow, then the C intrinsics are the best choice.

    Amara

    [1] Clang Language Extensions — Clang 18.0.0git documentation
    extensions-for-loop-hint-optimizations

A large portion of user still use intrinsics too, as provided in
avxintrin.h and the likes. They are then lowered to a single/few
llvm instructions with vector operands.

Thanks Serge! This means for every new intrinsic set, a systematic change should be made to LLVM to support the new intrinsic set, right? The change should include frontend change, IR instruction set change, as well as low level code generation changes?

It really depends. In most cases, the intrinsic is implemented in terms
of generic vector instruction, directly represented at the LLVM level:

    static __inline __m256d __DEFAULT_FN_ATTRS
    _mm256_sub_pd(__m256d __a, __m256d __b)
    {
      return (__m256d)((__v4df)__a-(__v4df)__b);
    }

But some intrinsics cn not be modeled that way:

    static __inline __m256d __DEFAULT_FN_ATTRS
    _mm256_hadd_pd(__m256d __a, __m256d __b)
    {
      return (__m256d)__builtin_ia32_haddpd256((__v4df)__a, (__v4df)__b);
    }

In that case, the builtin is relatively opaque to the Middle End, nd
lowered in the backend (see include/llvm/IR/IntrinsicsX86.td)