mmintrin.h files and clang-cl


cl.exe supports using SSE intrinsics even if the compiler when targeting a CPU that doesn’t support them – for example, when using intrinsics from tmmintrin.h, the intrinsics will expand to SSSE3 instructions, but the compiler won’t generate SSSE3 instructions for regular C code. This can be used for having a few functions that use SSSE3 in a translation unit (those that use the intrinsics) and then call these from other functions in the same translation unit after checking processor flags for support.

clang (and gcc) don’t support this. With clang, if you want to use tmmintrin.h, you have to build with -mssse3, and then the compiler can generate SSSE3 instructions for all C code. This is because the intrinsics get compiled into builtins, which then are turned into llvm vector instructions which then no longer remember that they were intrinsics at some point – they aren’t different from the llvm generated for regular code.

This causes some compatibility issues between clang-cl and cl – see for example

It just occurred to me that in MS mode, we could implement the SSE intrinsics as built-in asm blocks when the targeted CPU doesn’t support them, like so:

#ifndef SSSE3_

static __forceinline __m128i _mm_abs_epi8(__m128i __a) {
_asm pabsb xmm0, xmmword ptr __a


static inline __m128i attribute((always_inline, nodebug))
_mm_abs_epi8(__m128i __a)
return (__m128i)__builtin_ia32_pabsb128((__v16qi)__a);

The downsides of this would be that clang couldn’t reason about these and would insert lots of unnecessary stores and loads, and code using intrinsics is probably performance-sensitive – but it’d be more compatible.

Are there other downsides, or reasons why this can’t work at all? (My expectation is that we won’t want to do this, and that I can then point at this thread for why.)