An issue about re-implementing the AVX2 intrinsic using inline ASM

Greetings everyone. Please allow me to illustrate my problem here:

First, please consider the following sample code:

#include <stdio.h>

#include

#include

#include <immintrin.h>

using namespace std;

int main(int argc, char const *argv)

{

__m256i x ,y ;

__m256i res = _mm256_and_si256(x, y);

return 0;

}

It can be compiled easily using clang –mavx2 source.cc

And we now want to re-write this _mm256_and_si256 function using inline ASM, just like the following:

#include <stdio.h>

#include

#include

using namespace std;

typedef float __m256 attribute ((vector_size (32)));

typedef double __m256d attribute((vector_size(32)));

typedef long long __m256i attribute((vector_size(32)));

typedef long long __v4di attribute ((vector_size (32)));

typedef int __v8si attribute ((vector_size (32)));

typedef short __v16hi attribute ((vector_size (32)));

typedef char __v32qi attribute ((vector_size (32)));

attribute((always_inline)) inline

__m256i _my_mm256_and_si256(__m256i s1, __m256i s2)

{

__m256i result;

asm (“vpand %2, %1, %0” : “=x”(result) : “x”(s1), “xm”(s2) );

return result;

}

int main(int argc, char const *argv)

{

__m256i x ,y ;

__m256i res = _my_mm256_and_si256(x, y );

return 0;

}

This new code can be compiled well also using clang –mavx2 source.cc

However, if we remove the –mavx2 flag, clang will emit the error:

fatal error: error in backend: Do not know how to split the result of this operator!

clang: error: clang frontend command failed with exit code 70 (use -v to see invocation)

Someone has given me an explanation here saying

If we miss the –mavx4.2 flag, the clang/llvm is unable to determine the right machine target to bind the input memory parameter to the input register required by the vpand operator here since the vpand operator requires ymm[0…7] as its input/output.

This makes some sense and I guess the gcc error output 20 : error: impossible constraint in ‘asm’ are actually complaining the similar thing. However, this doesn’t explain why the sse4.2 asm code can be compiled without the –msse4.2 flag. So please allow me to show you more here:

#include <stdio.h>

#include

#include

#include <stdint.h>

#include <emmintrin.h>

using namespace std;

static inline attribute ((always_inline))

int new_cmpestri(

__m128i str1, int len1, __m128i str2, int len2, const int mode) {

int result;

asm(“pcmpestri %5, %2, %1”

: “=c”(result) : “x”(str1), “xm”(str2), “a”(len1), “d”(len2), “i”(mode) : “cc”);

return result;

}

int main(int argc, char const *argv)

{

__m128i str1;

int len1 = 0;

__m128i str2;

int len2 =0;

const int mode = 0;

uint32_t result = new_cmpestri(str1, len1, str2, len2, mode);

return 0;

}

And the CPUID Flags of pcmpestri is SSE4.2. But this code can be compiled well without –msse4.2 flag.

I have conducted experiments with both gcc 4.9.2 and clang 3.3.

So in brief, I have two questions:

  1. Is it a possible task to compile the AVX2 ASM without –mavx flag using clang?

  2. If the answer to question 1 is NO, then why we can do that for SSE4.2 ASM without –msse4.2 flag?

Thank you very much for taking time reading this letter!