Inefficient code generation for _mm_test{z, c, nzc} (SSE4.1)

Hi

I've stumbled over a deficiency in clang's codegen for the SSE4.1 _mm_test* intrinsics. These intrinsics are supposed to map to the PTEST instruction, which sets the ZF (zero flag) and CF (carry flag) depending on whether the bitwise AND (or ANDNOT for CF) of two SSE registers is all zero or not. The construct

  if (_mm_test{z,c,nzc}_si128(v, m))
    …

should thus produce a PTEST instruction followed by a branch instruction (JZ for _mm_testz_si128, JC fr _mm_testc_si128, JNBE for _mm_testnzc_si128). Clang, however, instead produces something like

  PTEST …
  SETE %al
  MOVZBL %al, %eax
  TEST %eax, %eax
  JNE ...

Also, the LLVM bitcode looks a tad strange. For

  if (_mm_testz_si128(v,v))
    body();

Clang generates

  %2 = tail call i32 @llvm.x86.sse41.ptestz(<4 x float> %1, <4 x float> %1) nounwind
  %3 = icmp eq i32 %2, 0
  br i1 %3, label %5, label %4
; <label>:4 ; preds = %0
  tail call void (...)* @body() nounwind
  br label %5
; <label>:5 ; preds = %4, %0
  ret void

Since _mm_testz_si128 uses __m128i (the integer SSE type), *not* __m128 (the single-precision float SSE type), it seems strange that the corresponding LLVM intrinsic takes parameters of type float.

I'm not sure whether fixing this involves changing Clang or LLVM (or both?), which is why I haven't filed a bug report so far, but instead posted this here.

Funnily enough, GCC 4.2 (at least the OSX version) has the same problem. Later GCC versions get it right, though.

best regards,
Florian Pflug

Hi

I've stumbled over a deficiency in clang's codegen for the SSE4.1 _mm_test* intrinsics. These intrinsics are supposed to map to the PTEST instruction, which sets the ZF (zero flag) and CF (carry flag) depending on whether the bitwise AND (or ANDNOT for CF) of two SSE registers is all zero or not. The construct

if (_mm_test{z,c,nzc}_si128(v, m))
   …

should thus produce a PTEST instruction followed by a branch instruction (JZ for _mm_testz_si128, JC fr _mm_testc_si128, JNBE for _mm_testnzc_si128). Clang, however, instead produces something like

PTEST …
SETE %al
MOVZBL %al, %eax
TEST %eax, %eax
JNE ...

Also, the LLVM bitcode looks a tad strange. For

if (_mm_testz_si128(v,v))
   body();

Clang generates

%2 = tail call i32 @llvm.x86.sse41.ptestz(<4 x float> %1, <4 x float> %1) nounwind
%3 = icmp eq i32 %2, 0
br i1 %3, label %5, label %4
; <label>:4 ; preds = %0
tail call void (...)* @body() nounwind
br label %5
; <label>:5 ; preds = %4, %0
ret void

Since _mm_testz_si128 uses __m128i (the integer SSE type), *not* __m128 (the single-precision float SSE type), it seems strange that the corresponding LLVM intrinsic takes parameters of type float.

I'm not sure whether fixing this involves changing Clang or LLVM (or both?), which is why I haven't filed a bug report so far, but instead posted this here.

IMO, this is strictly a LLVM codegen issue. I also recall someone has been working on this? Are you using trunk clang / llvm?

Evan

Interestingly, the AVX ptest intrinsics are correctly taking 4 x i64 arguments. I’ll fix the 128-bit versions to take 2 x i64.

Wouldn't it also make more sense for the PTEST intrinsics to return an i1, not an i32? One could then use the results of
llvm.x86.sse41.ptest{z,c,nc} directly as a condition for br, without the intermediate icmp step. Or so I imagine, at least.

best regards,
Florian Pflug

I intend to look into fixing that. There’s likely something missing during DAG combine that should be able to fix this up.