inefficient code generation for 128-bit->256-bit typecast intrinsics

Hello,

LLVM generates two additional instructions for 128->256 bit typecasts
(e.g. _mm256_castsi128_si256()) to clear out the upper 128 bits of YMM register corresponding to source XMM register.

    vxorps xmm2,xmm2,xmm2

    vinsertf128 ymm0,ymm2,xmm0,0x0

Most of the industry-standard C/C++ compilers (GCC, Intel's compiler, Visual Studio compiler) don't

generate any extra moves for 128-bit->256-bit typecast intrinsics.

None of these compilers zero-extend the upper 128 bits of the 256-bit YMM register. Intel's

documentation for the _mm256_castsi128_si256 intrinsic explicitly states that "the upper bits of the

resulting vector are undefined" and that "this intrinsic does not introduce extra moves to the

generated code".

Clang implements these typecast intrinsics differently. Is this intentional? I suspect that this was done to avoid a hardware penalty caused by partial register writes. But, isn't the overall cost of 2 additional instructions (vxor + vinsertf128) for *every* 128-bit->256-bit typecast intrinsic higher than the hardware penalty caused by partial register writes for *rare* cases when the upper part of YMM register corresponding to a source XMM register is not cleared already?

Thanks!

Katya.

Hi Katya,

Can you please open a bugzilla bug report (llvm.org/bugs) ?

Thanks,
Nadav