LLVM generates two additional instructions for 128->256 bit typecasts
(e.g. _mm256_castsi128_si256()) to clear out the upper 128 bits of YMM register corresponding to source XMM register.
Most of the industry-standard C/C++ compilers (GCC, Intel's compiler, Visual Studio compiler) don't
generate any extra moves for 128-bit->256-bit typecast intrinsics.
None of these compilers zero-extend the upper 128 bits of the 256-bit YMM register. Intel's
documentation for the _mm256_castsi128_si256 intrinsic explicitly states that "the upper bits of the
resulting vector are undefined" and that "this intrinsic does not introduce extra moves to the
Clang implements these typecast intrinsics differently. Is this intentional? I suspect that this was done to avoid a hardware penalty caused by partial register writes. But, isn't the overall cost of 2 additional instructions (vxor + vinsertf128) for *every* 128-bit->256-bit typecast intrinsic higher than the hardware penalty caused by partial register writes for *rare* cases when the upper part of YMM register corresponding to a source XMM register is not cleared already?