Question about generated code for x86 vpgather* intrinsics

Hi,

I’ve been looking into some basic vectorized code using_mm256_i32gather_epi32 for vpgatherdd. In a basic function I’ve been testing, I’m a little confused that it zeroes out the result before the gather (example: https://godbolt.org/g/zQzn56). Shouldn’t this be unnecessary when the mask is not specified by the intrinsic (and therefor set for every element), in which case the resulting ymm register will be fully loaded? The IR seems to specify a pre-zeroed result if I’m understanding things correctly:

%8 = tail call <8 x i32> @llvm.x86.avx2.gather.d.d.256(<8 x i32> zeroinitializer, i8* %1, <8 x i32> %7, <8 x i32> <i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1>, i8 1)

Am I correct in thinking this initialization is unnecessary here given than the mask is all -1? I’m not really concerned with the cost of the initialization as much as I am concerned that I don’t fully understand the semantics of the instruction/intrinsic :slight_smile:

Thanks,
-Jackson

You are correct that the initialization is unnecessary for the calculation of the result of the instruction.

It’s zero in the IR because the intrinsic header file uses _mm256_undefined_si256() which is defined as zero for other reasons. See llvm.org/PR32176. We don’t currently have a convenient way to put undef in the IR from C code.

#define _mm256_i32gather_epi32(m, i, s) extension ({
(__m256i)__builtin_ia32_gatherd_d256((__v8si)_mm256_undefined_si256(),
(int const *)(m), (__v8si)(__m256i)(i),
(__v8si)_mm256_set1_epi32(-1), (s)); })

Now the backend could detect that mask is all 1s and remove the zero. But the backend knows one additional thing about the gather instructions. Even though the mask is all ones, the scheduler in the CPU doesn’t know that until the gather instruction executes. That means the scheduler has to conservatively assume that the passthru input may be used by the instruction so the gather can’t execute until the last writer of whatever register is chosen has executed and produced its result. Even though that result isn’t going to be used by the gather. This is a false scheduling dependency. To break the dependency we emit an explicit zeroing with an xor which has special treatment in the CPU. The xor result will be considered ready without ever executing and the gather won’t wait for it.

The backend will replace any non-zero or undef value with zero when it can prove the mask is all ones.

We could be smarter and try to find a register that hasn’t been written in a while and use the zeroing as a last resort, but that’s harder. We should maybe not emit the zero with -Os or -Oz either, but no one has complained about that yet.