when i generate code with 72 loop iterations.
the compiler generates code with using avx512 zmm operations 4 times (16x4=64) and remaining 8 iterations are handled by routine mov operations with EAX register. wouldn’t it be better if it uses ymm for remaining 8 iterations as it does when iteration count is between 8 and 15. same for xmm and so on.
please correct me if i am wrong.
Thank You
Thank you for the reference. Very interesting read!
There are couple questions though:
- What is the implementation status of this effort?
- I didn’t find anything on masked low-trip/remainder vectorization through what I’ve read. I believe for AVX512 and other masking-enabled targets (e.g. VPU) masking may be preferred technique for low trip count vectoization.
By masked low-trip vectorization I mean something along the lines of following transformation:
for (i = 0; i < smallN; ++i) {
op;
}
Transformed to:
for (i = 0; i < round_UP_to_multiple_of_VL(smallN, VL); ++i) {
if (i < smallN)
op;
}
And than vectorize by VL with mask.
Where (assuming VL – is small power of 2)
round_up_to_multiple_of_VL(x, consexpr VL) {
return (x + ~VL) ^ ~VL;
}
Thank you,
Serge.