when i generate code with 72 loop iterations.

the compiler generates code with using avx512 zmm operations 4 times (16x4=64) and remaining 8 iterations are handled by routine mov operations with EAX register. wouldn’t it be better if it uses ymm for remaining 8 iterations as it does when iteration count is between 8 and 15. same for xmm and so on.

please correct me if i am wrong.

Thank You

Thank you for the reference. Very interesting read!

There are couple questions though:

- What is the implementation status of this effort?
- I didn’t find anything on masked low-trip/remainder vectorization through what I’ve read. I believe for AVX512 and other masking-enabled targets (e.g. VPU) masking may be preferred technique for low trip count vectoization.

By masked low-trip vectorization I mean something along the lines of following transformation:

for (i = 0; i < smallN; ++i) {

op;

}

Transformed to:

for (i = 0; i < round_UP_to_multiple_of_VL(smallN, VL); ++i) {

if (i < smallN)

op;

}

And than vectorize by VL with mask.

Where (assuming VL – is small power of 2)

round_up_to_multiple_of_VL(x, consexpr VL) {

return (x + ~VL) ^ ~VL;

}

Thank you,

Serge.