Question about VectorLegalizer::ExpandStore() with v4i1

Rob, Ahmed, and Jingu,

[I'm sorry if my point of view is too x86 centric.]

the tricky part about fixing it is the need to settle on a memory layout for these vectors
(packed vs byte per i1; packed would be compatible with AVX512, I think).

I agree with Ahmed here, in principle. It's actually more than that, since vector compare
in AVX2 and below produces the same bitwidth per element as the compared data.
For example, in a mixed data type code, it isn't rare to feed integer vector compare
(0/FFFFFFFF, not even 0/1) consumed in double precision blend (or compute) and vice versa
---- mask conversion between 32bit-per-elem and 64bit-per-elem has to happen.
We need to minimize conversion between 0/1 logic and 0/-1 logic, and also conversion
between different element sizes. Doing so for AVX2 and below is challenging enough.
Introduction of AVX512F in Xeon Phi added another challenge to the vectorizer developers.
Addition of AVX512BW and VL should make it easier.

Without AVX512BW and VL (i.e., all of today's x86 targets), optimal representation of
the result of compare is determined by how it is consumed, and it is not a good idea
to have such optimization in multiple different places. If the legalizer has to blindly
legalize v4i1 without knowing how it is consumed, it is best to look at what happens
to v8i1. We can then let the same optimizer work to get the optimal ASM code out
in the end, whether vectorization factor is 4 or 8.

In the end, I may be agreeing to Rob, but not because of the reasons Rob mentioned.
One of the headaches is movmskps/pmovmskb do not have a quick reverse instruction
(MIC-AVX512 and below). I do not know LLVM's X86 CodeGen enough to say whether it
internally has mask-to/from-vector nodes. If it has, I'd hope X86 CodeGen can cancel out such
things in a peephole manner very efficiently so that blindly going for i1-per-elem (at type
legalization time) is good enough for most (if not all) cases ----- and I also hope that is
good (or good enough) for other (i.e., non-x86) backends.

Hideki Saito
Vectorizer Technical Lead
Intel Compiler and Languages