Question about VectorLegalizer::ExpandStore() with v4i1

Hi All,

I have a problem with VectorLegalizer::ExpandStore() with v4i1.

Let's see a example.

* LLVM IR
store <4 x i1> %edgeMask_for.body1314, <4 x i1>* %27

* SelectionDAG before vector legalization
ch = store<ST1[%16](align=4), trunc to v4i1> t0, t128, t32, undef:i64

* SelectionDAG after vector legalization
ch = store<ST1[%16](align=4), trunc to i1> t0, t133, t32, undef:i64
  t133: i32 = extract_vector_elt t128, Constant:i64<0>
ch = store<ST1[%16](align=4), trunc to i1> t0, t136, t32, undef:i64
  t136: i32 = extract_vector_elt t128, Constant:i64<1>
ch = store<ST1[%16](align=4), trunc to i1> t0, t139, t32, undef:i64
  t139: i32 = extract_vector_elt t128, Constant:i64<2>
ch = store<ST1[%16](align=4), trunc to i1> t0, t142, t32, undef:i64
  t142: i32 = extract_vector_elt t128, Constant:i64<3>

As you can see above SelectionDAG, if backend decides to expand vector
store with v4i1, vector legalizer generates 4 store with same
destination address. I think it needs to handle non-byte addressable
types like ExpandLoad(). When I look at ExpandLoad(), it handles the
case. If I implement new backend, I might have done custom lowering to
avoid this case. But I am using x86_64 target and it generates above
codes. How do you think about it? If I missed something, please let me
know.

Thanks,
JinGu Kang

Hi All,

Can someone comment below question whether it is wrong or not please?

Hi All,

Can someone comment below question whether it is wrong or not please?

Hi All,

I have a problem with VectorLegalizer::ExpandStore() with v4i1.

Let's see a example.

* LLVM IR
store <4 x i1> %edgeMask_for.body1314, <4 x i1>* %27

* SelectionDAG before vector legalization
ch = store<ST1[%16](align=4), trunc to v4i1> t0, t128, t32, undef:i64

* SelectionDAG after vector legalization
ch = store<ST1[%16](align=4), trunc to i1> t0, t133, t32, undef:i64
  t133: i32 = extract_vector_elt t128, Constant:i64<0>
ch = store<ST1[%16](align=4), trunc to i1> t0, t136, t32, undef:i64
  t136: i32 = extract_vector_elt t128, Constant:i64<1>
ch = store<ST1[%16](align=4), trunc to i1> t0, t139, t32, undef:i64
  t139: i32 = extract_vector_elt t128, Constant:i64<2>
ch = store<ST1[%16](align=4), trunc to i1> t0, t142, t32, undef:i64
  t142: i32 = extract_vector_elt t128, Constant:i64<3>

As you can see above SelectionDAG, if backend decides to expand vector
store with v4i1, vector legalizer generates 4 store with same
destination address. I think it needs to handle non-byte addressable
types like ExpandLoad(). When I look at ExpandLoad(), it handles the
case. If I implement new backend, I might have done custom lowering to
avoid this case. But I am using x86_64 target and it generates above
codes. How do you think about it? If I missed something, please let me
know.

JinGu,

Your analysis is correct, vectors of i1 are incorrectly legalized.
This is a known issue (http://llvm.org/PR22603); the tricky part about
fixing it is the need to settle on a memory layout for these vectors
(packed vs byte per i1; packed would be compatible with AVX512, I
think).

-Ahmed

Thanks Ahmed.

Hi, Ahmed.

A packed representation, one bit per i1, is natural and best for our
work, for sure. In the Parabix project, we produced very fast text
and byte stream processing applications using packed bit streams,
stored 128 bits at a time for SSE/Neon/Altivec registers, 256 bits at
a time for AVX, 512 bits at a time for AVX 512.

I also think that the one bit per i1 approach is best and most consistent
overall. Vectors are not arrays. Vectors are intended to be treated
as single values. Whereas an array of i1 could reasonably be viewed as
an array of bytes, a vector of i1 should be packed.

The use of vector types in general should signify that efficient loading,
storing and manipulating of vectors is more important than manipulation of
individual elements. The entire point is to provide a natural model for
SIMD instruction sets, it seems to me.

As you say, the packed representation makes a lot of sense for AVX512.
But even the existing SSE and AVX instruction sets use a packed representation
in many cases. For example, the SSE operation movmskps produces a 4xi1
and pmovmskb produces 16xi1, both in packed form. In addition, any
icmp or fcmp operation can be easily implemented using two instructions
to produce packed i1 values. Our software relies on this packed
representation extensively.