Altivec vs the type legalizer

PPC Altivec supports vector type v16i8 (and others) where the element type is not legal (in llvm's implementation). When we have a BUILD_VECTOR of these types with constant elements, LegalizeTypes first promotes the element types to i32, then builds a constant pool entry of type v16i32. This is wrong. I can fix it by truncating the elements back to i8 in ExpandBUILD_VECTOR. Does this seem like the right approach? I ask because we'll be relying on ConstantVector::get and getConstantPool to work even with elements of a type that's illegal for the target; currently, they do. (And I see no other way to fix it except to break the vector into scalars, which produces horrendous code.)

typedef vector unsigned char vuint8_t;
vuint8_t baz;
void foo(vuint8_t x) {
   vuint8_t temp = (vuint8_t)(22,21,20, 3, 25,24,23, 3, 28,27,26, 3, 31,30,29, 3);
   baz = vec_add(x, temp);
}

Hi Dale,

PPC Altivec supports vector type v16i8 (and others) where the element type is not legal (in llvm's implementation). When we have a BUILD_VECTOR of these types with constant elements, LegalizeTypes first promotes the element types to i32, then builds a constant pool entry of type v16i32.

are you sure? I would expect it to build v4i32.

Ciao,

Duncan.

Yes, I'm sure. Try it.

(Unless somebody's fixed it in the last week or so; I'm a bit out of date.)

Earlier this year we ran into a similar problem for NEON and ended up modifying the type rules for BUILD_VECTOR so that the operand types do not need to mach the element types. It is possible that what you are seeing is fall-out from that change.

With the "new" BUILD_VECTOR rules, you can have a v16i8 BUILD_VECTOR with 16 operands of type i32, and the operands are implicitly truncated to i8. Before we made that change, the type legalizer would (for the v16i8 case, e.g.) translate to a legal BUILD_VECTOR by using a bunch of shift and mask operations to pack 4 elements into each i32 value. When the target (like NEON) can directly support the smaller element types, it is a big mess to try to undo the shifting and masking to figure out what the original i8 operands were before type legalization.

ExpandBUILD_VECTOR seems like the right place to handle this, but I would not do it by operating on illegal types. We should add code there to combine the individual elements into values of the smallest legal integer type (e.g., for v16i8, pack 4 operands into each i32 value) and then bitcast the result. I think this used to happen upstream in type legalization, and now some more work is needed later on.

Hi Dale, I think Bob is right: the type legalizer shouldn't be turning v16i8
into v16i32, what should happen is that the return type of the BUILD_VECTOR
continues to be v16i8, but the type of the operands changes to i32, so you
end up with a BUILD_VECTOR that takes 16 lots of i32, and produces a v16i8.
The target then has all the info it needs to produce the best code, but needs
to be careful not the use the operand type (i32) when it really wants the vector
element type (i8).

While Bob describes this as being new rules, IIRC the old type legalizer used
to handle BUILD_VECTOR with an illegal element type needing promotion the same
way: it just promoted the operands, resulting in a mismatch between the operand
type and the vector element type (unlike in the bad old days, I believe this is
now documented as being allowed, in the description of the BUILD_VECTOR node).
However in the case of PPC I think the PPC code grabbed the BUILD_VECTOR and
transformed it before it hit this logic in the old type legalizer. Now that
type legalization is being done first, probably this got reversed: first the
type legalizer logic transforms the BUILD_VECTOR, and only later the ppc code
gets to do its stuff.

Ciao,

Duncan.

Hi Dale, I think Bob is right: the type legalizer shouldn't be turning v16i8
into v16i32, what should happen is that the return type of the BUILD_VECTOR
continues to be v16i8, but the type of the operands changes to i32, so you
end up with a BUILD_VECTOR that takes 16 lots of i32, and produces a v16i8.

It does that.

The target then has all the info it needs to produce the best code, but needs
to be careful not the use the operand type (i32) when it really wants the vector
element type (i8).

I don't think it's target dependent. This is also broken on Neon; the breakage is introduced when lowering a BUILD_VECTOR to a load from ConstantPool, and the call that builds the ConstantPool does not currently pass enough information to DTRT, it just passes a vector of i32's. Try the following with -march=arm -mattr=+neon . (It is possible that there's no way to get the FE to generate this on Neon, however.)

; ModuleID = 'small.c'
target datalayout = "E-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f128:64:128-n32"

@baz = common global <16 x i8> zeroinitializer ; <<16 x i8>*> [#uses=1]

define void @foo(<16 x i8> %x) nounwind ssp {
entry:
   %x_addr = alloca <16 x i8> ; <<16 x i8>*> [#uses=2]
   %temp = alloca <16 x i8> ; <<16 x i8>*> [#uses=2]
   %"alloca point" = bitcast i32 0 to i32 ; <i32> [#uses=0]
   store <16 x i8> %x, <16 x i8>* %x_addr
   store <16 x i8> <i8 22, i8 21, i8 20, i8 3, i8 25, i8 24, i8 23, i8 3, i8 28, i8 27, i8 26, i8 3, i8 31, i8 30, i8 29, i8 3>, <16 x i8>* %temp, align 16
   %0 = load <16 x i8>* %x_addr, align 16 ; <<16 x i8>> [#uses=1]
   %1 = load <16 x i8>* %temp, align 16 ; <<16 x i8>> [#uses=1]
   %tmp = add <16 x i8> %0, %1 ; <<16 x i8>> [#uses=1]
   store <16 x i8> %tmp, <16 x i8>* @baz, align 16
   br label %return

return: ; preds = %entry
   ret void
}

To make things more concrete here is the patch I was trying out:

Index: lib/CodeGen/SelectionDAG/LegalizeDAG.cpp