Splat mask in shuffle vector

Hi all,
I am confused with the definition of splat mask in shuffle vector. I have searched the topics of splat in llvm-dev, and here is the most valuable one: What is “splat” in BUILD_VECTOR?. From the topic I know that splat means all elements are the same. But when I read the codes of isSplatMask in SelectionDAG.cpp, I notice that the implementation will treat <i32 undef, i32 2, i32 2, i32 undef> as splat mask. In this implement, the vector_shuffle will be combined to build_vector or lowered to target special instruction (for example, vdup in arm).
here is a simple example:

test.cl

int8 test(int r) {
int8 b;
b.s762 = r;
return b;
}

IR with O1:

define <8 x i32> @test(i32 %r) {
entry:
%splat.splatinsert = insertelement <3 x i32> undef, i32 %r, i32 0
%0 = shufflevector <3 x i32> %splat.splatinsert, <3 x i32> undef, <8 x i32> <i32 undef, i32 undef, i32 0, i32 undef, i32 undef, i32 undef, i32 0, i32 0>
ret <8 x i32> %0
}

log

Combining: t12: v9i32 = vector_shuffle<u,u,0,u,u,u,0,0,u> t20, undef:v9i32
Creating new node: t21: v9i32 = BUILD_VECTOR t4, t4, t4, t4, t4, t4, t4, t4, t4
… into: t21: v9i32 = BUILD_VECTOR t4, t4, t4, t4, t4, t4, t4, t4, t4

ASM

dup.32 q8, r1

According to the codes and logs, each element in b will be r. I don’t think this is what I expect.

I have tried to modify the implementation of isSplatMask, just testing Mask[i] == Mask[0], and llvm-check failed 30+ cases:
LLVM :: CodeGen/AArch64/arm64-neon-copy.ll
LLVM :: CodeGen/AArch64/arm64-vmul.ll
LLVM :: CodeGen/AArch64/dag-combine-trunc-build-vec.ll
LLVM :: CodeGen/AArch64/expand-select.ll
LLVM :: CodeGen/AArch64/mul_by_elt.ll
LLVM :: CodeGen/AArch64/neon-scalar-copy.ll
LLVM :: CodeGen/AArch64/trunc-v1i64.ll
LLVM :: CodeGen/AArch64/vecreduce-fmax-legalization-nan.ll
LLVM :: CodeGen/ARM/2009-11-02-NegativeLane.ll
LLVM :: CodeGen/ARM/vdup.ll
LLVM :: CodeGen/ARM/vzip.ll
LLVM :: CodeGen/PowerPC/qpx-bv-sint.ll
LLVM :: CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll
LLVM :: CodeGen/Thumb2/mve-shufflemov.ll
LLVM :: CodeGen/Thumb2/mve-vecreduce-fminmax.ll
LLVM :: CodeGen/Thumb2/mve-vecreduce-loops.ll
LLVM :: CodeGen/Thumb2/mve-vld3.ll
LLVM :: CodeGen/Thumb2/mve-vld4.ll
LLVM :: CodeGen/Thumb2/mve-vst3.ll
LLVM :: CodeGen/X86/haddsub-shuf.ll
LLVM :: CodeGen/X86/insertelement-duplicates.ll
LLVM :: CodeGen/X86/pr42905.ll
LLVM :: CodeGen/X86/pr46189.ll
LLVM :: CodeGen/X86/shuffle-of-splat-multiuses.ll
LLVM :: CodeGen/X86/split-extend-vector-inreg.ll
LLVM :: CodeGen/X86/sse3.ll
LLVM :: CodeGen/X86/trunc-subvector.ll
LLVM :: CodeGen/X86/var-permute-512.ll
LLVM :: CodeGen/X86/vector-narrow-binop.ll
LLVM :: CodeGen/X86/vector-shift-ashr-sub128.ll
LLVM :: CodeGen/X86/vector-shift-lshr-sub128.ll
LLVM :: CodeGen/X86/vector-shift-shl-sub128.ll
LLVM :: CodeGen/X86/vector-shuffle-128-v16.ll
LLVM :: CodeGen/X86/vector-shuffle-128-v4.ll
LLVM :: CodeGen/X86/vector-shuffle-combining-avx2.ll
LLVM :: CodeGen/X86/vector-shuffle-combining-avx512bwvl.ll
LLVM :: CodeGen/X86/vector-zext.ll
LLVM :: CodeGen/X86/vshift-4.ll
LLVM :: CodeGen/X86/widen_shuffle-1.ll

Obviously, this implementation becomes the de facto definition, but I don’t think it’s accurate. It’s relatively easy to fix the back-end codes and the test case, but I’m worried that the running applications that depend on this definition will be affected by my change.

So, what’s your opinion?

What do you expect though? Your code has basically undefined behavior (at least under C/C++ semantics, I’m not sure about OpenCL): you’re reading undefined memory when you return the entire vector value without setting individual elements.

From the point of view of LLVM: undef is “any value”, so the codegen chose to splat the same value everywhere because it is convenient and because undef allows it.

hi mehdi_amini,

Thanks for your reply.

I know my code has undefined behavior. I notice that most vectors with splat mask mixed with undef are temporary variables. It’s hard to generate practical IR to show my meaning, so I use this code.

I accept your opinion: it is convenient and undef allows it. But I have more concerns:
Assume that A arch with broadcast instruction with predication mask (actually, this arch exist, more than one in AI chips), shuffle a vector with splat mask mixed with undef can lower to this broadcast instruction. But if it’s just a temporary variable, this custom lowering may break current general codegen. If we run general codegen first, we lose the opportunity to custom lower to broadcast with predication mask in some situations.

Maybe there is a complex method to resolve this problem. I am not sure, so I bring up this discussion: should we or can we restrict the definition of splat mask?

I don’t quite get what you mean. Are you saying that you have HW where it would be more efficient to not splat when you have the insertelement + shufflevector you showed as example?
There are a few ways to handle it, one would be to match the insertelement+codegen pattern in the codegen prepare phase (before SelectionDAG) and turn it into an intrinsic. Another way would be to match this in SelectionDAG itself with a target-specific pattern: there are multiple phases in SelectionDAG and targets can catch such patterns before the generic ones (if I remember correctly).

Hi mehdi_amini,

Yes, that’s what I mean.
I will follow your suggestion and do some investigation. Thanks!