Can LLVM vectorize <2 x i32> type

Hi,

Is LLVM be able to generate code for the following code?

%mul = mul <2 x i32> %1, %2, where %1 and %2 are <2 x i32> type.

I am running it on a Haswell processor with LLVM-3.4.2. It seems that it will generates really complicated code with vpaddq, vpmuludq, vpsllq, vpsrlq.

Thanks,
Zhi

Is LLVM be able to generate code for the following code?

%mul = mul <2 x i32> %1, %2, where %1 > and %2 are <2 x i32> type.

I am running it on a Haswell processor with LLVM-3.4.2. It seems that it will generates really complicated code with vpaddq, vpmuludq, vpsllq, vpsrlq.

Can you please elaborate more on what is your test case and what do you want to see the final output? It will be good if you can give test case you are running LLVM on.

Regards,
Suyog Sarda

For example, I have the following IR code,

for.cond.preheader: ; preds = %if.end18
%mul = mul i32 %12, %3
%cmp21128 = icmp sgt i32 %mul, 0
br i1 %cmp21128, label %for.body.preheader, label %return

for.body.preheader: ; preds = %for.cond.preheader
%19 = mul i32 %12, %3
%20 = add i32 %19, -1
%21 = zext i32 %20 to i64
%22 = add i64 %21, 1
%end.idx = add i64 %21, 1
%n.vec = and i64 %22, 8589934584
%cmp.zero = icmp eq i64 %n.vec, 0
br i1 %cmp.zero, label %middle.block, label %vector.ph

The corresponding assembly code is:

BB#3: # %for.cond.preheader

imull %r9d, %ebx
testl %ebx, %ebx
jle .LBB10_63

BB#4: # %for.body.preheader

leal -1(%rbx), %eax
incq %rax
xorl %edx, %edx
movabsq $8589934584, %rcx # imm = 0x1FFFFFFF8
andq %rax, %rcx
je .LBB10_8

I changed all the scalar operands to <2 x ValueType> ones. The IR becomes the following

for.cond.preheader: ; preds = %if.end18
%mulS44_D = mul <2 x i32> %splatLDS24_D.splat, %splatLDS7_D.splat
%cmp21128S45_D = icmp sgt <2 x i32> %mulS44_D, zeroinitializer
%sextS46_D = sext <2 x i1> %cmp21128S45_D to <2 x i64>
%BCS46_D = bitcast <2 x i64> %sextS46_D to i128
%mskS46_D = icmp ne i128 %BCS46_D, 0
br i1 %mskS46_D, label %for.body.preheader, label %return

for.body.preheader: ; preds = %for.cond.preheader
%S47_D = mul <2 x i32> %splatLDS24_D.splat, %splatLDS7_D.splat
%S48_D = add <2 x i32> %S47_D, <i32 -1, i32 -1>
%S49_D = zext <2 x i32> %S48_D to <2 x i64>
%S50_D = add <2 x i64> %S49_D, <i64 1, i64 1>
%end.idxS51_D = add <2 x i64> %S49_D, <i64 1, i64 1>

%n.vecS52_D = and <2 x i64> %S50_D, <i64 8589934584, i64 8589934584>

%cmp.zeroS53_D = icmp eq <2 x i64> %n.vecS52_D, zeroinitializer
%sextS54_D = sext <2 x i1> %cmp.zeroS53_D to <2 x i64>
%BCS54_D = bitcast <2 x i64> %sextS54_D to i128
%mskS54_D = icmp ne i128 %BCS54_D, 0
br i1 %mskS54_D, label %middle.block, label %vector.ph

Now the assembly for the above IR code is:

BB#4: # %for.cond.preheader

vmovdqa 144(%rsp), %xmm0 # 16-byte Reload
vpmuludq %xmm7, %xmm0, %xmm2
vpsrlq $32, %xmm7, %xmm4
vpmuludq %xmm4, %xmm0, %xmm4
vpsllq $32, %xmm4, %xmm4
vpaddq %xmm4, %xmm2, %xmm2
vpsrlq $32, %xmm0, %xmm4
vpmuludq %xmm7, %xmm4, %xmm4
vpsllq $32, %xmm4, %xmm4
vpaddq %xmm4, %xmm2, %xmm2
vpextrq $1, %xmm2, %rax
cltq
vmovq %rax, %xmm4
vmovq %xmm2, %rax
cltq
vmovq %rax, %xmm5
vpunpcklqdq %xmm4, %xmm5, %xmm4 # xmm4 = xmm5[0],xmm4[0]
vpcmpgtq %xmm3, %xmm4, %xmm3
vptest %xmm3, %xmm3
je .LBB10_66

BB#5: # %for.body.preheader

vpaddq %xmm15, %xmm2, %xmm3
vpand %xmm15, %xmm3, %xmm3
vpaddq .LCPI10_1(%rip), %xmm3, %xmm8
vpand .LCPI10_5(%rip), %xmm8, %xmm5
vpxor %xmm4, %xmm4, %xmm4
vpcmpeqq %xmm4, %xmm5, %xmm6
vptest %xmm6, %xmm6
jne .LBB10_9

It turned out that the vector one is way more complicated than the scalar one. I was expecting that it would be not so tedious.

For example, I have the following IR code,

for.cond.preheader: ; preds = %if.end18
%mul = mul i32 %12, %3
%cmp21128 = icmp sgt i32 %mul, 0
br i1 %cmp21128, label %for.body.preheader, label %return

for.body.preheader: ; preds = %for.cond.preheader
%19 = mul i32 %12, %3
%20 = add i32 %19, -1
%21 = zext i32 %20 to i64
%22 = add i64 %21, 1
%end.idx = add i64 %21, 1
%n.vec = and i64 %22, 8589934584
%cmp.zero = icmp eq i64 %n.vec, 0
br i1 %cmp.zero, label %middle.block, label %vector.ph

The corresponding assembly code is:

BB#3: # %for.cond.preheader

imull %r9d, %ebx
testl %ebx, %ebx
jle .LBB10_63

BB#4: # %for.body.preheader

leal -1(%rbx), %eax
incq %rax
xorl %edx, %edx
movabsq $8589934584, %rcx # imm = 0x1FFFFFFF8
andq %rax, %rcx
je .LBB10_8

I changed all the scalar operands to <2 x ValueType> ones. The IR becomes the following
for.cond.preheader: ; preds = %if.end18
%mulS44_D = mul <2 x i32> %splatLDS24_D.splat, %splatLDS7_D.splat
%cmp21128S45_D = icmp sgt <2 x i32> %mulS44_D, zeroinitializer
%sextS46_D = sext <2 x i1> %cmp21128S45_D to <2 x i64>
%BCS46_D = bitcast <2 x i64> %sextS46_D to i128
%mskS46_D = icmp ne i128 %BCS46_D, 0
br i1 %mskS46_D, label %for.body.preheader, label %return

for.body.preheader: ; preds = %for.cond.preheader
%S47_D = mul <2 x i32> %splatLDS24_D.splat, %splatLDS7_D.splat
%S48_D = add <2 x i32> %S47_D, <i32 -1, i32 -1>
%S49_D = zext <2 x i32> %S48_D to <2 x i64>
%S50_D = add <2 x i64> %S49_D, <i64 1, i64 1>
%end.idxS51_D = add <2 x i64> %S49_D, <i64 1, i64 1>
%n.vecS52_D = and <2 x i64> %S50_D, <i64 8589934584, i64 8589934584>
%cmp.zeroS53_D = icmp eq <2 x i64> %n.vecS52_D, zeroinitializer
%sextS54_D = sext <2 x i1> %cmp.zeroS53_D to <2 x i64>
%BCS54_D = bitcast <2 x i64> %sextS54_D to i128
%mskS54_D = icmp ne i128 %BCS54_D, 0
br i1 %mskS54_D, label %middle.block, label %vector.ph

Now the assembly for the above IR code is:

BB#4: # %for.cond.preheader

vmovdqa 144(%rsp), %xmm0 # 16-byte Reload
vpmuludq %xmm7, %xmm0, %xmm2
vpsrlq $32, %xmm7, %xmm4
vpmuludq %xmm4, %xmm0, %xmm4
vpsllq $32, %xmm4, %xmm4
vpaddq %xmm4, %xmm2, %xmm2
vpsrlq $32, %xmm0, %xmm4
vpmuludq %xmm7, %xmm4, %xmm4
vpsllq $32, %xmm4, %xmm4
vpaddq %xmm4, %xmm2, %xmm2
vpextrq $1, %xmm2, %rax
cltq
vmovq %rax, %xmm4
vmovq %xmm2, %rax
cltq
vmovq %rax, %xmm5
vpunpcklqdq %xmm4, %xmm5, %xmm4 # xmm4 = xmm5[0],xmm4[0]
vpcmpgtq %xmm3, %xmm4, %xmm3
vptest %xmm3, %xmm3
je .LBB10_66

BB#5: # %for.body.preheader

vpaddq %xmm15, %xmm2, %xmm3
vpand %xmm15, %xmm3, %xmm3
vpaddq .LCPI10_1(%rip), %xmm3, %xmm8
vpand .LCPI10_5(%rip), %xmm8, %xmm5
vpxor %xmm4, %xmm4, %xmm4
vpcmpeqq %xmm4, %xmm5, %xmm6
vptest %xmm6, %xmm6
jne .LBB10_9

As Mats pointed out, this may be the same problem as:
https://llvm.org/bugs/show_bug.cgi?id=22703

Basically, the code is generated for AVX2 where register XMM are 128 bits. Some of the above ops are <2 x i32> and involve sext to <2 x i64>, bitcast, etc. Hence the code has extra vector instructions.

Regards,
Suyog Sarda

Thanks for pointing out. In this case, it seems that vectorization is not actually profitable. Doesn’t that mean we always need to sext i32 to i64 so that we can use the whole lanes in a XMM register? Thanks.