IR: extra vpcmpgtq in assembly when calculating min and max of two i64 vectors on Intel Haswell

As part of the implementation of a vectorized bitonic sorting network, I need to calculate the element-wise min and max of pairs of vectors. For i64 vectors on Intel Haswell, this can be computed as a comparison (vpcmpgtq) followed by two blends (vblendvpd).

However, the following LLVM IR

define <8 x i64> @julia_f_10618(<4 x i64> %0, <4 x i64> %1) #0 {
top:
  %2 = icmp slt <4 x i64> %0, %1
  %3 = select <4 x i1> %2, <4 x i64> %0, <4 x i64> %1
  %4 = select <4 x i1> %2, <4 x i64> %1, <4 x i64> %0
  %5 = shufflevector <4 x i64> %3, <4 x i64> %4, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  ret <8 x i64> %5
}

compiles to

	vpcmpgtq	ymm2, ymm1, ymm0
	vblendvpd	ymm2, ymm1, ymm0, ymm2
	vpcmpgtq	ymm3, ymm0, ymm1
	vblendvpd	ymm1, ymm1, ymm0, ymm3
	vmovapd	ymm0, ymm2
	ret

which has an extra vpcmpgtq. Here is the full code in context on godbolt. (Using the smin and smax LLVM intrinsics yields the same result.)

I can use inline assembly to avoid the extra comparison, like so:

define <8 x i64> @julia_f_10634(<4 x i64> %0, <4 x i64> %1) #0 {
top:
  %2 = call <4 x i64> asm "vpcmpgtq $2, $1, $0", "=x,x,x"(<4 x i64> %0, <4 x i64> %1) #3
  %3 = call <4 x i64> asm "vblendvpd $3, $2, $1, $0", "=x,x,x,x"(<4 x i64> %0, <4 x i64> %1, <4 x i64> %2) #3
  %4 = call <4 x i64> asm "vblendvpd $3, $2, $1, $0", "=x,x,x,x"(<4 x i64> %1, <4 x i64> %0, <4 x i64> %2) #3
  %5 = shufflevector <4 x i64> %3, <4 x i64> %4, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  ret <8 x i64> %5
}

Benchmarking shows this to be faster as suspected. But inline assembly is not acceptable as I need portable IR code.

Is there another way to write the IR to avoid the unneeded vpcmpgtq?

Looking at the debug ouput, we’ve managed to match the smin/smax ops:

Type-legalized selection DAG: %bb.0 'julia_f_10618:top'
SelectionDAG has 13 nodes:
  t0: ch = EntryToken
  t2: v4i64,ch = CopyFromReg t0, Register:v4i64 %0
  t4: v4i64,ch = CopyFromReg t0, Register:v4i64 %1
    t7: v4i64 = smin t2, t4
  t14: ch,glue = CopyToReg t0, Register:v4i64 $ymm0, t7
    t8: v4i64 = smax t2, t4
  t16: ch,glue = CopyToReg t14, Register:v4i64 $ymm1, t8, t14:1
  t17: ch = X86ISD::RET_FLAG t16, TargetConstant:i32<0>, Register:v4i64 $ymm0, Register:v4i64 $ymm1, t16:1

which regrettably expand later to:

Optimized legalized selection DAG: %bb.0 'julia_f_10618:top'
SelectionDAG has 15 nodes:
  t0: ch = EntryToken
  t2: v4i64,ch = CopyFromReg t0, Register:v4i64 %0
  t4: v4i64,ch = CopyFromReg t0, Register:v4i64 %1
      t25: v4i64 = X86ISD::PCMPGT t4, t2
    t24: v4i64 = vselect t25, t2, t4
  t14: ch,glue = CopyToReg t0, Register:v4i64 $ymm0, t24
      t21: v4i64 = X86ISD::PCMPGT t2, t4
    t20: v4i64 = vselect t21, t2, t4
  t16: ch,glue = CopyToReg t14, Register:v4i64 $ymm1, t20, t14:1
  t17: ch = X86ISD::RET_FLAG t16, TargetConstant:i32<0>, Register:v4i64 $ymm0, Register:v4i64 $ymm1, t16:1

I’ll try to make a fix as this is a pretty common pattern. In the meantime either of these appear to work:

define <8 x i64> @julia_f_10618_opt(<4 x i64> %0, <4 x i64> %1) #0 {
  %c01 = icmp slt <4 x i64> %0, %1
  %c10 = icmp slt <4 x i64> %1, %0
  %s0 = select <4 x i1> %c01, <4 x i64> %0, <4 x i64> %1
  %s1 = select <4 x i1> %c10, <4 x i64> %0, <4 x i64> %1
  %res = shufflevector <4 x i64> %s0, <4 x i64> %s1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  ret <8 x i64> %res
}

define <8 x i64> @julia_f_10618_minmax(<4 x i64> %0, <4 x i64> %1) {
  %min = tail call <4 x i64> @llvm.smin.v4i64(<4 x i64> %0, <4 x i64> %1)
  %max = tail call <4 x i64> @llvm.smax.v4i64(<4 x i64> %1, <4 x i64> %0)
  %res = shufflevector <4 x i64> %min, <4 x i64> %max, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  ret <8 x i64> %res
}
declare <4 x i64> @llvm.smin.v4i64(<4 x i64>, <4 x i64>)
declare <4 x i64> @llvm.smax.v4i64(<4 x i64>, <4 x i64>)
1 Like

Thank you very much for the explanation and the workaround! I especially appreciate that you point out what to look for in the debug output.