As part of the implementation of a vectorized bitonic sorting network, I need to calculate the element-wise min and max of pairs of vectors. For i64
vectors on Intel Haswell, this can be computed as a comparison (vpcmpgtq
) followed by two blends (vblendvpd
).
However, the following LLVM IR
define <8 x i64> @julia_f_10618(<4 x i64> %0, <4 x i64> %1) #0 {
top:
%2 = icmp slt <4 x i64> %0, %1
%3 = select <4 x i1> %2, <4 x i64> %0, <4 x i64> %1
%4 = select <4 x i1> %2, <4 x i64> %1, <4 x i64> %0
%5 = shufflevector <4 x i64> %3, <4 x i64> %4, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
ret <8 x i64> %5
}
compiles to
vpcmpgtq ymm2, ymm1, ymm0
vblendvpd ymm2, ymm1, ymm0, ymm2
vpcmpgtq ymm3, ymm0, ymm1
vblendvpd ymm1, ymm1, ymm0, ymm3
vmovapd ymm0, ymm2
ret
which has an extra vpcmpgtq
. Here is the full code in context on godbolt. (Using the smin
and smax
LLVM intrinsics yields the same result.)
I can use inline assembly to avoid the extra comparison, like so:
define <8 x i64> @julia_f_10634(<4 x i64> %0, <4 x i64> %1) #0 {
top:
%2 = call <4 x i64> asm "vpcmpgtq $2, $1, $0", "=x,x,x"(<4 x i64> %0, <4 x i64> %1) #3
%3 = call <4 x i64> asm "vblendvpd $3, $2, $1, $0", "=x,x,x,x"(<4 x i64> %0, <4 x i64> %1, <4 x i64> %2) #3
%4 = call <4 x i64> asm "vblendvpd $3, $2, $1, $0", "=x,x,x,x"(<4 x i64> %1, <4 x i64> %0, <4 x i64> %2) #3
%5 = shufflevector <4 x i64> %3, <4 x i64> %4, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
ret <8 x i64> %5
}
Benchmarking shows this to be faster as suspected. But inline assembly is not acceptable as I need portable IR code.
Is there another way to write the IR to avoid the unneeded vpcmpgtq
?