I find that some sequences of shufflevector
s in LLVM IR compile into suboptimal instructions in part due to the shufflevector
s being combined. For example, on Intel Haswell, the following can be computed as a blend followed by a cross-lane shuffle:
define <4 x i64> @julia_f_19355(<4 x i64> %0, <4 x i64> %1) #0 {
top:
%2 = shufflevector <4 x i64> %0, <4 x i64> %1, <4 x i32> <i32 0, i32 5, i32 6, i32 3>
%3 = shufflevector <4 x i64> %2, <4 x i64> undef, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
ret <4 x i64> %3
}
But it actually compiles to
vperm2f128 ymm2, ymm0, ymm1, 33 # ymm2 = ymm0[2,3],ymm1[0,1]
vperm2f128 ymm0, ymm1, ymm0, 33 # ymm0 = ymm1[2,3],ymm0[0,1]
vblendps ymm0, ymm0, ymm2, 204 # ymm0 = ymm0[0,1],ymm2[2,3],ymm0[4,5],ymm2[6,7]
ret
which has an additional cross-lane shuffle. In the debug output, I see that
t5: v4i64 = vector_shuffle<0,5,6,3> t2, t4
t7: v4i64 = vector_shuffle<2,3,0,1> t5, undef:v4i64
is combined to
t13: v4i64 = vector_shuffle<2,7,4,1> t4, t2
which generates
t22: v4i64 = X86ISD::VPERM2X128 t4, t2, TargetConstant:i8<33>
t21: v4i64 = X86ISD::VPERM2X128 t2, t4, TargetConstant:i8<33>
t19: v4i64 = X86ISD::BLENDI t22, t21, TargetConstant:i8<10>
I can prevent the compiler from combining the shufflevector
s by inserting a freeze
instruction between them, like so:
define <4 x i64> @julia_f_19360(<4 x i64> %0, <4 x i64> %1) #0 {
top:
%2 = shufflevector <4 x i64> %0, <4 x i64> %1, <4 x i32> <i32 0, i32 5, i32 6, i32 3>
%3 = freeze <4 x i64> %2
%4 = shufflevector <4 x i64> %3, <4 x i64> undef, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
ret <4 x i64> %4
}
Which compiles to the intended sequence:
vblendps ymm0, ymm0, ymm1, 60 # ymm0 = ymm0[0,1],ymm1[2,3,4,5],ymm0[6,7]
vpermpd ymm0, ymm0, 78 # ymm0 = ymm0[2,3,0,1]
ret
But as I understand it this is not freeze
’s intended purpose.
Are there other ways in IR to selectively stop the compiler from combining shufflevector
s? Or ways to help it generate efficient code for shuffles?