IR: alternatives to `freeze` to selectively prevent compiler from combining shufflevectors

I find that some sequences of shufflevectors in LLVM IR compile into suboptimal instructions in part due to the shufflevectors being combined. For example, on Intel Haswell, the following can be computed as a blend followed by a cross-lane shuffle:

define <4 x i64> @julia_f_19355(<4 x i64> %0, <4 x i64> %1) #0 {
top:
  %2 = shufflevector <4 x i64> %0, <4 x i64> %1, <4 x i32> <i32 0, i32 5, i32 6, i32 3>
  %3 = shufflevector <4 x i64> %2, <4 x i64> undef, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
  ret <4 x i64> %3
}

But it actually compiles to

	vperm2f128	ymm2, ymm0, ymm1, 33    # ymm2 = ymm0[2,3],ymm1[0,1]
	vperm2f128	ymm0, ymm1, ymm0, 33    # ymm0 = ymm1[2,3],ymm0[0,1]
	vblendps	ymm0, ymm0, ymm2, 204           # ymm0 = ymm0[0,1],ymm2[2,3],ymm0[4,5],ymm2[6,7]
	ret

which has an additional cross-lane shuffle. In the debug output, I see that

      t5: v4i64 = vector_shuffle<0,5,6,3> t2, t4
    t7: v4i64 = vector_shuffle<2,3,0,1> t5, undef:v4i64

is combined to

    t13: v4i64 = vector_shuffle<2,7,4,1> t4, t2

which generates

      t22: v4i64 = X86ISD::VPERM2X128 t4, t2, TargetConstant:i8<33>
      t21: v4i64 = X86ISD::VPERM2X128 t2, t4, TargetConstant:i8<33>
    t19: v4i64 = X86ISD::BLENDI t22, t21, TargetConstant:i8<10>

I can prevent the compiler from combining the shufflevectors by inserting a freeze instruction between them, like so:

define <4 x i64> @julia_f_19360(<4 x i64> %0, <4 x i64> %1) #0 {
top:
  %2 = shufflevector <4 x i64> %0, <4 x i64> %1, <4 x i32> <i32 0, i32 5, i32 6, i32 3>
  %3 = freeze <4 x i64> %2
  %4 = shufflevector <4 x i64> %3, <4 x i64> undef, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
  ret <4 x i64> %4
}

Which compiles to the intended sequence:

	vblendps	ymm0, ymm0, ymm1, 60            # ymm0 = ymm0[0,1],ymm1[2,3,4,5],ymm0[6,7]
	vpermpd	ymm0, ymm0, 78                  # ymm0 = ymm0[2,3,0,1]
	ret

But as I understand it this is not freeze’s intended purpose.

Are there other ways in IR to selectively stop the compiler from combining shufflevectors? Or ways to help it generate efficient code for shuffles?

Looks like a missing pattern. X86 has lowerShuffleAsBlendAndPermute to perform this type of thing, but its usually only called when one of the inputs is already “in place”.

Should be fixed by:
https://github.com/llvm/llvm-project/commit/32162cf291d40b8ead01061ea68bcdbc79ba9573

1 Like

Thanks again!

But besides this case there are more complicated examples where I’d still like to stop the compiler from combining shufflevectors.

For instance, the bit-reversal permutation of 16 i64s can be computed with 4 cross-lane shuffles and 4 unpacks, as shown here (with freeze inserted), but without the freeze instruction it compiles to 8 cross-lane shuffles, 4 unpacks, and 4 blends. (This example applies only to LLVM 15-dev. On LLVM 14 these shuffles don’t seem to get combined so freeze is unnecessary.)

I’d really appreciate if you could raise each shuffle issue you encounter: https://github.com/llvm/llvm-project/issues/ so we can more easily keep track - thanks!

In the meantime, if you aren’t passing the IR through InstCombine, you can try (on AVX2+ targets) using the permd intrinsic for single op shuffles, so it only gets expanded after legalization:

call <8 x i32> @llvm.x86.avx2.permd(<8 x i32> %vec, <8 x i32> %mask)