IR: alternatives to `freeze` to selectively prevent compiler from combining shufflevectors

vyu · May 15, 2022, 12:24pm

I find that some sequences of shufflevectors in LLVM IR compile into suboptimal instructions in part due to the shufflevectors being combined. For example, on Intel Haswell, the following can be computed as a blend followed by a cross-lane shuffle:

define <4 x i64> @julia_f_19355(<4 x i64> %0, <4 x i64> %1) #0 {
top:
  %2 = shufflevector <4 x i64> %0, <4 x i64> %1, <4 x i32> <i32 0, i32 5, i32 6, i32 3>
  %3 = shufflevector <4 x i64> %2, <4 x i64> undef, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
  ret <4 x i64> %3
}

But it actually compiles to

	vperm2f128	ymm2, ymm0, ymm1, 33    # ymm2 = ymm0[2,3],ymm1[0,1]
	vperm2f128	ymm0, ymm1, ymm0, 33    # ymm0 = ymm1[2,3],ymm0[0,1]
	vblendps	ymm0, ymm0, ymm2, 204           # ymm0 = ymm0[0,1],ymm2[2,3],ymm0[4,5],ymm2[6,7]
	ret

which has an additional cross-lane shuffle. In the debug output, I see that

      t5: v4i64 = vector_shuffle<0,5,6,3> t2, t4
    t7: v4i64 = vector_shuffle<2,3,0,1> t5, undef:v4i64

is combined to

    t13: v4i64 = vector_shuffle<2,7,4,1> t4, t2

which generates

      t22: v4i64 = X86ISD::VPERM2X128 t4, t2, TargetConstant:i8<33>
      t21: v4i64 = X86ISD::VPERM2X128 t2, t4, TargetConstant:i8<33>
    t19: v4i64 = X86ISD::BLENDI t22, t21, TargetConstant:i8<10>

I can prevent the compiler from combining the shufflevectors by inserting a freeze instruction between them, like so:

define <4 x i64> @julia_f_19360(<4 x i64> %0, <4 x i64> %1) #0 {
top:
  %2 = shufflevector <4 x i64> %0, <4 x i64> %1, <4 x i32> <i32 0, i32 5, i32 6, i32 3>
  %3 = freeze <4 x i64> %2
  %4 = shufflevector <4 x i64> %3, <4 x i64> undef, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
  ret <4 x i64> %4
}

Which compiles to the intended sequence:

	vblendps	ymm0, ymm0, ymm1, 60            # ymm0 = ymm0[0,1],ymm1[2,3,4,5],ymm0[6,7]
	vpermpd	ymm0, ymm0, 78                  # ymm0 = ymm0[2,3,0,1]
	ret

But as I understand it this is not freeze’s intended purpose.

Are there other ways in IR to selectively stop the compiler from combining shufflevectors? Or ways to help it generate efficient code for shuffles?

RKSimon · May 15, 2022, 1:04pm

Looks like a missing pattern. X86 has lowerShuffleAsBlendAndPermute to perform this type of thing, but its usually only called when one of the inputs is already “in place”.

RKSimon · May 15, 2022, 4:14pm

Should be fixed by:
https://github.com/llvm/llvm-project/commit/32162cf291d40b8ead01061ea68bcdbc79ba9573

vyu · May 16, 2022, 5:58am

Thanks again!

But besides this case there are more complicated examples where I’d still like to stop the compiler from combining shufflevectors.

For instance, the bit-reversal permutation of 16 i64s can be computed with 4 cross-lane shuffles and 4 unpacks, as shown here (with freeze inserted), but without the freeze instruction it compiles to 8 cross-lane shuffles, 4 unpacks, and 4 blends. (This example applies only to LLVM 15-dev. On LLVM 14 these shuffles don’t seem to get combined so freeze is unnecessary.)

RKSimon · May 16, 2022, 10:09am

I’d really appreciate if you could raise each shuffle issue you encounter: https://github.com/llvm/llvm-project/issues/ so we can more easily keep track - thanks!

In the meantime, if you aren’t passing the IR through InstCombine, you can try (on AVX2+ targets) using the permd intrinsic for single op shuffles, so it only gets expanded after legalization:

call <8 x i32> @llvm.x86.avx2.permd(<8 x i32> %vec, <8 x i32> %mask)

Topic		Replies	Views
LoopVectorizer: shufflevectors LLVM Dev List Archives	3	94	September 4, 2018
vector optimization LLVM Dev List Archives	2	82	May 14, 2010
IR canonicalization: shufflevector or vector trunc? LLVM Dev List Archives	11	110	January 26, 2017
Scalable vectors support for shuffles RISCV	0	488	March 8, 2022
IR canonicalization: vector select or shufflevector? LLVM Dev List Archives	6	79	September 17, 2016

IR: alternatives to `freeze` to selectively prevent compiler from combining shufflevectors

Related topics