IR canonicalization: shufflevector or vector trunc?

rotateright · January 12, 2017, 5:04pm

It’s time for another round of “What is the canonical IR?”

Credit for this episode to Zvi and PR31551.
https://llvm.org/bugs/show_bug.cgi?id=31551


define <4 x i16> @shuffle(<16 x i16> %x) {
  %shuf = shufflevector <16 x i16> %x, <16 x i16> undef, <4 x i32> <i32 0, i32 4, i32 8, i32 12>
  ret <4 x i16> %shuf
}

define <4 x i16> @trunc(<16 x i16> %x) {
  %bc = bitcast <16 x i16> %x to <4 x i64>
  %tr = trunc <4 x i64> %bc to <4 x i16>
  ret <4 x i16> %tr
}

Potential reasons to prefer one or the other:

Shuffle is the most compact.
Trunc is easier to read.
One of these is easier for value tracking.
Compatibility with existing IR transforms (eg, InterleavedAccess recognizes the shuffle form).
We don’t create arbitrary shuffle masks in IR because that’s bad for a lot of targets (but maybe this mask pattern should always be recognized as special?).

Eli_Friedman · January 12, 2017, 6:06pm

Hmm… not sure what the right answer is, but a couple more observations: 1. If we’re going to canonicalize, we should probably canonicalize the same way independent of the original argument type (so we would introduce bitcasts either way). 2. Those two functions are only equivalent on little-endian platforms. -Eli

rotateright · January 12, 2017, 6:19pm

It's time for another round of "What is the canonical IR?"

Credit for this episode to Zvi and PR31551.
31551 – Lower shuffles with equivalent truncates as good as

define <4 x i16> @shuffle(<16 x i16> %x) {
  %shuf = shufflevector <16 x i16> %x, <16 x i16> undef, <4 x i32> <i32 0, i32 4, i32 8, i32 12>
  ret <4 x i16> %shuf
}

define <4 x i16> @trunc(<16 x i16> %x) {
  %bc = bitcast <16 x i16> %x to <4 x i64>
  %tr = trunc <4 x i64> %bc to <4 x i16>
  ret <4 x i16> %tr
}

Potential reasons to prefer one or the other:
1. Shuffle is the most compact.
2. Trunc is easier to read.
3. One of these is easier for value tracking.
4. Compatibility with existing IR transforms (eg, InterleavedAccess
recognizes the shuffle form).
5. We don't create arbitrary shuffle masks in IR because that's bad for a
lot of targets (but maybe this mask pattern should always be recognized as
special?).

Hmm... not sure what the right answer is, but a couple more observations:
1. If we're going to canonicalize, we should probably canonicalize the
same way independent of the original argument type (so we would introduce
bitcasts either way).

Ah, right - kill #1 in my list.

2. Those two functions are only equivalent on little-endian platforms.

I was wondering about that. So yes, if we do want to canonicalize (until
the recent compile-time complaints, I always thought this was the objective
of InstCombine...maybe it still is), then the masks we're matching or
generating will differ based on endianness.

Rackover_Zvi · January 12, 2017, 8:14pm

Just to add, there is also the ‘zext’ – ‘shuffle with zero’ duality which can broaden the discussion.

–Zvi

rotateright · January 13, 2017, 4:19pm

Right - I think that case looks like this for little endian:

define <2 x i32> @zextshuffle(<2 x i16> %x) {
%zext_shuffle = shufflevector <2 x i16> %x, <2 x i16> zeroinitializer, <4 x i32> <i32 0, i32 2, i32 1, i32 2>
%bc = bitcast <4 x i16> %zext_shuffle to <2 x i32>
ret <2 x i32> %bc
}

define <2 x i32> @zextvec(<2 x i16> %x) {
%zext = zext <2 x i16> %x to <2 x i32>
ret <2 x i32> %zext
}

IMO, the fact that we have to take endianness into account with the shuffles makes the trunc/zext forms the better choice. That way, we limit the endian dependency to one place in InstCombine, and other transforms don’t have to worry about it. We also have lots of existing folds for trunc/zext and hardly any for shuffles.

Rackover_Zvi · January 16, 2017, 5:16pm

Suppose we prefer the ‘trunc’ form, then what about cases such as:

define <2 x i16> @shuffle(<16 x i16> %x) {

%shuf = shufflevector <16 x i16> %x, <16 x i16> undef, <2 x i32> <i32 0, i32 8>

ret <2 x i16> %shuf

}

Will the ‘shufflevector’ be canonicalized to a ‘trunc’ of a vector of i128?

define <2 x i16> @trunc(<16 x i16> %x) {

%bc = bitcast <16 x i16> %x to <2 x i128>

%tr = trunc <2 x i128> %bc to <2 x i16>

ret <2 x i16> %tr

}

This may challenge the Legalizer downstream.

–Zvi

rotateright · January 17, 2017, 4:38pm

We use InstCombiner::ShouldChangeType() to prevent transforms to illegal integer types, but I’m not sure how that would apply to vector types.

Ie, let’s say v256 is a legal type in your example. DataLayout doesn’t appear to specify what configurations of a 256-bit vector are legal, so I don’t think we can currently use that to say v2i128 should be treated differently than v16i16.

Is this a valid argument to not canonicalize the IR?

Rackover_Zvi · January 19, 2017, 4:17pm

Hi Sanjay,

I agree we should also discuss if this canonicalization is beneficial.

For starters, do we have a concrete case where we would benefit from canonicalizing shuffles <-> truncates in LLVM IR?

IMO, we should not count benefits for codegen because that alone does not justify transforming the IR ; we could always do this on the SelectionDAG.

–Zvi

rotateright · January 21, 2017, 7:30pm

Hi Sanjay,

I agree we should also discuss **if** this canonicalization is beneficial.

For starters, do we have a concrete case where we would benefit from
canonicalizing shuffles <-> truncates in LLVM IR?

IMO, we should not count benefits for codegen because that alone does not
justify transforming the IR ; we could always do this on the SelectionDAG.

Agreed. If we're just talking about IR benefits, then it's easy to
demonstrate a win for trunc/zext based on value tracking:

target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128" ; little-endian

define <4 x i32> @shuffle(<4 x i64> %x) {
  %y = shl <4 x i64> %x, <i64 32, i64 32, i64 32, i64 32> ; low half of
each elt is zero
  %bc = bitcast <4 x i64> %y to <8 x i32> ; even index elements are all zero
  %trunc = shufflevector <8 x i32> %bc, <8 x i32> undef, <4 x i32> <i32 0,
i32 2, i32 4, i32 6>
  ret <4 x i32> %trunc
}

define <4 x i32> @trunc(<4 x i64> %x) {
  %y = shl <4 x i64> %x <i64 32, i64 32, i64 32, i64 32> ; low half of each
elt is zero
  %trunc = trunc <4 x i64> %y to <4 x i32> ; so this must be zero...
  ret <4 x i32> %trunc
}

$ ./opt -instsimplify 31551.ll -S
...
define <4 x i32> @shuffle(<4 x i64> %x) {
  %y = shl <4 x i64> %x, <i64 32, i64 32, i64 32, i64 32>
  %bc = bitcast <4 x i64> %y to <8 x i32>
  %trunc = shufflevector <8 x i32> %bc, <8 x i32> undef, <4 x i32> <i32 0,
i32 2, i32 4, i32 6>
  ret <4 x i32> %trunc
}

define <4 x i32> @trunc(<4 x i64> %x) {
ret <4 x i32> zeroinitializer
}

Of course, this is something I invented as an example, but AFAIK we have
better value tracking for trunc/zext than shuffle, so we'll have an easier
time folding the IR if that is possible.

Rackover_Zvi · January 26, 2017, 1:11pm

Thanks for providing the example, Sanjay.

Here’s a case that may be worth mentioning:

For the following equivalent (little-endian) functions:

define i32 @A(<8 x i32> %V) {

%S = shufflevector <8 x i32> %V, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>

%E = extractelement <4 x i32> %S, i32 1

ret i32 %E

}

define i32 @B(<8 x i32> %V) {

%B = bitcast <8 x i32> %V to <4 x i64>

%T = trunc <4 x i64> %B to <4 x i32>

%E = extractelement <4 x i32> %T, i32 1

ret i32 %E

}

‘opt –instcombine’ gives:

define i32 @A(<8 x i32> %V) {

%E = extractelement <8 x i32> %V, i32 2

ret i32 %E

}

define i32 @B(<8 x i32> %V) {

%B = bitcast <8 x i32> %V to <4 x i64>

%1 = extractelement <4 x i64> %B, i32 1

%E = trunc i64 %1 to i32

ret i32 %E

}

Was the ‘truncate’ form in function @B combined to lesser preferable form? Not sure if this is true because running both combined function through ‘llc’ on X86-64 results with identical generated code.

–Zvi

rotateright · January 26, 2017, 5:22pm

Thanks for providing the example, Sanjay.

Here’s a case that may be worth mentioning:

For the following equivalent (little-endian) functions:

define i32 @A(<8 x i32> %V) {

   %S = shufflevector <8 x i32> %V, <8 x i32> undef, <4 x i32> <i32 0, i32
2, i32 4, i32 6>

   %E = extractelement <4 x i32> %S, i32 1

   ret i32 %E

}

define i32 @B(<8 x i32> %V) {

   %B = bitcast <8 x i32> %V to <4 x i64>

   %T = trunc <4 x i64> %B to <4 x i32>

   %E = extractelement <4 x i32> %T, i32 1

   ret i32 %E

}

‘opt –instcombine’ gives:

define i32 @A(<8 x i32> %V) {

  %E = extractelement <8 x i32> %V, i32 2

  ret i32 %E

}

define i32 @B(<8 x i32> %V) {

  %B = bitcast <8 x i32> %V to <4 x i64>

  %1 = extractelement <4 x i64> %B, i32 1

  %E = trunc i64 %1 to i32

  ret i32 %E

}

Was the ‘truncate’ form in function @B combined to lesser preferable form?

Yes, I don't think there's any question that @A is the canonical form of
those 2 options. Excellent counter-example.
Filed as:
https://llvm.org/bugs/show_bug.cgi?id=31770

Not sure if this is true because running both combined function through
‘llc’ on X86-64 results with identical generated code.

I think this is irrelevant. If there was a DAG canonicalization providing a
backstop, then we might just say, "Who cares?", but there isn't:

$ ./opt -instcombine shtr.ll -S | ./llc -o - -mtriple=powerpc64
    mr 3, 5
    blr
B: # @B
    rotldi 3, 5, 32
    rlwimi 3, 6, 0, 0, 31
    blr

rotateright · January 26, 2017, 5:28pm

Disregard that example - I forgot to put the layout with big 'E' in the IR
file. But that doesn't change that @A is canonical.

Topic		Replies	Views
IR canonicalization: vector select or shufflevector? LLVM Dev List Archives	6	79	September 17, 2016
Canonicalizing vector masking. LLVM Dev List Archives	4	68	October 10, 2014
IR: alternatives to `freeze` to selectively prevent compiler from combining shufflevectors Beginners	4	288	May 16, 2022
__builtin_shufflevector vs. llvm shufflevector Clang Frontend	5	148	February 9, 2012
Generalizing shuffle vector LLVM Dev List Archives	6	117	September 30, 2008

IR canonicalization: shufflevector or vector trunc?

Related topics