[RFC] Extending shufflevector for vscale vectors (SVE etc.)

Currently, for scalable vectors, only splat shuffles are allowed; we're considering allowing more different kinds of shuffles. The issue is, essentially, that a shuffle mask is a simple list of integers, and that isn't enough to express a scalable operation. For example, concatenating two fixed-length vectors currently looks like this:

shufflevector <2 x i32> %v1, <2 x i32> %v2, <4 x i32> <i32 0, i32 1, i32 2, i32 3>

(Note that despite the syntax, the mask is really just a list of constant integers, not a general constant expression.)

There isn't any obvious way to extend this to variable shuffles. The mask would need to have "vscale * N" elements, and it's impossible to write out all the elements of a list of unknown size; it would need to be generated with some sort of formula.

The motivation here is to express various common patterns that are useful for vectorization and lowering SVE intrinsics, in a target-independent manner:

1.Unzipping/unpacking interleaved data
2.Zipping/packing interleaved data
3.Reversing a vector
4.Concatenating two vectors, or splitting a vector into halves.

This isn't a comprehensive list of all shuffles we might want, but it's a reasonable starting point. The way the proposal is written, it should be easy to extend if we add more shuffles.

Proposed IR syntax:

%result = shufflevector <vscale x 4 x i32> %v1, <vscale x 4 x i32> %v2, SHUFFLE_NAME

SHUFFLE_NAME can be one of the following (with examples of the equivalent <4 x i32> shuffles):
splat - Splat element 0 of the first operand. (<0, 0, 0, 0>)
reverse - Reverse the elements of the first operand (<3, 2, 1, 0>)
concat - Concatenate the two operands (<0, 1, 2, 3, 4, 5, 6, 7>)
split_low - Return the low half of the first operand (<0, 1>)
split_high - Return the high half of the first operand (<2, 3>)
zip_low - Zip together the low halves of the two operands (<0, 4, 1, 5>)
zip_high - Zip together the high halves of the two operands (<2, 6, 3, 7>)
unzip_even - Unzip the even elements of the two operands (<0, 2, 4, 6>)
unzip_odd - unzip the odd elements of the two operands (<1, 3, 5, 7>)

On SVE targets, all of these shuffles can be lowered to a single instruction for vector types which fit in a single register. I expect that other scalable vector instruction sets will also support these operations, since they're important for many use cases. (If we end up in some weird situation where it's necessary, we could expand a shuffle into an extractelement/insertelement loop, similar to what we do for unsupported fixed-length shuffles. But the vectorizer would probably avoid generating unsupported shuffles, so it's unlikely to come up in practice.)

In C++, I expect to represent this list as an enum, and then wrap up the entire description of a fixed or scalable shuffle into a class "ShuffleMask". This would allow checking whether an operation matches one of the above patterns, and can be converted to the existing ArrayRef<int> for fixed shuffles. ShuffleVectorInst::getShuffleMask would then return a ShuffleMask, I think. Then we could add an alternate API getFixedShuffleMask() that only works for fixed shuffles, and just returns the fixed mask as an ArrayRef<int>.

I'm working on refactoring the existing shufflevector handling in LLVM to allow making these changes. See https://reviews.llvm.org/D72467 . I haven't tried implementing this proposal yet, though.

Alternatives:

Instead of extending shufflevector, we could introduce a dedicated intrinsic for each common shuffle. This is less readable, and makes it harder to leverage existing code that reasons about shuffles. But it would mean fewer changes to existing code.

We could make shufflevector take a variable shuffle mask operand (any Value*). This has been proposed before. This introduces a lot of complexity: the general case is hard to lower on most targets, and it becomes harder to pattern-match the special cases we actually care about. (It's possible we want to expose this as a target-independent intrinsic at some point, but that wouldn't really serve as a substitute for what's described here.)

We could come up with some way to represent a formula for generating a shuffle mask, instead of only allowing specific, known, shuffles. I'm not sure how to do that in a way that's both straightforward to reason about, and covers all the cases we care about. Or also along these lines, we could specify a shuffle mask as a tree of operations: for example, "zip(first_vector, reverse(second_vector))".

We can add more shuffles to the list. There are a few SVE shuffle instructions which are not equivalent to any of the basic operations I've listed: ext and trn. And it's possible to construct other shuffles out of sequences of multiple instructions. (Some of the additional shuffles we could specify would require an integer parameter, or an integer parameter list; it should be possible to support that without any major changes to the proposal.)

-Eli

Hi Eli,

Proposed IR syntax:

%result = shufflevector <vscale x 4 x i32> %v1, <vscale x 4 x i32> %v2, SHUFFLE_NAME

SHUFFLE_NAME can be one of the following (with examples of the equivalent <4 x i32> shuffles):
splat - Splat element 0 of the first operand. (<0, 0, 0, 0>)
reverse - Reverse the elements of the first operand (<3, 2, 1, 0>)
concat - Concatenate the two operands (<0, 1, 2, 3, 4, 5, 6, 7>)
split_low - Return the low half of the first operand (<0, 1>)
split_high - Return the high half of the first operand (<2, 3>)
zip_low - Zip together the low halves of the two operands (<0, 4, 1, 5>)
zip_high - Zip together the high halves of the two operands (<2, 6, 3, 7>)
unzip_even - Unzip the even elements of the two operands (<0, 2, 4, 6>)
unzip_odd - unzip the odd elements of the two operands (<1, 3, 5, 7>)

This fixed list of shuffles makes me uncomfortable, and I wonder if
there isn't a much simpler solution to the problem. Specifically,
allow the IR form:

%result = shufflevector <vscale x n x TY> %v1, <vscale x n x TY> %v2,
<m x i32> <mask>

yielding a result of type <vscale x m x TY>. (The <mask> could still
just be a constant list of integers, i.e. this doesn't require a
relaxation to arbitrary Value* masks.)

I believe that all the shuffle names you listed above could fit into
that scheme, simply by plugging in the mask that you've already
written in that list. In other words, the mask isn't vscale, is still
represent as just a straightforward list of constant, and it is
applied equally to each "vscale part" of the operand vectors. This is
conceptually analogous to how `select` on vectors can have either an
<n x i1> selection operand to operate as a fully vector select, or an
i1 selection operand that is applied equally to each part of the
operand vectors.

Maybe I'm misunderstanding the intended semantics of your list of
shuffles, in which case, please enlighten me :slight_smile:

Cheers,
Nicolai

Back when Arm was proposing the SVE extensions, another proposal was
to use SCEV like patterns.

I think they're all valid proposals, but implementation wise, they can
get really complicated.

Masks and expressions will have to be pattern-matched against
target-specific instructions when lowering, and if we're not going to
be generating the most unusual patterns to begin with
(front-end/middle-end creating them), then I think a fixed list to
begin with is quite sensible.

For example, I wouldn't want to let the vectoriser generate any random
pattern in the shuffle if I know that there is no valid instruction in
the back-end that can cope with that, and I'll end up with
under-performing code.

A fixed list will also allow us to build the infrastructure first,
with one less problem to handle. After we're sure it works well, we
can extend to different patterns.

For example, we can add to SHUFFLE-NAME things like "mask" and "scev"
and "expr", and then add whatever to the end of it. In those edge
cases, it will be up to the back-end to legalise that in a meaningful,
and hopefully performing, way.

cheers,
--renato

Eli Friedman via llvm-dev <llvm-dev@lists.llvm.org> writes:

[...]
We can add more shuffles to the list. There are a few SVE shuffle
instructions which are not equivalent to any of the basic operations
I've listed: ext and trn.

I realise this wasn't supposed to be an exhaustive list, but FWIW:
revb, revh and revw are also useful for reversing each sequence
of N elements (as opposed to reversing the whole vector).

Thanks,
Richard

revb is technically already exposed as a target-independent intrinsic (llvm.bswap), but yes, I missed those.

-Eli

Hi Renato,

> This fixed list of shuffles makes me uncomfortable, and I wonder if
> there isn't a much simpler solution to the problem. Specifically,
> allow the IR form:
>
> %result = shufflevector <vscale x n x TY> %v1, <vscale x n x TY> %v2,
> <m x i32> <mask>
>
> yielding a result of type <vscale x m x TY>. (The <mask> could still
> just be a constant list of integers, i.e. this doesn't require a
> relaxation to arbitrary Value* masks.)

Back when Arm was proposing the SVE extensions, another proposal was
to use SCEV like patterns.

I think they're all valid proposals, but implementation wise, they can
get really complicated.

Masks and expressions will have to be pattern-matched against
target-specific instructions when lowering, and if we're not going to
be generating the most unusual patterns to begin with
(front-end/middle-end creating them), then I think a fixed list to
begin with is quite sensible.

For example, I wouldn't want to let the vectoriser generate any random
pattern in the shuffle if I know that there is no valid instruction in
the back-end that can cope with that, and I'll end up with
under-performing code.

How is any of this different from non-vscale shufflevector?

It feels to me that if you're not willing to do a natural extension of
what's in shufflevector already, then going with an intrinsic for the
time being is the wiser choice.

Cheers,
Nicolai

> For example, I wouldn't want to let the vectoriser generate any random
> pattern in the shuffle if I know that there is no valid instruction in
> the back-end that can cope with that, and I'll end up with
> under-performing code.

How is any of this different from non-vscale shufflevector?

This specific point is not. It's a consequence of getting both
scalable and fixed shuffles wrong.

My argument is that getting scalable shuffles wrong is harder to
recover than fixed-size ones.

Fixed vectors will have a number of insert/extract element that are
known at compile time, while scalable vectors will have to add a
runtime stub or equivalent.

It feels to me that if you're not willing to do a natural extension of
what's in shufflevector already, then going with an intrinsic for the
time being is the wiser choice.

Using a simple mask is not trivial in scalable vectors because you
don't know the number of elements. What to do if the mask is smaller
or larger than the actual register, and not a multiple, etc?

Expressions are easier to check at compile time, because if they are
valid for all n in (0..N), then they are valid for a subset, whatever
the chunk size, if multiple. But what is a valid expression?

For example, can we add calls to that expression if the function can
be known at compile time? Do we really need to? If not *any*
expression, how do we restrict the set of valid operations and where
does that code goes.

None of those questions are too hard to answer, but I fear we can
spend more time discussing the semantics of the expression and what's
allowed in there than the actual implementation.

If we really *have* to, then we have to. But if the set of shuffles
proposed are sufficient for all scalable extensions in existence for
the foreseeable future, then it should be fine like that.

Having said that, I don't see anything wrong with implementing this
with intrinsics for now, if people feel there are some cases that we
cannot cover using a small list of cases.

Hi Eli,

Did you consider a design point between these two extremes? You could introduce one new instruction, something like “fixed shuffle vector” that takes two vectors and an enum. That would keep the structurally (and representationally) different cases as separate instructions, without creating a new instruction per fixed shuffle kind.

Relatedly, how do you foresee canonicalization (in instcombine and inst selection) working for these? If not for compatibility, it would make sense to canonicalize from shuffle vector to the ‘fixed’ formats, but doing that would probably introduce a bunch of regressions for various targets.

-Chris

This is a fair point. It seems to me that the motivation is to make the shuffling of vectors generic enough to support both fixed and scalable vectors, which is a sensible and elegant approach. However, it may wreak havoc throughout parts of the source base and/or regress several targets until the offending code is identified and properly massaged, which would take considerable work. Or not. Though the risk is likely, it would be interesting to get at least a rough assessment of the damage, which can be achieved with a rough prototyping of this approach.

Using intrinsics has its merits too, especially in its simpler form suggested by Chris. Though it would be less generic and repeating the existing shuffles for fixed length, it would probably be far less risky, particularly for being less invasive. If anything, it would be easier to prototype with this approach.

If my assessment is near the mark, then, strategically, the latter approach seems to allow making some progress sooner. Once everyone gets a better understanding of shuffling scalable vectors, then both approaches could be reconsidered again, but informed by this preliminary implementation.

Back to y'all.

From: Chris Lattner <clattner@nondot.org>
Sent: Wednesday, February 5, 2020 4:02 PM
To: Eli Friedman <efriedma@quicinc.com>
Cc: llvm-dev <llvm-dev@lists.llvm.org>
Subject: [EXT] Re: [llvm-dev] [RFC] Extending shufflevector for vscale vectors
(SVE etc.)

>
> Currently, for scalable vectors, only splat shuffles are allowed; we're
considering allowing more different kinds of shuffles. The issue is,
essentially, that a shuffle mask is a simple list of integers, and that isn't
enough to express a scalable operation. For example, concatenating two
fixed-length vectors currently looks like this:
>
> Proposed IR syntax:
>
> %result = shufflevector <vscale x 4 x i32> %v1, <vscale x 4 x i32> %v2,
SHUFFLE_NAME
>
> Alternatives:
>
> Instead of extending shufflevector, we could introduce a dedicated
intrinsic for each common shuffle. This is less readable, and makes it harder
to leverage existing code that reasons about shuffles. But it would mean
fewer changes to existing code.

Hi Eli,

Did you consider a design point between these two extremes? You could
introduce one new instruction, something like “fixed shuffle vector” that
takes two vectors and an enum. That would keep the structurally (and
representationally) different cases as separate instructions, without creating
a new instruction per fixed shuffle kind.

Well, there are sort of two forms of this. I could add a new instruction, or I could add a new intrinsic (maybe using a metadata string to specify the shuffle). An instruction is a ton of boilerplate. And an intrinsic means we don't get shufflevector constant expressions, which are useful for optimization.

Either way, it's a bunch of extra work if we intend to eventually unify the two. I don't see any scenario under which we don't want to eventually unify them. The operations I'm adding are semantically the same as the equivalent fixed-width shuffles; we just can't represent the shuffle masks the same way. And I think if we do end up changing the representation of scalable shufflevectors later, we'll be able to autoupgrade the existing ones.

I think I can keep the initial patch relatively small if we wrap the abstract notion of a "ShuffleMask", which is either a fixed shuffle or a named scalable shuffle, in a C++ class. And then we can let optimizations that expect fixed shuffles just convert that to the expected ArrayRef<int>.

Relatedly, how do you foresee canonicalization (in instcombine and inst
selection) working for these? If not for compatibility, it would make sense to
canonicalize from shuffle vector to the ‘fixed’ formats, but doing that would
probably introduce a bunch of regressions for various targets.

I'm thinking that we don't use the new named shuffles for fixed-width shuffles at the IR level. Instead, we add helpers to ShuffleVectorInst that match either a scalable shuffle, or an equivalent fixed shuffle. So code that wants to handle both can pretend they're canonicalized, and code that handles fixed shuffles won't be disrupted.

In SelectionDAG, shuffles aren't really unified the same way they are in IR. I think we map onto existing operations where they make sense (CONCAT_VECTORS and EXTRACT_SUBVECTOR). For scalable zip/unzip, I haven't really thought deeply about it; it probably makes sense to change SHUFFLE_VECTOR's shuffle mask to use ShuffleMask like IR, but that probably doesn't have a big impact either way.

-Eli

Hi Eli,

Did you consider a design point between these two extremes? You could
introduce one new instruction, something like “fixed shuffle vector” that
takes two vectors and an enum. That would keep the structurally (and
representationally) different cases as separate instructions, without creating
a new instruction per fixed shuffle kind.

Well, there are sort of two forms of this. I could add a new instruction, or I could add a new intrinsic (maybe using a metadata string to specify the shuffle). An instruction is a ton of boilerplate. And an intrinsic means we don't get shufflevector constant expressions, which are useful for optimization.

Oh yes, I didn’t necessarily mean instruction, an intrinsic would make sense as well. What value do use see shuffle vector constant expressions provided?

In my opinion, LLVM constant expressions seem like a bad idea in general, and I’d rather see them go away over time - as one example “constants that can trap” often are mishandled, and pattern matching is slower and more complex due to them. In my mind, the a better representation would be something that directly models what we need: global variable initializers can have relocations applied to them, so they have a very specific grammar of constant expression that represents this (symbol + constant, and symbol-symbol+constant on macho targets). The rest of the constant expr could be dropped, simplifying the system.

Either way, it's a bunch of extra work if we intend to eventually unify the two. I don't see any scenario under which we don't want to eventually unify them. The operations I'm adding are semantically the same as the equivalent fixed-width shuffles; we just can't represent the shuffle masks the same way. And I think if we do end up changing the representation of scalable shufflevectors later, we'll be able to autoupgrade the existing ones.

It isn’t obvious to me that these need to be unified: we don’t have to a have a single operation named “shuffle vector” that does all of the possible element permutations. For example, vperm on PPC Altivec supports data dependent shuffle masks, and merging that into shuffle vector seems like a bad idea.

An alternate design could look like three things (again, ignoring intrinsic vs instruction):

“Data dependent shuffle” would allow runtime shuffle masks.
  -> “fixed mask shuffle” would require a statically determined shuffle mask.
     -> “well known target-independent shuffle” would support specific shuffles like zip, even/odd splits, splat, etc.
         -> “target specific shuffles” would be any number of special things that are shuffle like.

By defining a tower of these, instcombine would canonicalize to the most specific thing possible, and then targets and other transformations (e.g. at isel time) would handle the one representation that maps onto the thing their instruction can do.

The advantage of this approach is that the helper functions (e.g. the methods on the instruction) can be specific to the use case, e.g. “fixed mask shuffle” returns an ArrayRef<int> or whatever, and the “well known target independent shuffle” operation could have a way to turn the enum into an ArrayRef<int> for the case when you want to expand it out.

Again, to be clear, this is true regardless of whether they are intrinsics or instructions. If they are instructions, these would be methods, intrinsics would use global functions that assert on the opcode.

From a historical perspective, the ShuffleVector design was intended to allow significant mid-level optimizations that merged and transformed shuffle masks. In practice though, instcombine (for example) has had to stay very "hands off" with shuffles in general because it is too easy to produce something that cannot be efficiently code generated. If you canonicalize towards “well known” shuffles, we could improve this, because we can expect targets to efficiently handle the well known shuffles.

Relatedly, how do you foresee canonicalization (in instcombine and inst
selection) working for these? If not for compatibility, it would make sense to
canonicalize from shuffle vector to the ‘fixed’ formats, but doing that would
probably introduce a bunch of regressions for various targets.

I'm thinking that we don't use the new named shuffles for fixed-width shuffles at the IR level.

Is this out of conservatism because of the amount of existing code that works on ShuffleVector? While variable length vectors will always be special in some ways, it seems useful to make their representation as consistent with static length vectors as possible.

Represent even fixed length shuffles with an explicit representation seems like it would make pattern matching the specific cases easier, makes the IR easier to read, and provides a single canonical representation for each mask. The generic pattern matching stuff (e.g. PatternMatch.h, DAG ISel) could paper over the representational differences as well.

-Chris

From: Chris Lattner <clattner@nondot.org>
Sent: Friday, February 7, 2020 3:00 PM
To: Eli Friedman <efriedma@quicinc.com>
Cc: llvm-dev <llvm-dev@lists.llvm.org>
Subject: [EXT] Re: [llvm-dev] [RFC] Extending shufflevector for vscale vectors
(SVE etc.)

>>
>> Hi Eli,
>>
>> Did you consider a design point between these two extremes? You could
>> introduce one new instruction, something like “fixed shuffle vector” that
>> takes two vectors and an enum. That would keep the structurally (and
>> representationally) different cases as separate instructions, without
creating
>> a new instruction per fixed shuffle kind.
>
> Well, there are sort of two forms of this. I could add a new instruction, or I
could add a new intrinsic (maybe using a metadata string to specify the
shuffle). An instruction is a ton of boilerplate. And an intrinsic means we
don't get shufflevector constant expressions, which are useful for
optimization.

Oh yes, I didn’t necessarily mean instruction, an intrinsic would make sense
as well. What value do use see shuffle vector constant expressions
provided?

In my opinion, LLVM constant expressions seem like a bad idea in general,
and I’d rather see them go away over time - as one example “constants that
can trap” often are mishandled, and pattern matching is slower and more
complex due to them. In my mind, the a better representation would be
something that directly models what we need: global variable initializers can
have relocations applied to them, so they have a very specific grammar of
constant expression that represents this (symbol + constant, and symbol-
symbol+constant on macho targets). The rest of the constant expr could be
dropped, simplifying the system.

I think there are some practical advantages to having constant expressions; in particular, we get automatic "CSE", which I think ends up being a practical advantage in some cases. And there are certain issues involving SelectionDAG (in particular, cloning a constant into every basic block make pattern-matching more effective, and reduces register pressure in some cases). And some of our cost modeling on IR sort of depends on the fact that constants are not instructions. I mean, we don't need constants; at an extreme we could have a constantint instruction, like GlobalISel's G_CONSTANT. But it's not clear that would be helpful for midlevel optimizations.

For scalable vectors in particular, currently, the only way to express a vector with where every element is a simple integer is using a splat shuffle expression, and those often fold into some instruction. Continuing to use constant expressions for that will make that work more smoothly, I think. That could be solved in other ways, though (more fun code for CodeGenPrepare!), so there isn't a hard requirement.

Some of this is what eventually led to the way we lower llvm.vscale: we have the intrinsic, but we have a hack in codegenprepare to convert it to a GEP constant expression.

Granted, there are disadvantages to constant expressions in function bodies: we end up with more than one way to represent the same operation, it's hard to remember to keep trapping constants in account, cost modeling can get confused if it doesn't correctly account for an "expensive" constant. And the use of constant expressions in global variables is a complete mess, like you mentioned.

> Either way, it's a bunch of extra work if we intend to eventually unify the
two. I don't see any scenario under which we don't want to eventually unify
them. The operations I'm adding are semantically the same as the equivalent
fixed-width shuffles; we just can't represent the shuffle masks the same
way. And I think if we do end up changing the representation of scalable
shufflevectors later, we'll be able to autoupgrade the existing ones.

It isn’t obvious to me that these need to be unified: we don’t have to a have
a single operation named “shuffle vector” that does all of the possible
element permutations. For example, vperm on PPC Altivec supports data
dependent shuffle masks, and merging that into shuffle vector seems like a
bad idea.

An alternate design could look like three things (again, ignoring intrinsic vs
instruction):

“Data dependent shuffle” would allow runtime shuffle masks.
  -> “fixed mask shuffle” would require a statically determined shuffle mask.
     -> “well known target-independent shuffle” would support specific
shuffles like zip, even/odd splits, splat, etc.
         -> “target specific shuffles” would be any number of special things that
are shuffle like.

I'm not sure trying to form a strict hierarchy really works out. The key aspect that breaks this is the use of "undef" in shuffle masks. So there are actually multiple fixed shuffles of the same width which might count as a "zip_lower", depending on the context.

I think the distinguishing factor here that means we want to integrate these shuffles into the shufflevector instruction, vs. other sorts of shuffle-ish operations, is that the shuffle pattern is known at compile-time. Given that, optimizations can reason about the value of various lanes, not just a black box. Knowing a "vperm" is in fact a shuffle isn't very helpful. All you can say is that every output lane is equal to some input lane. I guess you could figure out *something* based on that, but it doesn't really make sense to represent vperm, select, and shufflevector as the same instruction; the operands are different.

We can force any sort of representation to "work" with enough refactoring and helper methods, but I think my proposal ends up being the most straightforward.

From a historical perspective, the ShuffleVector design was intended to allow
significant mid-level optimizations that merged and transformed shuffle
masks. In practice though, instcombine (for example) has had to stay very
"hands off" with shuffles in general because it is too easy to produce
something that cannot be efficiently code generated. If you canonicalize
towards “well known” shuffles, we could improve this, because we can
expect targets to efficiently handle the well known shuffles.

We actually do a surprising amount here these days... and we're probably going to expand it more using target-specific information (see discussion on https://reviews.llvm.org/D73480 ).

>> Relatedly, how do you foresee canonicalization (in instcombine and inst
>> selection) working for these? If not for compatibility, it would make sense
to
>> canonicalize from shuffle vector to the ‘fixed’ formats, but doing that
would
>> probably introduce a bunch of regressions for various targets.
>
> I'm thinking that we don't use the new named shuffles for fixed-width
shuffles at the IR level.

Is this out of conservatism because of the amount of existing code that
works on ShuffleVector? While variable length vectors will always be special
in some ways, it seems useful to make their representation as consistent
with static length vectors as possible.

Represent even fixed length shuffles with an explicit representation seems
like it would make pattern matching the specific cases easier, makes the IR
easier to read, and provides a single canonical representation for each mask.
The generic pattern matching stuff (e.g. PatternMatch.h, DAG ISel) could
paper over the representational differences as well.

The other aspect here is the use of "undef" in fixed shuffles. Frontends usually don't do this, but some of our optimizations for shuffles are built around replacing unused results of shuffles with "undef". We don't want to forbid transforming "<0, 4, 1, 5>" -> "<0, undef, 1, 5>" because we decided the former was the named shuffle "zip", and we want to allow transforms/analysis for zip-like shuffles on "<0, undef, 1, 5>". So I think it ends up being more straightforward to have a helper for "is this a zip", as opposed to a helper to transforming a "zip" to a fixed shuffle mask.

Independent of anything we do with scalable vectors, it might make sense in the IR printer to add some sort of note for known shuffle patterns, so developers don't have to figure out the meaning of the numerical sequences in IR dumps.

-Eli

In my opinion, LLVM constant expressions seem like a bad idea in general,
and I’d rather see them go away over time - as one example “constants that
can trap” often are mishandled, and pattern matching is slower and more
complex due to them. In my mind, the a better representation would be
something that directly models what we need: global variable initializers can
have relocations applied to them, so they have a very specific grammar of
constant expression that represents this (symbol + constant, and symbol-
symbol+constant on macho targets). The rest of the constant expr could be
dropped, simplifying the system.

I think there are some practical advantages to having constant expressions; in particular, we get automatic “CSE”, which I think ends up being a practical advantage in some cases. And there are certain issues involving SelectionDAG (in particular, cloning a constant into every basic block make pattern-matching more effective, and reduces register pressure in some cases). And some of our cost modeling on IR sort of depends on the fact that constants are not instructions. I mean, we don’t need constants; at an extreme we could have a constantint instruction, like GlobalISel’s G_CONSTANT. But it’s not clear that would be helpful for midlevel optimizations.

Sure, but that cuts both ways. The fact that they end up as unfolded constantexprs implies that they need to be emitted, and the compiler doesn’t generally have a cost model to reflect that.

CSE doesn’t happen in general because (as you mention below) undef’s exist in shuffle masks, and the are not auto-CSE’d.

In the comments above, I was referring to LLVM IR - I agree that SelectionDAG has its own issues, and that the LLVM IR design doesn’t necessarily translate over.

For scalable vectors in particular, currently, the only way to express a vector with where every element is a simple integer is using a splat shuffle expression, and those often fold into some instruction. Continuing to use constant expressions for that will make that work more smoothly, I think. That could be solved in other ways, though (more fun code for CodeGenPrepare!), so there isn’t a hard requirement.

Oh, I see what you’re saying. The issue here is that the vector length is modeled as a Constant. Have I ever mentioned that that was always weird and non-obvious? :slight_smile:

Joking aside, there is a serious problem for getting the cost model correct. You are (reasonably) wanting to have “something cheap” be modeled as a "Constant*”, however a shuffle as an ConstantExpr also admits the cases where an instruction needs to be dynamically computed.

This conflation is exactly the problem I have with the existing constant expressions design. You cannot reasonably write a cost model for this in target independent code, and in some cases it helps but others it hurts. It is hugely problematic to me that we represent things as a Constant* that actually emit code (this is why ‘constants that can trap’ exist for example).

In my opinion, you’d be much better off by making these be a separate intrinsic that has no constantexpr, and use the standard CGP (or GlobalIsel) tricks to duplicate it where necessary on your target. The reason for this is that these will need to be executed on some targets, but can be folded on others. Having them explicit still allows CSE and other transformations to work on them, which is important for canonicalization and also for targets where they are not folded.

Some of this is what eventually led to the way we lower llvm.vscale: we have the intrinsic, but we have a hack in codegenprepare to convert it to a GEP constant expression.

Since you have the GEP trick that works with variable length vectors, why do you need llvm.vscale anymore at all?

An alternate design could look like four things (again, ignoring intrinsic vs
instruction):

“Data dependent shuffle” would allow runtime shuffle masks.
→ “fixed mask shuffle” would require a statically determined shuffle mask.
→ “well known target-independent shuffle” would support specific
shuffles like zip, even/odd splits, splat, etc.
→ “target specific shuffles” would be any number of special things that
are shuffle like.

I’m not sure trying to form a strict hierarchy really works out. The key aspect that breaks this is the use of “undef” in shuffle masks. So there are actually multiple fixed shuffles of the same width which might count as a “zip_lower”, depending on the context.

How does it break it? It seems reasonable to have an undef mask even for the ‘well known’ shuffles.

I think the distinguishing factor here that means we want to integrate these shuffles into the shufflevector instruction, vs. other sorts of shuffle-ish operations, is that the shuffle pattern is known at compile-time.

I can understand where you’re coming from, but I’m making an argument independent of variable length vectors. I’m arguing that the design above is better even for static vectors because the original ShuffleVector bet didn’t work out for them either.

Given that, optimizations can reason about the value of various lanes, not just a black box.

I don’t see how the above argues for black boxes. Target specific intrinsics are often black boxes for sure, but I don’t see how "well known target-independent shuffles” would be black boxes that prevent understanding lane mappings.

We can force any sort of representation to “work” with enough refactoring and helper methods,

Yes, agreed on that!

Represent even fixed length shuffles with an explicit representation seems
like it would make pattern matching the specific cases easier, makes the IR
easier to read, and provides a single canonical representation for each mask.
The generic pattern matching stuff (e.g. PatternMatch.h, DAG ISel) could
paper over the representational differences as well.

The other aspect here is the use of “undef” in fixed shuffles. Frontends usually don’t do this, but some of our optimizations for shuffles are built around replacing unused results of shuffles with “undef”. We don’t want to forbid transforming “<0, 4, 1, 5>” → “<0, undef, 1, 5>” because we decided the former was the named shuffle “zip”, and we want to allow transforms/analysis for zip-like shuffles on “<0, undef, 1, 5>”. So I think it ends up being more straightforward to have a helper for “is this a zip”, as opposed to a helper to transforming a “zip” to a fixed shuffle mask.

If you take a step back and look at the structure of the problem we’re working with, the existing ShuffleVector design is very problematic IMO. Among other things, knowing that shuffle output elements are undef but not knowing that about calls and loads of vectors is pretty weird. Why don’t we have something like the !range metadata that applies to load/call/shuffle to indicate that the result vector elements have some unknown elements? This would be far more general and correct.

In a world where we have two different “shuffle with a constant mask” and “shuffle with well known algorithms” operations, we would not implement undef output elements with “-1 in the shuffle mask”. Instead, we would have a design like:

  1. Shuffle masks would always have elements that range from 0 to #elements-1, and the verifier would enforce this. The accessors would return ArrayRef instead of ArrayRef

  2. We would have a unified way (like the !range attribute) to model this, and it would apply to both shuffles (plus loads and calls).

  3. Because this applies to calls, instcombine could infer this property for target specific intrinsics, the result could be propagated across function results using IPA etc.

Independent of anything we do with scalable vectors, it might make sense in the IR printer to add some sort of note for known shuffle patterns, so developers don’t have to figure out the meaning of the numerical sequences in IR dumps.

Sure. +1

-Chris

Some of this is what eventually led to the way we lower llvm.vscale: we have the intrinsic, but we have a hack in codegenprepare to convert it to a GEP constant expression.

Since you have the GEP trick that works with variable length vectors, why do you need llvm.vscale anymore at all?

As far as I understood it, it would be dependent on the target's DataLayout whether the GEP trick would be a valid transformation because of vector alignment (https://reviews.llvm.org/D71636#1792473).

If you take a step back and look at the structure of the problem we’re working with, the existing ShuffleVector design is very problematic IMO. Among other things, knowing that shuffle output elements are undef but not knowing that about calls and loads of vectors is pretty weird.

The masked load intrinsics have a predicate and a passthru operand. Is that not what you mean?

Why don’t we have something like the !range metadata that applies to load/call/shuffle to indicate that the result vector elements have some unknown elements? This would be far more general and correct.

Instead of adding a new attribute, can we use a `select` (or perhaps a new intrinsic) to express which lanes are undef? That way the same mechanism would also work for scalable vectors without having to figure out some convoluted way to describe that in metadata.

At the moment the `select` would be thrown away by InstCombine which folds `select %p, %v, undef -> %v`. I'm not sure to what extend this is desirable; it is useful to be folded away for certain purposes but not when you want to explicitly specify that something can be undef. Perhaps a new intrinsic would be more appropriate?

The other aspect here is the use of "undef" in fixed shuffles. Frontends usually don't do this, but some of our optimizations for shuffles are built around replacing unused results of shuffles with "undef". We don't want to forbid transforming "<0, 4, 1, 5>" -> "<0, undef, 1, 5>" because we decided the former was the named shuffle "zip", and we want to allow transforms/analysis for zip-like shuffles on "<0, undef, 1, 5>". So I think it ends up being more straightforward to have a helper for "is this a zip", as opposed to a helper to transforming a "zip" to a fixed shuffle mask.

For some shufflemask like `<0, undef, undef, undef>`, I guess this means that 'isSplat' and 'isZipLo' would both return true?

Just recapping the discussion here for my understanding (please correct me where I'm wrong):

- The original proposal is to extend `shufflevector` to take either a 'shuffle pattern' or a 'shuffle mask'.
  - For fixed-width shuffles it will retain the mask, in case any of the elements is 'undef' (so not to lose that information)
  - ShuffleVectorInst is extended with methods to query if something e.g. is a zip/splat/etc.

- An alternative would be to not unify these into a single `shufflevector`, but rather have multiple operations.
  - InstCombine can segment shuffles into two exclusive sets of operations: known shuffle patterns vs unknown pattern with explicit mask
  - To avoid multiple representations for the same shuffle, we want to map to exclusively either to 'explicit mask' or 'known mask' operations.
    But in order not to lose information about lanes being undef, we'd then need to model the 'undef' lanes separate from the instruction.

For either route, does this mean we'll also want to consider supporting data-dependent shuffle masks? (I don't see why we wouldn't support those if we go through the lengths of adding support for more types of shuffles).

Eli's proposal seems quite neat and sounds like it would take less effort, as this is mostly about extending shufflevector rather than changing it. Going down the path of revisiting shufflevector entirely by removing 'undef' from the shuffle mask seems like a much bigger change. That's not a reason not to do it, but if we end up going down that path, I think we should consider separating concerns and prioritising what's needed for scalable auto-vec first (either using this proposal or starting off with shuffle intrinsics that work exclusively for scalable vectors, so that we don't need to worry about canonicalisation). As long as we eventually end up with a shuffle representation that puts fixed-width vectors and scalable vectors on somewhat equal footing (or at least as much as possible).

Thanks,

Sander

It sounds like extending shufflevector to support an arbitrary scalable mask is off the table (with the future of arbitrary fixed length masks also in doubt). We (Arm) have this support in our downstream compiler but its effect on the code base is significant because we had to fix up all code paths that expect to be able to iterate across shuffle masks at compile time (not to mention the code generation support required).

Even with this support it was underused because we found no places where non-named shuffles are introduced and those created through various folds end up being decomposed into their original parts to gain good code generation. With no "generic shufflevector instruction" pot of gold at the end of the rainbow and no reason to expect the gold to be worth anything, adding such support was a waste in hindsight.

However, shufflevector does support generic fixed length masks, with much code depending on it, so unless we're going to remove all getShuffleMask like interfaces I don't see how allowing scalable shuffle masks (named or otherwise) is going to be clean. For this reason, my vote would be for scalable shuffle masks to be illegal, essentially the situation we are in today[1], and to introduce a separate method for named shuffles, be it intrinsic or instruction based. This would allow a simpler path to scalable auto-vectorisation whilst providing a route to migrate away from today's shufflevector.

Paul!!!

[1] We do allow zeroinitializer to represent splats but I'm not convinced this is safe and should be removed.

    >> Some of this is what eventually led to the way we lower llvm.vscale: we have the intrinsic, but we have a hack in codegenprepare to convert it to a GEP constant expression.
    > Since you have the GEP trick that works with variable length vectors, why do you need llvm.vscale anymore at all?
    As far as I understood it, it would be dependent on the target's DataLayout whether the GEP trick would be a valid transformation because of vector alignment (https://reviews.llvm.org/D71636#1792473).
    
    > If you take a step back and look at the structure of the problem we’re working with, the existing ShuffleVector design is very problematic IMO. Among other things, knowing that shuffle output elements are undef but not knowing that about calls and loads of vectors is pretty weird.
    The masked load intrinsics have a predicate and a passthru operand. Is that not what you mean?
    
    > Why don’t we have something like the !range metadata that applies to load/call/shuffle to indicate that the result vector elements have some unknown elements? This would be far more general and correct.
    Instead of adding a new attribute, can we use a `select` (or perhaps a new intrinsic) to express which lanes are undef? That way the same mechanism would also work for scalable vectors without having to figure out some convoluted way to describe that in metadata.
    
    At the moment the `select` would be thrown away by InstCombine which folds `select %p, %v, undef -> %v`. I'm not sure to what extend this is desirable; it is useful to be folded away for certain purposes but not when you want to explicitly specify that something can be undef. Perhaps a new intrinsic would be more appropriate?
    
    >>> The other aspect here is the use of "undef" in fixed shuffles. Frontends usually don't do this, but some of our optimizations for shuffles are built around replacing unused results of shuffles with "undef". We don't want to forbid transforming "<0, 4, 1, 5>" -> "<0, undef, 1, 5>" because we decided the former was the named shuffle "zip", and we want to allow transforms/analysis for zip-like shuffles on "<0, undef, 1, 5>". So I think it ends up being more straightforward to have a helper for "is this a zip", as opposed to a helper to transforming a "zip" to a fixed shuffle mask.
    For some shufflemask like `<0, undef, undef, undef>`, I guess this means that 'isSplat' and 'isZipLo' would both return true?
    
    Just recapping the discussion here for my understanding (please correct me where I'm wrong):
    
    - The original proposal is to extend `shufflevector` to take either a 'shuffle pattern' or a 'shuffle mask'.
      - For fixed-width shuffles it will retain the mask, in case any of the elements is 'undef' (so not to lose that information)
      - ShuffleVectorInst is extended with methods to query if something e.g. is a zip/splat/etc.
    
    - An alternative would be to not unify these into a single `shufflevector`, but rather have multiple operations.
      - InstCombine can segment shuffles into two exclusive sets of operations: known shuffle patterns vs unknown pattern with explicit mask
      - To avoid multiple representations for the same shuffle, we want to map to exclusively either to 'explicit mask' or 'known mask' operations.
        But in order not to lose information about lanes being undef, we'd then need to model the 'undef' lanes separate from the instruction.
    
    For either route, does this mean we'll also want to consider supporting data-dependent shuffle masks? (I don't see why we wouldn't support those if we go through the lengths of adding support for more types of shuffles).
    
    Eli's proposal seems quite neat and sounds like it would take less effort, as this is mostly about extending shufflevector rather than changing it. Going down the path of revisiting shufflevector entirely by removing 'undef' from the shuffle mask seems like a much bigger change. That's not a reason not to do it, but if we end up going down that path, I think we should consider separating concerns and prioritising what's needed for scalable auto-vec first (either using this proposal or starting off with shuffle intrinsics that work exclusively for scalable vectors, so that we don't need to worry about canonicalisation). As long as we eventually end up with a shuffle representation that puts fixed-width vectors and scalable vectors on somewhat equal footing (or at least as much as possible).
    
    Thanks,
    
    Sander

Some of this is what eventually led to the way we lower llvm.vscale: we have the intrinsic, but we have a hack in codegenprepare to convert it to a GEP constant expression.

Since you have the GEP trick that works with variable length vectors, why do you need llvm.vscale anymore at all?

As far as I understood it, it would be dependent on the target's DataLayout whether the GEP trick would be a valid transformation because of vector alignment (https://reviews.llvm.org/D71636#1792473).

Sure, but GEP is DataLayout specific in general. I am not an expert on this, but isn't this just a question of generating the right IR?

If you take a step back and look at the structure of the problem we’re working with, the existing ShuffleVector design is very problematic IMO. Among other things, knowing that shuffle output elements are undef but not knowing that about calls and loads of vectors is pretty weird.

The masked load intrinsics have a predicate and a passthru operand. Is that not what you mean?

Sure, that’s one example of this. The general issue is that “reasoning about undef vector element results” is not specific to shuffles. It applies equally to other operations as well.

Why don’t we have something like the !range metadata that applies to load/call/shuffle to indicate that the result vector elements have some unknown elements? This would be far more general and correct.

Instead of adding a new attribute, can we use a `select` (or perhaps a new intrinsic) to express which lanes are undef? That way the same mechanism would also work for scalable vectors without having to figure out some convoluted way to describe that in metadata.

At the moment the `select` would be thrown away by InstCombine which folds `select %p, %v, undef -> %v`. I'm not sure to what extend this is desirable; it is useful to be folded away for certain purposes but not when you want to explicitly specify that something can be undef. Perhaps a new intrinsic would be more appropriate?

The goal here is to keep scalable and fixed vectors as similar as possible, while not allowing the mechanisms to regress fixed vectors. I don’t think that select will do that. We’re talking vector-element level undef analysis, not undef of the whole vector.

The other aspect here is the use of "undef" in fixed shuffles. Frontends usually don't do this, but some of our optimizations for shuffles are built around replacing unused results of shuffles with "undef". We don't want to forbid transforming "<0, 4, 1, 5>" -> "<0, undef, 1, 5>" because we decided the former was the named shuffle "zip", and we want to allow transforms/analysis for zip-like shuffles on "<0, undef, 1, 5>". So I think it ends up being more straightforward to have a helper for "is this a zip", as opposed to a helper to transforming a "zip" to a fixed shuffle mask.

For some shufflemask like `<0, undef, undef, undef>`, I guess this means that 'isSplat' and 'isZipLo' would both return true?

Yes. That’s already generally true of the target isel stuff.

Just recapping the discussion here for my understanding (please correct me where I'm wrong):

- The original proposal is to extend `shufflevector` to take either a 'shuffle pattern' or a 'shuffle mask'.
- For fixed-width shuffles it will retain the mask, in case any of the elements is 'undef' (so not to lose that information)
- ShuffleVectorInst is extended with methods to query if something e.g. is a zip/splat/etc.

- An alternative would be to not unify these into a single `shufflevector`, but rather have multiple operations.
- InstCombine can segment shuffles into two exclusive sets of operations: known shuffle patterns vs unknown pattern with explicit mask
- To avoid multiple representations for the same shuffle, we want to map to exclusively either to 'explicit mask' or 'known mask' operations.
   But in order not to lose information about lanes being undef, we'd then need to model the 'undef' lanes separate from the instruction.

Right.

For either route, does this mean we'll also want to consider supporting data-dependent shuffle masks? (I don't see why we wouldn't support those if we go through the lengths of adding support for more types of shuffles).

I’m saying that this is a likely future, but I would really prefer NOT to lump it into the same discussion. It is just useful to keep the bigger context in mind while talking about specific points.

-Chris

As discussed on the April 2nd SVE Sync-up call, here are a few more
shuffles that I would like to see for Complex:

1) Duplicate even elements of one vector: (<0,0,2,2>) // could be
TRN1(x, x) intrinsic
2) Duplicate odd elements of one vector: (<1,1,3,3>). // could be
TRN2(x, x) intrinsic
3) Swap even/odd elements of one vector: (<1,0,3,2>) // could be
select(1010b, TRN1(x,x), TRN2(x,x))

The other shuffles that I need locally have been covered already.