[RFC] Introduce a new stepvector operation

Hi,

As part of adding support for scalable vectorization we need to update llvm::InnerLoopVectorizer::getStepVector for scalable vectors. Currently this just returns a constant vector with the sequence <0, 1, 2, 3, …>, however this assumes we are using fixed length vectors. For scalable vectors we need an operation that does the same thing, but without having to explicitly initalise all the elements. Any new stepvector operation we provide could also be used for fixed length vectors too if desired.

I believe the desirable properties of the operation should be:

  1. The operation requires no arguments and simply returns a vector with the numeric sequence <0, 1, 2, …>

  2. For types with a large number of elements, e.g. <vscale x 32 x i8> (vscale = 16), there is the possibility of the sequence value exceeding the limit of the type midway through the vector. In such cases we define the operation such that those elements are undefined or poison values.

A simple ‘stepvector’ operation (however we choose to implement it) with the properties described above can then be used together with additional ‘mul’ and ‘add’ instructions to create any arbitrary linear sequence, i.e. <0, 2, 4, 6, …> or <1, 3, 5, 7, …>

The first possible implementation with the properties as described above involves using a new llvm.stepvector() intrinsic with no arguments that simply returns a vector sequence <0, 1, 2, …> of the requested type, i.e.

declare <vscale x 4 x i32> @llvm.stepvector.nxv4i32()

Introducing a new intrinsic is simple to implement and we can easily come up with an appropriate cost model – cheap for fixed width vectors or scalable vectors using SVE where we have the ‘index’ instruction.

However, since such an intrinsic is effectively returning a constant vector sequence we could instead implement it using a new ‘stepvector’ constant in a similar way to how ‘zeroinitializer’ works. This would be done using a new ConstantStepVector class similar to ConstantAggregateZero and would return a vector with the numeric sequence <0, 1, 2, …>. The main advantages of using a constant over an intrinsic are:

  1. It is easy to write tests in LLVM IR since ‘stepvector’ would work in the same way as ‘zeroinitializer’, i.e. “%1 = add <4 x i32> %0, stepvector”

  2. Creation of the node is easy with the simple interface:

static Constant *ConstantStepVector::get(Type Ty)

  1. It is easy to do optimisations, e.g. CSE, and pattern matching in IR.

The main disadvantages are:

  1. A scalable constant cannot be represented as well in the .data section, although we can still create a constant based on the architectural maximum for vscale. It’s worth pointing out that this problem also exists for zeroinitializer too – we’re just more likely to have cheap instructions to do the job.

  2. Harder to fit into the cost model due to it being a constant.

  3. There are some concerns that we might then have to support stepvector as a constant in the shufflevector operation too and that it should be restricted to zeroinitializer only.

Any thoughts or feedback you have would be much appreciated!

Kind Regards,

David Sherwood.

David,

Thanks for writing this up. I’d just like to speak to some concerns I have regarding shufflevector. As many of us know, shufflevector takes two vectors and a constant vector of i32, and does stuff. The constant shuffle mask can be scalable or fixed width. The shuffle mask is supposed to be an arbitrary constant vector, however for scalable vectors only zeroinitializer or undef are accepted. There are reasonable technical reasons for this state of affairs, but it reveals an issue: we don’t really handle constant scalable vectors very well. Surely there are other similar issues throughout the codebase, but this is one I struggle with regularly so it sticks out in my mind.

However, we probably want to be able to use the stepvector in shufflevector. For instance, if we had a stepvector literal, then we could implement vector concatenation in terms of shufflevector:

%a_concat_b = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> stepvector

In fact, a lot of useful shuffles can be implemented in terms of stepvector multiplied or added to some constants. Pulling from Eli’s list in https://lists.llvm.org/pipermail/llvm-dev/2020-January/138762.html, I can see:

%result = shufflevector <vscale x 4 x i32> %v1, <vscale x 4 x i32> %v2, SHUFFLE_NAME

SHUFFLE_NAME can be one of the following (with examples of the equivalent <4 x i32> shuffles):

concat - Concatenate the two operands (<0, 1, 2, 3, 4, 5, 6, 7>) → see above

split_low - Return the low half of the first operand (<0, 1>) → stepvector of type <vscale x n/2 x i32>

split_high - Return the high half of the first operand (<2, 3>) → (stepvector + splat(n/2)) of type <vscale x n/2 x i32>

zip_low - Zip together the low halves of the two operands (<0, 4, 1, 5>)

zip_high - Zip together the high halves of the two operands (<2, 6, 3, 7>)

unzip_even - Unzip the even elements of the two operands (<0, 2, 4, 6>) (stepvector + stepvector) of type

unzip_odd - unzip the odd elements of the two operands (<1, 3, 5, 7>) (stepvector + stepvector + splat(1)) of type

Unfortunately, all of these cannot be done because shufflevector only supports scalable undef or zeroinitializer. In order to support these cases, we would need to extend shufflevector to support stepvector (for concat), and arbitrary constant expressions for the rest. Supporting stepvector might not be so hard with the current scheme: if the shuffle is scalable, and the mask is <0, 1, …, n - 1>, then the input mask was a scalable stepvector. However, I think this illustrates my proposal: vector pattern literals. They could look like this in IR:

<0, 1, …> ; stepvector

This is also more flexible, because it enables lots of other scalable vector literals:

<7, 7, …> ; splat(7) without writing that horrid insertelement/shufflevector thing

<0, 2, …> ; unzip_even mask

<1, 3, …> ; unzip_odd mask

The implementation for shufflevector would be straightforward because the mask for the two currently supported cases of zeroinitializer and undef (<0, 0, …> <=> zeroinitializer and <undef, undef, …> <=> undef) already follow the proposed scheme. This could also have the side benefits of making some IR easier to read (for very wide vectors, the fixed width stepvector could be more than 80 columns wide), and might result in efficiency gains in the compiler (don’t need to walk a very wide vector to see if it is a stepvector; can just canonicalize <0, 1, 2, 3, 4, 5, 6, 7, …, 2048> once to ConstantPatternVector(0, 1)).

Sorry for rambling. I think I’ve personally come around to the idea that a constant would be good. However, a more flexible constant would be best if we’re going to use it to add a bunch of special cases to the codebase. Most special cases for scalable undef and zeroinitializer can be replaced with equivalent code that also handles vector pattern literals.

Thanks,

Christopher Tetreault

Glad to hear you've come around, Chris. Named shuffle constants are a
good compromise, and I would very much like to see that happen as
well.

Just to be clear: I think adding another magic named literal would be a mistake. I think being able to write <n, m, ...> would be great, and would obviate the need for many of the named shuffles in Eli's RFC as well as serve in David's use case.

Thanks,
   Christopher Tetreault

Oh, I misunderstood. If you want to go that route, why not use
"vscale" to imply that the constants are also scaled?

So something like:

<vscale x 2 x i32> <i32 0, i32 2>

would imply:

<i32 0, i32 2, i32 4, i32 6, ...>

I'm not sure how you'd formalize that yet though. And reverse would be
a little weird, but I suppose it would be a little weird in your
proposal too.

Actually, after writing this out, I think named constants would be
much more clear. There's no room for misinterpretation if they're
named.

Just a thought on another approach.

Unless I’m missing something, a step vector is simply a prefix sum over a constant vector. Do we have clean ways with scalable vectors of representing those concepts? If we do, we don’t need a separate intrinsic. If we don’t, we probably need them anyways?

Slightly more specifically:
allones = broadcast 0
sum = prefix_sum allones
stepvec = sum - allones

Out of those, the only tricky one is the prefix sum. Have we explored that?

Philip

I think the idea I've proposed is cleaner and more general than just inferring it by the type having vscale. It would be a shame if it were limited to just scalable vectors. I listed a few use cases for fixed width vectors as well. I'm not proposing this as a cure-all for shufflevector, it's just a more general version of the proposed stepvector that is useful in more cases. My personal opinion is that all of these "unary shuffles" should be a different instruction, but that's outside the scope of this RFC.

As for it being unclear: complicated code is hard to read in general, so complicated pattern vectors will be as well. But to me, "stepvector" only has meaning if I am familiar with the use case and/or read the docs. <0, 1, ...> is obvious to anybody who's taken any sort of higher math class.

Thanks,
   Christopher Tetreault

I think the idea I've proposed is cleaner and more general than just inferring it by the type having vscale. It would be a shame if it were limited to just scalable vectors. I listed a few use cases for fixed width vectors as well. I'm not proposing this as a cure-all for shufflevector, it's just a more general version of the proposed stepvector that is useful in more cases. My personal opinion is that all of these "unary shuffles" should be a different instruction, but that's outside the scope of this RFC.

Agreed, we need to discuss which of these should be unary or binary
shuffles. There are vector performance implications from what solution
we choose.

E.g. having a unary extract_even|odd(...) will leave us with a 1/2
width vector result, that we'll need to concat with another 1/2 width
vector. So I think an even|odd shuffle should be a binary shuffle.

But something like reverse(...), which provides a full vector result,
should probably be unary.

That all said, IMO we should keep the shuffles binary and use an undef
operator where desired. That gives the user the most flexibility, and
doesn't require adding new operations to LLVM.

As for it being unclear: complicated code is hard to read in general, so complicated pattern vectors will be as well. But to me, "stepvector" only has meaning if I am familiar with the use case and/or read the docs. <0, 1, ...> is obvious to anybody who's taken any sort of higher math class.

Playing Devil's Advocate here...

Shuffle vector accepts any mask, right? Would we be in the business of
scaling *any* mask the user gives with '...'?

E.g. a zip_lo would be <0, 4, 1, 5, ...>, right? If that's true, I'm
assuming that we could pluck any number of elements from each operand
in any order. Is that what you were thinking? Or are there
restrictions on what this generic form can accept?

And also consider something like reverse:

%r = shufflevector <vscale x 4 x float> %1, <vscale x 4 x float>
undef, <4 x i32> <i32 3, i32 0, ...>

That's ambiguous. Does it mean:

<7, 6, 5, 4, 3, 2, 1, 0>

Or:

<3, 2, 1, 0, 7, 6, 5, 4>

I'd much rather see it as:

%r = shufflevector <vscale x 4 x float> %1, <vscale x 4 x float>
undef, <vscale x 4 x i32> reverseinitializer

It's not, of course, the most expressive solution, but it's really
just syntactic sugar for things we do today.

Playing Devil's Advocate here...

Shuffle vector accepts any mask, right? Would we be in the business of scaling *any* mask the user gives with '...'?

E.g. a zip_lo would be <0, 4, 1, 5, ...>, right? If that's true, I'm assuming that we could pluck any number of elements from each operand in any order. Is that what you were thinking? Or are there restrictions on what this generic form can accept?

A lot of these more complicated shuffles would require us to be able to use `vscale` in expressions. For example, zip_lo is actually <0, vscale * 4, 1, (vscale * 4) + 1, ...>

And also consider something like reverse:

%r = shufflevector <vscale x 4 x float> %1, <vscale x 4 x float> undef, <4 x i32> <i32 3, i32 0, ...>

That's ambiguous. Does it mean:

<7, 6, 5, 4, 3, 2, 1, 0>

Or:

<3, 2, 1, 0, 7, 6, 5, 4>

For your reverse example, you actually will get <3, 0, -3, -6, ...>. Reverse would be <(vscale * 4) - 1, (vscale * 4) - 2, ...>.

As a starting point, I'm only suggesting we add linear patterns. <n, m, ...> is never ambiguous; in geometric terms, n is the origin of a ray, and m - n is the vector that forms the direction of the ray. I think this would give us a lot of value: it would enable the loop vectorizer to use getStepVector for scalable vectors, it would enable many new shufflevector patterns, it would make the IR easier to read, and it might even make the compiler faster by not having to iterate over masks to see if they are a stepvector mask. (I don't know how often we do this, if at all)

If we allowed vscale and arithmetic operators to be used in vector patterns, it would enable more shuffles. This may be worth doing, but it might overcomplicate the code, and it would require `vscale` to be a constant. If we allowed non-linear patterns, it would enable even more. No programming language that I'm familiar with that has this sort of thing tries to support non-linear patterns.

Personally, I think shufflevector tries to do too much. For the shuffles that are really needed, and can't be represented with a linear pattern mask, we should probably just add intrinsics or instructions for them.

Hi,

As part of adding support for scalable vectorization we need to update llvm::InnerLoopVectorizer::getStepVector for scalable vectors. Currently this just returns a constant vector with the sequence <0, 1, 2, 3, ..>, however this assumes we are using fixed length vectors. For scalable vectors we need an operation that does the same thing, but without having to explicitly initalise all the elements. Any new stepvector operation we provide could also be used for fixed length vectors too if desired.

I believe the desirable properties of the operation should be:

1. The operation requires no arguments and simply returns a vector with the numeric sequence <0, 1, 2, …>
2. For types with a large number of elements, e.g. <vscale x 32 x i8> (vscale = 16), there is the possibility of the sequence value exceeding the limit of the type midway through the vector. In such cases we define the operation such that those elements are undefined or poison values.

A simple ‘stepvector’ operation (however we choose to implement it) with the properties described above can then be used together with additional ‘mul’ and ‘add’ instructions to create any arbitrary linear sequence, i.e. <0, 2, 4, 6, …> or <1, 3, 5, 7, …>

The first possible implementation with the properties as described above involves using a new llvm.stepvector() intrinsic with no arguments that simply returns a vector sequence <0, 1, 2, …> of the requested type, i.e.

  declare <vscale x 4 x i32> @llvm.stepvector.nxv4i32()

Introducing a new intrinsic is simple to implement and we can easily come up with an appropriate cost model – cheap for fixed width vectors or scalable vectors using SVE where we have the ‘index’ instruction.

Are you planning on updating the existing code that generates vector constants to use @llvm.stepvector for fixed width vectors?

However, since such an intrinsic is effectively returning a constant vector sequence we could instead implement it using a new ‘stepvector’ constant in a similar way to how ‘zeroinitializer’ works. This would be done using a new ConstantStepVector class similar to ConstantAggregateZero and would return a vector with the numeric sequence <0, 1, 2, …>. The main advantages of using a constant over an intrinsic are:

1. It is easy to write tests in LLVM IR since ‘stepvector’ would work in the same way as ‘zeroinitializer’, i.e. “%1 = add <4 x i32> %0, stepvector”
2. Creation of the node is easy with the simple interface:
  static Constant *ConstantStepVector::get(Type Ty)

If there’s IRBuilder::createStepVector(Ty), the creation of the intrinsic should be also quite simple as well. A potential inconvenience of the intrinsic compared to a constant is that we may generate redundant intrinsic calls . Not sure how big of an issue that is going to be, but it shouldn’t be too big in practice, as long as only a few places create step vectors.

3. It is easy to do optimisations, e.g. CSE, and pattern matching in IR.

I think if the intrinsic has the right attributes (e.g. readnone), things like DCE, GVN & CSE should work automatically for the intrinsic. As for pattern-matching, you could just add a matcher for stepvector, so patterns would look similar regardless whether it is a constant or an intrinsic?

Given that we already have an intrinsic for vscale and there are going to be named shuffle intrinsics for scalable vectors, having another intrinsic for stepvector seems consistent with the previous decisions and the path of least resistance. Personally I am not convinced that mentioned benefits of the named constant are worth the trouble if we still have a set of intrinsics for vscale & the named shuffle.

Cheer,
Florian

Hi Florian,

RE: “Are you planning on updating the existing code that generates vector constants to use @llvm.stepvector for fixed width vectors?”

Are you referring to InnerLoopVectorizer::getStepVector()? I guess this might be

worthwhile for consistency and will ensure greater test coverage, provided it doesn’t

cause regressions in code quality?

With regards to the use of constants there is currently an ongoing discussion in

another set of replies around the use of vector literals such as <0, 2, …> or

<1, 3, …>, which you might be interested in? I agree that using an intrinsic should

be quite straightforward, but there suggestions for changing how we deal with

shufflevectors that would make use of such vector literals.

Kind Regards,

David.

I'm not opposed to it. But it seems like a lot of effort to solve only
some of our problems, with no clear path to solve them all. We're
really only getting stepvector and the strided shuffles out of it. The
more complicated shuffles, the interesting ones, will still be
intrinsics.

Hi Chris,

I’ve been following the discussions between you and Cameron and just so I understand

things correctly it looks like you’re proposing a new way of writing vector literals in IR

where the start and stride are determined from the first two values, i.e. <1, 3, …>

has a start of 1 and stride of 2. I quite like the idea of a vector literal, but are you

proposing that internally in C++ we have some sort of ConstantStepVector that contains

a start/stride pair? Or when parsing this vector literal do you expect us to internally

represent this as something like:

splat(1) + 2 * stepvector

where stepvector would be fixed as <0, 1, 2, 3, …>?

When emitting LLVM IR with something like “clang -S -emit-llvm” would you also

expect to see these vector literals being emitted too? I’m just thinking about how

this would look in vectorised code, for example normally when vectorising an

induction variable we’d see something like this:

vector.ph:

vector.body:

%vec.ind = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, %vector.ph ], …

Are you suggesting the vector literal would be created in it’s own statement

in the preheader, or directly as a constant like before:

vector.body:

%vec.ind = phi <4 x i32> [ <i32 0, i32 1, …>, %vector.ph ], …

Kind Regards,

David.

David,

I am proposing a new C++ class (ConstantStepVector is fine; I don’t really care what color the bike shed is). In my scheme, there would be no stepvector literal, but you could write <0, 1, …> in IR and get the stepvector returned by llvm::InnerLoopVectorizer::getStepVector().

I would expect that you could write these ConstantStepVector literals directly in statements, so instead of:

vector.body:

%vec.ind = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, %vector.ph ], …

you would have:

vector.body:

%vec.ind = phi <4 x i32> [ <i32 0, i32 1, …>, %vector.ph ], …

… and it would also allow you to have:

vector.body:

%vec.ind = phi <vscale x 4 x i32> [ <i32 0, i32 1, …>, %vector.ph ], …

At some point, I think it would probably make sense to start canonicalizing to it. As you know, for scalable splats we do this horrible insertelement/shufflevector thing, that should be replaced by <n, n, …> (for all splats except zero and undef). Scalable stepvector would have to use it, but it may make sense to have fixed width stepvector use it. I mentioned downthread several potential uses for it. For shufflevector, we currently allow zeroinitializer and undef for scalable shuffle masks. We could allow ConstantStepVector, and internally in C++ code produce the fixed subvector mask, and special case it for scalable shuffles (that code is already a bit gross).

Overall, I think having a constant literal is useful. However, if we’re going to add complexity to the language, let’s do it in the most flexible way possible. It would be silly to have the stepvector literal, and then decide to also have ConstantStepVector at some point in the future.

Thanks,

Christopher Tetreault

Hi Florian,

RE: “Are you planning on updating the existing code that generates vector constants to use @llvm.stepvector for fixed width vectors?”
Are you referring to InnerLoopVectorizer::getStepVector()? I guess this might be
worthwhile for consistency and will ensure greater test coverage, provided it doesn’t
cause regressions in code quality?

If we can guarantee that stepvector and the corresponding constant generate the same code in all cases, that seems reasonable. But I guess that would mean we would have to update all places that look through shuffles to also support stepvector? That seems like this is going to potentially be a lot of work and it is easy to miss things (and we have to ensure all new optimizations also consider both cases).

Will there be a lot of additional test coverage by using it? Are you anticipating most transforms to apply to both fixed width and scalable vectors?

What I am worried about is that we are claiming @llvm.stepvector behaves like the corresponding vector constant for fixed vectors, but then in practice run into situations where different code is generated, e.g. because we missed to update a transform to support both forms equally. I think it would be easier to avoid this problem altogether, by defining it away and allowing only scalable vector types for @llvm.stepvector, at least initially.

In terms of consistency/uniformity in LLVM code, we could have `Builder::CreateStepVector()` and `PatternMatch::m_StepVector` to unify creating and matching the different versions depending on the type.

Personally I also think using stepvector instead of fixed width vector constants like <0, 1, 2, 3> unnecessarily regresses the readability of the generated IR.

With regards to the use of constants there is currently an ongoing discussion in
another set of replies around the use of vector literals such as <0, 2, …> or
<1, 3, …>, which you might be interested in? I agree that using an intrinsic should
be quite straightforward, but there suggestions for changing how we deal with
shufflevectors that would make use of such vector literals.

I think extending vector literals would probably benefit from a wider context beyond stepvector and a wider audience, because it extends to scope of the proposal quite a bit.

Cheers,
Florian

Hi Chris,

I just wanted to say that I like this idea (i.e. <i32 0, i32 1, …>) very much but wonder how much of it is reliant on its potential uses by shufflevector. I ask because the shufflevector problem has extra pitfalls and I’m not convinced opening up a handful of extra shuffles (beyond the current splat support, which I consider a bit of a hack) is that great long term.

Essentially, I’m not a fan of having a nobbled shufflevector for scalable vectors as compared to fixed length vectors. Taking your splat example, I would much prefer a dedicated splat instruction that works for all vector types and be just as pretty for constant and variable splats, whereas ConstantStepVector only helps with constant splats.

As I say, I like the idea and so am really just asking if you think there’ll be enough pull to make it happen without tying it to shufflevector?

Paul!!!

Paul,

Really, the two big use cases for scalable vectors are stepvector and enabling more uses of shufflevector. shufflevector is the one constant expression that “doesn’t work” for scalable vectors yet. Realistically, it can never be made to work with arbitrary constant scalable vectors without a fundamental redesign. Another use case is that we could simplify more constant expressions if they are written in terms of ConstantStepVector. Using the original proposed stepvector literal, (here, splat(1) is the insertelement/shufflevector thing; I’m not going to write it out) splat(1) + stepvector is as simplified as it can get, but splat(1) + <0, 1, …> can be simplified to <1, 2, …>.

That said, there are other uses for fixed width vectors. It would enable writing more compact LLVM IR code, which is useful in test cases (stepvector of type <64 x i8> would span multiple lines). Additionally, pattern matching on a stepvector is linear in the width of the vector, but for ConstantStepVector it will have constant time. These are, admittedly, minor advantages. However, I think the level of effort in implementing ConstantStepVector with a start/stride pair vs ConstantStepVector that starts at zero and increments by one is not that great.

Thanks,

Christopher Tetreault

Thanks Chris, this all sounds great. I just felt the question needed to be asked as I didn’t want to have this new feature automatically imply new shufflevector features (where I still believe the best solution for scalable vectors is to just never use the shufflevector instruction, but that’s a story for a different day).

It matches very closely to our earlier design in that we implemented an indexvector instruction that took start and step values. This had a ConstantExpr variant, which other that the way it was printed, pretty much mapped to this <n, n+1,…> concept. We switched to stepvector to simplify the design along with our technical debt, plus it meant we had nicer handling for wrapping flags[1]. As you say having support for <imm, imm+1,…> instead of just <0, 1,…> sounds like a perfect upgrade.

[1] What’s the plan for wrapping? Are these constants implicit NSW/NUW?

Paul!!!

Paul,

I agree, the question was important. The fact is that these pattern vectors won’t just automatically work for scalable shuffles; somebody still needs to do that work. However, the current design of shufflevector could be easily extended to support them, and I think this extension would be non-controversial.

Regarding wrapping, I hadn’t given it much thought. I think David’s original proposal of having out of bounds values be undef or poison sounds good to me. Since the abstraction is “two points on a line”, the values must necessarily grow too large for the specified type (except in the <n, n, …> case), and the only thing they can reasonably do is become invalid.

Thanks,

Christopher Tetreault

Hi all,

Thanks for all the great input and suggestions! It has been really valuable

feedback. Having thought about this some more here is my proposal for

a way forward:

  1. In the short term we implement InnerLoopVectorizer::getStepVector()

for scalable vectors using a new llvm.experimental.stepvector intrinsic that

is only designed to work for scalable vectors. This is uncontroversial, easy to

implement and unblocks a lot of the necessary scalable vector work in the

vectoriser.

  1. At the same time I can send a new RFC proposal to the mailing list

outlining the introduction of the new vector literal construct that Chris

proposed. I think that this needs a new RFC and more input from the wider

community before we can start using this for getStepVector(). Once we’ve

got consensus that this is the desired way forward we can easily switch over to

using these new vector literals in the vectoriser, not just for scalable vectors

but for fixed width too.

Kind Regards,

David Sherwood.