[RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Hi,

Now that Sander has committed enough MC support for SVE, here's an updated
RFC for variable length vector support with a set of 14 patches (listed at the end)
to demonstrate code generation for SVE using the extensions proposed in the RFC.

I have some ideas about how to support RISC-V's upcoming extension alongside
SVE; I'll send an email with some additional comments on Robin's RFC later.

Feedback and questions welcome.

-Graham

Hi Graham,

Just a few initial comments.

Graham Hunter <Graham.Hunter@arm.com> writes:

``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the same number of
  bytes.

"scalable" instead of "scalable x."

For derived types, a function (getSizeExpressionInBits) to return a pair of
integers (one to indicate unscaled bits, the other for bits that need to be
scaled by the runtime multiple) will be added. For backends that do not need to
deal with scalable types, another function (getFixedSizeExpressionInBits) that
only returns unscaled bits will be provided, with a debug assert that the type
isn't scalable.

Can you explain a bit about what the two integers represent? What's the
"unscaled" part for?

The name "getSizeExpressionInBits" makes me think that a Value
expression will be returned (something like a ConstantExpr that uses
vscale). I would be surprised to get a pair of integers back. Do
clients actually need constant integer values or would a ConstantExpr
sufffice? We could add a ConstantVScale or something to make it work.

Comparing two of these sizes together is straightforward if only unscaled sizes
are used. Comparisons between scaled sizes is also simple when comparing sizes
within a function (or across functions with the inherit flag mentioned in the
changes to the type), but cannot be compared otherwise. If a mix is present,
then any number of unscaled bits will not be considered to have a greater size
than a smaller number of scaled bits, but a smaller number of unscaled bits
will be considered to have a smaller size than a greater number of scaled bits
(since the runtime multiple is at least one).

If we went the ConstantExpr route and added ConstantExpr support to
ScalarEvolution, then SCEVs could be compared to do this size
comparison. We have code here that adds ConstantExpr support to
ScalarEvolution. We just didn't know if anyone else would be interested
in it since we added it solely for our Fortran frontend.

We have added an experimental `vscale` intrinsic to represent the runtime
multiple. Multiplying the result of this intrinsic by the minimum number of
elements in a vector gives the total number of elements in a scalable vector.

I think this may be a case where added a full-fledged Instruction might
be worthwhile. Because vscale is intimately tied to addressing, it
seems like things such as ScalarEvolution support will be important. I
don't know what's involved in making intrinsics work with
ScalarEvolution but it seems strangely odd that a key component of IR
computation would live outside the IR proper, in the sense that all
other fundamental addressing operations are Instructions.

For constants consisting of a sequence of values, an experimental `stepvector`
intrinsic has been added to represent a simple constant of the form
`<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
start can be added, and changing the step requires multiplying by a splat.

This is another case where an Instruction might be better, for the same
reasons as with vscale.

Also, "iota" is the name Cray has traditionally used for this operation
as it is the mathematical name for the concept. It's also used by C++
and go and so should be familiar to many people.

Future Work
-----------

Intrinsics cannot currently be used for constant folding. Our downstream
compiler (using Constants instead of intrinsics) relies quite heavily on this
for good code generation, so we will need to find new ways to recognize and
fold these values.

As above, we could add ConstantVScale and also ConstantStepVector (or
ConstantIota). They won't fold to compile-time values but the
expressions could be simplified. I haven't really thought through the
implications of this, just brainstorming ideas. What does your
downstream compiler require in terms of constant support. What kinds of
queries does it need to do?

                         -David

The name "getSizeExpressionInBits" makes me think that a Value
expression will be returned (something like a ConstantExpr that uses
vscale). I would be surprised to get a pair of integers back.

Same here.

If we went the ConstantExpr route and added ConstantExpr support to
ScalarEvolution, then SCEVs could be compared to do this size
comparison.

This sounds like a cleaner solution.

I think this may be a case where added a full-fledged Instruction might
be worthwhile. Because vscale is intimately tied to addressing, it
seems like things such as ScalarEvolution support will be important. I
don't know what's involved in making intrinsics work with
ScalarEvolution but it seems strangely odd that a key component of IR
computation would live outside the IR proper, in the sense that all
other fundamental addressing operations are Instructions.

...

This is another case where an Instruction might be better, for the same
reasons as with vscale.

This is a side-effect of the original RFC a few years ago. The general
feeling was that we can start with intrinsics and, if they work, we
change the IR.

We can have a work-in-progress implementation before fully committing
SCEV and other more core ideas in, and then slowly, and with more
certainty, move where it makes more sense.

Also, "iota" is the name Cray has traditionally used for this operation
as it is the mathematical name for the concept. It's also used by C++
and go and so should be familiar to many people.

That sounds better, but stepvector is more than C++'s iota and it's
just a plain scalar evolution sequence like {start, op, step}. In
C++'s iota (AFAICS), the step is always 1.

Anyway, I don't mind any name, really. Whatever is more mnemonic.

;; Create sequence for scalable vector
%stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
%mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
%addoffset = add <scalable 4 x i32> %mulbystride, %str_off

Once stepvetor (or iota) becomes a proper IR instruction, I'd like to
see this restricted to inlined syntax. The sequence { step*vec +
offset } only makes sense in the scalable context and the intermediate
results should not be used elsewhere.

Renato Golin <renato.golin@linaro.org> writes:

This is another case where an Instruction might be better, for the same
reasons as with vscale.

This is a side-effect of the original RFC a few years ago. The general
feeling was that we can start with intrinsics and, if they work, we
change the IR.

We can have a work-in-progress implementation before fully committing
SCEV and other more core ideas in, and then slowly, and with more
certainty, move where it makes more sense.

Ok, that makes sense. I do think the goal should be making these proper
Instructions.

Also, "iota" is the name Cray has traditionally used for this operation
as it is the mathematical name for the concept. It's also used by C++
and go and so should be familiar to many people.

That sounds better, but stepvector is more than C++'s iota and it's
just a plain scalar evolution sequence like {start, op, step}. In
C++'s iota (AFAICS), the step is always 1.

I thought stepvector was also always step one, as Graham states a
multiply by a constant splat must be used for other step values.

Anyway, I don't mind any name, really. Whatever is more mnemonic.

Me neither. I was simply noting some of the history surrounding the
operation and naming in other familiar places.

;; Create sequence for scalable vector
%stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
%mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
%addoffset = add <scalable 4 x i32> %mulbystride, %str_off

Once stepvetor (or iota) becomes a proper IR instruction, I'd like to
see this restricted to inlined syntax. The sequence { step*vec +
offset } only makes sense in the scalable context and the intermediate
results should not be used elsewhere.

I'm not so sure. iota is a generally useful operation and scaling it to
various step values is also useful. It's used often for strided memory
access, which would be done via gather/scatter in LLVM but generating a
vector GEP via stepvector would be convenient and convey more semantic
information than, say, loading a constant vector of indices to feed the
GEP.

                               -David

Hi David,

Thanks for taking a look.

Hi Graham,

Just a few initial comments.

Graham Hunter <Graham.Hunter@arm.com> writes:

``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the same number of
bytes.

"scalable" instead of "scalable x."

Yep, missed that in the conversion from the old <n x m x ty> format.

For derived types, a function (getSizeExpressionInBits) to return a pair of
integers (one to indicate unscaled bits, the other for bits that need to be
scaled by the runtime multiple) will be added. For backends that do not need to
deal with scalable types, another function (getFixedSizeExpressionInBits) that
only returns unscaled bits will be provided, with a debug assert that the type
isn't scalable.

Can you explain a bit about what the two integers represent? What's the
"unscaled" part for?

'Unscaled' just means 'exactly this many bits', whereas 'scaled' is 'this many bits
multiplied by vscale'.

The name "getSizeExpressionInBits" makes me think that a Value
expression will be returned (something like a ConstantExpr that uses
vscale). I would be surprised to get a pair of integers back. Do
clients actually need constant integer values or would a ConstantExpr
sufffice? We could add a ConstantVScale or something to make it work.

I agree the name is not ideal and I'm open to suggestions -- I was thinking of the two
integers representing the known-at-compile-time terms in an expression:
'(scaled_bits * vscale) + unscaled_bits'.

Assuming the pair is of the form (unscaled, scaled), then for a type with a size known at
compile time like <4 x i32> the size would be (128, 0).

For a scalable type like <scalable 4 x i32> the size would be (0, 128).

For a struct with, say, a <scalable 32 x i8> and an i64, it would be (64, 256).

When calculating the offset for memory addresses, you just need to multiply the scaled
part by vscale and add the unscaled as is.

Comparing two of these sizes together is straightforward if only unscaled sizes
are used. Comparisons between scaled sizes is also simple when comparing sizes
within a function (or across functions with the inherit flag mentioned in the
changes to the type), but cannot be compared otherwise. If a mix is present,
then any number of unscaled bits will not be considered to have a greater size
than a smaller number of scaled bits, but a smaller number of unscaled bits
will be considered to have a smaller size than a greater number of scaled bits
(since the runtime multiple is at least one).

If we went the ConstantExpr route and added ConstantExpr support to
ScalarEvolution, then SCEVs could be compared to do this size
comparison. We have code here that adds ConstantExpr support to
ScalarEvolution. We just didn't know if anyone else would be interested
in it since we added it solely for our Fortran frontend.

We added a dedicated SCEV expression class for vscale instead; I suspect it works
either way.

We have added an experimental `vscale` intrinsic to represent the runtime
multiple. Multiplying the result of this intrinsic by the minimum number of
elements in a vector gives the total number of elements in a scalable vector.

I think this may be a case where added a full-fledged Instruction might
be worthwhile. Because vscale is intimately tied to addressing, it
seems like things such as ScalarEvolution support will be important. I
don't know what's involved in making intrinsics work with
ScalarEvolution but it seems strangely odd that a key component of IR
computation would live outside the IR proper, in the sense that all
other fundamental addressing operations are Instructions.

We've tried it as both an instruction and as a 'Constant', and both work fine with
ScalarEvolution. I have not yet tried it with the intrinsic.

For constants consisting of a sequence of values, an experimental `stepvector`
intrinsic has been added to represent a simple constant of the form
`<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
start can be added, and changing the step requires multiplying by a splat.

This is another case where an Instruction might be better, for the same
reasons as with vscale.

Also, "iota" is the name Cray has traditionally used for this operation
as it is the mathematical name for the concept. It's also used by C++
and go and so should be familiar to many people.

Iota would be fine with me; I forget the reason we didn't go with that initially. We
also had 'series_vector' in the past, but that was a more generic form with start
and step parameters instead of requiring additional IR instructions to multiply and
add for the result as we do for stepvector.

Future Work
-----------

Intrinsics cannot currently be used for constant folding. Our downstream
compiler (using Constants instead of intrinsics) relies quite heavily on this
for good code generation, so we will need to find new ways to recognize and
fold these values.

As above, we could add ConstantVScale and also ConstantStepVector (or
ConstantIota). They won't fold to compile-time values but the
expressions could be simplified. I haven't really thought through the
implications of this, just brainstorming ideas. What does your
downstream compiler require in terms of constant support. What kinds of
queries does it need to do?

It makes things a little easier to pattern match (just looking for a constant to start
instead of having to match multiple different forms of vscale or stepvector multiplied
and/or added in each place you're looking for them).

The bigger reason we currently depend on them being constant is that code generation
generally looks at a single block at a time, and there are several expressions using
vscale that we don't want to be generated in one block and passed around in a register,
since many of the load/store addressing forms for instructions will already scale properly.

We've done this downstream by having them be Constants, but if there's a good way
of doing them with intrinsics we'd be fine with that too.

-Graham

I thought stepvector was also always step one, as Graham states a
multiply by a constant splat must be used for other step values.

You're right! Sorry, it's been a while. Step-vector is a simple iota,
the multiplier and offset come form the operations.

I'm not so sure. iota is a generally useful operation and scaling it to
various step values is also useful. It's used often for strided memory
access, which would be done via gather/scatter in LLVM but generating a
vector GEP via stepvector would be convenient and convey more semantic
information than, say, loading a constant vector of indices to feed the
GEP.

My point is that those patterns will be generated by C-level
intrinsics or IR optimisation passes (like vectorisers), so they have
a specific meaning in that context.

What I fear is if some other pass like CSE finds the patterns out and
common them up at the top of the function/BB and then the back-end
loses sight of what that was and can't generate the step increment
instruction in the end.

Hi David,

Thanks for taking a look.

Hi Graham,

Just a few initial comments.

Graham Hunter <Graham.Hunter@arm.com> writes:

``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the same number of
bytes.

"scalable" instead of "scalable x."

Yep, missed that in the conversion from the old <n x m x ty> format.

For derived types, a function (getSizeExpressionInBits) to return a pair of
integers (one to indicate unscaled bits, the other for bits that need to be
scaled by the runtime multiple) will be added. For backends that do not need to
deal with scalable types, another function (getFixedSizeExpressionInBits) that
only returns unscaled bits will be provided, with a debug assert that the type
isn't scalable.

Can you explain a bit about what the two integers represent? What's the
"unscaled" part for?

'Unscaled' just means 'exactly this many bits', whereas 'scaled' is 'this many bits
multiplied by vscale'.

The name "getSizeExpressionInBits" makes me think that a Value
expression will be returned (something like a ConstantExpr that uses
vscale). I would be surprised to get a pair of integers back. Do
clients actually need constant integer values or would a ConstantExpr
sufffice? We could add a ConstantVScale or something to make it work.

I agree the name is not ideal and I'm open to suggestions -- I was thinking of the two
integers representing the known-at-compile-time terms in an expression:
'(scaled_bits * vscale) + unscaled_bits'.

Assuming the pair is of the form (unscaled, scaled), then for a type with a size known at
compile time like <4 x i32> the size would be (128, 0).

For a scalable type like <scalable 4 x i32> the size would be (0, 128).

For a struct with, say, a <scalable 32 x i8> and an i64, it would be (64, 256).

When calculating the offset for memory addresses, you just need to multiply the scaled
part by vscale and add the unscaled as is.

Comparing two of these sizes together is straightforward if only unscaled sizes
are used. Comparisons between scaled sizes is also simple when comparing sizes
within a function (or across functions with the inherit flag mentioned in the
changes to the type), but cannot be compared otherwise. If a mix is present,
then any number of unscaled bits will not be considered to have a greater size
than a smaller number of scaled bits, but a smaller number of unscaled bits
will be considered to have a smaller size than a greater number of scaled bits
(since the runtime multiple is at least one).

If we went the ConstantExpr route and added ConstantExpr support to
ScalarEvolution, then SCEVs could be compared to do this size
comparison. We have code here that adds ConstantExpr support to
ScalarEvolution. We just didn't know if anyone else would be interested
in it since we added it solely for our Fortran frontend.

We added a dedicated SCEV expression class for vscale instead; I suspect it works
either way.

We have added an experimental `vscale` intrinsic to represent the runtime
multiple. Multiplying the result of this intrinsic by the minimum number of
elements in a vector gives the total number of elements in a scalable vector.

I think this may be a case where added a full-fledged Instruction might
be worthwhile. Because vscale is intimately tied to addressing, it
seems like things such as ScalarEvolution support will be important. I
don't know what's involved in making intrinsics work with
ScalarEvolution but it seems strangely odd that a key component of IR
computation would live outside the IR proper, in the sense that all
other fundamental addressing operations are Instructions.

We've tried it as both an instruction and as a 'Constant', and both work fine with
ScalarEvolution. I have not yet tried it with the intrinsic.

+CC Sanjoy to confirm: I think intrinsics should be fine to add support for in SCEV.

Graham Hunter <Graham.Hunter@arm.com> writes:

Can you explain a bit about what the two integers represent? What's the
"unscaled" part for?

'Unscaled' just means 'exactly this many bits', whereas 'scaled' is 'this many bits
multiplied by vscale'.

Right, but what do they represent? If I have <scalable 4 x i32> is "32"
"unscaled" and "4" "scaled?" Or is "128" "scaled?" Or something else?

I see you answered this below.

The name "getSizeExpressionInBits" makes me think that a Value
expression will be returned (something like a ConstantExpr that uses
vscale). I would be surprised to get a pair of integers back. Do
clients actually need constant integer values or would a ConstantExpr
sufffice? We could add a ConstantVScale or something to make it work.

I agree the name is not ideal and I'm open to suggestions -- I was thinking of the two
integers representing the known-at-compile-time terms in an expression:
'(scaled_bits * vscale) + unscaled_bits'.

Assuming the pair is of the form (unscaled, scaled), then for a type with a size known at
compile time like <4 x i32> the size would be (128, 0).

For a scalable type like <scalable 4 x i32> the size would be (0, 128).

For a struct with, say, a <scalable 32 x i8> and an i64, it would be (64, 256).

When calculating the offset for memory addresses, you just need to multiply the scaled
part by vscale and add the unscaled as is.

Ok, now I understand what you're getting at. A ConstantExpr would
encapsulate this computation. We alreay have "non-static-constant"
values for ConstantExpr like sizeof and offsetof. I would see
VScaleConstant in that same tradition. In your struct example,
getSizeExpressionInBits would return:

add(mul(256, vscale), 64)

Does that satisfy your needs?

Is there anything about vscale or a scalable vector that requires a
minimum bit width? For example, is this legal?

<scalable 1 x double>

I know it won't map to an SVE type. I'm simply curious because
traditionally Cray machines defined vectors in terms of
machine-dependent "maxvl" with an element type, so with the above vscale
would == maxvl. Not that we make any such things anymore. But maybe
someone else does?

If we went the ConstantExpr route and added ConstantExpr support to
ScalarEvolution, then SCEVs could be compared to do this size
comparison. We have code here that adds ConstantExpr support to
ScalarEvolution. We just didn't know if anyone else would be interested
in it since we added it solely for our Fortran frontend.

We added a dedicated SCEV expression class for vscale instead; I suspect it works
either way.

Yes, that's probably true. A vscale SCEV is less invasive.

We've tried it as both an instruction and as a 'Constant', and both work fine with
ScalarEvolution. I have not yet tried it with the intrinsic.

vscale as a Constant is interesting. It's a target-dependent Constant
like sizeof and offsetof. It doesn't have a statically known value and
maybe isn't "constant" across functions. So it's a strange kind of
constant.

Ultimately whatever is easier for LLVM to analyze in the long run is
best. Intrinsics often block optimization. I don't know whether vscale
would be "eaiser" as a Constant or an Instruction.

As above, we could add ConstantVScale and also ConstantStepVector (or
ConstantIota). They won't fold to compile-time values but the
expressions could be simplified. I haven't really thought through the
implications of this, just brainstorming ideas. What does your
downstream compiler require in terms of constant support. What kinds of
queries does it need to do?

It makes things a little easier to pattern match (just looking for a constant to start
instead of having to match multiple different forms of vscale or stepvector multiplied
and/or added in each place you're looking for them).

Ok. Normalization could help with this but I certainly understand the
issue.

The bigger reason we currently depend on them being constant is that code generation
generally looks at a single block at a time, and there are several expressions using
vscale that we don't want to be generated in one block and passed around in a register,
since many of the load/store addressing forms for instructions will already scale properly.

This is kind of like X86 memop folding. If a load has multiple uses, it
won't be folded, on the theory that one load is better than many folded
loads. If a load has exactly one use, it will fold. There's explicit
predicate code in the X86 backend to enforce this requirement. I
suspect if the X86 backend tried to fold a single load into multiple
places, Bad Things would happen (needed SDNodes might disappear, etc.).

Codegen probably doesn't understand non-statically-constant
ConstantExprs, since sizeof of offsetof can be resolved by the target
before instruction selection.

We've done this downstream by having them be Constants, but if there's a good way
of doing them with intrinsics we'd be fine with that too.

If vscale/stepvector as Constants works, it seems fine to me.

                               -David

I'm not so sure. iota is a generally useful operation and scaling it to
various step values is also useful. It's used often for strided memory
access, which would be done via gather/scatter in LLVM but generating a
vector GEP via stepvector would be convenient and convey more semantic
information than, say, loading a constant vector of indices to feed the
GEP.

My point is that those patterns will be generated by C-level
intrinsics or IR optimisation passes (like vectorisers), so they have
a specific meaning in that context.

What I fear is if some other pass like CSE finds the patterns out and
common them up at the top of the function/BB and then the back-end
loses sight of what that was and can't generate the step increment
instruction in the end.

Got it. Graham hit this point as well. I took your suggestion as
"fusing" the iota/scale/offset together. I would still want to be able
to generate things like stepvector without scaling and adding offsets (I
suppose a scale of 1 and offset of 0 would be ok, but ugly). I don't
really care if we prevent CSE of such things.

                            -David

Hi David,

The name "getSizeExpressionInBits" makes me think that a Value
expression will be returned (something like a ConstantExpr that uses
vscale). I would be surprised to get a pair of integers back. Do
clients actually need constant integer values or would a ConstantExpr
sufffice? We could add a ConstantVScale or something to make it work.

I agree the name is not ideal and I'm open to suggestions -- I was thinking of the two
integers representing the known-at-compile-time terms in an expression:
'(scaled_bits * vscale) + unscaled_bits'.

Assuming the pair is of the form (unscaled, scaled), then for a type with a size known at
compile time like <4 x i32> the size would be (128, 0).

For a scalable type like <scalable 4 x i32> the size would be (0, 128).

For a struct with, say, a <scalable 32 x i8> and an i64, it would be (64, 256).

When calculating the offset for memory addresses, you just need to multiply the scaled
part by vscale and add the unscaled as is.

Ok, now I understand what you're getting at. A ConstantExpr would
encapsulate this computation. We alreay have "non-static-constant"
values for ConstantExpr like sizeof and offsetof. I would see
VScaleConstant in that same tradition. In your struct example,
getSizeExpressionInBits would return:

add(mul(256, vscale), 64)

Does that satisfy your needs?

Ah, I think the use of 'expression' in the name definitely confuses the issue then. This
isn't for expressing the size in IR, where you would indeed just multiply by vscale and
add any fixed-length size.

This is for the analysis code around the IR -- lots of code asks for the size of a Type in
bits to determine what it can do to a Value with that type. Some of them are specific to
scalar Types, like determining whether a sign/zero extend is needed. Others would
apply to vector types (including scalable vectors), such as checking whether two
Types have the exact same size so that a bitcast can be used instead of a more
expensive operation like copying to memory and back to convert.

See 'getTypeSizeInBits' and 'getTypeStoreSizeInBits' in DataLayout -- they're used
a few hundred times throughout the codebase, and to properly support scalable
types we'd need to return something that isn't just a single integer. Since most
backends won't support scalable vectors I suggested having a 'FixedSize' method
that just returns the single integer, but it may be better to just leave the existing method
as is and create a new method with 'Scalable' or 'VariableLength' or similar in the
name to make it more obvious in common code.

There's a few places where changes in IR may be needed; 'lifetime.start' markers in
IR embed size data, and we would need to either add a scalable term to that or
find some other way of indicating the size. That can be dealt with when we try to
add support for the SVE ACLE though.

Is there anything about vscale or a scalable vector that requires a
minimum bit width? For example, is this legal?

<scalable 1 x double>

I know it won't map to an SVE type. I'm simply curious because
traditionally Cray machines defined vectors in terms of
machine-dependent "maxvl" with an element type, so with the above vscale
would == maxvl. Not that we make any such things anymore. But maybe
someone else does?

That's legal in IR, yes, and we believe it should be usable to represent the vectors for
RISC-V's 'V' extension. The main problem there is that they have a dynamic vector
length within the loop so that they can perform the last iterations of a loop within vector
registers when there's less than a full register worth of data remaining. SVE uses
predication (masking) to achieve the same effect.

For the 'V' extension, vscale would indeed correspond to 'maxvl', and I'm hoping that a
'setvl' intrinsic that provides a predicate will avoid the need for modelling a change in
dynamic vector length -- reducing the vector length is effectively equivalent to an implied
predicate on all operations. This avoids needing to add a token operand to all existing
instructions that work on vector types.

-Graham

Graham Hunter via llvm-dev <llvm-dev@lists.llvm.org> writes:

Ok, now I understand what you're getting at. A ConstantExpr would
encapsulate this computation. We alreay have "non-static-constant"
values for ConstantExpr like sizeof and offsetof. I would see
VScaleConstant in that same tradition. In your struct example,
getSizeExpressionInBits would return:

add(mul(256, vscale), 64)

Does that satisfy your needs?

Ah, I think the use of 'expression' in the name definitely confuses the issue then. This
isn't for expressing the size in IR, where you would indeed just multiply by vscale and
add any fixed-length size.

Ok, thanks for clarifying. The use of "expression" is confusing.

This is for the analysis code around the IR -- lots of code asks for the size of a Type in
bits to determine what it can do to a Value with that type. Some of them are specific to
scalar Types, like determining whether a sign/zero extend is needed. Others would
apply to vector types (including scalable vectors), such as checking whether two
Types have the exact same size so that a bitcast can be used instead of a more
expensive operation like copying to memory and back to convert.

If this method returns two integers, how does LLVM interpret the
comparison? If the return value is { <unscaled>, <scaled> } then how
do, say { 1024, 0 } and { 0, 128 } compare? Doesn't it depend on the
vscale? They could be the same size or not, depending on the target
characteristics.

Are bitcasts between scaled types and non-scaled types disallowed? I
could certainly see an argument for disallowing it. I could argue that
for bitcasting purposes that the unscaled and scaled parts would have to
exactly match in order to do a legal bitcast. Is that the intent?

Is there anything about vscale or a scalable vector that requires a
minimum bit width? For example, is this legal?

<scalable 1 x double>

I know it won't map to an SVE type. I'm simply curious because
traditionally Cray machines defined vectors in terms of
machine-dependent "maxvl" with an element type, so with the above vscale
would == maxvl. Not that we make any such things anymore. But maybe
someone else does?

That's legal in IR, yes, and we believe it should be usable to represent the vectors for
RISC-V's 'V' extension. The main problem there is that they have a dynamic vector
length within the loop so that they can perform the last iterations of a loop within vector
registers when there's less than a full register worth of data remaining. SVE uses
predication (masking) to achieve the same effect.

For the 'V' extension, vscale would indeed correspond to 'maxvl', and I'm hoping that a
'setvl' intrinsic that provides a predicate will avoid the need for modelling a change in
dynamic vector length -- reducing the vector length is effectively equivalent to an implied
predicate on all operations. This avoids needing to add a token operand to all existing
instructions that work on vector types.

Right. In that way the RISC V method is very much like what the old
Cray machines did with the Vector Length register.

So in LLVM IR you would have "setvl" return a predicate and then apply
that predicate to operations using the current select method? How does
instruction selection map that back onto a simple setvl + unpredicated
vector instructions?

For conditional code both vector length and masking must be taken into
account. If "setvl" returns a predicate then that predicate would have
to be combined in some way with the conditional predicate (typically via
an AND operation in an IR that directly supports predicates). Since
LLVM IR doesn't have predicates _per_se_, would it turn into nested
selects or something? Untangling that in instruction selection seems
difficult but perhaps I'm missing something.

                                 -David

Hi,

Graham Hunter via llvm-dev <llvm-dev@lists.llvm.org> writes:

Ok, now I understand what you're getting at. A ConstantExpr would
encapsulate this computation. We alreay have "non-static-constant"
values for ConstantExpr like sizeof and offsetof. I would see
VScaleConstant in that same tradition. In your struct example,
getSizeExpressionInBits would return:

add(mul(256, vscale), 64)

Does that satisfy your needs?

Ah, I think the use of 'expression' in the name definitely confuses the issue then. This
isn't for expressing the size in IR, where you would indeed just multiply by vscale and
add any fixed-length size.

Ok, thanks for clarifying. The use of "expression" is confusing.

This is for the analysis code around the IR -- lots of code asks for the size of a Type in
bits to determine what it can do to a Value with that type. Some of them are specific to
scalar Types, like determining whether a sign/zero extend is needed. Others would
apply to vector types (including scalable vectors), such as checking whether two
Types have the exact same size so that a bitcast can be used instead of a more
expensive operation like copying to memory and back to convert.

If this method returns two integers, how does LLVM interpret the
comparison? If the return value is { <unscaled>, <scaled> } then how
do, say { 1024, 0 } and { 0, 128 } compare? Doesn't it depend on the
vscale? They could be the same size or not, depending on the target
characteristics.

I did have a paragraph on that in the RFC, but perhaps a list would be
a better format (assuming X,Y,etc are non-zero):

{ X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.

{ 0, X } <cmp> { 0, Y }: Normal comparison within a function, or across
                         functions that inherit vector length. Cannot be
                         compared across non-inheriting functions.

{ X, 0 } > { 0, Y }: Cannot return true.

{ X, 0 } = { 0, Y }: Cannot return true.

{ X, 0 } < { 0, Y }: Can return true.

{ Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract common
                             terms and try the above comparisons; it
                             may not be possible to get a good answer.

I don't know if we need a 'maybe' result for cases comparing scaled
vs. unscaled; I believe the gcc implementation of SVE allows for such
results, but that supports a generic polynomial length representation.

I think in code, we'd have an inline function to deal with the first case
and an likely-not-taken call to a separate function to handle all the
scalable cases.

Are bitcasts between scaled types and non-scaled types disallowed? I
could certainly see an argument for disallowing it. I could argue that
for bitcasting purposes that the unscaled and scaled parts would have to
exactly match in order to do a legal bitcast. Is that the intent?

I would propose disallowing bitcasts, but allowing extracting a subvector
if the minimum number of scaled bits matches the number of unscaled bits.

Is there anything about vscale or a scalable vector that requires a
minimum bit width? For example, is this legal?

<scalable 1 x double>

I know it won't map to an SVE type. I'm simply curious because
traditionally Cray machines defined vectors in terms of
machine-dependent "maxvl" with an element type, so with the above vscale
would == maxvl. Not that we make any such things anymore. But maybe
someone else does?

That's legal in IR, yes, and we believe it should be usable to represent the vectors for
RISC-V's 'V' extension. The main problem there is that they have a dynamic vector
length within the loop so that they can perform the last iterations of a loop within vector
registers when there's less than a full register worth of data remaining. SVE uses
predication (masking) to achieve the same effect.

For the 'V' extension, vscale would indeed correspond to 'maxvl', and I'm hoping that a
'setvl' intrinsic that provides a predicate will avoid the need for modelling a change in
dynamic vector length -- reducing the vector length is effectively equivalent to an implied
predicate on all operations. This avoids needing to add a token operand to all existing
instructions that work on vector types.

Right. In that way the RISC V method is very much like what the old
Cray machines did with the Vector Length register.

So in LLVM IR you would have "setvl" return a predicate and then apply
that predicate to operations using the current select method? How does
instruction selection map that back onto a simple setvl + unpredicated
vector instructions?

For conditional code both vector length and masking must be taken into
account. If "setvl" returns a predicate then that predicate would have
to be combined in some way with the conditional predicate (typically via
an AND operation in an IR that directly supports predicates). Since
LLVM IR doesn't have predicates _per_se_, would it turn into nested
selects or something? Untangling that in instruction selection seems
difficult but perhaps I'm missing something.

My idea is for the RISC-V backend to recognize when a setvl intrinsic has
been used, and replace the use of its value in AND operations with an
all-true value (with constant folding to remove unnecessary ANDs) then
replace any masked instructions (generally loads, stores, anything else
that might generate an exception or modify state that it shouldn't) with
target-specific nodes that understand the dynamic vlen.

This could be part of lowering, or maybe a separate IR pass, rather than ISel.
I *think* this will work, but if someone can come up with some IR where it
wouldn't work then please let me know (e.g. global-state-changing instructions
that could move out of blocks where one setvl predicate is used and into one
where another is used).

Unfortunately, I can't find a description of the instructions included in
the 'V' extension in the online manual (other than setvl or configuring
registers), so I can't tell if there's something I'm missing.

-Graham

RVV is a little bit behind SVE in the process :slight_smile: On the whole it's
following the style of vector processor that has had several
implementations at Berkeley, dating back a decade or more. The set of
operations is pretty much nailed down now, but things such as the exact
instruction encodings are still in flux. There is an intention to get some
experience with compilers and FPGA (at least) implementations of the
proposal before ratifying it as part of the RISC-V standard. So details
could well change during that period.

Hi Graham,

First of all, thanks a lot for updating the RFC and also for putting up the
patches, they are quite interesting and illuminate some details I was curious
about. I have some minor questions and comments inline below but overall I
believe this is both a reasonably small extension of LLVM IR and powerful
enough to support SVE, RVV, and hopefully future ISAs with variable length

vectors. Details may change as we gather more experience, but it’s a very

good starting point.

One thing I am missing is a discussion of how stack frame layout will be
handled. Earlier RFCs mentioned a concept called “Stack Regions” but (IIRC)
gave little details and it was a long time ago anyway. What are your current
plans here?

I’ll reply separately to the sub-thread about RISC-V codegen.

Cheers,
Robin

Hi Robin,

One thing I am missing is a discussion of how stack frame layout will be
handled. Earlier RFCs mentioned a concept called "Stack Regions" but (IIRC)
gave little details and it was a long time ago anyway. What are your current
plans here?

Good that you're pointing this out. For now, we can be overly conservative and
assume the in-memory size of a scalable vector to be the architecturally defined
maximum. Naturally this is not very efficient as we allocate much more space
on the stack than needed.

I'll be posting a separate RFC that addresses a more optimal frame-layout
with scalable vectors in the coming weeks.

Cheers,

Sander

Hi Graham,
Hi David,

glad to hear other people are thinking about RVV codegen!

Hi,

>
> Graham Hunter via llvm-dev <llvm-dev@lists.llvm.org> writes:
>>> Is there anything about vscale or a scalable vector that requires a
>>> minimum bit width? For example, is this legal?
>>>
>>> <scalable 1 x double>
>>>
>>> I know it won't map to an SVE type. I'm simply curious because
>>> traditionally Cray machines defined vectors in terms of
>>> machine-dependent "maxvl" with an element type, so with the above vscale
>>> would == maxvl. Not that we make any such things anymore. But maybe
>>> someone else does?
>>
>> That's legal in IR, yes, and we believe it should be usable to represent the vectors for
>> RISC-V's 'V' extension. The main problem there is that they have a dynamic vector
>> length within the loop so that they can perform the last iterations of a loop within vector
>> registers when there's less than a full register worth of data remaining. SVE uses
>> predication (masking) to achieve the same effect.

Yes, <scalable 1 x T> should be allowed in the IR type system (even <1
x T> is currently allowed and unlike the scalable variant that's not
even useful) and it would be the sole legal vector types in the RISCV
backend.

>> For the 'V' extension, vscale would indeed correspond to 'maxvl', and I'm hoping that a
>> 'setvl' intrinsic that provides a predicate will avoid the need for modelling a change in
>> dynamic vector length -- reducing the vector length is effectively equivalent to an implied
>> predicate on all operations. This avoids needing to add a token operand to all existing
>> instructions that work on vector types.

Yes, vscale would be the *maximum* vector length (how many elements
fit into each register), not the *active* vector length (how many
elements are operated on in the current loop iteration).

This has nothing to do with tokens, though. The tokens I proposed were
to encode the fact that even 'maxvl' varies on a function by function
basis. This RFC approaches the same issue differently, but it's still
there -- in terms of this RFC, operations on scalable vectors depend
on `vscale`, which is "not necessarily [constant] across functions".
That implies, for example, that an unmasked <scalable 4 x i32> load or
store (which accesses vscale * 16 bytes of memory) can't generally be
moved from one function to another unless it's somehow ensured that
both functions will have the same vscale. For that matter, the same
restriction applies to calls to `vscale` itself.

The evolution of the active vector length is a separate problem and
one that doesn't really impact the IR type system (nor one that can
easily be solved by tokens).

>
> Right. In that way the RISC V method is very much like what the old
> Cray machines did with the Vector Length register.
>
> So in LLVM IR you would have "setvl" return a predicate and then apply
> that predicate to operations using the current select method? How does
> instruction selection map that back onto a simple setvl + unpredicated
> vector instructions?
>
> For conditional code both vector length and masking must be taken into
> account. If "setvl" returns a predicate then that predicate would have
> to be combined in some way with the conditional predicate (typically via
> an AND operation in an IR that directly supports predicates). Since
> LLVM IR doesn't have predicates _per_se_, would it turn into nested
> selects or something? Untangling that in instruction selection seems
> difficult but perhaps I'm missing something.

My idea is for the RISC-V backend to recognize when a setvl intrinsic has
been used, and replace the use of its value in AND operations with an
all-true value (with constant folding to remove unnecessary ANDs) then
replace any masked instructions (generally loads, stores, anything else
that might generate an exception or modify state that it shouldn't) with
target-specific nodes that understand the dynamic vlen.

I am not quite so sure about turning the active vector length into
just another mask. It's true that the effects on arithmetic, load,
stores, etc. are the same as if everything executed under a mask like
<1, 1, ..., 1, 0, 0, ..., 0> with the number of ones equal to the
active vector length. However, actually materializing the masks in the
IR means the RISCV backend has to reverse-engineer what it must do
with the vl register for any given (masked or unmasked) vector
operation. The stakes for that are rather high, because (1) it applies
to pretty much every single vector operation ever, and (2) when it
fails, the codegen impact is incredibly bad.

(1) The vl register affects not only loads, stores and other
operations with side effects, but all vector instructions, even pure
computation (and reg-reg moves, but that's not relevant for IR). An
integer vector add, for example, only computes src1[i] + src2[i] for 0
<= i < vl and the remaining elements of the destination register (from
vl upwards) are zeroed. This is very natural for strip-mined loops
(you'll never need those elements), but it means an unmasked IR level
vector add is a really bad fit for the RISC-V 'vadd' instruction.
Unless the backend can prove that only the first vl elements of the
result will ever be observed, it will have to temporarily set vl to
MAXVL so that the RVV instruction will actually compute the "full"
result. Establishing that seems like it will require at least some
custom data flow analysis, and it's unclear how robust it can be made.

(2) Failing to properly use vl for some vector operation is worse than
e.g. materializing a mask you wouldn't otherwise need. It requires
that too (if the operation is masked), but more importantly it needs
to save vl, change it to MAXVL, and finally restore the old value.
That's quite expensive: besides the ca. 3 extra instructions and the
scratch GPR required, this save/restore dance can have other nasty
effects depending on uarch style. I'd have to consult the hardware
people to be sure, but from my understanding risks include pipeline
stalls and expensive roundtrips between decoupled vector and scalar
units.

To be clear: I have not yet experimented with any of this, so I'm not
saying this is a deal breaker. A well-engineered "demanded elements"
analysis may very well be good enough in practice. But since we
broached the subject, I wanted to mention this challenge. (I'm
currently side stepping it by not using built-in vector instructions
but instead intrinsics that treat vl as magic extra state.)

This could be part of lowering, or maybe a separate IR pass, rather than ISel.
I *think* this will work, but if someone can come up with some IR where it
wouldn't work then please let me know (e.g. global-state-changing instructions
that could move out of blocks where one setvl predicate is used and into one
where another is used).

There are some operations that use vl for things other than simple
masking. To give one example, "speculative" loads (which silencing
some exceptions to safely permit vectorization of some loops with
data-dependent exits, such as strlen) can shrink vl as a side effect.
I believe this can be handled by modelling all relevant operations
(including setvl itself) as intrinsics that have side effects or
read/write inaccessible memory. However, if you want to have the
"current" vl (or equivalent mask) around as SSA value, you need to
"reload" it after any operation that updates vl. That seems like it
could get a bit complex if you want to do it efficiently (in the
limit, it seems equivalent to SSA construction).

Unfortunately, I can't find a description of the instructions included in
the 'V' extension in the online manual (other than setvl or configuring
registers), so I can't tell if there's something I'm missing.

I'm very sorry for that, I know how frustrating it can be. I hope the
above gives a clearer picture of the constraints involved. Exact
instructions, let alone encodings, are still in flux as Bruce said.

Cheers,
Robin

Hi Robin,

Thanks for the comments; replies inline (except for the stack regions question;
Sander is handling that side of things).

-Graham

Hi Graham,

First of all, thanks a lot for updating the RFC and also for putting up the
patches, they are quite interesting and illuminate some details I was curious
about. I have some minor questions and comments inline below but overall I
believe this is both a reasonably small extension of LLVM IR and powerful
enough to support SVE, RVV, and hopefully future ISAs with variable length

vectors. Details may change as we gather more experience, but it’s a very

good starting point.

One thing I am missing is a discussion of how stack frame layout will be
handled. Earlier RFCs mentioned a concept called “Stack Regions” but (IIRC)
gave little details and it was a long time ago anyway. What are your current
plans here?

I’ll reply separately to the sub-thread about RISC-V codegen.

Cheers,
Robin

I agree that would be nice to catch invalid comparisons; I’ve considered a
function that would take two 'Value*'s instead of types so that the parent
could be determined, but I don’t think that works in all cases.

Using an optional ‘Function*’ argument in a size comparison function will
work; I think many of the existing size queries are on scalar values in code
that has already explicitly checked that it’s not working on aggregate types.

It may be a better idea to catch misuses when trying to clone instructions
into a different function.

Agreed, it’s orthogonal to the scalable vector part.

So it does; there’s a few more cases around the codebase that don’t use that function
though. I may create a cleanup patch for them.

Other possibilities might be ‘requires[Sign|Zero]Extension’, ‘requiresTrunc’, ‘getLargestType’,
etc.

It’s a leftover from me converting from the constant, which could be of any
integer type.

The latter, though in this case I updated the example IR by hand so didn’t catch
that case :wink:

The IR in the example patch was generated by the compiler, and does split out the
multiply into a separate instruction (which is then strength-reduced to a shift).

Graham Hunter <Graham.Hunter@arm.com> writes:

========
1. Types

To represent a vector of unknown length a boolean `Scalable` property has been
added to the `VectorType` class, which indicates that the number of elements in
the vector is a runtime-determined integer multiple of the `NumElements` field.
Most code that deals with vectors doesn't need to know the exact length, but
does need to know relative lengths -- e.g. get a vector with the same number of
elements but a different element type, or with half or double the number of
elements.

In order to allow code to transparently support scalable vectors, we introduce
an `ElementCount` class with two members:

- `unsigned Min`: the minimum number of elements.
- `bool Scalable`: is the element count an unknown multiple of `Min`?

For non-scalable vectors (``Scalable=false``) the scale is considered to be
equal to one and thus `Min` represents the exact number of elements in the
vector.

The intent for code working with vectors is to use convenience methods and avoid
directly dealing with the number of elements. If needed, calling
`getElementCount` on a vector type instead of `getVectorNumElements` can be used
to obtain the (potentially scalable) number of elements. Overloaded division and
multiplication operators allow an ElementCount instance to be used in much the
same manner as an integer for most cases.

This mixture of compile-time and runtime quantities allow us to reason about the
relationship between different scalable vector types without knowing their
exact length.

How does this work in practice? Let's say I populate a vector with a
splat. Presumably, that gives me a "full length" vector. Let's say the
type is <scalable 2 x double>. How do I split the vector and get
something half the width? What is its type? How do I split it again
and get something a quarter of the width? What is its type? How do I
use half- and quarter-width vectors? Must I resort to predication?

Ths split question comes into play for backward compatibility. How
would one take a scalable vector and pass it into a NEON library? It is
likely that some math functions, for example, will not have SVE versions
available.

Is there a way to represent "double width" vectors? In mixed-data-size
loops it is sometimes convenient to reason about double-width vectors
rather than having to split them (to legalize for the target
architecture) and keep track of their parts early on. I guess the more
fundamental question is about how such loops should be handled.

What do insertelement and extractelement mean for scalable vectors?
Your examples showed insertelement at index zero. How would I, say,
insertelement into the upper half of the vector? Or any arbitrary
place? Does insertelement at index 10 of a <scalable 2 x double> work,
assuming vscale is large enough? It is sometimes useful to constitute a
vector out of various scalar pieces and insertelement is a convenient
way to do it.

Thanks for your patience. :slight_smile:

                          -David

Hi Robin,

responses inline.

-Graham

Hi Graham,
Hi David,

glad to hear other people are thinking about RVV codegen!

Hi,

Graham Hunter via llvm-dev <llvm-dev@lists.llvm.org> writes:

Is there anything about vscale or a scalable vector that requires a
minimum bit width? For example, is this legal?

<scalable 1 x double>

I know it won't map to an SVE type. I'm simply curious because
traditionally Cray machines defined vectors in terms of
machine-dependent "maxvl" with an element type, so with the above vscale
would == maxvl. Not that we make any such things anymore. But maybe
someone else does?

That's legal in IR, yes, and we believe it should be usable to represent the vectors for
RISC-V's 'V' extension. The main problem there is that they have a dynamic vector
length within the loop so that they can perform the last iterations of a loop within vector
registers when there's less than a full register worth of data remaining. SVE uses
predication (masking) to achieve the same effect.

Yes, <scalable 1 x T> should be allowed in the IR type system (even <1
x T> is currently allowed and unlike the scalable variant that's not
even useful) and it would be the sole legal vector types in the RISCV
backend.

For the 'V' extension, vscale would indeed correspond to 'maxvl', and I'm hoping that a
'setvl' intrinsic that provides a predicate will avoid the need for modelling a change in
dynamic vector length -- reducing the vector length is effectively equivalent to an implied
predicate on all operations. This avoids needing to add a token operand to all existing
instructions that work on vector types.

Yes, vscale would be the *maximum* vector length (how many elements
fit into each register), not the *active* vector length (how many
elements are operated on in the current loop iteration).

This has nothing to do with tokens, though. The tokens I proposed were
to encode the fact that even 'maxvl' varies on a function by function
basis. This RFC approaches the same issue differently, but it's still
there -- in terms of this RFC, operations on scalable vectors depend
on `vscale`, which is "not necessarily [constant] across functions".
That implies, for example, that an unmasked <scalable 4 x i32> load or
store (which accesses vscale * 16 bytes of memory) can't generally be
moved from one function to another unless it's somehow ensured that
both functions will have the same vscale. For that matter, the same
restriction applies to calls to `vscale` itself.

The evolution of the active vector length is a separate problem and
one that doesn't really impact the IR type system (nor one that can
easily be solved by tokens).

Agreed.

Right. In that way the RISC V method is very much like what the old
Cray machines did with the Vector Length register.

So in LLVM IR you would have "setvl" return a predicate and then apply
that predicate to operations using the current select method? How does
instruction selection map that back onto a simple setvl + unpredicated
vector instructions?

For conditional code both vector length and masking must be taken into
account. If "setvl" returns a predicate then that predicate would have
to be combined in some way with the conditional predicate (typically via
an AND operation in an IR that directly supports predicates). Since
LLVM IR doesn't have predicates _per_se_, would it turn into nested
selects or something? Untangling that in instruction selection seems
difficult but perhaps I'm missing something.

My idea is for the RISC-V backend to recognize when a setvl intrinsic has
been used, and replace the use of its value in AND operations with an
all-true value (with constant folding to remove unnecessary ANDs) then
replace any masked instructions (generally loads, stores, anything else
that might generate an exception or modify state that it shouldn't) with
target-specific nodes that understand the dynamic vlen.

I am not quite so sure about turning the active vector length into
just another mask. It's true that the effects on arithmetic, load,
stores, etc. are the same as if everything executed under a mask like
<1, 1, ..., 1, 0, 0, ..., 0> with the number of ones equal to the
active vector length. However, actually materializing the masks in the
IR means the RISCV backend has to reverse-engineer what it must do
with the vl register for any given (masked or unmasked) vector
operation. The stakes for that are rather high, because (1) it applies
to pretty much every single vector operation ever, and (2) when it
fails, the codegen impact is incredibly bad.

I can see where the concern comes from; we had problems reconstructing
semantics when experimenting with search loop vectorization and often
had to fall back on default (slow) generic cases.

My main reason for proposing this was to try and ensure that the size was
consistent from the point of view of the query functions we were discussing
in the main thread. If you're fine with all size queries assuming maxvl (so
things like stack slots would always use the current configured maximum
length), then I don't think there's a problem with dropping this part of the
proposal and letting you find a better representation of active length.

(1) The vl register affects not only loads, stores and other
operations with side effects, but all vector instructions, even pure
computation (and reg-reg moves, but that's not relevant for IR). An
integer vector add, for example, only computes src1[i] + src2[i] for 0
<= i < vl and the remaining elements of the destination register (from
vl upwards) are zeroed. This is very natural for strip-mined loops
(you'll never need those elements), but it means an unmasked IR level
vector add is a really bad fit for the RISC-V 'vadd' instruction.
Unless the backend can prove that only the first vl elements of the
result will ever be observed, it will have to temporarily set vl to
MAXVL so that the RVV instruction will actually compute the "full"
result. Establishing that seems like it will require at least some
custom data flow analysis, and it's unclear how robust it can be made.

(2) Failing to properly use vl for some vector operation is worse than
e.g. materializing a mask you wouldn't otherwise need. It requires
that too (if the operation is masked), but more importantly it needs
to save vl, change it to MAXVL, and finally restore the old value.
That's quite expensive: besides the ca. 3 extra instructions and the
scratch GPR required, this save/restore dance can have other nasty
effects depending on uarch style. I'd have to consult the hardware
people to be sure, but from my understanding risks include pipeline
stalls and expensive roundtrips between decoupled vector and scalar
units.

Ah, I hadn't appreciated you might need to save/restore the VL like that.
I'd worked through a couple of small example loops and it seemed fine,
but hadn't looked at more complicated cases.

To be clear: I have not yet experimented with any of this, so I'm not
saying this is a deal breaker. A well-engineered "demanded elements"
analysis may very well be good enough in practice. But since we
broached the subject, I wanted to mention this challenge. (I'm
currently side stepping it by not using built-in vector instructions
but instead intrinsics that treat vl as magic extra state.)

This could be part of lowering, or maybe a separate IR pass, rather than ISel.
I *think* this will work, but if someone can come up with some IR where it
wouldn't work then please let me know (e.g. global-state-changing instructions
that could move out of blocks where one setvl predicate is used and into one
where another is used).

There are some operations that use vl for things other than simple
masking. To give one example, "speculative" loads (which silencing
some exceptions to safely permit vectorization of some loops with
data-dependent exits, such as strlen) can shrink vl as a side effect.
I believe this can be handled by modelling all relevant operations
(including setvl itself) as intrinsics that have side effects or
read/write inaccessible memory. However, if you want to have the
"current" vl (or equivalent mask) around as SSA value, you need to
"reload" it after any operation that updates vl. That seems like it
could get a bit complex if you want to do it efficiently (in the
limit, it seems equivalent to SSA construction).

Ok; the fact that there's more instructions that can change vl and that you might
need to reload it is useful to know.

SVE uses predication to achieve the same via the first-faulting/no-faulting
load instructions and the ffr register.

I think SVE having 16 predicate registers (vs. 8 for RVV and AVX-512) has led
to us using the feature quite widely with our own experiments; I'll try looking for
non-predicated solutions as well when we try to expand scalable vectorization
capabilities.

Unfortunately, I can't find a description of the instructions included in
the 'V' extension in the online manual (other than setvl or configuring
registers), so I can't tell if there's something I'm missing.

I'm very sorry for that, I know how frustrating it can be. I hope the
above gives a clearer picture of the constraints involved. Exact
instructions, let alone encodings, are still in flux as Bruce said.

Yes, the above information is definitely useful, even if I don't have a complete
picture yet. Thanks.

Hi,

Just to clarify the part below about changing the proposal, this means that a vector:

  • Has a size of vscale * sizeof(ty) when reasoning about sizes within llvm
  • Potentially has a smaller size of active_vscale * sizeof(ty) depending on the current
    processor state at runtime.

For SVE, active_vscale == vscale. For RVV (and potentially others), it can be smaller based
on the VL state generated by setvl or similar.

-Graham