Adding support for vscale

I've posted two patches on Phabricator to add support for VScale in LLVM.

A brief recap on `vscale`:
The scalable vector type in LLVM IR is defined as `<vscale x n x m>`, to create types such as `<vscale x 16 x i8>` for a scalable vector with at least 16 bytes. In the definition of the scalable type, `vscale` is specified as a positive constant of type integer that will only be known at runtime but is guaranteed to be constant throughout the program.

The first patch [1] adds support for `vscale` as a symbolic constant to the LLVM IR so that it can be used in address calculations, induction variable updates and other places where the `vscale` property needs to be used to generate code for scalable vectors.
The second patch [2] adds the ISD::VSCALE node which, if supported by the target, can be materialised into an instruction that returns the runtime value for `vscale`. It can also be used to be folded into addressing modes, like needed for SVE/SVE2 reg+imm load/store instructions.

I'm aware that Graham has discussed this before at previous dev meetings and that some had their reservations about exposing this as a Constant explicitly. But the reasons for doing so are because the value is inherently constant. If it is not constant, this would violate the definition of the scalable type. This change enforces that. Also, vscale is expected to be used in addressing modes, so moving/hoisting or any kind of GVN/CSE would obfuscate the use of vscale for these purposes and would need to be untangled in passes like CodeGenPrep.

Hopefully the patches help clearing up any questions/reservations people may have had previously (and if not, I hope this thread can be the platform to discuss these).

[1] https://reviews.llvm.org/D68202
[2] https://reviews.llvm.org/D68203

I've posted two patches on Phabricator to add support for VScale in LLVM.

A brief recap on `vscale`:
The scalable vector type in LLVM IR is defined as `<vscale x n x m>`, to create types such as `<vscale x 16 x i8>` for a scalable vector with at least 16 bytes. In the definition of the scalable type, `vscale` is specified as a positive constant of type integer that will only be known at runtime but is guaranteed to be constant throughout the program.

RISC-V RVV explicitly allows changing VL (which I am assuming is the
same as vscale) at runtime, so VL wouldn't be a constant.
Additionally, we (libre-riscv) are working on a similar scalar vectors
ISA called SimpleV that also allows changing VL at runtime and we are
planning on basing it on LLVM's scalable vector support.

[1] https://reviews.llvm.org/D68202
[2] https://reviews.llvm.org/D68203

Jacob Lifshay

I’ve posted two patches on Phabricator to add support for VScale in LLVM.

Excellent!

A brief recap on vscale:
The scalable vector type in LLVM IR is defined as <vscale x n x m>, to create types such as <vscale x 16 x i8> for a scalable vector with at least 16 bytes. In the definition of the scalable type, vscale is specified as a positive constant of type integer that will only be known at runtime but is guaranteed to be constant throughout the program.

Ah. Right. There is something known as data-dependent fail-on-first, which does not match with the assertion that vscale will be constant.

Yes any given vector would be vscale long and it is good to be able to runtime declare such vectors: loops in assembler may be generated which sets VL (a Control Status Register declaring the number of elements to be processed in any given loop iteration)

However for e.g memcpy or strcpy or anything else which is not fixed length and not even the program knows how long the vector will be, there is data-dependent fail-on-first.

A related thread goes through this, pay attention to Stephen’s questions and it becomes clear:
https://groups.google.com/forum/?nomobile=true#!topic/comp.arch/3z3PlCwdq8U

A link to ARM SVE ffirst capability is also proved in that thread. Yes, SVE has ffirst although it is a SIMD variant rather than one that affects VL.

RISC-V RVV explicitly allows changing VL (which I am assuming is the
same as vscale) at runtime, so VL wouldn’t be a constant.

This would be good to clarify, Sander. On first reading it seems to me that vscale is intended to be the actual full vector size, not related to VL.

Regardless, setting it even as runtime constant seems to be a red flag.

What is vscale intended for, and how does it relate to Cray-like Vector Length?

Additionally, we (libre-riscv) are working on a similar scalar vectors
ISA called SimpleV that also allows changing VL at runtime and we are
planning on basing it on LLVM’s scalable vector support.

Both SV and RVV are based on Cray VL which is a runtime global CSR setting the number of elements to be processed in any given vector loop.

The difference is that RVV requests a VL and is arbitrarily allocated an actual VL (less than or equal to the requested VL), where in SV you get exactly what is requested and if overallocated an illegal instruction is raised.

Hello Jacob and Luke,

First off, even if a dynamically changing vscale was truly necessary for RVV or SV, this thread would be far too late to raise the question. That vscale is constant – that the number of elements in a scalable vector does not change during program execution – is baked into the accepted scalable vector type proposal from top to bottom and in fact was one of the conditions for its acceptance (runtime-variable type sizes create many more headaches which nobody has worked out how to solve to a satisfactory degree in the context of LLVM). This thread is just about whether vscale should be exposed to programs in the form of a Constant or as an intrinsic which always returns the same value during one program execution.

Luckily, this is not a problem for RVV. I do not know anything about this “SV” extension you are working on so I cannot comment on that, but I’ll sketch the reasons for why it’s not an issue with RVV and maybe that helps you with SV too. As mentioned above, this is tangential to the focus of this thread, so if you want to discuss further I’d prefer you do that in a new thread.

The dynamically-changing VL is a kind of predication in that it limits processing to a subset of lanes, and like masks it can just be another SSA value that is an input to the computations it affects. You may be aware of Simon Moll’s vector predication (previously: explicit vector length) proposal which does just that. In contrast, the vscale concept is more about how many elements a vector register contains, regardless of whether some operations process only a subset of them. In RVV terms that means it’s related not to VL but more to VBITS, which is indeed a constant (and has been for many months).

The only dynamic thing about “how many elements are there in a vector register” is that varying the width of the elements (8b, 16b, etc.) and the length multiplier (grouping together 1/2/4/8 registers) causes a predictable, relative increase or decrease (x2, x8, x0.5, etc.) of the number of elements, regardless of the specific value of VBITS. But this is perfectly compatible with a constant vscale because vscale only is the unknown-at-compile-time factor in the size of a scalable vector type. Varying the other components, the compile-time-constant factor and the element type, results in scalable vectors with different relative sizes in exactly the same way we need to handle RVV’s element width and LMUL concepts. For example <vscale x 4 x i16> has four times as many elements and twice as many bits as <vscale x 1 x i32>, so it captures the distinction between a SEW=16,LMUL=2 vtype setting and a SEW=32,LMUL=1 vtype setting.

Regards,
Robin

Ah, ok. So vscale is basically calculated based off of the type and vlmax rather than being VL.

SV works mostly like that except it supports more than one vlmax, since vlmax is derived from the number of contiguous int/fp registers that the register allocator assigns to that particular vector (which can be part of the vector type rather than leaving it entirely up to the register allocator).

So, SV may not be able to use scalable vectors directly but may work better with fixed-length vectors where all the vector ops have a VL parameter there. perhaps it could use scalable vectors then translate to fixed-length vectors + VL.

Jacob Lifshay

Hello Jacob and Luke,

First off, even if a dynamically changing vscale was truly necessary
for RVV or SV, this thread would be far too late to raise the question.
That vscale is constant -- that the number of elements in a scalable
vector does not change during program execution -- is baked into the
accepted scalable vector type proposal from top to bottom and in fact
was one of the conditions for its acceptance...

that should be explicitly made clear in the patches. it sounds very
much like it's only suitable for statically-allocated
arrays-of-vectorisable-types:

typedef vec4 float[4]; // SEW=32,LMUL=4 probably
static vec4 globalvec[1024]; // vscale == 1024 here

or, would it be intended for use inside functions - again statically-allocated?

int somefn(void) {
  static vec4 localvec[1024]; // vscale == 1024 here
}

*or*, would it be intended to be used like this?
int somefn(num_of_vec4s) {
  static vec4 localvec[num_of_vec4s]; // vscale == dynamic, here
}

clarifying this in the documentation strings on vscale, perhaps even
providing c-style examples, would be extremely useful, and avoid
misunderstandings.

... (runtime-variable type
sizes create many more headaches which nobody has worked out
how to solve to a satisfactory degree in the context of LLVM).

hmmmm. so it looks like data-dependent fail-on-first is something
that's going to come up later, rather than right now.

*This* thread is just about whether vscale should be exposed to programs
in the form of a Constant or as an intrinsic which always returns the same
value during one program execution.

Luckily, this is not a problem for RVV. I do not know anything about this
"SV" extension you are working on

SV has been designed specifically to help with the creation of
*Hybrid* CPU / VPU / GPUs. it's very similar to RVV except that there
are no new instructions added.

a typical GPU would be happy to have 128-bit-wide SIMD or VLIW-style
instructions, on the basis that (A) the shader programs are usually no
greater than 1K in size and (B) those 128-bit-wide instructions have
an extremely high bang-per-buck ratio, of 32x FP32 operations issued
at once.

in a *hybrid* CPU - VPU - GPU context even a 1k shader program hits a
significant portion of the 1st level cache which is *not* separate
from a *GPU*'s 1st level cache because the CPU *is* the GPU.

consequently, SV has been specifically designed to "compactify"
instruction effectiveness by "prefixing" even RVC 16-bit opcodes with
vectorisation "tags".

this has the side-effect of reducing executable size by over 10% in
many cases when compared to RVV.

so I cannot comment on that, but I'll sketch the reasons for why it's not
an issue with RVV and maybe that helps you with SV too.

looks like it does: Jacob explains (in another reply) that MVL is
exactly the same concept, except that in RVV it is hard-coded (baked)
into the hardware, where in SV it is explicitly set as a CSR, and i
explained in the previous reply that in RVV the VL CSR is requested
(and the hardware chooses a value), whereas in SV, the VL CSR *must*
be set to exactly what is requested [within the bounds of MVL, sorry,
left that out earlier].

As mentioned above, this is tangential to the focus of this thread, so if
you want to discuss further I'd prefer you do that in a new thread.

it's not yet clear whether vscale is intended for use in
static-allocation involving fixed constants or whether it's intended
for use with runtime-dependent variables inside functions.

with that not being clear, my questions are not tangential to the
focus of the thread.

however yes i would agree that data-dependent fail-on-first is
definitely not the focus of this thread, and would need to be
discussed later.

we are a very small team at the moment, we may end up missing valuable
discussions: how can it be ensured that we are included in future
discussions?

[...]
You may be aware of Simon Moll's vector predication (previously:
explicit vector length) proposal which does just that.

ah yehyehyeh. i remember.

In contrast, the vscale concept is more about how many elements a
vector register contains, regardless of whether some operations process
only a subset of them.

ok so this *might* be answering my question about vscale being
relate-able to a function parameter (the latter of the c examples), it
would be good to clarify.

In RVV terms that means it's related not to VL but more to VBITS,
which is indeed a constant (and has been for many months).

ok so VL is definitely "assembly-level" rather than something that
actually is exposed to the intrinsics. that may turn out to be a
mistake when it comes to data-dependent fail-on-first capability
(which is present in a *DIFFERENT* form in ARM SVE, btw), but would,
yes, need discussion separately.

For example <vscale x 4 x i16> has four times as many elements and
twice as many bits as <vscale x 1 x i32>, so it captures the distinction
between a SEW=16,LMUL=2 vtype setting and a SEW=32,LMUL=1
vtype setting.

hang on - so this may seem like a silly question: is it intended that
the *word* vscale would actually appear in LLVM-IR i.e. it is a new
compiler "keyword"? or did you use it here in the context of just "an
example", where actually the idea is that actual value would be <5 x 4
x i16> or <5 x 1 x i32>?

let me re-read the summary:

"This patch adds vscale as a symbolic constant to the IR, similar to
undef and zeroinitializer, so that it can be used in constant
expressions."

it's a keyword, isn't it?

so, that "vscale" keyword would be substituted at runtime by either a
constant (1024) *or* a runtime-calculated variable or function
parameter (num_of_vec4s), is that correct?

apologies for asking: these are precisely the kinds of
from-zero-prior-knowledge questions that help with any review process
to clarify things for other users/devs.

l.

Hi Luke,

First off, even if a dynamically changing vscale was truly necessary
for RVV or SV, this thread would be far too late to raise the question.
That vscale is constant -- that the number of elements in a scalable
vector does not change during program execution -- is baked into the
accepted scalable vector type proposal from top to bottom and in fact
was one of the conditions for its acceptance...

that should be explicitly made clear in the patches. it sounds very
much like it's only suitable for statically-allocated
arrays-of-vectorisable-types:

typedef vec4 float[4]; // SEW=32,LMUL=4 probably
static vec4 globalvec[1024]; // vscale == 1024 here

'vscale' just refers to the scaling factor that gives the maximum size of
the vector at runtime, not the number of currently active elements.

SVE will be using predication alone to deal with data that doesn't fill an
entire vector, whereas RVV and SX-Aurora want to use a separate mechanism
that fits with their hardware having a changeable active length.

The scalable type tells you the maximum number of elements that could be
operated on, and individual operations can constrain that to a smaller
number of elements. The latter is what Simon Moll's proposal addresses.

... (runtime-variable type
sizes create many more headaches which nobody has worked out
how to solve to a satisfactory degree in the context of LLVM).

hmmmm. so it looks like data-dependent fail-on-first is something
that's going to come up later, rather than right now.

Arm's downstream compiler has been able to use the scalable type and a
constant vscale with first-faulting loads for around 4 years, so there's
no conflict here.

We will need to figure out exactly what form the first faulting intrinsics
take of course, as I think SVE's predication-only approach doesn't quite
fit with others -- maybe we'll end up with two intrinsics? Or maybe we'll
be able to synthesize a predicate from an active vlen and pattern match?
Something to discuss later I guess. (I'm not even sure AVX512 has a
first-faulting form, possibly just no-faulting and check the first predicate
element?)

As mentioned above, this is tangential to the focus of this thread, so if
you want to discuss further I'd prefer you do that in a new thread.

it's not yet clear whether vscale is intended for use in
static-allocation involving fixed constants or whether it's intended
for use with runtime-dependent variables inside functions.

Runtime-dependent, though you could use C-level types and intrinsics to
try a static approach.

ok so this *might* be answering my question about vscale being
relate-able to a function parameter (the latter of the c examples), it
would be good to clarify.

In RVV terms that means it's related not to VL but more to VBITS,
which is indeed a constant (and has been for many months).

ok so VL is definitely "assembly-level" rather than something that
actually is exposed to the intrinsics. that may turn out to be a
mistake when it comes to data-dependent fail-on-first capability
(which is present in a *DIFFERENT* form in ARM SVE, btw), but would,
yes, need discussion separately.

For example <vscale x 4 x i16> has four times as many elements and
twice as many bits as <vscale x 1 x i32>, so it captures the distinction
between a SEW=16,LMUL=2 vtype setting and a SEW=32,LMUL=1
vtype setting.

hang on - so this may seem like a silly question: is it intended that
the *word* vscale would actually appear in LLVM-IR i.e. it is a new
compiler "keyword"? or did you use it here in the context of just "an
example", where actually the idea is that actual value would be <5 x 4
x i16> or <5 x 1 x i32>?

If you're referring to the '<vscale x 4 x i32>' syntax, that's already part
of LLVM IR now (though effectively still in 'beta'). You can see a few
examples in .ll tests now, e.g. llvm/test/Bitcode/compatibility.ll

It's also documented in the langref.

Sander's patch takes the existing 'vscale' keyword and allows it to be
used outside the type, to serve as an integer constant that represents the
same runtime value as it does in the type.

Some previous discussions proposed using an intrinsic to start with for this,
and that may still happen depending on community reaction, but the Arm
hpc compiler team felt it was important to at least start a wider discussion
on this topic before proceeding. From our experience, using an intrinsic makes
it harder to work with shufflevector or get good code generation. If someone
can spot a problem with our reasoning on that please let us know.

-Graham

Hi Luke,

hi graham, thanks for responding in such an informative fashion.

> typedef vec4 float[4]; // SEW=32,LMUL=4 probably
> static vec4 globalvec[1024]; // vscale == 1024 here

'vscale' just refers to the scaling factor that gives the maximum size of
the vector at runtime, not the number of currently active elements.

ok, this starts to narrow down the definition. i'm attempting to get
clarity on what it means. so, in the example above involving
globalvec, "maximum size of the vector at runtime" would be "1024"
(not involving RVV VL).

and... would vscale would be dynamically (but permanently) substituted
with the constant "1024", there?

and in that example i gave which was a local function, vscale would be
substituted with "local_vlen_param_len" permanently and irrevocably at
runtime?

or, is it intended to be dynamically (but permanently) substituted
with something related to RVV's *MVL* at runtime?

if it's intended to be substituted by MVL, *that* starts to make more
sense, because MVL may actually vary depending on the hardware on
which the program is being executed. smaller systems may have an MVL
of only 1 (only allowing one element of a vector to be executed at any
one time) whereas Mainframe or massively-parallel systems may have...
MVL in the hundreds.

SVE will be using predication alone to deal with data that doesn't fill an
entire vector, whereas RVV and SX-Aurora

[and SV! :slight_smile: ]

want to use a separate mechanism
that fits with their hardware having a changeable active length.

okaaay, now, yes, i Get It. this is MVL (Max Vector Length) in RVV.
btw minor nitpick: it's not that "their" hardware changes, it's that
the RISC-V Vector Spec *allows* arbitrary MVL length (so there are
multiple vendors each choosing an arbitrary MVL suited to their
customer's needs). "RVV-compliant hardware" would fit things better.

hmmm that's going to be interesting for SV, because SV specifically
permits variable MVL *at runtime*. however, just checking the spec
(don't laugh, yes i know i wrote it...) MVL is set through an
immediate. there's a way to bypass that and set it dynamically, but
it's intended for context-switching, *not* for general-purpose use.

ah wait.... argh. ok, is vscale expected to be a global constant *for
the entire application*? note above: SV allows MVL to be set
*arbitrarily*, and this is extremely important.

the reason it is important is because unlike RVV, SV uses the actual
*scalar* register files. it does *NOT* have a separate "Vector
Register File".

so if vscale was set to say 8 on a per-runtime basis, that then sets
the total number of registers *in the scalar register file* which will
be utilised for vectorisation.

it becomes impossible to set vscale to 4, which another function might
have been specifically designed to use.

so what would then need to be done is: predicate out the top 4
elements, which now comes with a performance-penalty and a whole
boat-load of mess.

so, apologies: we reaaaally need vscale to be selectable on at the
very least a per-function basis.

otherwise, applications would have to set it (at runtime) to the
"least inconvenient" value, wasting "the least-inconvenient number of
registers".

The scalable type tells you the maximum number of elements that could be
operated on,

... which is related (in RVV) to MVL...

and individual operations can constrain that to a smaller
number of elements.

... by setting VL.

> hmmmm. so it looks like data-dependent fail-on-first is something
> that's going to come up later, rather than right now.

Arm's downstream compiler has been able to use the scalable type and a
constant vscale with first-faulting loads for around 4 years, so there's
no conflict here.

ARM's SVE uses predication. the LDs that would [normally] cause
page-faults create a mask, instead, giving *only* those LDs which
"succeeded".

that's then passed into standard [SIMD-based] predicated operations,
masking out operations, the most important one (for the canonical
strcpy / memcpy) being the ST.

We will need to figure out exactly what form the first faulting intrinsics
take of course, as I think SVE's predication-only approach doesn't quite
fit with others -- maybe we'll end up with two intrinsics?

perhaps - as robin suggests, this for another discussion (not related
to vscale).

or... maybe not.

if vscale was permitted to be dynamically set, not only would it suit
SV's ability to set different vscales on a per-function (or other)
basis, it could be utilised by RVV, SV, and anything else that changes
VL based on data-dependent conditions, to change the following
instructions.

what i'm saying is: vscale needs to be permitted to be a variable, not
a constant.

now, ARM SVE wouldn't *use* that capability: it would hard-code it to
512/SEW/etc.etc. (or whatever), by setting it to a global constant.
follow-up LLVM-IR-morphing passes would end up generating
globally-fixed-width SVE instructions.

RVV would be able to set that vscale variable as a way to indicate
data-dependent lengths [various IR-morphing-passes would carry out the
required substitutions prior to creating actual assembler]

SV would be able to likewise do that *and* start from a value for
vscale that suited each function's requirements to utilise a subset of
the register file which suited the workload.

SV could then trade off "register spill" with "vector size", which i
can tell you right now will be absolutely critical for 3D GPU
workloads. we can *NOT* allow register spill using LD/STs for a GPU
workload covering gigabytes of data, the power consumption penalty
would just be mental [commercially totally unacceptable]. it would be
far better to allow a function which required that many registers to
dynamically set vscale=2 or possibly even vscale=1

(we have 128 *scalar* registers, where, reminder: MVL is used to say
how many of the *SCALAR* register file get utilised to "make up" a
vector).

oh. ah. bruce (et al), isn't there an option in RVV to allow Vectors
to sit on top of the *scalar* register file(s)? (Zfinx)
https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-registers

Or maybe we'll
be able to synthesize a predicate from an active vlen and pattern match?

the purpose of having a dynamic VL, which comes originally from the
Cray Supercomputer Vector Architecture, is to not have to use the
kinds of instructions that perform bit-manipulation
(mask-manipulation) which are not only wasted CPU cycles, but end up
in many [simpler] hardware implementations with masked-out "Lanes"
running empty, particularly ones that have Vector Front-ends but
predicated-SIMD-style ALU backends.

i would be quite concerned, therefore, if by "synthesise a predicate"
the idea was, instead of using actual dynamic truncation of vlen
(changing vscale), instructions were used to create a predicate which
had its last bits set to zero.

basically using RVV/SV fail-on-first to emulate the way that ARM SVE
fail-on-first creates masks.

that would be... yuk :slight_smile:

Sander's patch takes the existing 'vscale' keyword and allows it to be
used outside the type, to serve as an integer constant that represents the
same runtime value as it does in the type.

if i am understanding things correctly, it reaaally needs to be
allowed to be a variable, definitely not a constant.

Some previous discussions proposed using an intrinsic to start with for this,
and that may still happen depending on community reaction, but the Arm
hpc compiler team felt it was important to at least start a wider discussion
on this topic before proceeding. From our experience, using an intrinsic makes
it harder to work with shufflevector or get good code generation. If someone
can spot a problem with our reasoning on that please let us know.

honestly can't say, can i leave it to you to decide if it's related to
this vscale thread, and, if so, could you elaborate further? if it's
not, feel free to leave it for another time? will see if there is any
follow-up discussion here.

thanks graham.

l.

Thanks @Robin and @Graham for giving some background on scalable vectors and clarifying some of the details!

Apologies if I'm repeating things here, but it is probably good to emphasize the conceptually different, but complementary models for scalable vectors:
1. Vectors of unknown, but constant size throughout the program.
2. Vectors of changing size throughout the program.

Where (2) basically builds on (1).

LLVM's scalable vectors support (1) directly. The scalable type is defined using the concept `vscale` that is constant throughout the program and expresses the unknown, but maximum size of a scalable vector. My patch builds on that definition by adding `vscale` as a keyword that can be used in expressions. For this model, predication can be used to disable the lanes that are not needed. Given that `vscale` is defined as inherently constant and a corner-stone of the scalable type, it makes no sense to describe the `vscale` keyword as an intrinsic.

The other model for scalable vectors (2) requires additional intrinsics to get/set the `active VL` at runtime. This model would be complementary to `vscale`, as it still requires the same scalable vector type to describe a vector of unknown size. `vscale` can be used to express the maximum vector length, but the `active vector length` would need to be handled through explicit intrinsics. As Robin explained, it would also need Simon Moll's vector predication proposal to express operations on `active VL` elements.

apologies for asking: these are precisely the kinds of
from-zero-prior-knowledge questions that help with any review process
to clarify things for other users/devs.

No apologies required, the discussion on scalable types have been going on for quite a while so there are much email threads to read through. It is important these concepts are clear and well understood!

clarifying this in the documentation strings on vscale, perhaps even
providing c-style examples, would be extremely useful, and avoid
misunderstandings.

I wonder if we should add a separate document about scalable vectors that describe these concepts in more detail with some examples.

Given that (2) is a very different use-case, I hope we can keep discussions on that model separate from this thread, if possible.

Thanks,

Sander

Hi Luke,

want to use a separate mechanism
that fits with their hardware having a changeable active length.

okaaay, now, yes, i Get It. this is MVL (Max Vector Length) in RVV.
btw minor nitpick: it's not that "their" hardware changes, it's that
the RISC-V Vector Spec *allows* arbitrary MVL length (so there are
multiple vendors each choosing an arbitrary MVL suited to their
customer's needs). "RVV-compliant hardware" would fit things better.

Yes, the hardware doesn't change, dynamic/active VL just stops processing
elements past the number of active elements.

SVE similarly allows vendors to choose a maximum hardware vector length
but skips an active VL in favour of predication only.

I'll try and clear things up with a concrete example for SVE.

Allowable SVE hardware vector lengths are all multiples of 128 bits. So
our main legal types for codegen will have a minimum size of 128 bits,
e.g. <vscale x 4 x i32>.

If a low-end device implements SVE at 128 bits, then at runtime vscale is
1 and you get exactly <4 x i32>.

For mid-level devices I'd guess 256 bits is reasonable, so vscale would be
2 and <vscale x 4 x i32> would be equivalent to <8 x i32>, but we still
only guarantee that the first 4 lanes exist.

For Fujitsu's A64FX at 512 bits, vscale is 4 and legal type would now be
equivalent to <16 x i32>.

In all cases, vscale is constant at runtime for those machines. While it
is possible to change the maximum vector length from privileged code (so
you could set the A64FX to run with 256b or 128b vectors if you chose...
even 384b if you wanted to), we don't allow for changes at runtime since
that may corrupt data. Expecting the compiler to be able to recover from
a change in vector length when you have spilled registers to the stack
isn't reasonable.

Robin found a way to make this work for RVV; there, he had the additional
concern of registers being joined together in x2,x4,(x8?) combinations.
This was resolved by just making the legal types bigger when that feature
is in use iirc.

Would that approach help SV, or is it just a backend thing deciding how
many scalar registers it can spare?

The scalable type tells you the maximum number of elements that could be
operated on,

... which is related (in RVV) to MVL...

and individual operations can constrain that to a smaller
number of elements.

... by setting VL.

Yes, at least for architectures that support changing VL. Simon's
proposal was to provide intrinsics for common IR operations which
took an additional parameter corresponding to VL; vscale doesn't
represent VL, so doesn't need to change.

hmmmm. so it looks like data-dependent fail-on-first is something
that's going to come up later, rather than right now.

Arm's downstream compiler has been able to use the scalable type and a
constant vscale with first-faulting loads for around 4 years, so there's
no conflict here.

ARM's SVE uses predication. the LDs that would [normally] cause
page-faults create a mask, instead, giving *only* those LDs which
"succeeded".

Those that succeeded until the first that didn't -- every bit in the
mask after a fault is unset, even if it would have succeeded with a
first-faulting gather operation.

that's then passed into standard [SIMD-based] predicated operations,
masking out operations, the most important one (for the canonical
strcpy / memcpy) being the ST.

Nod; I wrote an experimental early exit loop vectorizer which made use of that.

We will need to figure out exactly what form the first faulting intrinsics
take of course, as I think SVE's predication-only approach doesn't quite
fit with others -- maybe we'll end up with two intrinsics?

perhaps - as robin suggests, this for another discussion (not related
to vscale).

Or maybe we'll
be able to synthesize a predicate from an active vlen and pattern match?

the purpose of having a dynamic VL, which comes originally from the
Cray Supercomputer Vector Architecture, is to not have to use the
kinds of instructions that perform bit-manipulation
(mask-manipulation) which are not only wasted CPU cycles, but end up
in many [simpler] hardware implementations with masked-out "Lanes"
running empty, particularly ones that have Vector Front-ends but
predicated-SIMD-style ALU backends.

Yeah, I get that, which is why I support Simon's proposals.

i would be quite concerned, therefore, if by "synthesise a predicate"
the idea was, instead of using actual dynamic truncation of vlen
(changing vscale), instructions were used to create a predicate which
had its last bits set to zero.

basically using RVV/SV fail-on-first to emulate the way that ARM SVE
fail-on-first creates masks.

that would be... yuk :slight_smile:

Ah, I could have made it a bit clearer. I meant have a first-faulting
load intrinsic which returns a vector and an integer representing the
number of valid lanes. For architectures using a dynamic VL, you could
then pass that integer to subsequent operations so they are tied to
that number of active elements.

For SVE/AVX512, we'd have to splat that integer and compare against
a stepvector to generate a mask. Ugly, but it can be pattern matched
into the direct first/no-faulting loads and masks for codegen.

Or we just use separate intrinsics.

To discuss later, I think; possibly on the early-exit loopvec thread.

-Graham

(readers note this, copied from the end before writing!
"Given that (2) is a very different use-case, I hope we can keep discussions on
that model separate from this thread, if possible.")

Thanks @Robin and @Graham for giving some background on scalable vectors and clarifying some of the details!

hi sander, thanks for chipping in. um, just a point of order: was it
intentional to leave out both jacob and myself? my understanding is
that inclusive and welcoming language is supposed to used within this
community, and it *might* be mistaken as being exclusionary and
unwelcoming.

if that was a misunderstanding or an oversight i apologise for raising it.

Apologies if I'm repeating things here, but it is probably good to emphasize
the conceptually different, but complementary models for scalable vectors:
1. Vectors of unknown, but constant size throughout the program.

... which matches with both hardware-fixed per-implementation
variations in potential [max] SIMD-width for any given architecture as
well as Vector-based "Maximum Vector Length", typically representing
the "Lanes" of a [traditional] Vector Architecture.

2. Vectors of changing size throughout the program.

...representing VL in "Cray-style" Vector Engines (NEC SX-Aurora, RVV,
SV) and representing the (rather unfortunate) corner-case cleanup -
and predication - deployed in SIMD
(https://www.sigarch.org/simd-instructions-considered-harmful/)

Where (2) basically builds on (1).

LLVM's scalable vectors support (1) directly. The scalable type is defined
using the concept `vscale` that is constant throughout the program and
expresses the unknown, but maximum size of a scalable vector.
My patch builds on that definition by adding `vscale` as a keyword that
can be used in expressions.

ah HA! excccellent. *that* was the sentence giving the key piece of
information needed to understand what is going on, here. i appreciate
it does actually say that, "This patch adds vscale as a symbolic
constant to the IR, similar to
undef and zeroinitializer, so that it can be used in constant
expressions" however without the context about what vscale is based
*on*, it's just not possible to understand.

can i therefore recommend a change, here:

"Scalable vector types are defined as <vscale x #elts x #eltty>,
where vscale itself is defined as a positive symbolic constant
of type integer, representing a platform-dependent (fixed but
implementor-specific) limit of any given hardware's maximum
simultaneous "element processing" capacity"

you could add, in brackets, "(typically the SIMD element width)" at
the end there. then, this starts to make sense, but could be further
made explicit:

"This patch adds vscale as a symbolic constant to the IR, similar to
undef and zeroinitializer, so that vscale - representing the
runtime-detected "element processing" capacity - can be used in
constant expressions"

For this model, predication can be used to disable the lanes
that are not needed. Given that `vscale` is defined as inherently
constant and a corner-stone of the scalable type, it makes no
sense to describe the `vscale` keyword as an intrinsic.

indeed: if it's intended near-exclusively for SIMD-style hardware,
then yes, absolutely.

my only concern would be: some circumstances (some algorithms) may
perform better with MMX, some with SSE, some with different levels of
performance on e.g. AMD or Intel, which would, with benchmarking, show
that some algorithms perform better if vscale=8 (resulting in some
other MMX/SSE subset being utilised) than if vscale=16.

in particular, on hardware which doesn't *have* predication, they're
definitely in trouble if vscale is fixed (SIMD considered harmful).
it may even be the case, for whatever reason, that performance sucks
for AVX512 instructions with a low predicate bitcount, if compared to
using smaller-range SIMD operations, perhaps due to the vastly-greater
size of the AVX instructions themselves.

honestly i don't know: i'm just throwing ideas out, here.

would it be reasonable to assume that predication *always* is to be
used in combination with vscale? or is it the intention to
[eventually] be able to auto-generate the kinds of [painful in
retrospect] SIMD assembly shown in the above article?

The other model for scalable vectors (2) requires additional intrinsics
to get/set the `active VL` at runtime.

ok. with you here.

This model would be complementary to `vscale`, as it still requires the
same scalable vector type to describe a vector of unknown size.

ah. that's where the assumption breaks down, because of SV allowing
its vectors to "sit" on top of the *actual* scalar regfile(s), we do
in fact permit an [immediate-specified] vscale to be set, arbitrarily,
at any time.

now, we mmmiiiight be able to get away with assuming that vscale is
equal to the absolute maximum possible setting (64 for RV64, 32 for
RV32), then use / play-with the "runtime active VL get/set"
intrinsics.

i'm kiinda wary of saying "absolutely yes that's the way forward" for
us, particularly without some input from Jacob here.

`vscale` can be used to express the maximum vector length,

wait... hang on: RVV i am pretty certain there is not supposed to be
any kind of assumption of knowledge about MVL. in SV that's fine, but
in RVV i don't believe it is.

bruce, andrew, robin, can you comment here?

but the `active vector length` would need to be handled through
explicit intrinsics. As Robin explained, it would also need Simon Moll's
vector predication proposal to express operations on `active VL` elements.

ok, a link to that would be handy... let me see if i can find it...
what comes up is this: https://reviews.llvm.org/D57504 is that right?

> apologies for asking: these are precisely the kinds of
> from-zero-prior-knowledge questions that help with any review process
> to clarify things for other users/devs.
No apologies required, the discussion on scalable types have been going on for quite a while so there are much email threads to read through. It is important these concepts are clear and well understood!

:slight_smile:

> clarifying this in the documentation strings on vscale, perhaps even
> providing c-style examples, would be extremely useful, and avoid
> misunderstandings.
I wonder if we should add a separate document about scalable vectors
that describe these concepts in more detail with some examples.

it's exceptionally complex, with so many variants, i feel this is
almost essential.

Given that (2) is a very different use-case, I hope we can keep discussions on
that model separate from this thread, if possible.

good idea, if there's a new thread started please do cc me.
cross-relationship between (2) and vscale may make it slightly
unavoidable though to involve this one.

l.

hi graham,

> the RISC-V Vector Spec *allows* arbitrary MVL length (so there are
> multiple vendors each choosing an arbitrary MVL suited to their
> customer's needs). "RVV-compliant hardware" would fit things better.

Yes, the hardware doesn't change, dynamic/active VL just stops processing
elements past the number of active elements.

SVE similarly allows vendors to choose a maximum hardware vector length
but skips an active VL in favour of predication only.

yes.

I'll try and clear things up with a concrete example for SVE.

Allowable SVE hardware vector lengths are all multiples of 128 bits. So
our main legal types for codegen will have a minimum size of 128 bits,
e.g. <vscale x 4 x i32>.

For Fujitsu's A64FX at 512 bits, vscale is 4 and legal type would now be
equivalent to <16 x i32>.

okaaaay, so, right, it is kinda similar to MVL for RVV, except
dynamically settable in powers of 2. okaay. makes sense: just as
with Cray-style Vectors, high-end machines can go extremely wide.

In all cases, vscale is constant at runtime for those machines. While it
is possible to change the maximum vector length from privileged code (so
you could set the A64FX to run with 256b or 128b vectors if you chose...
even 384b if you wanted to), we don't allow for changes at runtime since
that may corrupt data. Expecting the compiler to be able to recover from
a change in vector length when you have spilled registers to the stack
isn't reasonable.

deep breath: i worked for Aspex Semiconductors, they have (had) what
was called an "Array String Processor". 2-bit ALUs could have a gate
opened up which constructed 4-bit ALUs, open up another gate now you
have 8-bit ALUs, open another you have 16-bit, 32-bit, 64-bit and so
on.

thus, as a massively-deep SIMD architecture you could, at runtime,
turn computations round from either using 32 cycles with a batch of
32x 2-bit ALUs to perform 32x separate and distinct parallel 64-bit
operations

OR

open up all the gates, and use ONE cycle to compute a single 64-bit operation.

with LOAD/STORE taking fixed time but algorithms (obviously) taking
variable lengths of time, our job, as FAEs, was to write f*****g
SPREADSHEETS (yes, really) giving estimates of which was the best
possible balance to keep LD/STs an equal time-consumer as the frickin
algorithm.

as you can probably imagine, this being in assembler, and literally a
dozen algorithms having to be written where one would normally do,
code productivity was measured in DAAAAYYYS per line of code.

we do have a genuine need to do something similar, here (except
automated or at an absolute minimum, under the guidance of #pragma).

the reason is because this is for a [hybrid] 3D GPU, to run
texturisation and other workloads. these are pretty much unlike a CPU
workload: data comes in, gets processed, data goes out. there's *one*
LD, one algorithm, one ST, in a massive loop covering tens to hundreds
(to gigabytes, in large GPUs) of megabytes per second.

if there's *any* register spill at all, the L1/L2 performance and
power penalty is so harsh that it's absolutely unthinkable to let it
happen. this was outlined in Jeff Bush's nyuzipass2016 paper.

the solution is therefore to have fine-grained dynamic control over
vscale, on a per-loop basis. letting the registers spill *cannot* be
permitted, so is not actually a problem per se.

with a fine-grained dynamic control over vscale, we can perform a
(much better, automated) equivalent of the awfulness-we-did-at-Aspex,
analysing the best vscale to use for that loop, that will cover as
many registers as possible, *without* spill. even if vscale gets set
to 1, that's far, _far_ better than allowing LD/ST register-spilling.

and with most 3D workloads being very specifically designed to fit
into 128 FP32 registers (even for MALI400 and Vivante GC800), and our
design having 128 FP64 registers that can be MMX-style subdivided into
2x FP32, 4x FP16, we should be fine.

Robin found a way to make this work for RVV; there, he had the additional
concern of registers being joined together in x2,x4,(x8?) combinations.
This was resolved by just making the legal types bigger when that feature
is in use iirc.

unfortunately - although i do not know the full details (Jacob knows
this better than I) there are some 3D workloads involving 4x3 or 3x4
matrices, and Texture datasets with arrays of X,Y,Z coordinates which
means that power-of-two boundaries will result in serious performance
penalties (25% reduction due to a Lane always running empty).

Would that approach help SV, or is it just a backend thing deciding how
many scalar registers it can spare?

it would be best to see what Jacob has to say: we're basically likely
to be reserving the top x32-x127 scalar registers for "use" as
vectors. however being able to dynamically alter the actual
allocation of registers on a per-loop basis [and never "spilling"] is
going to be critical to ensuring the commercial success and acceptance
of the entire processor.

in the absolute worst case we would be forced to set vscale = 1, which
then "punishes" performance by only utilising say x32-x47. this would
(hypothetically) result in a meagre 25% of peak performance (all 16
registers being effectively utilised as scalar-only).

if however vscale could be dynamically set to 4, that loop could
(hypothetically) deploy registers x32-x95, the parallelism would
properly kick in, and we'd get 4x the level of performance.

ok that was quite a lot, cutting much of what follows...

> ARM's SVE uses predication. the LDs that would [normally] cause
> page-faults create a mask, instead, giving *only* those LDs which
> "succeeded".

Those that succeeded until the first that didn't -- every bit in the
mask after a fault is unset, even if it would have succeeded with a
first-faulting gather operation.

yehyeh. i do like ffirst, a lot.

> that's then passed into standard [SIMD-based] predicated operations,
> masking out operations, the most important one (for the canonical
> strcpy / memcpy) being the ST.

Nod; I wrote an experimental early exit loop vectorizer which made use of that.

it's pretty awesome, isn't it? :slight_smile: the one thing that nobody really
expect to be able to parallelise / auto-vectorise, and it's now
possible!

Ah, I could have made it a bit clearer. I meant have a first-faulting
load intrinsic which returns a vector and an integer representing the
number of valid lanes.

[ah, when it comes up, (early-exit loopvec thread?) i should mention
that in SV we've augmented fail-first to *true* data-dependent
semantics, based on whether the result of [literally any] operation is
zero or not. work-in-progress here (related to FP "what constitutes
fail" because NaN can be considered "fail").]

For architectures using a dynamic VL, you could
then pass that integer to subsequent operations so they are tied to
that number of active elements.

For SVE/AVX512, we'd have to splat that integer and compare against
a stepvector to generate a mask. Ugly, but it can be pattern matched
into the direct first/no-faulting loads and masks for codegen.

this sounds very similar to the RVV use of a special "vmfirst"
predicate-mask instruction which is used to detect the zero point in
the canonical strcpy example. it... works :slight_smile:

Or we just use separate intrinsics.

To discuss later, I think; possibly on the early-exit loopvec thread.

ok, yes, agreed.

thanks graham.

l.

Hi Luke,

was it intentional to leave out both jacob and myself?
[...]
if that was a misunderstanding or an oversight i apologise for raising it.

It was definitely not my intention to be non-inclusive, my apologies if that seemed the case!

can i therefore recommend a change, here:
[...]
"This patch adds vscale as a symbolic constant to the IR, similar to
undef and zeroinitializer, so that vscale - representing the
runtime-detected "element processing" capacity - can be used in
constant expressions"

Thanks for the suggestion! I like the use of the word `capacity` especially now that the term 'vector length' has overloaded meanings.
I'll add some extra words to the vscale patch to clarify its meaning.

my only concern would be: some circumstances (some algorithms) may
perform better with MMX, some with SSE, some with different levels of
performance on e.g. AMD or Intel, which would, with benchmarking, show
that some algorithms perform better if vscale=8 (resulting in some
other MMX/SSE subset being utilised) than if vscale=16.

If fixed-width/short vectors are more beneficial for some algorithm, I'd recommend using fixed-width vectors directly. It would be up to the target to lower that to the vector instruction set. For AArch64, this can be done using Neon (max 128bits) or with SVE/SVE2 using a 'fixed-width' predicate mask, e.g. vl4 for a predicate of 4 elements, even when the vector capacity is larger than 4.

would it be reasonable to assume that predication *always* is to be
used in combination with vscale? or is it the intention to
[eventually] be able to auto-generate the kinds of [painful in
retrospect] SIMD assembly shown in the above article?

When the size of a vector is constant throughout the program, but unknown at compile-time, then some form of masking would be required for loads and stores (or other instructions that may cause an exception). So it is reasonable to assume that predication is used for such vectors.

This model would be complementary to `vscale`, as it still requires the
same scalable vector type to describe a vector of unknown size.

ah. that's where the assumption breaks down, because of SV allowing
its vectors to "sit" on top of the *actual* scalar regfile(s), we do
in fact permit an [immediate-specified] vscale to be set, arbitrarily,
at any time.

Maybe I'm missing something here, but if SV uses an immediate to define vscale, that implies the value of vscale is known at compile-time and thus regular (fixed-width) vector types can be used?

now, we mmmiiiight be able to get away with assuming that vscale is
equal to the absolute maximum possible setting (64 for RV64, 32 for
RV32), then use / play-with the "runtime active VL get/set"
intrinsics.

i'm kiinda wary of saying "absolutely yes that's the way forward" for
us, particularly without some input from Jacob here.

Note that there isn't a requirement to use `vscale` as proposed in my first patch. If RV only cares about the runtime active-VL then some explicit, separate mechanism to get/set the active VL would be needed anyway. I imagine the resulting runtime value (instead of `vscale`) to then be used in loop indvar updates, address computations, etc.

ok, a link to that would be handy... let me see if i can find it...
what comes up is this: https://reviews.llvm.org/D57504 is that right?

Yes, that's the one!

Thanks,

Sander

It was definitely not my intention to be non-inclusive, my apologies if that seemed the case!

No problem Sander.

can i therefore recommend a change, here:
[…]
“This patch adds vscale as a symbolic constant to the IR, similar to
undef and zeroinitializer, so that vscale - representing the
runtime-detected “element processing” capacity - can be used in
constant expressions”
Thanks for the suggestion! I like the use of the word capacity especially now that the term ‘vector length’ has overloaded meanings.
I’ll add some extra words to the vscale patch to clarify its meaning.

super. will keep an eye out for it.

my only concern would be: some circumstances (some algorithms) may
perform better with MMX, some with SSE, some with different levels of
performance on e.g. AMD or Intel, which would, with benchmarking, show
that some algorithms perform better if vscale=8 (resulting in some
other MMX/SSE subset being utilised) than if vscale=16.
If fixed-width/short vectors are more beneficial for some algorithm, I’d recommend using fixed-width vectors directly. It would be up to the target to lower that to the vector instruction set. For AArch64, this can be done using Neon (max 128bits) or with SVE/SVE2 using a ‘fixed-width’ predicate mask, e.g. vl4 for a predicate of 4 elements, even when the vector capacity is larger than 4.

I have a feeling that this was - is - the “workaround” that Graham was referring to.

would it be reasonable to assume that predication always is to be
used in combination with vscale? or is it the intention to
[eventually] be able to auto-generate the kinds of [painful in
retrospect] SIMD assembly shown in the above article?

When the size of a vector is constant throughout the program, but unknown at compile-time, then some form of masking would be required for loads and stores (or other instructions that may cause an exception). So it is reasonable to assume that predication is used for such vectors.

This model would be complementary to vscale, as it still requires the
same scalable vector type to describe a vector of unknown size.

ah. that’s where the assumption breaks down, because of SV allowing
its vectors to “sit” on top of the actual scalar regfile(s), we do
in fact permit an [immediate-specified] vscale to be set, arbitrarily,
at any time.
Maybe I’m missing something here, but if SV uses an immediate to define vscale, that implies the value of vscale is known at compile-time and thus regular (fixed-width) vector types can be used?

It’s not really intended to be exposed to frontends except by #pragma or inline assembly.

We can set an immediate however by doing so we hard-code the allocated maximum number of scalar regs to be utilised.

If that is too many then register spill might occur (with disastrous penalties for 3D) and if too small then performance is poor as ALUs sit idle.

In addition SV works on RV32 and RV64 where the regfiles are half the number of total bits and consequently we really will need dynamic scaling, there, in order to halve the size of vectors rather than risk register spill.

Plus, if people reeeeeaaally want to not have 128 registers, which there may be a genuine market need particularly in 3D Embedded, they might consider the cost of 128 regs to be too great, use the “normal” 32 of RISCV instead.

Here they would definitely want vscale=1 and to do everything as close to scalar operation as possible. If they have vec4 datatypes (using SUBVL) they might end up with regspill but that is a price they pay for the decision to reduce the regfile size.

(btw SUBVL is a multiplier of length 2, 3 or 4, representing vec2-4 identical to RVV’s subvector.

This is explicitly used in the (c/c++) source code, where MVL immidiates and VL lengths definitely are not)

now, we mmmiiiight be able to get away with assuming that vscale is
equal to the absolute maximum possible setting (64 for RV64, 32 for
RV32), then use / play-with the “runtime active VL get/set”
intrinsics.

i’m kiinda wary of saying “absolutely yes that’s the way forward” for
us, particularly without some input from Jacob here.
Note that there isn’t a requirement to use vscale as proposed in my first patch.

Oh? Ah! That is an important detail :slight_smile:

One that is tough to express in a short introduction in the docstring without going into too much detail.

If RV only cares about the runtime active-VL then some explicit, separate mechanism to get/set the active VL would be needed anyway. I imagine the resulting runtime value (instead of vscale) to then be used in loop indvar updates, address computations, etc.

Ok this might be the GetOutOfJailFree card I was looking for :slight_smile:

My general feeling on this then is that both RVV and SV should avoid using vscale.

In the case of RVV, MVL is a hardware defined constant that is never intended to be known by applications. There’s no published detection mechanism. Loops are supposed to be designed to run a few more times on lower spec’d hardware.

Robin, what’s your thoughts there?

SV it looks like we will need to do something like <%reg x 4 x f32> which has an analysis pass to process it, calculating the total number of available regs for a given block, isolated by LD and ST boundaries, and maximise %reg to not spill.

ok, a link to that would be handy… let me see if i can find it…
what comes up is this: https://reviews.llvm.org/D57504 is that right?

Yes, that’s the one!

Super, encountered it a few months back will read again.

L.

Software should be portable across different RVV implementations, in particular across different values of the impl-defined constants VLEN, ELEN, SLEN. But being portable does not mean software must never mention these (and derived quantities such as vscale or, in the RVV spec, VLMAX) at all, just that it has to work correctly no matter which value they have. And in fact, there is a published (written out in the spec) mechanism for obtaining VLMAX, which is directly related to VLEN (so you can obtain VLEN with a little more arithmetic, though for most purposes VLMAX is more useful): requesting the vector length of -1 (unsigned: 2^XLEN - 1) is guaranteed to result in vl=VLMAX.

For regular strip-mined loops, the vsetvl instruction takes care of everything so there’s simply no need for the program to do this. But for other tasks, it’s required (i.e., you can’t sensibly write the program otherwise) and perfectly fine w.r.t. portability. One example is the stack frame layout when there’s any vectors on the stack (e.g. for spills), since the vector stack slots must in general be large enough to hold a full vector (= VLEN*LMUL bits). Granted, I don’t think this or other examples will normally occur in LLVM IR generated by a loop vectorizer, so vscale will probably not occur very frequently in RVV. Nevertheless, there is nothing inherently non-portable about it.

Regards
Robin

PS: I don’t want to read too much into your repeated use of “MVL”, but FWIW the design of RVV has changed quite radically since “MVL” was last used in any spec draft. If you haven’t read any version since v0.6 (~ December 2018) with a “clean slate”, may I suggest you do that when you find the time? You can find the latest draft at https://github.com/riscv/riscv-v-spec/

My general feeling on this then is that both RVV and SV should avoid using vscale.

In the case of RVV, MVL is a hardware defined constant that is never intended to be known by applications. There’s no published detection mechanism. Loops are supposed to be designed to run a few more times on lower spec’d hardware.

Robin, what’s your thoughts there?

Software should be portable across different RVV implementations, in particular across different values of the impl-defined constants VLEN, ELEN, SLEN. But being portable does not mean software must never mention these (and derived quantities such as vscale or, in the RVV spec, VLMAX) at all, just that it has to work correctly no matter which value they have. And in fact, there is a published (written out in the spec) mechanism for obtaining VLMAX,

Ah excellent. It’s a little obtuse (the wording is very indirect. As the RVV WG is a closed list, can I leave it with you to raise that as an issue?)

which is directly related to VLEN (so you can obtain VLEN with a little more arithmetic, though for most purposes VLMAX is more useful): requesting the vector length of -1 (unsigned: 2^XLEN - 1) is guaranteed to result in vl=VLMAX.

For regular strip-mined loops, the vsetvl instruction takes care of everything so there’s simply no need for the program to do this. But for other tasks, it’s required (i.e., you can’t sensibly write the program otherwise) and perfectly fine w.r.t. portability. One example is the stack frame layout when there’s any vectors on the stack (e.g. for spills), since the vector stack slots must in general be large enough to hold a full vector (= VLEN*LMUL bits).

Kernel context switch as well. Both would likely be written in assembler.

Granted, I don’t think this or other examples will normally occur in LLVM IR generated by a loop vectorizer, so vscale will probably not occur very frequently in RVV.

Interesting. It is sort-of what I had a hunch would be the case.

Nevertheless, there is nothing inherently non-portable about it.

Indeed. Thank you for the insights, Robin.

Regards
Robin

PS: I don’t want to read too much into your repeated use of “MVL”, but FWIW the design of RVV has changed quite radically since “MVL” was last used in any spec draft. If you haven’t read any version since v0.6 (~ December 2018) with a “clean slate”, may I suggest you do that when you find the time? You can find the latest draft at https://github.com/riscv/riscv-v-spec/

Ah yes thank you, I reference it at least three times a week: such a large document it is easy to miss things. I will replace MVL in SV with VLMAX.

Appreciated the headsup.

L.

Ok so taking the RISCV developers off cc, because it looks like neither SV nor RVV would use vscale, as we basically identified, eventually, that it is a way to express the “architectural SIMD width”.

The rest of this is therefore nothing to do with vector engines, and is purely some constructive input for future consideration.

Let us take a scenario where data is short vectors, well below vscale. That there is also some inter-element dependence (cross product or other) which makes laying multiple short vectors into a single vscale long SIMD awkward.

Under such circumstances having a fixed vscale is extremely wasteful, particularly if there is an out of order engine which could use mixed scalar or MMX/SSE with AVX512 for example.

Thus for the longer operations the idea is to throw those at AVX512 and the shorter ones at 64 bit MMX/SSE.

The point is: both could benefit from vscale excrpt unfortunately, there is only one vscale and it can therefore only be applied to one of the SIMD ALUs.

This tends to suggest that either vscale should be a variable (and applicable on a per group basis, separated by LD/STs)

OR

That there should be more than one vscale.

i.e that vscale should, instead.of being a fixed global type, should instead be morphed to be %vscaleN similar to %regN, conveying the context of its intended scope and use.

Thus, certain groups of operations intended to be farmed to a SPECIFIC SIMD suite (AVX512) may be specifically separated from those intended to be targetted at another suite (MMX/SSE).

Of course, on architectures which have no such distinction, a simple pass would merge them all into one global vscale.

A thought for consideration.

L.