RFC: C and C++ extension to support variable-length register-sized vector types

Richard Sandiford wrote:

TL;DR: This is an RFC about adding variable-length register-sized vector
types to C and C++. Near the end of the message there are some

Phabricator

links to the clang implementation (which we're only posting to back up
the RFC; it's not intended for commit).

Summary

This is an RFC about some C and C++ language changes related to Arm's
Scalable Vector Extension (SVE). A detailed description of SVE is
available here:

    https://static.docs.arm.com/ddi0584/a/DDI0584A_a_SVE_supp_armv8A.pdf

It's been almost two weeks and no one else has replied to this, so I
thought I'd at least make sure that everyone is aware that this work is
relevant not only to ARM's SVE but also to the proposed RISC-V Vector
Extension.

I don't have detailed comments on the proposed changes to C/C++ or the
implementation, but I do want to comment on some of the differences between
SVE and RVV.

Hopefully it is possible to come up with a single proposal which will be
suitable for both ARM and RISC-V.

The following reflects my personal opinion and understanding of RVV (which
is not yet finalised) and is not an official position of SiFive or the
RISC-V Foundation or its Vector working group.

but the only feature that really matters for this RFC is that SVE has
no fixed or preferred vector length. Implementations can instead choose
from a range of possible vector lengths, with 128 bits being the minimum
and 2048 bits being the maximum. The actual length is variable and only
known at runtime.

My understanding is that in SVE the vector length in bits is fixed for any
given CPU core. I don't know what happens in a big.LITTLE system.

In RVV the vector length is potentially different in every loop nest, and
even from iteration to iteration of the same loop body. The minimum vector
length is one element. I don't think there is a defined maximum.

For example, the vector length will be shorter on the last iteration of a
loop than on the others if the length of the high-level vector is not an
exact multiple of the length of the vector registers. (RISC-V doesn't need
any "last elements clean up" code after the main vector loop)

It's also possible (depending on the micro-architecture) that if a vector
load or store instruction crosses a page boundary and there is a page fault
(or even TLB miss) or protection fault, that loop iteration will complete
using a shorter vector length (possibly without taking the fault at all),
and the next iteration of the loop will start from the beginning of the
next memory page.

Note that while in SVE all vector registers in a given CPU core have the
same number of bits (and thus different numbers of elements, depending on
the element size), in RVV all vector registers in a given loop body
iteration have the same number of elements (and thus different sizes in
bits, depending on the element size).

The prologue of a RISC-V vector processing loop contains an instruction
that declares how many registers of each element size are needed. It is
expected there will be implementations that have a single pool of vector
register storage (e.g. SRAM) that is dynamically divided into registers for
each loop. If, for example, you have 1024 bytes of vector register storage
and a particular loop asks for 1 vector with byte elements and 3 registers
with single-precision float elements then you might get a vector length for
*that* loop of 78 elements, with the byte vector starting at offset 0 in
the SRAM and the float vectors starting at offsets 80, 392, and 704.

However, even though the length is variable, the concept of a
"register-sized" C and C++ vector type makes just as much sense for SVE
as it does for other vector architectures. Vector library functions
take such register-sized vectors as input and return them as results.
Intrinsic functions are also just as useful as they are for other vector
architectures, and they too take register-sized vectors as input and
return them as results.

Intrinsic functions are absolutely required, and are I think the main
reason for such a low-level register-sized vector type to exist.

I'm not sure whether user-written functions operating on register-sized
vectors are useful enough to support. User-written functions would normally
take and return a higher-level vector type, and would implement the desired
functionality in terms of calls to other user-written functions (operating
on the high level vector as a whole) and/or explicit loops iterating
through the high level vector type using intrinsic functions on the
register-sized vector type proposed here.

All these types are opaque builtin types and are only intended to be
used with the associated ACLE intrinsics. There are intrinsics for
creating vectors from scalars, loading from scalars, storing to scalars,
reinterpreting one type as another, etc.

The idea is that the vector types would only be used for short-term
register-sized working data. Longer-term data would typically be stored
out to arrays.

I agree with this.

For example, the vector function underlying:
> #pragma omp declare simd > double sin(double); > > would be: > >

svfloat64_t mangled_sin(svfloat64_t, svbool_t); > > (The svbool_t is
because SVE functions should be predicated by default, > to avoid the need
for a scalar tail.)

Passing a predicate vector would work in RVV, but it's not necessary as any
function that takes a low-level vector-register argument will automatically
operate on the correct amount of it because it will share the same current
value in the Vector Length register.

Such a function might also *decrease* the value in the Vector Length
register if, for example, it encounters a fault in a vector load or store
within the function.

The approach we took was to treat all the SVE types as permanently
incomplete.

This seems reasonable.

Specific things we wanted to remain invalid -- by inheriting the rules from
incomplete types -- were:

  * creating or accessing arrays that have sizeless types
  * doing pointer arithmetic on pointers to sizeless types

But when writing a strip-mining loop you need to be able to increment the
pointer to the last vector-register worth of data to point to the address
of the next vector-register worth of data to be processed.

I think in this regard a sv<base>_t* should act like a base*. The compiler
doesn't know the vector length, but it knows the element size. The code at
runtime *has* to know the vector length.

Hi Bruce,

Thanks for the reply.

Bruce Hoult via cfe-dev <cfe-dev@lists.llvm.org> writes:

Richard Sandiford wrote:

TL;DR: This is an RFC about adding variable-length register-sized vector
types to C and C++. Near the end of the message there are some Phabricator
links to the clang implementation (which we're only posting to back up
the RFC; it's not intended for commit).

Summary

This is an RFC about some C and C++ language changes related to Arm's
Scalable Vector Extension (SVE). A detailed description of SVE is
available here:

    https://static.docs.arm.com/ddi0584/a/DDI0584A_a_SVE_supp_armv8A.pdf

It's been almost two weeks and no one else has replied to this, so I thought
I'd at least make sure that everyone is aware that this work is relevant not
only to ARM's SVE but also to the proposed RISC-V Vector Extension.

I don't have detailed comments on the proposed changes to C/C++ or the
implementation, but I do want to comment on some of the differences
between SVE and RVV.

Hopefully it is possible to come up with a single proposal which will be
suitable for both ARM and RISC-V.

The following reflects my personal opinion and understanding of RVV (which is
not yet finalised) and is not an official position of SiFive or the RISC-V
Foundation or its Vector working group.

but the only feature that really matters for this RFC is that SVE has
no fixed or preferred vector length. Implementations can instead choose
from a range of possible vector lengths, with 128 bits being the minimum
and 2048 bits being the maximum. The actual length is variable and only
known at runtime.

My understanding is that in SVE the vector length in bits is fixed for any
given CPU core. I don't know what happens in a big.LITTLE system.

SVE hardware has a maximum vector length, but it's possible for software
to choose a vector length that is less than that if necessary.

In RVV the vector length is potentially different in every loop nest, and even
from iteration to iteration of the same loop body. The minimum vector length is
one element. I don't think there is a defined maximum.

For example, the vector length will be shorter on the last iteration of a loop
than on the others if the length of the high-level vector is not an exact
multiple of the length of the vector registers. (RISC-V doesn't need any "last
elements clean up" code after the main vector loop)

My understanding from the LLVM RFC about RVV was that there were two
vector lengths of interest, the "maximum vector length" (MVL) and the
active vector length. Is that right? Is it likely that the MVL would
change from one iteration to another, or would only the active vector
length change?

If only the active vector length changes during the loop then I would
imagine the sizeless type proposal might map well to the MVL-dependent
register types. The active vector length would then be an on-the-side
global property that says how many bits of those register types are
currently significant. This would be similar to the role played by the
predicate registers in SVE.

It's also possible (depending on the micro-architecture) that if a vector load
or store instruction crosses a page boundary and there is a page fault (or even
TLB miss) or protection fault, that loop iteration will complete using a
shorter vector length (possibly without taking the fault at all), and the next
iteration of the loop will start from the beginning of the next memory page.

Same question here I suppose: is the active length the one that would
change? (Assuming I've understood the distinction.)

Note that while in SVE all vector registers in a given CPU core have the same
number of bits (and thus different numbers of elements, depending on the
element size), in RVV all vector registers in a given loop body iteration have
the same number of elements (and thus different sizes in bits, depending on the
element size).

OK.

The prologue of a RISC-V vector processing loop contains an instruction that
declares how many registers of each element size are needed. It is expected
there will be implementations that have a single pool of vector register
storage (e.g. SRAM) that is dynamically divided into registers for each loop.
If, for example, you have 1024 bytes of vector register storage and a
particular loop asks for 1 vector with byte elements and 3 registers with
single-precision float elements then you might get a vector length for *that*
loop of 78 elements, with the byte vector starting at offset 0 in the SRAM and
the float vectors starting at offsets 80, 392, and 704.

However, even though the length is variable, the concept of a
"register-sized" C and C++ vector type makes just as much sense for SVE
as it does for other vector architectures. Vector library functions
take such register-sized vectors as input and return them as results.
Intrinsic functions are also just as useful as they are for other vector
architectures, and they too take register-sized vectors as input and
return them as results.

Intrinsic functions are absolutely required, and are I think the main reason
for such a low-level register-sized vector type to exist.

I'm not sure whether user-written functions operating on
register-sized vectors are useful enough to support. User-written
functions would normally take and return a higher-level vector type,
and would implement the desired functionality in terms of calls to
other user-written functions (operating on the high level vector as a
whole) and/or explicit loops iterating through the high level vector
type using intrinsic functions on the register-sized vector type
proposed here.

The idea here was more to support people who wanted to write custom
implementations of "#pragma omp declare simd" routines in C rather
than asm. These routines would normally take register-sized inputs
and outputs by value, so C would need to provide a way of doing the
same. This is how SLEEF is written, for example. I agree it isn't
likely to be important for higher-level functions.

All these types are opaque builtin types and are only intended to be
used with the associated ACLE intrinsics. There are intrinsics for
creating vectors from scalars, loading from scalars, storing to scalars,
reinterpreting one type as another, etc.

The idea is that the vector types would only be used for short-term
register-sized working data. Longer-term data would typically be stored
out to arrays.

I agree with this.

For example, the vector function underlying:

#pragma omp declare simd
double sin(double);

would be:

svfloat64_t mangled_sin(svfloat64_t, svbool_t); > > (The svbool_t is because

SVE functions should be predicated by default, > to avoid the need for a scalar
tail.)

Passing a predicate vector would work in RVV, but it's not necessary as any
function that takes a low-level vector-register argument will automatically
operate on the correct amount of it because it will share the same current
value in the Vector Length register.

Such a function might also *decrease* the value in the Vector Length register
if, for example, it encounters a fault in a vector load or store within the
function.

The approach we took was to treat all the SVE types as permanently
incomplete.

This seems reasonable.

Specific things we wanted to remain invalid -- by inheriting the rules from
incomplete types -- were:

  * creating or accessing arrays that have sizeless types
  * doing pointer arithmetic on pointers to sizeless types

But when writing a strip-mining loop you need to be able to increment the
pointer to the last vector-register worth of data to point to the address of
the next vector-register worth of data to be processed.

I think in this regard a sv<base>_t* should act like a base*. The compiler
doesn't know the vector length, but it knows the element size. The code at
runtime *has* to know the vector length.

The idea with the SVE intrinsics was that normal loads and stores would
operate on <base>_t* rather than sv<base>_t*, and any pointer increment
would be to bump the <base>_t* by the number of elements. E.g. the SVE
intrinsic code to do a normal contiguous load followed by a pointer
increment would be:

  svbool_t pg;
  uint32_t *ptr;

  svint32_t vec = svld1(pg, ptr);
  ptr += svcntw();

Using and dereferencing svint32_t* wouldn't be correct when pg is only
partial, such as the last iteration of the loop, since there aren't
guaranteed to be a full vector's worth of elements at *ptr. However,
it would be valid to do:

  svint32_t v1, v2;
  ...
  std::swap (v1, v2);

where std::swap operates on svint32_t&s. The same could be done in C
using:

  void swap(svint32_t *a, svint32_t *b) {
    svint32_t tmp = *a;
    *a = *b;
    *b = *a;
  }
  ...
    svint32_t v1, v2;
    ...
    swap (&v1, &v2);

We included pointers and references to sizeless types for cases like these.

It sounds like the same arrangement could also work for RVV, if the
types did represent MVL-dependent register types and if svlenw() were
replaced by the RVV intrinsic to read the active vector length.

Thanks,
Richard