I think this is the right place for me to ask questions that are target-specific to RISC-V even though the post is about vectorization.
In this post I want to discuss on how the vectorizer should act on the RISC-V
Vector (RVV) extension [0]. To make this post self-contained, I will start from
how a scalar loop can be vectorized with existing scalable vector extension
(e.g. SVE). Then to what we have in RVV, a new concept of “LMUL” [1] that is
introduced to brings more flexibility in vectorization. Then to the most
specific interface we have now to manually control the vectorizer,
vectorize_width(VF, scalable)
[2], and the problem it has with respect to
RVV.
Lets assume a perfect world that there are no remaining elements not processed
after the vectorized loop and please don’t be to picky on the psuedo code
Existing vectorization for other scalable vector extensions
Consider a scalar loop of below.
using DTYPE0 = int8_t;
using DTYPE1 = int32_t;
int N;
DTYPE0 a[N], b[N], c[N];
DTYPE1 d[N], e[N], f[N];
for (int i=0; i<N; ++i) {
a[i] = b[i] + c[i];
d[i] = e[i] + f[i];
}
SVE has two ways of vectorizing the loop, (1) vectorize after loop distribution.
(2) vectorize without loop distribution.
(1) Vectorie after loop distribution.
Both loops is able to fully utilize a whole vector register.
using DTYPE0 = int8_t;
using DTYPE1 = int32_t;
int N;
DTYPE0 a[N], b[N], c[N];
DTYPE1 d[N], e[N], f[N];
size_t stride0 = VLEN / sizeof(DTYPE0);
for (int i = 0; i < N; i += stride0) {
a[i] = b[i] + c[i];
a[i + 1] = b[i + 1] + c[ i +1];
...
a[i + stride0 - 1] = b[i + stride0 - 1] + c[i + stride0 - 1];
}
size_t stride1 = VLEN / sizeof(DTYPE1);
for (int i = 0; i < N; i += stride1) {
d[i] = e[i] + f[i];
d[i + 1] = e[i + 1] + f[ i +1];
...
d[i + stride1 - 1] = e[i + stride1 - 1] + f[i + stride1 - 1];
}
(2) Vectorize without loop distribution.
The single vectorized loop has to operate on the same stride, elements
processed per loop would be limited by the larger data type (DTYPE1
here)
and a[i] = b[i] + c[i]
cannot fully utilize the whole vector register.
using DTYPE0 = int8_t;
using DTYPE1 = int32_t;
int N;
DTYPE0 a[N], b[N], c[N];
DTYPE1 d[N], e[N], f[N];
size_t stride = min(VLEN / sizeof(DTYPE0), VLEN / sizeof(DTYPE1));
for (int i = 0; i < N; i += stride) {
a[i] = b[i] + c[i];
a[i + 1] = b[i + 1] + c[ i +1];
...
a[i + stride - 1] = b[i + stride - 1] + c[i + stride - 1];
d[i] = e[i] + f[i];
d[i + 1] = e[i + 1] + f[i + 1];
...
d[i + stride - 1] = e[i + stride - 1] + f[i + stride - 1];
}
How scalable vectorization types is expressed in LLVM vectorizer
LLVM has ScalarVectorType
to support vectorization of scalable extensions,
in the form of <vscale x VF x datatype>
, which VF is the vectorization
factor. The compilation would provide a “minimal vector register length”
(MinVLen
) that the vector extension would operate on. The VF
would be
realized from MinVLen
divided by width of datatype
that is to be
vectorized. The remaining variable vscale
will be the factor that scales
along with the vector length.
Vectorization for RISC-V
The RISC-V Vector (RVV) extension extends the concept of scalable vector
further with an extra register grouping parameter - LMUL (abbreviation for
Length Multiplier). It allows vector register grouping so the vector
instruction can operate a longer vector register length. This brings the
possibility of a better vectorized loop.
Consider the scalar loop previously mentioned, we can use four vector
registers for the int32_t
addition and allow both additions to operate on a
longer stride, and the int8_t
addition utilizes the whole vector register.
using DTYPE0 = int8_t;
using DTYPE1 = int32_t;
int N;
DTYPE0 a[N], b[N], c[N];
DTYPE1 d[N], e[N], f[N];
size_t stride = VLEN / sizeof(DTYPE0);
for (int i = 0; i < N; i += stride) {
/* a single vector register is engaged here */
a[i] = b[i] + c[i];
a[i + 1] = b[i + 1] + c[i + 1];
...
a[i + stride - 1] = b[i + stride - 1] + c[i + stride - 1];
/* four vector registers is engaged here to make it possible */
d[i] = e[i] + f[i];
d[i + 1] = e[i + 1] + f[i + 1];
...
d[i + stride - 1] = e[i + stride - 1] + f[i + stride - 1];
}
To take this further, if we have more spare vector registers, it is possible
that we can double the LMUL and engage even more vector registers, resulting
in a longer stride of the vectorized loop.
using DTYPE0 = int8_t;
using DTYPE1 = int32_t;
int N;
DTYPE0 a[N], b[N], c[N];
DTYPE1 d[N], e[N], f[N];
size_t stride = (VLEN * 2) / sizeof(DTYPE0);
for (int i = 0; i < N; i += stride) {
/* two vector registers is engaged here */
a[i] = b[i] + c[i];
a[i + 1] = b[i + 1] + c[i + 1];
...
a[i + stride - 1] = b[i + stride - 1] + c[i + stride - 1];
/* eight vector registers is engaged here */
d[i] = e[i] + f[i];
d[i + 1] = e[i + 1] + f[i + 1];
...
d[i + stride - 1] = e[i + stride - 1] + f[i + stride - 1];
}
How RVV is implemented to LLVM
RISC-V assumes a minimum vector length of 64, along with the LMUL parameter
that lengthens the vector register length, we can derive a chart like the
following that realizes the vectorization factor (VF) into integers.
Element width / LMUL | mf8 | mf4 | mf2 | m1 | m2 | m4 | m8 |
---|---|---|---|---|---|---|---|
8 | <vscale x 1 > | <vscale x 2 > | <vscale x 4 > | <vscale x 8 > | <vscale x 16 > | <vscale x 32 > | <vscale x 64 > |
16 | x | <vscale x 1 > | <vscale x 2 > | <vscale x 4 > | <vscale x 8 > | <vscale x 16 > | <vscale x 32 > |
32 | x | x | <vscale x 1 > | <vscale x 2 > | <vscale x 4 > | <vscale x 8 > | <vscale x 16 > |
32 | x | x | x | <vscale x 1 > | <vscale x 2 > | <vscale x 4 > | <vscale x 8 > |
Note that mf8
, mf4
, mf2
in the chart are fractional LMUL, corresponding
to multiplier of 1/8, 1/4, and 1/2, respectively.
With such type system, the vectorizer is able to express the vectorized loop
that both addition of int8_t
and int32_t
be processing the same amount of
elements in the same loop. As long as the we are operating on a vector length
that is greater or equal than what we assume for RVV (which is 64), there is no
functional problem in expressing the scalable vectorize types in RVV.
Problem of the existing pragma vectorize_width(VF, scalable)
Therefore we can say that there is no functional inability in the current
interface vectorize_width(VF, scalable)
, However I think this is not the
right interface exposed to the users to manipulate vectorization of RVV.
Users needs to be aware of the chart mentioned above to toggle the correct
vectorization factor that will use the the number of registers as user expects.
Take example of the vectorized loop above, where the int8_t
addition is using
two vector register and the int32_t
addition using eight. The user should use
#pragma clang loop vectorize_width(16, scalable)
.
Under RVV, I think the key cognitive assumption is that users need to
understand that multiple vector registers will engage in the vectorized loop.
Under this cognitive assumption, let me write a vectorized loop of the example
this post has repeatingly visited.
using DTYPE0 = int8_t;
using DTYPE1 = int32_t;
int N;
DTYPE0 a[N], b[N], c[N];
DTYPE1 d[N], e[N], f[N];
int i = 0;
while (i < N) {
int remaining_elem_to_process = N - i;
int vl = vsetvl_e8m2(remaining_elem_to_process);
a = vadd_i8m2(b, c, vl);
d = vadd_i32m8(e, f, vl);
}
This function vsetvl
takes the number of element to be processed and returns
the number of elements to process in the iteration given the element width and
the number of vector registers engaged. You will realize that vsetvl_e8m2
will return the same number as vsetvl_e32m8
.
Proposal
Under such problem, I think we should have a target-specific interface for RVV
that provides a pair of (element_width, lmul)
hint to the vectorizer. It is
essentially a syntax sugar to vectorize_width(VF, scalable)
, but such
interface will provide the correct semantic that users are informed that
multiple registers will engage in the vectorization. The pair provided will be
equivalent of toggling the element width and LMUL of the call to vsetvl
.
We need to have some regulation to enforce correct usage of the interface. User
should be aware of the element width in the loop. User can only specify the
largest element width that is used in the loop. Therefore providing (16, m2)
or (64, m2)
is not feasible in the example of this post. User should use
(32, LMUL)
, for example (32, m8)
. The compiler should be responsible of
checking mis-usage of the pragma, emit warning, and ignore the pragma if mis-used.
[0] v-spec
[1] v-spec: 3.4.2. Vector Register Grouping
[2] Patch of SVE extending the interface vectorize_width(VF, scalable)