On vectorization under RISC-V and its existing interface to control scalable vectorization width - vectorize_width(VF, scalable)

I think this is the right place for me to ask questions that are target-specific to RISC-V even though the post is about vectorization.


In this post I want to discuss on how the vectorizer should act on the RISC-V
Vector (RVV) extension [0]. To make this post self-contained, I will start from
how a scalar loop can be vectorized with existing scalable vector extension
(e.g. SVE). Then to what we have in RVV, a new concept of “LMUL” [1] that is
introduced to brings more flexibility in vectorization. Then to the most
specific interface we have now to manually control the vectorizer,
vectorize_width(VF, scalable) [2], and the problem it has with respect to
RVV.

Lets assume a perfect world that there are no remaining elements not processed
after the vectorized loop and please don’t be to picky on the psuedo code :slight_smile:

Existing vectorization for other scalable vector extensions

Consider a scalar loop of below.

using DTYPE0 = int8_t;
using DTYPE1 = int32_t;

int N;
DTYPE0 a[N], b[N], c[N];
DTYPE1 d[N], e[N], f[N];
for (int i=0; i<N; ++i) {
  a[i] = b[i] + c[i];
  d[i] = e[i] + f[i];
}

SVE has two ways of vectorizing the loop, (1) vectorize after loop distribution.
(2) vectorize without loop distribution.

(1) Vectorie after loop distribution.

Both loops is able to fully utilize a whole vector register.

using DTYPE0 = int8_t;
using DTYPE1 = int32_t;

int N;
DTYPE0 a[N], b[N], c[N];
DTYPE1 d[N], e[N], f[N];

size_t stride0 = VLEN / sizeof(DTYPE0);
for (int i = 0; i < N; i += stride0) {
  a[i] = b[i] + c[i];
  a[i + 1] = b[i + 1] + c[ i +1];
  ...
  a[i + stride0 - 1] = b[i + stride0 - 1] + c[i + stride0 - 1];
}

size_t stride1 = VLEN / sizeof(DTYPE1);
for (int i = 0; i < N; i += stride1) {  
  d[i] = e[i] + f[i];
  d[i + 1] = e[i + 1] + f[ i +1];
  ...
  d[i + stride1 - 1] = e[i + stride1 - 1] + f[i + stride1 - 1];
}

(2) Vectorize without loop distribution.

The single vectorized loop has to operate on the same stride, elements
processed per loop would be limited by the larger data type (DTYPE1 here)
and a[i] = b[i] + c[i] cannot fully utilize the whole vector register.

using DTYPE0 = int8_t;
using DTYPE1 = int32_t;

int N;
DTYPE0 a[N], b[N], c[N];
DTYPE1 d[N], e[N], f[N];

size_t stride = min(VLEN / sizeof(DTYPE0), VLEN / sizeof(DTYPE1));
for (int i = 0; i < N; i += stride) {
  a[i] = b[i] + c[i];
  a[i + 1] = b[i + 1] + c[ i +1];
  ...
  a[i + stride - 1] = b[i + stride - 1] + c[i + stride - 1];

  d[i] = e[i] + f[i];
  d[i + 1] = e[i + 1] + f[i + 1];
  ...
  d[i + stride - 1] = e[i + stride - 1] + f[i + stride - 1];
}

How scalable vectorization types is expressed in LLVM vectorizer

LLVM has ScalarVectorType to support vectorization of scalable extensions,
in the form of <vscale x VF x datatype>, which VF is the vectorization
factor. The compilation would provide a “minimal vector register length”
(MinVLen) that the vector extension would operate on. The VF would be
realized from MinVLen divided by width of datatype that is to be
vectorized. The remaining variable vscale will be the factor that scales
along with the vector length.

Vectorization for RISC-V

The RISC-V Vector (RVV) extension extends the concept of scalable vector
further with an extra register grouping parameter - LMUL (abbreviation for
Length Multiplier). It allows vector register grouping so the vector
instruction can operate a longer vector register length. This brings the
possibility of a better vectorized loop.

Consider the scalar loop previously mentioned, we can use four vector
registers for the int32_t addition and allow both additions to operate on a
longer stride, and the int8_t addition utilizes the whole vector register.

using DTYPE0 = int8_t;
using DTYPE1 = int32_t;

int N;
DTYPE0 a[N], b[N], c[N];
DTYPE1 d[N], e[N], f[N];

size_t stride = VLEN / sizeof(DTYPE0);
for (int i = 0; i < N; i += stride) {
  /* a single vector register is engaged here */
  a[i] = b[i] + c[i];
  a[i + 1] = b[i + 1] + c[i + 1];
  ...
  a[i + stride - 1] = b[i + stride - 1] + c[i + stride - 1];

  /* four vector registers is engaged here to make it possible */
  d[i] = e[i] + f[i];
  d[i + 1] = e[i + 1] + f[i + 1];
  ...
  d[i + stride - 1] = e[i + stride - 1] + f[i + stride - 1];
}

To take this further, if we have more spare vector registers, it is possible
that we can double the LMUL and engage even more vector registers, resulting
in a longer stride of the vectorized loop.

using DTYPE0 = int8_t;
using DTYPE1 = int32_t;

int N;
DTYPE0 a[N], b[N], c[N];
DTYPE1 d[N], e[N], f[N];

size_t stride = (VLEN * 2) / sizeof(DTYPE0);
for (int i = 0; i < N; i += stride) {
  /* two vector registers is engaged here */
  a[i] = b[i] + c[i];
  a[i + 1] = b[i + 1] + c[i + 1];
  ...
  a[i + stride - 1] = b[i + stride - 1] + c[i + stride - 1];

  /* eight vector registers is engaged here */
  d[i] = e[i] + f[i];
  d[i + 1] = e[i + 1] + f[i + 1];
  ...
  d[i + stride - 1] = e[i + stride - 1] + f[i + stride - 1];
}

How RVV is implemented to LLVM

RISC-V assumes a minimum vector length of 64, along with the LMUL parameter
that lengthens the vector register length, we can derive a chart like the
following that realizes the vectorization factor (VF) into integers.

Element width / LMUL mf8 mf4 mf2 m1 m2 m4 m8
8 <vscale x 1 > <vscale x 2 > <vscale x 4 > <vscale x 8 > <vscale x 16 > <vscale x 32 > <vscale x 64 >
16 x <vscale x 1 > <vscale x 2 > <vscale x 4 > <vscale x 8 > <vscale x 16 > <vscale x 32 >
32 x x <vscale x 1 > <vscale x 2 > <vscale x 4 > <vscale x 8 > <vscale x 16 >
32 x x x <vscale x 1 > <vscale x 2 > <vscale x 4 > <vscale x 8 >

Note that mf8, mf4, mf2 in the chart are fractional LMUL, corresponding
to multiplier of 1/8, 1/4, and 1/2, respectively.

With such type system, the vectorizer is able to express the vectorized loop
that both addition of int8_t and int32_t be processing the same amount of
elements in the same loop. As long as the we are operating on a vector length
that is greater or equal than what we assume for RVV (which is 64), there is no
functional problem in expressing the scalable vectorize types in RVV.

Problem of the existing pragma vectorize_width(VF, scalable)

Therefore we can say that there is no functional inability in the current
interface vectorize_width(VF, scalable), However I think this is not the
right interface exposed to the users to manipulate vectorization of RVV.

Users needs to be aware of the chart mentioned above to toggle the correct
vectorization factor that will use the the number of registers as user expects.
Take example of the vectorized loop above, where the int8_t addition is using
two vector register and the int32_t addition using eight. The user should use
#pragma clang loop vectorize_width(16, scalable).

Under RVV, I think the key cognitive assumption is that users need to
understand that multiple vector registers will engage in the vectorized loop.
Under this cognitive assumption, let me write a vectorized loop of the example
this post has repeatingly visited.

using DTYPE0 = int8_t;
using DTYPE1 = int32_t;

int N;
DTYPE0 a[N], b[N], c[N];
DTYPE1 d[N], e[N], f[N];

int i = 0;
while (i < N) {
  int remaining_elem_to_process = N - i;
  int vl = vsetvl_e8m2(remaining_elem_to_process);
  a = vadd_i8m2(b, c, vl);
  d = vadd_i32m8(e, f, vl);
}

This function vsetvl takes the number of element to be processed and returns
the number of elements to process in the iteration given the element width and
the number of vector registers engaged. You will realize that vsetvl_e8m2
will return the same number as vsetvl_e32m8.

Proposal

Under such problem, I think we should have a target-specific interface for RVV
that provides a pair of (element_width, lmul) hint to the vectorizer. It is
essentially a syntax sugar to vectorize_width(VF, scalable), but such
interface will provide the correct semantic that users are informed that
multiple registers will engage in the vectorization. The pair provided will be
equivalent of toggling the element width and LMUL of the call to vsetvl.

We need to have some regulation to enforce correct usage of the interface. User
should be aware of the element width in the loop. User can only specify the
largest element width that is used in the loop. Therefore providing (16, m2)
or (64, m2) is not feasible in the example of this post. User should use
(32, LMUL), for example (32, m8). The compiler should be responsible of
checking mis-usage of the pragma, emit warning, and ignore the pragma if mis-used.

[0] v-spec

[1] v-spec: 3.4.2. Vector Register Grouping

[2] Patch of SVE extending the interface vectorize_width(VF, scalable)

Thanks for raising this. How to handle LMUL from a vectorization perspective is a topic which has already gotten a fair amount of discussion, but the aspect you raise here - the pragma interface - hasn’t been discussed yet to my knowledge.

As a practical matter, I want to call out that our LMUL > 1 codegen is currently quite poor. Our cost model basically ignores LMUL - we know this is wrong. We have no good answer on register allocation at LMUL8. More generally, no one has really pushed on this part of codegen to my knowledge. In my view, a lot of work needs to happen here before a user can reasonable expect non-LMUL1 code to perform well.

Long term, I expect the vectorizer to pick an appropriate LMUL. We currently don’t have this part implemented at all, and thus default to LMUL1. Even once we have a robust cost model and selection heuristic, having an override mechanism for the user is likely worthwhile, but I think it’s important to note that long term very few loops should need such an override.

As for the pragma itself, I have to admit I find your proposal slightly confusing. (As in, I think I didn’t understand something.) Are you meaning to propose something which simply adds a level of error checking on top of the existing pragma? Your example seems to still assume that particular vector width values map to particular LMULs. It seems like the explicit LMUL is there an consistency check? Or am I missing something?

If you want, I’m happy to brainstorm this some offline. We can either chat directly, or this would be a good topic for the next RISC-V sync up call.

Thank you for the swift reply.

I am proposing a syntax sugar like

#pragma rvv lmul_sew(m1, e64)

that still respects the chart I have mentioned, which maps to a corresponding VF. This essentially achieves no more than the existing clang loop vectorize_width(VF, scalable) but the justification of such is that the proposed interface gives target-specific users the correct semantic under RVV domain to control the vectorization factor.

With the proposed pragma, the vectorizer would check on the pair specified by the pragma, and the element size should be at least used inside the loop, or else the vectorizer would ignore the pragma (and emit debug log that the pragma is ignored during compilation, not in Clang semantic checks). The Clang semantic check should check for invalid pairs that has X in the mentioned chart that maps (lmul, sew) pair to VF, for example (mf8, e64).