I think this is the right place for me to ask questions that are target-specific to RISC-V even though the post is about vectorization.

In this post I want to discuss on how the vectorizer should act on the RISC-V

Vector (RVV) extension [0]. To make this post self-contained, I will start from

how a scalar loop can be vectorized with existing scalable vector extension

(e.g. SVE). Then to what we have in RVV, a new concept of “LMUL” [1] that is

introduced to brings more flexibility in vectorization. Then to the most

specific interface we have now to manually control the vectorizer,

`vectorize_width(VF, scalable)`

[2], and the problem it has with respect to

RVV.

Lets assume a perfect world that there are no remaining elements not processed

after the vectorized loop and please don’t be to picky on the psuedo code

### Existing vectorization for other scalable vector extensions

Consider a scalar loop of below.

```
using DTYPE0 = int8_t;
using DTYPE1 = int32_t;
int N;
DTYPE0 a[N], b[N], c[N];
DTYPE1 d[N], e[N], f[N];
for (int i=0; i<N; ++i) {
a[i] = b[i] + c[i];
d[i] = e[i] + f[i];
}
```

SVE has two ways of vectorizing the loop, (1) vectorize after loop distribution.

(2) vectorize without loop distribution.

#### (1) Vectorie after loop distribution.

Both loops is able to fully utilize a whole vector register.

```
using DTYPE0 = int8_t;
using DTYPE1 = int32_t;
int N;
DTYPE0 a[N], b[N], c[N];
DTYPE1 d[N], e[N], f[N];
size_t stride0 = VLEN / sizeof(DTYPE0);
for (int i = 0; i < N; i += stride0) {
a[i] = b[i] + c[i];
a[i + 1] = b[i + 1] + c[ i +1];
...
a[i + stride0 - 1] = b[i + stride0 - 1] + c[i + stride0 - 1];
}
size_t stride1 = VLEN / sizeof(DTYPE1);
for (int i = 0; i < N; i += stride1) {
d[i] = e[i] + f[i];
d[i + 1] = e[i + 1] + f[ i +1];
...
d[i + stride1 - 1] = e[i + stride1 - 1] + f[i + stride1 - 1];
}
```

#### (2) Vectorize without loop distribution.

The single vectorized loop has to operate on the same stride, elements

processed per loop would be limited by the larger data type (`DTYPE1`

here)

and `a[i] = b[i] + c[i]`

cannot fully utilize the whole vector register.

```
using DTYPE0 = int8_t;
using DTYPE1 = int32_t;
int N;
DTYPE0 a[N], b[N], c[N];
DTYPE1 d[N], e[N], f[N];
size_t stride = min(VLEN / sizeof(DTYPE0), VLEN / sizeof(DTYPE1));
for (int i = 0; i < N; i += stride) {
a[i] = b[i] + c[i];
a[i + 1] = b[i + 1] + c[ i +1];
...
a[i + stride - 1] = b[i + stride - 1] + c[i + stride - 1];
d[i] = e[i] + f[i];
d[i + 1] = e[i + 1] + f[i + 1];
...
d[i + stride - 1] = e[i + stride - 1] + f[i + stride - 1];
}
```

### How scalable vectorization types is expressed in LLVM vectorizer

LLVM has `ScalarVectorType`

to support vectorization of scalable extensions,

in the form of `<vscale x VF x datatype>`

, which VF is the vectorization

factor. The compilation would provide a “minimal vector register length”

(`MinVLen`

) that the vector extension would operate on. The `VF`

would be

realized from `MinVLen`

divided by width of `datatype`

that is to be

vectorized. The remaining variable `vscale`

will be the factor that scales

along with the vector length.

### Vectorization for RISC-V

The RISC-V Vector (RVV) extension extends the concept of scalable vector

further with an extra register grouping parameter - LMUL (abbreviation for

Length Multiplier). It allows vector register grouping so the vector

instruction can operate a longer vector register length. This brings the

possibility of a better vectorized loop.

Consider the scalar loop previously mentioned, we can use **four** vector

registers for the `int32_t`

addition and allow both additions to operate on a

longer stride, and the `int8_t`

addition utilizes the whole vector register.

```
using DTYPE0 = int8_t;
using DTYPE1 = int32_t;
int N;
DTYPE0 a[N], b[N], c[N];
DTYPE1 d[N], e[N], f[N];
size_t stride = VLEN / sizeof(DTYPE0);
for (int i = 0; i < N; i += stride) {
/* a single vector register is engaged here */
a[i] = b[i] + c[i];
a[i + 1] = b[i + 1] + c[i + 1];
...
a[i + stride - 1] = b[i + stride - 1] + c[i + stride - 1];
/* four vector registers is engaged here to make it possible */
d[i] = e[i] + f[i];
d[i + 1] = e[i + 1] + f[i + 1];
...
d[i + stride - 1] = e[i + stride - 1] + f[i + stride - 1];
}
```

To take this further, if we have more spare vector registers, it is possible

that we can double the LMUL and engage even more vector registers, resulting

in a longer stride of the vectorized loop.

```
using DTYPE0 = int8_t;
using DTYPE1 = int32_t;
int N;
DTYPE0 a[N], b[N], c[N];
DTYPE1 d[N], e[N], f[N];
size_t stride = (VLEN * 2) / sizeof(DTYPE0);
for (int i = 0; i < N; i += stride) {
/* two vector registers is engaged here */
a[i] = b[i] + c[i];
a[i + 1] = b[i + 1] + c[i + 1];
...
a[i + stride - 1] = b[i + stride - 1] + c[i + stride - 1];
/* eight vector registers is engaged here */
d[i] = e[i] + f[i];
d[i + 1] = e[i + 1] + f[i + 1];
...
d[i + stride - 1] = e[i + stride - 1] + f[i + stride - 1];
}
```

### How RVV is implemented to LLVM

RISC-V assumes a minimum vector length of 64, along with the LMUL parameter

that lengthens the vector register length, we can derive a chart like the

following that realizes the vectorization factor (VF) into integers.

Element width / LMUL | mf8 | mf4 | mf2 | m1 | m2 | m4 | m8 |
---|---|---|---|---|---|---|---|

8 | <vscale x 1 > | <vscale x 2 > | <vscale x 4 > | <vscale x 8 > | <vscale x 16 > | <vscale x 32 > | <vscale x 64 > |

16 | x | <vscale x 1 > | <vscale x 2 > | <vscale x 4 > | <vscale x 8 > | <vscale x 16 > | <vscale x 32 > |

32 | x | x | <vscale x 1 > | <vscale x 2 > | <vscale x 4 > | <vscale x 8 > | <vscale x 16 > |

32 | x | x | x | <vscale x 1 > | <vscale x 2 > | <vscale x 4 > | <vscale x 8 > |

Note that `mf8`

, `mf4`

, `mf2`

in the chart are fractional LMUL, corresponding

to multiplier of 1/8, 1/4, and 1/2, respectively.

With such type system, the vectorizer is able to express the vectorized loop

that both addition of `int8_t`

and `int32_t`

be processing the same amount of

elements in the same loop. As long as the we are operating on a vector length

that is greater or equal than what we assume for RVV (which is 64), there is no

functional problem in expressing the scalable vectorize types in RVV.

###
Problem of the existing pragma `vectorize_width(VF, scalable)`

Therefore we can say that there is no functional inability in the current

interface `vectorize_width(VF, scalable)`

, However I think this is not the

right interface exposed to the users to manipulate vectorization of RVV.

Users needs to be aware of the chart mentioned above to toggle the correct

vectorization factor that will use the the number of registers as user expects.

Take example of the vectorized loop above, where the `int8_t`

addition is using

two vector register and the `int32_t`

addition using eight. The user should use

`#pragma clang loop vectorize_width(16, scalable)`

.

Under RVV, I think the key cognitive assumption is that users need to

understand that multiple vector registers will engage in the vectorized loop.

Under this cognitive assumption, let me write a vectorized loop of the example

this post has repeatingly visited.

```
using DTYPE0 = int8_t;
using DTYPE1 = int32_t;
int N;
DTYPE0 a[N], b[N], c[N];
DTYPE1 d[N], e[N], f[N];
int i = 0;
while (i < N) {
int remaining_elem_to_process = N - i;
int vl = vsetvl_e8m2(remaining_elem_to_process);
a = vadd_i8m2(b, c, vl);
d = vadd_i32m8(e, f, vl);
}
```

This function `vsetvl`

takes the number of element to be processed and returns

the number of elements to process in the iteration given the element width and

the number of vector registers engaged. You will realize that `vsetvl_e8m2`

will return the same number as `vsetvl_e32m8`

.

### Proposal

Under such problem, I think we should have a target-specific interface for RVV

that provides a pair of `(element_width, lmul)`

hint to the vectorizer. It is

essentially a syntax sugar to `vectorize_width(VF, scalable)`

, but such

interface will provide the correct semantic that users are informed that

multiple registers will engage in the vectorization. The pair provided will be

equivalent of toggling the element width and LMUL of the call to `vsetvl`

.

We need to have some regulation to enforce correct usage of the interface. User

should be aware of the element width in the loop. User can only specify the

largest element width that is used in the loop. Therefore providing `(16, m2)`

or `(64, m2)`

is not feasible in the example of this post. User should use

`(32, LMUL)`

, for example `(32, m8)`

. The compiler should be responsible of

checking mis-usage of the pragma, emit warning, and ignore the pragma if mis-used.

[0] v-spec

[1] v-spec: 3.4.2. Vector Register Grouping

[2] Patch of SVE extending the interface vectorize_width(VF, scalable)