On vectorization under RISC-V and its existing interface to control scalable vectorization width - vectorize_width(VF, scalable)

I think this is the right place for me to ask questions that are target-specific to RISC-V even though the post is about vectorization.


In this post I want to discuss on how the vectorizer should act on the RISC-V
Vector (RVV) extension [0]. To make this post self-contained, I will start from
how a scalar loop can be vectorized with existing scalable vector extension
(e.g. SVE). Then to what we have in RVV, a new concept of “LMUL” [1] that is
introduced to brings more flexibility in vectorization. Then to the most
specific interface we have now to manually control the vectorizer,
vectorize_width(VF, scalable) [2], and the problem it has with respect to
RVV.

Lets assume a perfect world that there are no remaining elements not processed
after the vectorized loop and please don’t be to picky on the psuedo code :slight_smile:

Existing vectorization for other scalable vector extensions

Consider a scalar loop of below.

using DTYPE0 = int8_t;
using DTYPE1 = int32_t;

int N;
DTYPE0 a[N], b[N], c[N];
DTYPE1 d[N], e[N], f[N];
for (int i=0; i<N; ++i) {
  a[i] = b[i] + c[i];
  d[i] = e[i] + f[i];
}

SVE has two ways of vectorizing the loop, (1) vectorize after loop distribution.
(2) vectorize without loop distribution.

(1) Vectorie after loop distribution.

Both loops is able to fully utilize a whole vector register.

using DTYPE0 = int8_t;
using DTYPE1 = int32_t;

int N;
DTYPE0 a[N], b[N], c[N];
DTYPE1 d[N], e[N], f[N];

size_t stride0 = VLEN / sizeof(DTYPE0);
for (int i = 0; i < N; i += stride0) {
  a[i] = b[i] + c[i];
  a[i + 1] = b[i + 1] + c[ i +1];
  ...
  a[i + stride0 - 1] = b[i + stride0 - 1] + c[i + stride0 - 1];
}

size_t stride1 = VLEN / sizeof(DTYPE1);
for (int i = 0; i < N; i += stride1) {  
  d[i] = e[i] + f[i];
  d[i + 1] = e[i + 1] + f[ i +1];
  ...
  d[i + stride1 - 1] = e[i + stride1 - 1] + f[i + stride1 - 1];
}

(2) Vectorize without loop distribution.

The single vectorized loop has to operate on the same stride, elements
processed per loop would be limited by the larger data type (DTYPE1 here)
and a[i] = b[i] + c[i] cannot fully utilize the whole vector register.

using DTYPE0 = int8_t;
using DTYPE1 = int32_t;

int N;
DTYPE0 a[N], b[N], c[N];
DTYPE1 d[N], e[N], f[N];

size_t stride = min(VLEN / sizeof(DTYPE0), VLEN / sizeof(DTYPE1));
for (int i = 0; i < N; i += stride) {
  a[i] = b[i] + c[i];
  a[i + 1] = b[i + 1] + c[ i +1];
  ...
  a[i + stride - 1] = b[i + stride - 1] + c[i + stride - 1];

  d[i] = e[i] + f[i];
  d[i + 1] = e[i + 1] + f[i + 1];
  ...
  d[i + stride - 1] = e[i + stride - 1] + f[i + stride - 1];
}

How scalable vectorization types is expressed in LLVM vectorizer

LLVM has ScalarVectorType to support vectorization of scalable extensions,
in the form of <vscale x VF x datatype>, which VF is the vectorization
factor. The compilation would provide a “minimal vector register length”
(MinVLen) that the vector extension would operate on. The VF would be
realized from MinVLen divided by width of datatype that is to be
vectorized. The remaining variable vscale will be the factor that scales
along with the vector length.

Vectorization for RISC-V

The RISC-V Vector (RVV) extension extends the concept of scalable vector
further with an extra register grouping parameter - LMUL (abbreviation for
Length Multiplier). It allows vector register grouping so the vector
instruction can operate a longer vector register length. This brings the
possibility of a better vectorized loop.

Consider the scalar loop previously mentioned, we can use four vector
registers for the int32_t addition and allow both additions to operate on a
longer stride, and the int8_t addition utilizes the whole vector register.

using DTYPE0 = int8_t;
using DTYPE1 = int32_t;

int N;
DTYPE0 a[N], b[N], c[N];
DTYPE1 d[N], e[N], f[N];

size_t stride = VLEN / sizeof(DTYPE0);
for (int i = 0; i < N; i += stride) {
  /* a single vector register is engaged here */
  a[i] = b[i] + c[i];
  a[i + 1] = b[i + 1] + c[i + 1];
  ...
  a[i + stride - 1] = b[i + stride - 1] + c[i + stride - 1];

  /* four vector registers is engaged here to make it possible */
  d[i] = e[i] + f[i];
  d[i + 1] = e[i + 1] + f[i + 1];
  ...
  d[i + stride - 1] = e[i + stride - 1] + f[i + stride - 1];
}

To take this further, if we have more spare vector registers, it is possible
that we can double the LMUL and engage even more vector registers, resulting
in a longer stride of the vectorized loop.

using DTYPE0 = int8_t;
using DTYPE1 = int32_t;

int N;
DTYPE0 a[N], b[N], c[N];
DTYPE1 d[N], e[N], f[N];

size_t stride = (VLEN * 2) / sizeof(DTYPE0);
for (int i = 0; i < N; i += stride) {
  /* two vector registers is engaged here */
  a[i] = b[i] + c[i];
  a[i + 1] = b[i + 1] + c[i + 1];
  ...
  a[i + stride - 1] = b[i + stride - 1] + c[i + stride - 1];

  /* eight vector registers is engaged here */
  d[i] = e[i] + f[i];
  d[i + 1] = e[i + 1] + f[i + 1];
  ...
  d[i + stride - 1] = e[i + stride - 1] + f[i + stride - 1];
}

How RVV is implemented to LLVM

RISC-V assumes a minimum vector length of 64, along with the LMUL parameter
that lengthens the vector register length, we can derive a chart like the
following that realizes the vectorization factor (VF) into integers.

Element width / LMUL mf8 mf4 mf2 m1 m2 m4 m8
8 <vscale x 1 > <vscale x 2 > <vscale x 4 > <vscale x 8 > <vscale x 16 > <vscale x 32 > <vscale x 64 >
16 x <vscale x 1 > <vscale x 2 > <vscale x 4 > <vscale x 8 > <vscale x 16 > <vscale x 32 >
32 x x <vscale x 1 > <vscale x 2 > <vscale x 4 > <vscale x 8 > <vscale x 16 >
32 x x x <vscale x 1 > <vscale x 2 > <vscale x 4 > <vscale x 8 >

Note that mf8, mf4, mf2 in the chart are fractional LMUL, corresponding
to multiplier of 1/8, 1/4, and 1/2, respectively.

With such type system, the vectorizer is able to express the vectorized loop
that both addition of int8_t and int32_t be processing the same amount of
elements in the same loop. As long as the we are operating on a vector length
that is greater or equal than what we assume for RVV (which is 64), there is no
functional problem in expressing the scalable vectorize types in RVV.

Problem of the existing pragma vectorize_width(VF, scalable)

Therefore we can say that there is no functional inability in the current
interface vectorize_width(VF, scalable), However I think this is not the
right interface exposed to the users to manipulate vectorization of RVV.

Users needs to be aware of the chart mentioned above to toggle the correct
vectorization factor that will use the the number of registers as user expects.
Take example of the vectorized loop above, where the int8_t addition is using
two vector register and the int32_t addition using eight. The user should use
#pragma clang loop vectorize_width(16, scalable).

Under RVV, I think the key cognitive assumption is that users need to
understand that multiple vector registers will engage in the vectorized loop.
Under this cognitive assumption, let me write a vectorized loop of the example
this post has repeatingly visited.

using DTYPE0 = int8_t;
using DTYPE1 = int32_t;

int N;
DTYPE0 a[N], b[N], c[N];
DTYPE1 d[N], e[N], f[N];

int i = 0;
while (i < N) {
  int remaining_elem_to_process = N - i;
  int vl = vsetvl_e8m2(remaining_elem_to_process);
  a = vadd_i8m2(b, c, vl);
  d = vadd_i32m8(e, f, vl);
}

This function vsetvl takes the number of element to be processed and returns
the number of elements to process in the iteration given the element width and
the number of vector registers engaged. You will realize that vsetvl_e8m2
will return the same number as vsetvl_e32m8.

Proposal

Under such problem, I think we should have a target-specific interface for RVV
that provides a pair of (element_width, lmul) hint to the vectorizer. It is
essentially a syntax sugar to vectorize_width(VF, scalable), but such
interface will provide the correct semantic that users are informed that
multiple registers will engage in the vectorization. The pair provided will be
equivalent of toggling the element width and LMUL of the call to vsetvl.

We need to have some regulation to enforce correct usage of the interface. User
should be aware of the element width in the loop. User can only specify the
largest element width that is used in the loop. Therefore providing (16, m2)
or (64, m2) is not feasible in the example of this post. User should use
(32, LMUL), for example (32, m8). The compiler should be responsible of
checking mis-usage of the pragma, emit warning, and ignore the pragma if mis-used.

[0] v-spec

[1] v-spec: 3.4.2. Vector Register Grouping

[2] Patch of SVE extending the interface vectorize_width(VF, scalable)

Thanks for raising this. How to handle LMUL from a vectorization perspective is a topic which has already gotten a fair amount of discussion, but the aspect you raise here - the pragma interface - hasn’t been discussed yet to my knowledge.

As a practical matter, I want to call out that our LMUL > 1 codegen is currently quite poor. Our cost model basically ignores LMUL - we know this is wrong. We have no good answer on register allocation at LMUL8. More generally, no one has really pushed on this part of codegen to my knowledge. In my view, a lot of work needs to happen here before a user can reasonable expect non-LMUL1 code to perform well.

Long term, I expect the vectorizer to pick an appropriate LMUL. We currently don’t have this part implemented at all, and thus default to LMUL1. Even once we have a robust cost model and selection heuristic, having an override mechanism for the user is likely worthwhile, but I think it’s important to note that long term very few loops should need such an override.

As for the pragma itself, I have to admit I find your proposal slightly confusing. (As in, I think I didn’t understand something.) Are you meaning to propose something which simply adds a level of error checking on top of the existing pragma? Your example seems to still assume that particular vector width values map to particular LMULs. It seems like the explicit LMUL is there an consistency check? Or am I missing something?

If you want, I’m happy to brainstorm this some offline. We can either chat directly, or this would be a good topic for the next RISC-V sync up call.

Thank you for the swift reply.

I am proposing a syntax sugar like

#pragma rvv lmul_sew(m1, e64)

that still respects the chart I have mentioned, which maps to a corresponding VF. This essentially achieves no more than the existing clang loop vectorize_width(VF, scalable) but the justification of such is that the proposed interface gives target-specific users the correct semantic under RVV domain to control the vectorization factor.

With the proposed pragma, the vectorizer would check on the pair specified by the pragma, and the element size should be at least used inside the loop, or else the vectorizer would ignore the pragma (and emit debug log that the pragma is ignored during compilation, not in Clang semantic checks). The Clang semantic check should check for invalid pairs that has X in the mentioned chart that maps (lmul, sew) pair to VF, for example (mf8, e64).

Attempting to restart this discussion.

This sentence from the original post is misstated.

RISC-V assumes a minimum vector length of 64

RISC-V isn’t the one assuming something here. It’s LLVM. For Zve32* LLVM should be assuming minimum vector length of 32 and using a different type mapping, but we’re not today.

The #pragma vectorize_width requires the user to know the magic 64 constant and if we fix the Zve32 type mapping, they need a different VF to get the same result. I think we need a way to express the VF in a way that hides these details.

The key concept for RISC-V vectorization factor is the ratio between SEW and LMUL.

The pragma here is intended to give the user a way to say that a particular element type should use a register with a particular LMUL. This sets the LMUL SEW ratio. All other element sizes will scale their LMUL using this ratio.

Behind the scenes the compiler can map this to the VF and vscale type mapping we’ve defined. Including the case were Zve32 has a different type mapping than Zve64/V.

@rofirrim @frasercrmck @preames @lukel Do you have any thoughts on this?

If I’m understanding correctly, the issue here is that when you specify “scalable”, it’s not clear what it’s scaling relative to; for SVE, obviously the registers scale in multiples of 128. But for RISCV, the scaling is less obvious, particularly with the existence of Zve32.

The key insight here is that instead of specifying the total number of lanes, you can specify how many registers values of a given element size use. This isn’t necessarily specific to RISCV; it’s just especially confusing for RISCV.

I guess you want to allow the user to specify the element size they’re thinking about so they don’t have to consider fractions. So you get syntax like, say, #pragma clang loop vectorize_width(32, num_regs, 1), to say “pick a vector factor so that vectors with 32-bit elements use one register”. This should have an unambiguous meaning for all the vector instruction sets I can think of.

I don’t think you need to bring in RISCV-specific concepts at this level; unless you’re very familiar with the instruction set, “LMUL” isn’t an intuitive concept.


Alternatively, if we really don’t trust the backend to do reasonable things with over-wide vectors, I guess we can specify a way to micro-manage the vectorizer for RISCV. For example, something like #pragma rvv_vectorize(num_regs(8, mf2), num_regs(16, m1), num_regs(32, m1), num_regs(64, m1)) to explicitly specify that 8-bit operations use lmul mf2, 16-bit operations use lmul m1, 32-bit operations use lmul m1, and 64-bit operations use lmul m1. Then the vectorizer would pick a vector factor and unroll operations as necessary to honor the user’s specification. I’m not sure how important it is to micromanage the use of lmul along these lines.

I think efriedma’s idea is good, more general and can suit not only RISC-V architectures.

I also think @efriedma-quic’s suggestion is reasonable.

In fact, it goes in the same spirit as the discussion that is also happening internally in the OpenMP community regarding the simdlen clause of the simd directives.

There the suggestion there is in the line of vectorize_width(scalable(size-of-element), factor) which I think can capture Eli’s suggestion. The size-of-element could be the size in bits and factor is a multiple of that. In general those values can be mapped to sensible values of LMUL.

For instance, vector_width(scalable(64), 1) could map vectors of 64-bit in (ungrouped) vector registers of RISC-V. vector_width(scalable(64), 2) would map them in groups of two (LMUL=2 in RVV parlance). One potential (notational) downside of this approach is that vector_width(scalable(32), 1) is in practice the same as vector_width(scalable(64), 2) in that vectors of 64-bit elements will be mapped to groups of two registers while vectors of 32-bit elements use a single register (which is made obvious in vector_width(scalable(32), 1)). But, it also allows us to express “use part of a register for smaller data types”, if indirectly. Back to vector_width(scalable(64), 1)), vectors of 32-bit would be mapped to half a register (LMUL=½ in RVV parlance).

I am a little confused on your additional proposal. What is the semantic behind scalable(size-of-element)? Is it suggesting a number of registers used in vectorization just like efriedma’s proposal? May you explain more, thank you.

Sorry, I didn’t express myself clearly.

The problem as you stated above, is that vectorize_width(VF, scalable) does no correctly capture the vectorization options that we have in RISC-V.

Following your example of a loop using mixed-data (i32 and i8), the goal is that the vector instructions use the same number of elements. As you observe there is a ratio of 4 between the two elements. In RISC-V we have a lot of flexibility: LMUL=1 for i32 and LMUL=¼ for i8, LMUL=2 for i32 and LMUL=½ for i8, LMUL=4 for i32 and LMUL=1 for i8 and even LMUL=8 for i32 and LMUL=2 for i8.

(As mentioned above by @preames, LMUL > 1 is convenient for the code generation but may be difficult to capture in the cost model (i.e. we are not currently modelling that LMUL > 1 will likely increase the latency of instructions by a factor O(LMUL)).)

How I see it, vectorize_width(VF, scalable) does not have enough information, so either the compiler makes the decision (i.e. it unconditionally choses one of the possible options) or we let the user to express this. We can always have the former option with the current syntax. Let’s discuss the latter option: the user makes the choice.

As an example, implementors of the VPU at BSC have already told me that LMUL > 1 performance won’t be great. For those machines, you want to vectorize in a way that you “anchor” for the largest element size of the vector loop (in the example i32) so you get LMUL=1 for i32 and LMUL=¼ for i8. We could express this doing vectorize_width(VF, scalable(32)) (or some other equivalent notation that captures this idea of the size of the element).

For another example, a machine may have good performance (as in it scales nicely) with LMUL > 1 and the loop uses few registers. Then in this case it may make sense to use, say, LMUL=4 for i32 and LMUL=1 for i8, you could do that doing vectorize_width(VF, scalable(8)).

If I’m understanding correctly, the issue here is that when you specify “scalable”, it’s not clear what it’s scaling relative to; for SVE, obviously the registers scale in multiples of 128. But for RISCV, the scaling is less obvious, particularly with the existence of Zve32.

Yes, being more specific, someone, who got familiar with RVV, may think about vectorization in terms of LMUL and SEW, so

#pragma clang loop vectorize_width(4, scalable)
for (int i = 0; i < n; ++i) {
  r += a[i];
}

from user’s perspective won’t have that RVV-meaning, unless user is familiar with LLVM’s encoding.
vec-report could also be unclear for the user:

<source>:5:5: remark: vectorized loop (vectorization width: vscale x 4, interleaved count: 1)

If we start to consider zve32 or even if somehow encoding changed to a point where <vscale x 4 x i32> encodes LMUL=2, SEW=32, but `<vscale x 4 x i64> encodes LMUL=2, SEW=64; that pragma for a loop with i32 and i64 computation within makes it impossible for the user to understand how vectorization is going to look like, until generated code is checked.

That said, the purpose of this discussion started by @eopXD is to make user-convenient pragma for RVV and hide implementation details (i.e. encoding.) from non-compiler developers.

I guess you want to allow the user to specify the element size they’re thinking about so they don’t have to consider fractions. So you get syntax like, say, #pragma clang loop vectorize_width(32, num_regs, 1), to say “pick a vector factor so that vectors with 32-bit elements use one register”. This should have an unambiguous meaning for all the vector instruction sets I can think of.

There the suggestion there is in the line of vectorize_width(scalable(size-of-element), factor) which I think can capture Eli’s suggestion. The size-of-element could be the size in bits and factor is a multiple of that. In general those values can be mapped to sensible values of LMUL.

These 2 alternatives are really good. Obviously, being aligned with OpenMP is a huge advantage. Also having something that some other target, like ARM, can use is great.

However, for a bit I’d like to be a “devil’s advocate”. The original proposal has advantages over vectorize_width(scalable(size-of-element), num-regs)

  1. It proposes named constants that RVV defines.
  2. It’s not possible to use that pair of LMUL and SEW in unsupported/ambiguous ways:
  • vectorize_width(128, 1)— Is it unsupported or equivalent to vectorize_width(64, 2) ?
  • vectorize_width(32, 16) — Is it unsupported or equivalent to vectorize_width(32, 8) AND interleave=2 ?
    vectorize_width(32, 3) — Is it unsupported or vectorizer suppose to choose LMUL=4 and mask off last register or clamp AVL so that chosen VL=3 * VLEN/SEW ?
  1. Looks like right now #pragma clang loop vectorize_width(x, scalable) is ARM-specific. That said, on X86 or any other target the pragma is ignored in trunk. That + it seems redundant to introduce another pragma for ARM, the RVV-specific pragma seems to be ok.

@rofirrim

For another example, a machine may have good performance (as in it scales nicely) with LMUL > 1 and the loop uses few registers. Then in this case it may make sense to use, say, LMUL=4 for i32 and LMUL=1 for i8, you could do that doing vectorize_width(VF, scalable(8)) .

In this sense, is the VF here redundant? Since we are pivoting on one of the data width (for your example will be choosing between 8 and 32) and let other data width(s) scale in more/less registers (or in RVV term - scaling more/less LMUL) based on the pivot.

Scalable vector architectures can have different register lengths, so I think using continue on using VF is not accurate here and I would lean more on efriedma’s proposal, which is specifying the registers explicitly.


I think @nikolaypanchenko has a point that right now #pragma clang loop vectorize_width(x, scalable) is ARM-specific. However I think it is still beneficial that we try to introduce something general rather than having another target specific pragma.

The vector factor/vectorization width is specifically “how many elements per iteration”; it doesn’t directly determine what instructions should be used. Making the vector factor narrower saves codesize/register pressure, making the vector factor wider reduces overhead from loop control structures.

A lot of the discussion here seems to be mixing the vector factor with LMUL. They are not the same thing. The vector factor is a property of the whole loop; it’s how many elements are processed per iteration. LMUL is a property of individual generated instructions; it’s how many elements a given instruction processes. The backend should automatically choose an appropriate LMUL for each instruction. For example, on a target where LMUL>1 instructions aren’t efficient, we might want to lower an operation to multiple LMUL=1 instructions.

If we want to allow users to explicitly specify LMUL, it has to be a separate thing. There isn’t any way to tell what the user wants given just the vector factor, however it’s expressed. Given, for example, #pragma clang loop vectorize_width(8, num_regs, 1), we can tell each iteration processes 1 register worth of 8-bit elenments, or 4 registers worth of 32-bit registers. But we can’t tell the user wants a 32-bit add to be one LMUL=4 operation, or two LMUL=2 operations, or four LMUL=1 operations, or eight LMUL=1/2 operations.

Hi @eopXD,

You’re right. In this case the VF is not useful and @nikolaypanchenko already pointed to some issues with that syntax.

This would leave us with vector_width(VF, scalable) and then the compiler chooses a mapping, which may be a bit obscure for the user but not less than it is now in the fixed vectors world and you argued against it above :smiley:.

We can pivot around the element type something like vector_width(scalable(element-size)) which I think is close to what @efriedma-quic means with vector_width(element-size, num-regs, 1).

Now, I think that maybe we do not need to specify the number of registers: if we pivot with a smaller element size, the larger element sizes will require more registers, even if the loop does not use that element size. For instance vector_width(scalable(32)) in a loop that operates with 64-bit data will be mapped to LMUL=2.