On vectorization under RISC-V and its existing interface to control scalable vectorization width - vectorize_width(VF, scalable)

If I’m understanding correctly, the issue here is that when you specify “scalable”, it’s not clear what it’s scaling relative to; for SVE, obviously the registers scale in multiples of 128. But for RISCV, the scaling is less obvious, particularly with the existence of Zve32.

Yes, being more specific, someone, who got familiar with RVV, may think about vectorization in terms of LMUL and SEW, so

#pragma clang loop vectorize_width(4, scalable)
for (int i = 0; i < n; ++i) {
  r += a[i];
}

from user’s perspective won’t have that RVV-meaning, unless user is familiar with LLVM’s encoding.
vec-report could also be unclear for the user:

<source>:5:5: remark: vectorized loop (vectorization width: vscale x 4, interleaved count: 1)

If we start to consider zve32 or even if somehow encoding changed to a point where <vscale x 4 x i32> encodes LMUL=2, SEW=32, but `<vscale x 4 x i64> encodes LMUL=2, SEW=64; that pragma for a loop with i32 and i64 computation within makes it impossible for the user to understand how vectorization is going to look like, until generated code is checked.

That said, the purpose of this discussion started by @eopXD is to make user-convenient pragma for RVV and hide implementation details (i.e. encoding.) from non-compiler developers.

I guess you want to allow the user to specify the element size they’re thinking about so they don’t have to consider fractions. So you get syntax like, say, #pragma clang loop vectorize_width(32, num_regs, 1), to say “pick a vector factor so that vectors with 32-bit elements use one register”. This should have an unambiguous meaning for all the vector instruction sets I can think of.

There the suggestion there is in the line of vectorize_width(scalable(size-of-element), factor) which I think can capture Eli’s suggestion. The size-of-element could be the size in bits and factor is a multiple of that. In general those values can be mapped to sensible values of LMUL.

These 2 alternatives are really good. Obviously, being aligned with OpenMP is a huge advantage. Also having something that some other target, like ARM, can use is great.

However, for a bit I’d like to be a “devil’s advocate”. The original proposal has advantages over vectorize_width(scalable(size-of-element), num-regs)

  1. It proposes named constants that RVV defines.
  2. It’s not possible to use that pair of LMUL and SEW in unsupported/ambiguous ways:
  • vectorize_width(128, 1)— Is it unsupported or equivalent to vectorize_width(64, 2) ?
  • vectorize_width(32, 16) — Is it unsupported or equivalent to vectorize_width(32, 8) AND interleave=2 ?
    vectorize_width(32, 3) — Is it unsupported or vectorizer suppose to choose LMUL=4 and mask off last register or clamp AVL so that chosen VL=3 * VLEN/SEW ?
  1. Looks like right now #pragma clang loop vectorize_width(x, scalable) is ARM-specific. That said, on X86 or any other target the pragma is ignored in trunk. That + it seems redundant to introduce another pragma for ARM, the RVV-specific pragma seems to be ok.