[RFC] Remove codegen support for trivial VP intrinsics in the RISC-V backend

A large amount of effort has been spent supporting every VP intrinsic in the RISC-V backend. However for a large portion of these, nothing in upstream LLVM uses them today and it seems unlikely they will be used in future.

This RFC proposes to remove support for these unused VP intrinsics in the RISC-V backend, both to simplify the backend and also to ensure that development time isn’t misspent working on them.

This does not propose to remove the intrinsics themselves, but it does open up the possibility of doing so which is deferred to another RFC.

Background

VP intrinsic support was added early on during the development of RVV support, and my understanding is it was largely to used to control vl from the loop vectorizer via the EVL argument.

There’s two reasons main why we want to control vl:

  1. To prevent trapping on unused lanes for loads/stores/divides and correctly perform permutations like reverses on dynamic vector lengths, reductions etc.
  2. To improve performance on some microarchitectures by not executing on lanes that aren’t needed, and avoid vsetvli toggles in general.

So the loop vectorizer used to convert every vectorized instruction to a VP intrinsic on RISC-V[1] for both the correctness and performance reasons above.

However last year the RISCVVLOptimizer pass was upstreamed, which takes care of the performance aspect by optimizing vl to only what’s demanded at the MIR level. Because it operates after instruction selection, it means that regular non-VP instructions also end up having optimized VLs.

So this meant that we could just emit regular instructions instead of VP intrinsics in a lot of cases, which avoided the issue of how most of InstCombine/InstSimplifty/DAGCombine isn’t yet lifted to work on VP intrinsics, and improved codegen significantly.

Proposal

The subset of intrinsics that we were able to swap out for instructions are the ones that are only needed for performance, e.g. llvm.vp.add. These only replace disabled lanes with poison, so it’s correct to just replace it with e.g. a regular add instruction. These intrinsics are what this RFC refers to as “trivial” VP intrinsics, since they fall into the same category as those in llvm::isTriviallyVectorizable.

Now that the loop vectorizer no longer emits these, there are no other users of these intrinsics now within LLVM upstream to the best of my knowledge. So it’s possible to remove codegen support for these in the RISC-V backend without affecting e.g. Clang or Flang.

This means it’s possible to remove a significant amount of VP related code in RISCVISelLowering.cpp, and reduce the number of different ways we have of expressing vector semantics from 3 to 2 (the others being regular LLVM IR and RVV intrinsics).

It also allows us to massively reduce the number of tests in llvm/test/CodeGen/RISCV/rvv. For every scalable and fixed vector test case, we have a corresponding -vp.ll test, 168 in total. A quick wc -l shows that these tests alone are 215k lines, and most of these are for trivial intrinsics.

Asides from the code cleanups, the other main major benefit is that it would clarify the general direction of the RISC-V backend to avoid trivial VP intrinsics and prevent any redundant work on trying to improve support for them. There’s been a few PRs in the last year in this area, and the effort is probably better directed elsewhere[2]

Intrinsics considered trivial

This would be every VP intrinsic bar the following:

  • llvm.vp.{load,store,gather,scatter,strided.load,strided.store,load.ff}
  • llvm.vp.merge: this has special semantics where the lanes past EVL aren’t poison
  • llvm.vp.{udiv,sdiv,urem,srem}: these mask off UB in disabled lanes
  • llvm.vp.reduce.* and llvm.vp.cttz.elts: disabled lanes affect the computed result
  • llvm.vp.{splice,splat,reverse}: these are permutations which aren’t easily expressible with shuffle vectors (splat is probably removable, but it’s not trivial)

Potential VP intrinsic users

The loop vectorizer and SLP vectorizer continue to use some of the memory and permutation VP intrinsics needed for correctness, but these are not considered trivial intrinsics.

LoopIdiomVectorize.cpp uses a vp.icmp, but that can be replaced with a regular icmp.

The only public out-of-tree user that I’m aware of that uses VP intrinsics is the region vectorizer. But in theory it should be able to follow the loop vectorizer and just emit regular instructions without any detriment to code quality.

Any language frontend that specifically wants to control vl and masking should just use the RVV intrinsics instead. I searched around and couldn’t find anything that was already using VP intrinsics though. And given that the only targets that implement VP intrinsics are RISC-V and VE, it seems unlikely that a frontend would be relying on it in a target agnostic way.

From the MLIR side I’m only aware of this one thread mentioning VP intrinsics, but I’m not sure if anything ever made it upstream.

I’m aware that some downstream LLVM forks use VP intrinsics more heavily. I would like to hear their opinion on this. Hopefully if the RISCVVLOptimizer is also enabled downstream then this change won’t be disruptive.

Alternatives considered

We could just leave the existing codegen support in and not develop it any further. It’s not a major maintenance burden, but the main concern would be that it gives more time for other users to start using and relying on these intrinsics.

Future work

The current future of VP in LLVM is somewhat uncertain. As of 2025 we are still in the “lift InstCombine” stage of the roadmap, and there hasn’t been any recent progress on this.

Currently RISC-V and VE are the only targets that support VP intrinsics. If this RFC goes ahead, then VE will be the only target to implement the trivial VP intrinsics.

It’s likely that VE could also add something similar to RISC-V’s RISCVVLOptimizer, potentially sharing some of the analysis infrastructure and even extending it to propagate masks[3].

By splitting the concept of predication into what is needed for correctness and what is needed for performance, and delegating the latter to a late MIR pass, we could remove trivial VP intrinsics altogether. This would significantly reduce the scope of VP intrinsics to those that are only needed for correctness.


  1. With EVL tail folding, which at the time wasn’t enabled by default. As of today it is now enabled by default. ↩︎

  2. [LegalizeTypes][VP] Teach isVPBinaryOp to recognize vp.sadd/saddu/ssub/ssubu.sat by tclin914 · Pull Request #154047 · llvm/llvm-project · GitHub https://github.com/llvm/llvm-project/pull/125991 [RISCV][TTI] Implement cost for vp min/max intrinsics by arcbbb · Pull Request #107567 · llvm/llvm-project · GitHub [RISCV] Lower VP_SELECT constant false to use vmerge.vxm/vmerge.vim by ChunyuLiao · Pull Request #144461 · llvm/llvm-project · GitHub https://github.com/llvm/llvm-project/pull/133245 https://github.com/llvm/llvm-project/pull/132345 ↩︎

  3. It’s worth noting that this idea of computing demanded elements and propagating the information backwards has been explored before https://llvm.org/devmtg/2023-05/slides/Posters/01-Albano-VectorPredictionPoster.pdf ↩︎

6 Likes

I believe most of if not all of the mainstream MLIR frontends still generate fixed vector when targeting RISC-V, so it’s unlikely VP intrinsics are prevailing at the MLIR side when it comes to RISC-V.

At the MLIR level, our initial plan was to introduce an RVV dialect to explicitly control vl, but after discussions we concluded that using VP intrinsics would be a more general solution. Based on that, we tested VP intrinsic coverage and contributed some fixes for RVV backend. Since then, we have been using VP intrinsics in our research and downstream projects, where explicit control of vl has provided room for performance tuning. As far as I know, VP intrinsics are currently the only upstream mechanism that allows explicit vl control in MLIR. If VP intrinsics were no longer maintained in the RISC-V backend, we might need to reconsider enabling an RVV dialect.

I will take a closer look at the RISCVVLOptimizer pass to see whether it provides the same performance benefits as explicit vl control.

I’d like to know what kind of work is required to maintain VP intrinsics for the RVV backend, and how much effort and resources it typically takes.

Only trivial VP intrinsics would be removed from the RISC-V backend. I don’t think an RVV dialect would be necessary, since the only change required would be to emit regular LLVM IR instructions asides from the intrinsics listed here. It’s also very possible that you may even see codegen improvements, similar to here.

It’s not much of a burden to keep the existing support in-tree. However I would argue that it’s very hard to justify any further upstream work on it given that nothing upstream uses these intrinsics. Given that I think it’s better to explicitly remove support for it earlier rather than later and leaving it to bitrot.

I have no stake in the RISCV backend, but I’m happy to see any work that paves the way towards removing the trivial VP intrinsics from IR at some point. Because…

…I believe that this the end of the road for them. I don’t see us ever accepting optimizations for trivial VP intrinsics into InstCombine. There is no good reason for these intrinsics to exist, and trying to support them everywhere would be a big waste of everyone’s time.

For correctness this looks fine, but for performance, in downstream we still use trivial VP intrinsics and are working to remove them without regressions (e.g. redundant vsetvli that can be optimized away).

That’s good to hear that you’re also working to move in the direction of removing trivial intrinsics downstream.

As for the regressions, do you have any specific examples? From my understanding most of these should be fixable by improving RISCVVLOptimizer. Support for recurrences and handling tuple types are two improvements that come to mind.

These shouldn’t be blockers for removing trivial intrinsics upstream given that we already don’t use them, but it would be good to track these on GitHub issues.

Overall removing the trivial VP intrinsics sounds great and would be a nice simplification to see!

Hi @lukel, thanks for putting together the RFC.

We have been heavy users of VP intrinsics in our experiments with the loop vectorizer when targeting a long vector implementation of RVV.

We will likely continue using them in our downstream branch but I can’t oppose to making the compiler implementation easier to maintain, so I’m totally OK with this.

2 Likes

Developer could use Polygeist to convert C/C++ code into MLIR. But it currently doesn’t supporting conversion of RVV intrinsic calls.

For example, here is my C++ code:

void axpy_vector(double a, double *dx, double *dy, int n) {
  int i;
  long gvl = __riscv_vsetvl_e64m1(n); // intrinsic
  for (i = 0; i < n;) {
    // intrinsic
    gvl = __riscv_vsetvl_e64m1(n - i);
    vfloat64m1_t v_dx = __riscv_vle64_v_f64m1(&dx[i], gvl);
    vfloat64m1_t v_dy = __riscv_vle64_v_f64m1(&dy[i], gvl);
    vfloat64m1_t v_res = __riscv_vfmacc_vf_f64m1(v_dy, a, v_dx, gvl);
    __riscv_vse64_v_f64m1(&dy[i], v_res, gvl);
    i += gvl;
  }
}

If I want to extend Polygeist, how to convert intrinsic(__riscv_vsetvl_e64m1, ..) into MLIR?

What dialect of operation are they supposed to be mapped to in MLIR?

Are there any problems or difficulties to implement this?

I don’t think VP intrinsics are relevant to your situation as they can’t represent RVV intrinsics like __riscv_vsetvl_e64m1 etc. I think there was an RFC for a RVV MLIR dialect but I’m not sure if it ever landed: [RFC] Add RISC-V Vector Extension (RVV) Dialect - #7 by aartbik

@lukel Thanks for reply. I am new to MLIR field.

I noticed the author of RVV Dialect said they were using VP intrinsics for research and projects.

Can call_intrinsic of LLVM dialect be used to make a call to __riscv_vsetvl_e64m1, etc. in MLIR code? Is there any problem in this way?

You can use call_intrinsic to call an LLVM IR intrinsic, but your code is calling C intrinsics. Those map, often (but not always) 1 to 1, to LLVM IR intrinsics.

For instance, __riscv_vsetvl_e64m1(avl) is implemented by __builtin_rvv_vsetvli(avl, 3, 0), which compiles down to llvm.riscv.vsetvli\. So, your call __riscv_vsetvl_e64m1 has to be lowered to something along the lines of:

%c0 = arith.constant 0 : i64
%c3 = arith.constant 3 : i64
llvm.call_intrinsic “llvm.riscv.vsetvli” (%arg0, %c3, %c0) : (i64, i64, i64) - > i64

You’ll have to chase these down the code, starting from the C functions in clang/include/clang/Basic/riscv_vector.td (and maybe others), and down to the definitions in llvm/IR/IntrinsicsRISCV.td.

There may also be a way for you to call (through a regular func.call) the C intrinsic, if you declare it as an external function and then link against whatever library those definitions end up in, but I haven’t use that route myself.

I hope this helps! :slight_smile:

1 Like

Really appreciate it ! Thanks for your advice.

Developer usually writes assembly code for optimization as well, like:

void riscv_vmadot(int8_t *a, int8_t *b, int i8s, bool print_result = false)
{
    const size_t vlmax_e32m1 = __riscv_vsetvlmax_e32m1();
    int32_t c[vlmax_e32m1];
    int32_t d[vlmax_e32m1];
    asm volatile(
        "vsetvli x7, x0, e8, m1, tu, mu\n\t"
        "vle8.v v0, (%[a])\n\t"
        "vle8.v v2, (%[b])\n\t"
        "vmadot v4, v0, v2 \n\t"
        "vsetvli x7, x0, e32, m1, tu, mu\n\t"
        "vse32.v v4, %[c]\n\t"
        "vse32.v v5, %[d]\n\t"
        : [c] "=g"(c), [d] "=g"(d)
        : [a] "r"(a), [b] "r"(b)
        : "memory");
}

Then how to deal with these assembly codes during conversion from C/C++ to MLIR?

Does it cause any problems to use assembly directly in MLIR?

Last time I checked, there was no way to represent inline assembly in MLIR (worth checking if this has changed). You’d have to do some sort of trick, like spitting the code between C and inline snippets, replacing the inline assembly with calls to the snippets. Then lower the C part to MLIR, compile the inline snippets to IR with clang, and then sort of re-inline the snippets using LTO.

But we’re entering “never had to do this before” territory, so I can only give you an idea that I’d explore as a first approach. Good luck!

1 Like

I checked for last two days and I found llvm.inline_asm in llvm dialect 'llvm' Dialect - MLIR.

Is it the way to represent inline assembly in MLIR? If I’m wrong, please correct me.

Yes.

After [RISCV] Remove codegen for vp_and, vp_or, vp_xor, vp_sra, vp_srl, vp_shl. NFC by lukel97 · Pull Request #194904 · llvm/llvm-project · GitHub, the codegen for all trivial VP intrinsics has been removed from the RISC-V backend, and the intrinsics are expanded instead to their equivalent plain LLVM IR.

2 Likes