[RFC] RISCV vector register spill optimization pass

songwu0813 · August 22, 2024, 11:56am

Overview

Current RISC-V vector register spill/reload works on full vector registers no matter how long the defining instruction uses, it’s might be a potential performance degradation if the spill/reload occurs in a hot loop, thus it’s worth it to only spill/reload the partial vector register if the remaining data is not needed.
Here are the example that illustrate the scenario described above:

vsetvli x0, a0, e8, mf2
vadd.vv v8, v8, v9
vs1r.v v8, (a2)         <-------- spill
.
.
.
vl1re8.v v8, (a2)       <-------- reload
.
.
.

in the example, v8 is defined by add.vv which only writes half of the register which means we don’t need to store the whole register when spilling, thus the example above can be optimized to unit strided load/store:

vsetvli x0, a0, e8, mf2
vadd.vv v8, v8, v9
vse8.v v8, (a2)         <-------- spill
.
.
.
vsetvli a1, x0, e8, mf2
vle8.v v8, (a2)         <-------- reload
.
.
.

The optimization can be performed when it guarantees v8 only uses a part of full register. One thing to mention is that it might be additional vset* instructions inserted since the full register load/store doesn’t need any but after it’s changed to unit strided load/store, the vset* instruction is needed when the vtype and vl setup are not compatible with the context.

Pull request

github.com/llvm/llvm-project

[RISCV][MI] Support partial spill/reload for vector registers

main ← 4vtomat:vreg_partial_spill

opened 02:01PM - 22 Aug 24 UTC

4vtomat

+1001 -1

RFC: https://discourse.llvm.org/t/rfc-riscv-vector-register-spill-optimization-p…ass/80850 Current RISC-V vector register spill/reload works on full vector registers no matter how long the defining instruction uses, in some cases, it's not necessary to spill full vector register, for example: ``` vsetvli a1, a0, e8, mf2, ta, ma vadd.vv v8, v8, v9 vs1r v8, (a2) <- spill . . . vl1r v8, (a2) <- reload vmul.vv v8, v8, v9 ``` Both spill and reload can be replaced to `vse8.v` and `vle8.v` respectively as below: ``` vsetvli a1, a0, e8, mf2, ta, ma vadd.vv v8, v8, v9 vse8.v v8, (a2) <- spill . . . vsetvli a1, x0, e8, mf2, ta, ma vle8.v v8, (a2) <- reload vmul.vv v8, v8, v9 ``` Note that this patch doesn't support the BB if there is any inline assembly, for example: ``` %0 = vadd.vv v8, v9 (e8, mf2) vs1r %0, %stack.0 ... inline_asm("vsetvli 888, e8, m1") %1 = vl1r %stack.0 inline_asm("vadd.vv %a, %b, %c", %a=v8, %b=%1, %c=%1) ``` If we rewrite the case above, %1 would become vle8 with mf2 and the RISCVInsertVSETVLI would emit a vsetvli with mf2 for %1 which is incompatible with original semantic which is m1.

preames · August 22, 2024, 2:54pm

Interesting work. A couple of discussion items:

Depending on hardware, reducing the width of the load/store may not have any particular benefit. Of particular note is that we’re giving up a whole register LD/ST and using the unit strided variants instead.
You talk about reducing VL, but from the example, your main focus is on reducing LMUL. You may want to be careful to distinguish which your focus actually is in code and comment.
You need to be careful to reason about the tail elements in the register being spilled. If the vadd has anything other than a tail undefined (an LLVM concept, not an ISA one), this transform would be illegal.
How do you handle identifying the value of VL and VTYPE to restore after the reload has to toggle them? (I haven’t checked your patch.) This is one of the hardest pieces of this idea.
What is your overall profitability goal with this transform? Do you have hardware which has dynamically lower cost for the spill/fill? Are you aiming to reduce stack sizes? Something else? Some combination of the above? Basically, why is this complexity worthwhile?

songwu0813 · August 23, 2024, 12:46pm

Thanks so much @preames for reviewing the proposal!!

Yes, the load/store may not benefit from this pass, so there’s a switch to turn it on or off(default off for now). Users can decide whether to enable it depending on the hardware or other optimization concern.

Right, the description is a bit misleading, I’m actually talking about reducing LMUL, thanks for pointing out!

Correct, in the algorithm, I just prevent to rewrite the spill/reload if the source register is produced by a “TU” instruction which needs to also preserve tail elements.

The main idea is we only need to preserve the data that is effective, so once we decide which LMUL to use, we can only focus on reloading back the LMUL wide data which means we can always set VL=VLMAX and SEW=8, Tail Agnostic for VTYPE.

Currently I don’t have any data or benchmark that is benefit from this, also there is not specific hardware either. Overall I think the pass can only make the code better since normally the whole register load/store has higher latency than partial load/store. The other potential benefit is reducing the stack size as we don’t need the full register stack size.

qcolombet · September 4, 2024, 7:57am

If you want to approach this in a more general way (i.e., that applies to other targets), the inline spiller should be able to provide the lane mask of what is live at the spill/reload point and the TargetInstrInfo’s implementation of the spilling callbacks (e.g., storeRegToStackSlot) could get this information and act on it.
If the target wants to leverage this information, the heavy lifting would still be on the target side as they have to ensure that the lane mask is properly translated into an offset for the store/load to be generated. The generic code (at least with the current infrastructure) cannot do that automatically.

This doesn’t change the profitability aspect @preames was mentioning, but it removes the need for a target specific pass.

Topic		Replies	Views
[RFC] Spill2Reg: Selectively replace spills to stack with spills to vector registers LLVM Dev List Archives	25	2489	December 20, 2024
Missed optimization - spill/load generated instead of reg-to-reg move (and two other questions) LLVM Dev List Archives	2	155	March 1, 2018
Instruction selection confusion at register - chooses vector register instead of scalar one LLVM Dev List Archives	2	120	October 27, 2016
RISC-V tests show reduced spills in llvm-mos backend RISCV	5	315	November 26, 2024
Does the update of sp redundant in spill/reload code? Beginners arm , riscv	2	313	May 4, 2022

[RFC] RISCV vector register spill optimization pass

Overview

Pull request

Related topics