Current RISC-V vector register spill/reload works on full vector registers no matter how long the defining instruction uses, it’s might be a potential performance degradation if the spill/reload occurs in a hot loop, thus it’s worth it to only spill/reload the partial vector register if the remaining data is not needed.
Here are the example that illustrate the scenario described above:
in the example, v8 is defined by add.vv which only writes half of the register which means we don’t need to store the whole register when spilling, thus the example above can be optimized to unit strided load/store:
The optimization can be performed when it guarantees v8 only uses a part of full register. One thing to mention is that it might be additional vset* instructions inserted since the full register load/store doesn’t need any but after it’s changed to unit strided load/store, the vset* instruction is needed when the vtype and vl setup are not compatible with the context.
Depending on hardware, reducing the width of the load/store may not have any particular benefit. Of particular note is that we’re giving up a whole register LD/ST and using the unit strided variants instead.
You talk about reducing VL, but from the example, your main focus is on reducing LMUL. You may want to be careful to distinguish which your focus actually is in code and comment.
You need to be careful to reason about the tail elements in the register being spilled. If the vadd has anything other than a tail undefined (an LLVM concept, not an ISA one), this transform would be illegal.
How do you handle identifying the value of VL and VTYPE to restore after the reload has to toggle them? (I haven’t checked your patch.) This is one of the hardest pieces of this idea.
What is your overall profitability goal with this transform? Do you have hardware which has dynamically lower cost for the spill/fill? Are you aiming to reduce stack sizes? Something else? Some combination of the above? Basically, why is this complexity worthwhile?
Thanks so much @preames for reviewing the proposal!!
Yes, the load/store may not benefit from this pass, so there’s a switch to turn it on or off(default off for now). Users can decide whether to enable it depending on the hardware or other optimization concern.
Right, the description is a bit misleading, I’m actually talking about reducing LMUL, thanks for pointing out!
Correct, in the algorithm, I just prevent to rewrite the spill/reload if the source register is produced by a “TU” instruction which needs to also preserve tail elements.
The main idea is we only need to preserve the data that is effective, so once we decide which LMUL to use, we can only focus on reloading back the LMUL wide data which means we can always set VL=VLMAX and SEW=8, Tail Agnostic for VTYPE.
Currently I don’t have any data or benchmark that is benefit from this, also there is not specific hardware either. Overall I think the pass can only make the code better since normally the whole register load/store has higher latency than partial load/store. The other potential benefit is reducing the stack size as we don’t need the full register stack size.
If you want to approach this in a more general way (i.e., that applies to other targets), the inline spiller should be able to provide the lane mask of what is live at the spill/reload point and the TargetInstrInfo’s implementation of the spilling callbacks (e.g., storeRegToStackSlot) could get this information and act on it.
If the target wants to leverage this information, the heavy lifting would still be on the target side as they have to ensure that the lane mask is properly translated into an offset for the store/load to be generated. The generic code (at least with the current infrastructure) cannot do that automatically.
This doesn’t change the profitability aspect @preames was mentioning, but it removes the need for a target specific pass.