Hello all,
We would like to propose the support of “compress store”/“expand load” patterns in LoopVectorizer, e.g. the following loops:
int compress(int *dst, int *src, int n) {
int idx = 0;
for (int i = 0; i < n; ++i)
if (condition(i))
dst[idx++] = src[i];
return idx;
}
int expand(int *dst, int *src, int n) {
int idx = 0;
for (int i = 0; i < n; ++i)
if (condition(i))
dst[i] = src[idx++];
return idx;
}
These patterns are commonly used for operations like filtering unused records from data stream (“stream compaction”), and LLVM IR already has intrinsics to represent such operations (llvm.masked.expandload, llvm.masked.compressstore). There are number of vector extensions that can produce effective code for them:
- X86: AVX512 has VCOMPRESS*/VEXPAND* instructions
- AArch64: SVE has COMPACT instruction
- RISC-V: RVV has VCOMPRESS/VIOTA instructions
Previous work
We’ve discovered that there have been attempts to do similar things in the past, e.g. this PR: [LV] Vectorization of compress idiom by nikolaypanchenko · Pull Request #83467 · llvm/llvm-project · GitHub and there are some presentations:
- https://llvm.org/devmtg/2024-04/slides/LightningTalks/Joshi-EnablingLoopVectorizer-for-CompressingStorePattern.pdf
- https://llvm.org/devmtg/2024-04/slides/QuickTalks/Wei-TargetAwareVectorization.pdf
Our proposed implementation is quite similar to the previous work, but we’ve also implemented “expand load” pattern and generation of run-time aliasing checks for such “compressed” memory accesses in LoopAccessAnalysis (without the later, the majority of such patterns in C code can’t be vectorized if there is no restrict qualifier on pointers).
Implementation details
Implementation consists of 3 main parts:
- IVDescriptors are extended with a new descriptor type - “monotonic”. In the examples given at the beginning, idx is a “monotonic” variable - it behaves very similarly to induction variable, but is updated only on some loop iterations under condition. Descriptor of this variable contains SCEVAddRec as if it will be updated on each iteration like an induction variable, and the condition of variable update is encoded as CFG edge (condition is true on the current loop iteration if this edge will be executed). Both pointer and integer types of monotonic variables are supported.
- SCEVAddRec of monotonic variable gives conservative estimation of the widest possible access bounds, so it can be supported in LoopAccessAnalysis: we substitute the SCEV of such variable with SCEVAddRec from its descriptor, and this memory access can be treated as already supported consecutive accesses in the loop, so we get the generation of run-time checks essentially free.
- VPlan changes: we’ve added one new recipe (VPMonotonicPHIRecipe) to represent the first lane of monotonic phi instruction (currently, all uses of monotonic phi should be scalar; producing the value for each lane of monotonic phi complicates things a bit and will be discussed later). Backedge value of this monotonic phi is produced by new VPInstruction type: ComputeMonotonicResult, that calculates
PhiValue + ctpop(ConditionMask) * StepValue
(small notice: “ctpop” operation calculates number of set bits via zext + reduce.add, not with llvm.ctpop intrinsic. This is done to support scalable VFs, and this looks like the only way to represent ctpop for scalable vector masks). ConditionMask is obtained from CFG edge that encodes the predicate of monotonic phi update. VPWidenMemoryRecipe is extended with a Compressed flag, that distinguishes generation between ordinary masked loads/stores and masked expand loads/compress stores (such accesses are represented with CM_Compressed InstWidening type).
In practice, there are common cases of using monotonic variables outside of the loop (as in example, idx is returned from function to indicate the number of written elements in dst buffer). This can be easily supported,too: ComputeMonotonicResult calculates the “resume” value of monotonic variable that can be used outside of the loop for this purpose.
Restrictions
In our implementation, interleaving of loops with monotonic variables is not supported. The problem here is that calculation of monotonic phi value for each part will require that mask value should be available in the preheader (and looks like this problem is generally unsolvable if the condition is not loop-invariant).
Another interesting area of improvement is support of “non-uniform” monotonic variables, for example:
void prefix_count(int *dst, int n) {
int sum = 0;
for (int i = 0; i < n; ++i)
if (condition(i))
dst[i] = sum++;
}
In this example, use of “sum” monotonic variable will be non-uniform after vectorization - we need to produce value for each vector lane. Currently, such patterns are not processed. We have the patches that address this case, but this is non-trivial task:
- Firstly, we need new intrinsic to calculate prefix sum from the given mask (e.g. “llvm.masked.stepvector”). For RISC-V RVV target, it can be effectively lowered in VIOTA instruction.
- Condition should dominate all “vector” uses of monotonic phi in such case (because we need to insert “masked.stepvector” before them); otherwise, vectorization will not be possible.
After all, we’ve found that such cases are not very common in practice (only 8 more loops vectorized in llvm-test-suite with SPEC2006 and SPEC2017), so probably the added complexity is not worth it (the majority of “monotonic” patterns are covered by expandload/compressstore intrinsics).
Patches
- [IVDescriptors] Implement MonotonicDescriptor by skachkov-sc · Pull Request #140720 · llvm/llvm-project · GitHub
- [LAA] Support monotonic pointers in LoopAccessAnalysis by skachkov-sc · Pull Request #140721 · llvm/llvm-project · GitHub
- [LoopVectorize][NFC] Refactor widening decision logic by skachkov-sc · Pull Request #140722 · llvm/llvm-project · GitHub
- [LoopVectorize] Support vectorization of compressing patterns in VPlan by skachkov-sc · Pull Request #140723 · llvm/llvm-project · GitHub
Feedback would be very much appreciated!