[RFC] loop vectorization of "compress store"/"expand load" patterns

skachkov-sc · May 20, 2025, 12:38pm

Hello all,

We would like to propose the support of “compress store”/“expand load” patterns in LoopVectorizer, e.g. the following loops:

int compress(int *dst, int *src, int n) {
  int idx = 0;
  for (int i = 0; i < n; ++i)
    if (condition(i))
      dst[idx++] = src[i];
   return idx;
}

int expand(int *dst, int *src, int n) {
  int idx = 0;
  for (int i = 0; i < n; ++i)
    if (condition(i))
      dst[i] = src[idx++];
  return idx;
}

These patterns are commonly used for operations like filtering unused records from data stream (“stream compaction”), and LLVM IR already has intrinsics to represent such operations (llvm.masked.expandload, llvm.masked.compressstore). There are number of vector extensions that can produce effective code for them:

X86: AVX512 has VCOMPRESS*/VEXPAND* instructions
AArch64: SVE has COMPACT instruction
RISC-V: RVV has VCOMPRESS/VIOTA instructions

Previous work

We’ve discovered that there have been attempts to do similar things in the past, e.g. this PR: [LV] Vectorization of compress idiom by nikolaypanchenko · Pull Request #83467 · llvm/llvm-project · GitHub and there are some presentations:

Our proposed implementation is quite similar to the previous work, but we’ve also implemented “expand load” pattern and generation of run-time aliasing checks for such “compressed” memory accesses in LoopAccessAnalysis (without the later, the majority of such patterns in C code can’t be vectorized if there is no restrict qualifier on pointers).

Implementation details

Implementation consists of 3 main parts:

IVDescriptors are extended with a new descriptor type - “monotonic”. In the examples given at the beginning, idx is a “monotonic” variable - it behaves very similarly to induction variable, but is updated only on some loop iterations under condition. Descriptor of this variable contains SCEVAddRec as if it will be updated on each iteration like an induction variable, and the condition of variable update is encoded as CFG edge (condition is true on the current loop iteration if this edge will be executed). Both pointer and integer types of monotonic variables are supported.
SCEVAddRec of monotonic variable gives conservative estimation of the widest possible access bounds, so it can be supported in LoopAccessAnalysis: we substitute the SCEV of such variable with SCEVAddRec from its descriptor, and this memory access can be treated as already supported consecutive accesses in the loop, so we get the generation of run-time checks essentially free.
VPlan changes: we’ve added one new recipe (VPMonotonicPHIRecipe) to represent the first lane of monotonic phi instruction (currently, all uses of monotonic phi should be scalar; producing the value for each lane of monotonic phi complicates things a bit and will be discussed later). Backedge value of this monotonic phi is produced by new VPInstruction type: ComputeMonotonicResult, that calculates PhiValue + ctpop(ConditionMask) * StepValue (small notice: “ctpop” operation calculates number of set bits via zext + reduce.add, not with llvm.ctpop intrinsic. This is done to support scalable VFs, and this looks like the only way to represent ctpop for scalable vector masks). ConditionMask is obtained from CFG edge that encodes the predicate of monotonic phi update. VPWidenMemoryRecipe is extended with a Compressed flag, that distinguishes generation between ordinary masked loads/stores and masked expand loads/compress stores (such accesses are represented with CM_Compressed InstWidening type).

In practice, there are common cases of using monotonic variables outside of the loop (as in example, idx is returned from function to indicate the number of written elements in dst buffer). This can be easily supported,too: ComputeMonotonicResult calculates the “resume” value of monotonic variable that can be used outside of the loop for this purpose.

Restrictions

In our implementation, interleaving of loops with monotonic variables is not supported. The problem here is that calculation of monotonic phi value for each part will require that mask value should be available in the preheader (and looks like this problem is generally unsolvable if the condition is not loop-invariant).

Another interesting area of improvement is support of “non-uniform” monotonic variables, for example:

void prefix_count(int *dst, int n) {
  int sum = 0;
  for (int i = 0; i < n; ++i)
    if (condition(i))
      dst[i] = sum++;
}

In this example, use of “sum” monotonic variable will be non-uniform after vectorization - we need to produce value for each vector lane. Currently, such patterns are not processed. We have the patches that address this case, but this is non-trivial task:

Firstly, we need new intrinsic to calculate prefix sum from the given mask (e.g. “llvm.masked.stepvector”). For RISC-V RVV target, it can be effectively lowered in VIOTA instruction.
Condition should dominate all “vector” uses of monotonic phi in such case (because we need to insert “masked.stepvector” before them); otherwise, vectorization will not be possible.

After all, we’ve found that such cases are not very common in practice (only 8 more loops vectorized in llvm-test-suite with SPEC2006 and SPEC2017), so probably the added complexity is not worth it (the majority of “monotonic” patterns are covered by expandload/compressstore intrinsics).

Patches

Feedback would be very much appreciated!

artagnon · May 27, 2025, 1:56pm

The infrastructure for vectorizing:

if (condition(iv))
  rdx = iv

already exists in upstream, and was authored by @Mel-Chen as iv-select-cmp or FindLastIV. Perhaps the first step would be to generalize FindLastIV to the following:

  int t = init_val;
  int last = -1;
  for (int iv = 0; iv < N; iv++) {
    if (condition(iv))
      last = iv;
  }
  s = (last == -1) ? init_val : a[last];

to enable vectorization of conditional-scalar-assignments, which @graham.hunter is working on here, and then generalize that to compress and expand?

I think we wouldn’t want specific patterns to be developed and merged independently, when it’s possible to share the code, and I think this review was along the same line of thought.

It’s nice that you’ve developed LAA support for runtime checks as well, but I think LAA should be decoupled from specific IVDescriptors.

npanchen · May 28, 2025, 8:38pm

That would be great to see this work to be implemented in LV!

to enable vectorization of conditional-scalar-assignments, which @graham.hunter is working on here, and then generalize that to compress and expand?

I think we wouldn’t want specific patterns to be developed and merged independently, when it’s possible to share the code, and I think this review was along the same line of thought.

If we consider general CSA and general monotonic idioms, the main differences I see between them are

presence of a single self-update in monotonic
That is, informally general CSA allows uses in reassignments in nested conditions:

last = -1;
for (int i = 0; i < n; ++i) {
  if (cond0(i)) {
      last = i;
      a[i] = last;
      if (cond1(i)) {
         last = f(i);
         b[i] = last;
      }
  }
}

and vectorizer should be able to easily vectorize it. While for monotonic having

for (int i = 0; i < n; ++i) {
  if (cond0(i)) {
     idx++;
     a[idx] = ...
     if (cond1(i)) {
       ++idx;
       b[idx] = ...
     }
  }

would be too hard to vectorize.

different handling of a last value. For CSA it’s needed to generate special vector code in postexist; monotonic’s scalar value will be computed within the vector loop.

That said, I do see this idiom different to CSA. In my abandoned PR, which had similar approach to Sergey’s, I had an urge to extend InductionDescriptor as monotonic is more like a conditional linear (induction) for most cases. “Most”, because theoretically it’s not hard to support multiplication:

for (int i = 0; i < n; ++i) {
  if (cond0(i)) {
     idx *= Constant;
     a[idx] = ...
     b[i] = idx + c[idx];
  }
}

but it is hard to imagine anyone will write code like this.

skachkov-sc · May 29, 2025, 4:52pm

Thank you for the responses! After looking at FindLastIV pattern, I find it difficult to re-use its code for monotonic pattern. As @npanchen said, monotonic variable is self-updated in the loop and in this sense looks more suitable for InductionDescriptor (with some additional information about the update predicate). However, we can also treat it as a reduction of condition value, e.g.:

int idx = 0;
for (int i = 0; i < n; ++i)
  if (cond(i)) {
    store(dst[idx])
    idx += Step;
  }

can be rewritten to

int idx = 0;
for (int i = 0; i < n; ++i) {
  if (cond(i))
    store(dst[idx])
  idx += extend(cond(i)) * Step;
}

(idx is a reduction variable in this example, but unlike the “ordinary” reduction, it has uses inside a loop)
Looks like reduction descriptor already supports some conditional patterns (RecurrenceDescriptor::isConditionalRdxPattern). It has some differences from our implementation (PHINode vs SelectInst is used for matching), but probably some logic can be shared here. @artagnon how do you feel about moving support of monotonic variables in RecurrenceDescriptor instead of adding a new descriptor type?

artagnon · June 2, 2025, 11:22am

Sounds good: kindly update your patches, and possibly delay the LAA work until we can find a good way to generalize replaceSymbolicStrideSCEV without depending on specific induction/recurrence descriptors. Yes, after some more thought, I don’t think FindLastIV is the one to generalize.

Topic		Replies	Views
RFC: New intrinsics masked.expandload and masked.compressstore LLVM Dev List Archives	10	222	September 27, 2016
loop vectorizer LLVM Dev List Archives	34	156	November 6, 2013
First attempt at recognizing pointer reduction LLVM Dev List Archives	13	78	October 24, 2013
LV: predication LLVM Dev List Archives	27	159	June 2, 2020
LLVM Loop Vectorizer puzzle LLVM Dev List Archives	18	113	May 23, 2013

[RFC] loop vectorization of "compress store"/"expand load" patterns

Previous work

Implementation details

Restrictions

Patches

Related topics