Adding support for multi-store producers to Affine loop fusion pass

Hello!

In case this is relevant to anybody, I’ve been working on extending the producer-consumer Affine loop fusion algorithm to support producer loops with multiple stores. An initial patch with some refactoring/cleanup needed for the new feature is currently under review [1]. The actual support for multi-store producers is WIP [2]. There are still some utilities to be extended to support multiple stores, such as isFusionProfitable.

As part of this work, I had to simplify and improve some aspects of the Affine fusion pass, which should also impact single-store producer scenarios. For example, computation slices (recomputed multiple times in the algorithm) are now computed only once and reused, legality and profitability analysis are more decoupled now, maximal fusion doesn’t require profitability analysis, fusion candidates are visited in a smarter order, etc. You can find all the details in the review messages. All of this should reduce compile time of single-store producer scenarios while preserving existing functionality.

We can bring aspects of the WIP patch to this thread, if needed. Also, let me know if you have any questions!

Thanks, @bondhugula and Andy Davis (can’t find Andy’s user), for all the guidance!

Diego

[1] Refactoring: https://reviews.llvm.org/D90798
[2] Multi-store support: [WIP] Add support for multi-store producer-consumer fusion by dcaballe · Pull Request #2 · dcaballe/llvm-project · GitHub

1 Like

These contributions are very welcome - thanks! Besides compile time improvements and separation of concerns, the refactoring makes it quite useful for downstream users to build their custom fusion models and to add further support upstream. I’m happy to review and participate in discussions.

With regard to compile time, we don’t have a way to measure or keep track of it. The amount of time it takes to run the test cases for fusion locally (which are numerous in number and take about 1/2 a second on a fast machine) could serve as a proxy for the time being.

1 Like

The amount of time it takes to run the test cases for fusion locally (which are numerous in number and take about 1/2 a second on a fast machine) could serve as a proxy for the time being.

Ok! I’ll check once the implementation is done. We need to add more analysis as part of the multi-store producer support (see below) so maybe we end up worsening the compile time after all. We’ll see.

One of the final functionality problems I’m investigating is about when to remove the producer loop in multi-store producer scenarios. canRemoveNode [1] served this purpose for single-store producer scenarios by checking that the producer loop has no other dependencies after fusion. Unfortunately, this doesn’t work for multi-store producer loops since they will always have more than one dependence so they will never be removed.

To address this problem, I introduced canRemoveSrcNodeAfterFusion [2], aimed at doing something smarter in this regard. It currently has a few checks but there is a big one missing: if the producer has other dependencies besides the producer-consumer one, we have to make sure that the computation slice being fused includes all the iterations consumed by these other dependences. Otherwise, the producer loop couldn’t be removed. For instance, in the example below, we can fuse the slice [0, 82) of %i0 into %i2 but we cannot remove %i0 because there is another use of %m that might read elements [82, 100).

  affine.for %i0 = 0 to 100 {
    affine.store %cf7, %m[%i0] : memref<100xf32>
  }
  affine.for %i2 = 0 to 82 {
    %v1 = affine.load %m[%i2] : memref<100xf32>
  }
  // Another use of %m

There is a lot of fine-grained analysis that we could do to maximize the removal of producer loops in a wide variety of scenarios but maybe we could start with something simple. For example, we could remove the producer loop only if the slice fused is “maximal” (i.e., it includes all the iterations of the producer loop). WDYT? What would be an easy way to check that property? Would comparing the bounds of the computation slice and the producer loop nest work?

[1] https://github.com/llvm/llvm-project/blob/master/mlir/lib/Transforms/LoopFusion.cpp#L318
[2] [WIP] Add support for multi-store producer-consumer fusion by dcaballe · Pull Request #2 · dcaballe/llvm-project · GitHub

This is exactly what I was going to suggest as well. The common case won’t benefit from such fine-grained analysis, and when needed, we could have a more precise version.

I assume both the slice and the producer loop would have the dimensions appearing in the same order. Then, you’d just have to do lb/ub equality check as the naive thing. With proper polyhedral methods, one would do a difference of the two sets (producer loop nest minus computation slice) and check if it’s zero. In fact, there are methods for these too in MLIR (PresburgerSet::subtract and getSetDifference). It’s possible that using these methods on such trivial sets may not be much overhead as well. While on this, we’d also like to not add too many methods that only work for the constant bounds case.

The patch that introduces support for fusing producer-consumer loops with multi-store producers is ready for review: https://reviews.llvm.org/D92876. We can discuss any design decisions that need a bit more elaboration here.

Thanks!

1 Like