RFC: [mlir][Vector][Affine] SuperVectortize: Optimization for misaligned data

Hello,

I’m reaching out to discuss a potential optimization in the SuperVectorizer, specifically regarding the handling of misaligned data during vectorization. In scenarios where the data size is not evenly divisible by the vector length, misalignment occurs, leading to the presence of garbage elements in the last iteration of a loop and this becomes important in cases where reduction occurs after the loop.

Currently, SuperVectorizer handles this by generating a mask in every iteration to avoid operations on these garbage values. While this ensures correctness, it introduces overhead by necessitating vector predication in all iterations (even though the mask is all 1s until the last iteration), which can significantly impact performance.

I propose considering an alternative approach, such as loop peeling or introducing a cleanup loop. This would involve processing the bulk of the data in the main loop without masking and then handling the remaining elements in a separate loop. Specifically, the cleanup loop would operate only on the last few elements, where masking is necessary to avoid garbage data processing.

This method could potentially reduce the overhead by limiting the use of masks to only the necessary part of the loop, avoiding the performance penalty associated with vector predication throughout the entire loop. The main benefits of this approach would be:

  1. Reduced computational overhead by minimizing the use of costly predication logic.
  2. Enhanced performance for loops dealing with misaligned data, particularly in reduction scenarios.
  3. Cleaner and more efficient code generation, especially in tight loops where the overhead of masking in every iteration can be particularly pronounced.

I believe this optimization could lead to significant performance improvements in scenarios where misaligned vectorization is common. I would appreciate your thoughts on this suggestion and its feasibility within the current architecture of SuperVectorizer.

Thank you for considering this optimization proposal. If it makes sense I will be happy to implement it.

Thanks,
Zeeshan Siddiqui

CC: @dcaballe @sergei-grechanik

3 Likes

Sorry for the late response. That makes sense to me! We are actually following a similar approach in the Linalg Vectorizer. We apply peeling as an opt-in pre-vectorization transformation. In that way, the vectorizer will automatically apply masking or not to the main/remainder vector loops if the number of iterations of each one is multiple of the vector length used. I think that’s the key design aspect: we should keep the peeling step separate from the vectorization step so that clients can decide if they want to peel or not to peel (e.g., to reduce code size).

It’s been a long time since I worked on the SuperVectorizer and affine loops but you may want to check if we already have some tools to generate what you are aiming for. It could be a matter of just putting some of the existing passes/patterns together.

Thank you for sharing your approach to loop peeling as a pre-vectorization transformation for the Linalg Vectorizer. I’m curious about the implementation details of this strategy. Are you utilizing any existing MLIR passes or utilities for peeling linalg loops, or have you developed a custom solution for this purpose?

I have implemented a similar transformation for affine for loops and am already observing significant performance improvements. I’m eager to explore if there’s an existing utility in MLIR that I can leverage for this. However, if such a utility does not exist, I would be happy to contribute my transformation for affine for loops, akin to what we currently have for scf.for loops.

For Linalg, you can see how we do it in IREE here. In particular, look at createLLVMCPUPeelPass and createGenericVectorizationPass. These are IREE passes which are mostly wrappers around peeling and vectorization utilities in MLIR upstream. You should be able to do something similar.

Regarding affine loops, perhaps @bondhugula might be able to help.

1 Like

There isn’t a loop peeling utility for affine.for and a contribution here is welcome. Similar functionality exists for unrolling and unroll-and-jam where a cleanup loop is generated when needed. lib/Dialect/Affine/Utils/LoopUtils.cpp is where the loop peeling utility can go.

A peeling utility could be useful regardless of misalignment (for e.g. when the trip count doesn’t divide the vector width, peel and vectorize is one of the several ways to vectorize).

1 Like