[RFC] Make LoopVectorize Aware of SLP Operations

Hello,

We would like to propose making LoopVectorize aware of SLP operations, to improve the generated code for loops operating on struct fields or doing complex math.

At the moment, LoopVectorize uses interleaving to vectorize loops that operate on values loaded/stored from consecutive addresses: vector loads/stores are generated to combine consecutive loads/stores and then shufflevector is used to de-interleave/interleave the loaded/stored values. At the moment however, we fail to detect cases where the same operations are applied to all consecutive values and there is no need to interleave. To illustrate this, consider the following example loop:

struct Test {
     int x;
     int y;
};

void add(struct Test *A, struct Test *B, struct Test *C) {
    for (unsigned i = 0; i < 1024; i++) {
        C[i].x = A[i].x + B[i].y;
        C[i].y = A[i].y + B[i].y;
    }
}

On X86, we do not vectorize this loop and on AArch64, we generate the following code for the vector body:

vector.body:
   %index = phi i64 [ %index.next, %vector.body ], [ 0,
                      %vector.body.preheader ]
   %8 = or i64 %index, 4
   %9 = getelementptr inbounds %struct.Test, %struct.Test* %A,
                      i64 %index, i32 0
   %10 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 %8,
                       i32 0
   %11 = bitcast i32* %9 to <8 x i32>*
   %12 = bitcast i32* %10 to <8 x i32>*
   %wide.vec = load <8 x i32>, <8 x i32>* %11, align 4, !tbaa !2
   %wide.vec60 = load <8 x i32>, <8 x i32>* %12, align 4, !tbaa !2
   %13 = getelementptr inbounds %struct.Test, %struct.Test* %B,
                       i64 %index, i32 0
   %14 = getelementptr inbounds %struct.Test, %struct.Test* %B, i64 %8,
                       i32 0
   %15 = bitcast i32* %13 to <8 x i32>*
   %16 = bitcast i32* %14 to <8 x i32>*
   %wide.vec64 = load <8 x i32>, <8 x i32>* %15, align 4, !tbaa !2
   %wide.vec65 = load <8 x i32>, <8 x i32>* %16, align 4, !tbaa !2
   %17 = add nsw <8 x i32> %wide.vec64, %wide.vec
   %18 = shufflevector <8 x i32> %17, <8 x i32> undef, <4 x i32>
                        <i32 0, i32 2, i32 4, i32 6>
   %19 = add nsw <8 x i32> %wide.vec65, %wide.vec60
   %20 = shufflevector <8 x i32> %19, <8 x i32> undef, <4 x i32>
                       <i32 0, i32 2, i32 4, i32 6>
   %21 = add nsw <8 x i32> %wide.vec64, %wide.vec
   %22 = shufflevector <8 x i32> %21, <8 x i32> undef, <4 x i32>
                       <i32 1, i32 3, i32 5, i32 7>
   %23 = add nsw <8 x i32> %wide.vec65, %wide.vec60
   %24 = shufflevector <8 x i32> %23, <8 x i32> undef, <4 x i32>
                       <i32 1, i32 3, i32 5, i32 7>
   %25 = getelementptr inbounds %struct.Test, %struct.Test* %C,
                       i64 %index, i32 1
   %26 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 %8,
                       i32 1
   %27 = getelementptr i32, i32* %25, i64 -1
   %28 = bitcast i32* %27 to <8 x i32>*
   %29 = getelementptr i32, i32* %26, i64 -1
   %30 = bitcast i32* %29 to <8 x i32>*
   %interleaved.vec = shufflevector <4 x i32> %18, <4 x i32> %22,
       <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
   store <8 x i32> %interleaved.vec, <8 x i32>* %28, align 4, !tbaa !2
   %interleaved.vec70 = shufflevector <4 x i32> %20, <4 x i32> %24,
       <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
   store <8 x i32> %interleaved.vec70, <8 x i32>* %30, align 4, !tbaa !2
   %index.next = add i64 %index, 8
   %31 = icmp eq i64 %index.next, 1024
   br i1 %31, label %for.cond.cleanup, label %vector.body, !llvm.loop !6

Note the use of shufflevector to interleave and deinterleave the vector elements. On AArch64, we emit additional uzp1, uzp2 and st2 instructions for those.

In the case above however, there is no need to de-interleave the loaded/stored data and instead we could mix the consecutive operands together in a single vector register and apply the operations to vectors with compounded operands. As for real-world use cases, complex number arithmetic for example could also benefit from not interleaving.

In what follows, I propose an extension to LoopVectorize to make it aware of SLP opportunities and detect operations on "compound values" (for the lack of a better term) as an extension to interleaved access handling. This idea is described in [1].

Structure

Hi Florian!

This proposal sounds pretty exciting! Integrating SLP-aware loop vectorization (or the other way around) and SLP into the VPlan framework is definitely aligned with the long term vision and we would prefer this approach to the LoopReroll and InstCombine alternatives that you mentioned. We prefer a generic implementation that can handle complicated cases to something ad-hoc for some simple ones. Because of this, we would have some comments regarding the design that you propose:

  1. Detect loops containing SLP opportunities (operations on compound
     values)
  2. Extend the cost model to choose between interleaving or using
     compound values
  3. Add support for vectorizing compound operations to VPlan

Currently, VPlan is not fully integrated in all the stages of the inner loop vectorizer pipeline. For that reason, part of your implementation (#1 and #2) would happen outside of VPlan and another part (#3) would be VPlan-based. As you know, we are currently leveraging a new vectorization path where 1) VPlan is built upfront in the pipeline (http://lists.llvm.org/pipermail/llvm-dev/2017-December/119523.html) and 2) all the vectorization stages will be implemented on top of the VPlan representation. We think that your proposal is a really good candidate to be implemented in this new “VPlan-native” vectorization path. In this way, we would avoid the porting effort of #1 and #2 to the final VPlan-base infrastructure and would give you the opportunity of getting involved in the design of VPlan. It would also avoid introducing the complexity of SLP into the existing cost model and code generation, which is also a concern to consider. We should definitely talk in depth about the requirements to implement this in the new vectorization path but we truly believe it's the best approach.

* loops where some vectors need to be transformed, for example where
   different operations are performed on different (groups of) lanes,
   like A[i].x + B[i].x, A[i].y - B[i].y which could be transformed to
   A[i].x + B[i].x, A[i].y + (-B[i].y), or where one compound group
   needs to be reordered, like A[i].x + B[i].y, A[i].y + B[i].x

This kind of transformation/reordering is an example of what could be a preparatory VPlan-to-VPlan transformation that could precede and simplify the core analysis of the SLP-aware analysis.

The SLP vectorizer in LLVM already implements a similar analysis. In the long term, it would probably make sense to share the analysis between LoopVectorize and the SLP vectorizer. But initially we think it would be safer to start with a separate and initially more limited analysis in LoopVectorize.

As a first step, this sounds reasonable to me. If it's implemented on top of VPlan it could be reused in a future VPlan-base SLP. We should have this in mind for the design from the very beginning so that we maximize the reuse of the common ground of "standalone" SLP and SLP-aware loop vectorization.

One limitation here is that we commit to either interleaving/compound vectorization when calculating the cost for the loads. Depending on the other instructions in the loop, interleaving could be beneficial overall, even though we could use compound vectorization for some operations. Initially we could only consider SLP-style vectorization if it can be used for all instructions.

VPlan will allow the independent evaluation of multiple vectorization scenarios in the future. This seems to fit into that category.

Add support for SLP style vectorization to Vplan
------------------------------------------------------------------------
Introduce 2 new recipes VPSLPMemoryRecipe and VPSLPInstructionRecipe.

We introduced VPInstructions to model masking in patch D38676. They are necessary to properly model the def-use/use-def chains in the VPlan representation, and we believe you will need to represent def-use/use-def chains for newly inserted operations, shuffles, and inserts/extracts. As such, VPInstructions would be a more proper representation than the Recipes per the long term vision. In fact, we plan to replace some of the current existing recipes with VPInstructions in the near future. For this reason, we think that the output of the SLP vectorization should be modelled using VPInstruction and not with new Recipes.

Thanks,
Diego Caballero & Vectorizer Team.
Intel Compiler and Languages.

Hi,

Hi Florian!

This proposal sounds pretty exciting! Integrating SLP-aware loop vectorization (or the other way around) and SLP into the VPlan framework is definitely aligned with the long term vision and we would prefer this approach to the LoopReroll and InstCombine alternatives that you mentioned. We prefer a generic implementation that can handle complicated cases to something ad-hoc for some simple ones. Because of this, we would have some comments regarding the design that you propose:

   1. Detect loops containing SLP opportunities (operations on compound
      values)
   2. Extend the cost model to choose between interleaving or using
      compound values
   3. Add support for vectorizing compound operations to VPlan

Currently, VPlan is not fully integrated in all the stages of the inner loop vectorizer pipeline. For that reason, part of your implementation (#1 and #2) would happen outside of VPlan and another part (#3) would be VPlan-based. As you know, we are currently leveraging a new vectorization path where 1) VPlan is built upfront in the pipeline (http://lists.llvm.org/pipermail/llvm-dev/2017-December/119523.html) and 2) all the vectorization stages will be implemented on top of the VPlan representation. We think that your proposal is a really good candidate to be implemented in this new “VPlan-native” vectorization path. In this way, we would avoid the porting effort of #1 and #2 to the final VPlan-base infrastructure and would give you the opportunity of getting involved in the design of VPlan. It would also avoid introducing the complexity of SLP into the existing cost model and code generation, which is also a concern to consider. We should definitely talk in depth about the requirements to implement this in the new vectorization path but we truly believe it's the best approach.

Thank you very much for your detailed response!

I also think that this proposal would fit very well in the "VPlan-native" vectorization path, especially building and evaluating multiple plans for different strategies (interleaved and SLP style for example) should make the cost-modelling easier in more complex scenarios. I suppose this proposal could be a good candidate for an initial user of the new VPlan model and I would also be happy with helping out with related VPlan infrastructure work!

  * loops where some vectors need to be transformed, for example where
    different operations are performed on different (groups of) lanes,
    like A[i].x + B[i].x, A[i].y - B[i].y which could be transformed to
    A[i].x + B[i].x, A[i].y + (-B[i].y), or where one compound group
    needs to be reordered, like A[i].x + B[i].y, A[i].y + B[i].x

This kind of transformation/reordering is an example of what could be a preparatory VPlan-to-VPlan transformation that could precede and simplify the core analysis of the SLP-aware analysis.

Yes in those cases I think it would be very beneficial to not commit to a single strategy too early.

The SLP vectorizer in LLVM already implements a similar analysis. In the long term, it would probably make sense to share the analysis between LoopVectorize and the SLP vectorizer. But initially we think it would be safer to start with a separate and initially more limited analysis in LoopVectorize.

As a first step, this sounds reasonable to me. If it's implemented on top of VPlan it could be reused in a future VPlan-base SLP. We should have this in mind for the design from the very beginning so that we maximize the reuse of the common ground of "standalone" SLP and SLP-aware loop vectorization.

Agreed. The infrastructure we add here should make it at least easier to re-use parts for standalone SLP.

One limitation here is that we commit to either interleaving/compound vectorization when calculating the cost for the loads. Depending on the other instructions in the loop, interleaving could be beneficial overall, even though we could use compound vectorization for some operations. Initially we could only consider SLP-style vectorization if it can be used for all instructions.

VPlan will allow the independent evaluation of multiple vectorization scenarios in the future. This seems to fit into that category.

Yes, I think this would fit very well in the Vplan-native model and also allow for a modular implementation of cost-modelling.

Add support for SLP style vectorization to Vplan
------------------------------------------------------------------------
Introduce 2 new recipes VPSLPMemoryRecipe and VPSLPInstructionRecipe.

We introduced VPInstructions to model masking in patch D38676. They are necessary to properly model the def-use/use-def chains in the VPlan representation, and we believe you will need to represent def-use/use-def chains for newly inserted operations, shuffles, and inserts/extracts. As such, VPInstructions would be a more proper representation than the Recipes per the long term vision. In fact, we plan to replace some of the current existing recipes with VPInstructions in the near future. For this reason, we think that the output of the SLP vectorization should be modelled using VPInstruction and not with new Recipes.

Interesting, thanks for the pointer. I will have a closer look, but I think VPInstructions should be good fit here.

Thanks,
Florian

Hi,

I am away for week now, but after that I plan to start working on proof-of-concept patches for this RFC.

Please let me know if you have any additional comments in the meantime. I will also be at EuroLLVM, in case anyone is interested in discussing it face to face.

Cheers,
Florian

Hi,