Hello,
We would like to propose making LoopVectorize aware of SLP operations, to improve the generated code for loops operating on struct fields or doing complex math.
At the moment, LoopVectorize uses interleaving to vectorize loops that operate on values loaded/stored from consecutive addresses: vector loads/stores are generated to combine consecutive loads/stores and then shufflevector is used to de-interleave/interleave the loaded/stored values. At the moment however, we fail to detect cases where the same operations are applied to all consecutive values and there is no need to interleave. To illustrate this, consider the following example loop:
struct Test {
int x;
int y;
};
void add(struct Test *A, struct Test *B, struct Test *C) {
for (unsigned i = 0; i < 1024; i++) {
C[i].x = A[i].x + B[i].y;
C[i].y = A[i].y + B[i].y;
}
}
On X86, we do not vectorize this loop and on AArch64, we generate the following code for the vector body:
vector.body:
%index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
%8 = or i64 %index, 4
%9 = getelementptr inbounds %struct.Test, %struct.Test* %A,
i64 %index, i32 0
%10 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 %8,
i32 0
%11 = bitcast i32* %9 to <8 x i32>*
%12 = bitcast i32* %10 to <8 x i32>*
%wide.vec = load <8 x i32>, <8 x i32>* %11, align 4, !tbaa !2
%wide.vec60 = load <8 x i32>, <8 x i32>* %12, align 4, !tbaa !2
%13 = getelementptr inbounds %struct.Test, %struct.Test* %B,
i64 %index, i32 0
%14 = getelementptr inbounds %struct.Test, %struct.Test* %B, i64 %8,
i32 0
%15 = bitcast i32* %13 to <8 x i32>*
%16 = bitcast i32* %14 to <8 x i32>*
%wide.vec64 = load <8 x i32>, <8 x i32>* %15, align 4, !tbaa !2
%wide.vec65 = load <8 x i32>, <8 x i32>* %16, align 4, !tbaa !2
%17 = add nsw <8 x i32> %wide.vec64, %wide.vec
%18 = shufflevector <8 x i32> %17, <8 x i32> undef, <4 x i32>
<i32 0, i32 2, i32 4, i32 6>
%19 = add nsw <8 x i32> %wide.vec65, %wide.vec60
%20 = shufflevector <8 x i32> %19, <8 x i32> undef, <4 x i32>
<i32 0, i32 2, i32 4, i32 6>
%21 = add nsw <8 x i32> %wide.vec64, %wide.vec
%22 = shufflevector <8 x i32> %21, <8 x i32> undef, <4 x i32>
<i32 1, i32 3, i32 5, i32 7>
%23 = add nsw <8 x i32> %wide.vec65, %wide.vec60
%24 = shufflevector <8 x i32> %23, <8 x i32> undef, <4 x i32>
<i32 1, i32 3, i32 5, i32 7>
%25 = getelementptr inbounds %struct.Test, %struct.Test* %C,
i64 %index, i32 1
%26 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 %8,
i32 1
%27 = getelementptr i32, i32* %25, i64 -1
%28 = bitcast i32* %27 to <8 x i32>*
%29 = getelementptr i32, i32* %26, i64 -1
%30 = bitcast i32* %29 to <8 x i32>*
%interleaved.vec = shufflevector <4 x i32> %18, <4 x i32> %22,
<8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
store <8 x i32> %interleaved.vec, <8 x i32>* %28, align 4, !tbaa !2
%interleaved.vec70 = shufflevector <4 x i32> %20, <4 x i32> %24,
<8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
store <8 x i32> %interleaved.vec70, <8 x i32>* %30, align 4, !tbaa !2
%index.next = add i64 %index, 8
%31 = icmp eq i64 %index.next, 1024
br i1 %31, label %for.cond.cleanup, label %vector.body, !llvm.loop !6
Note the use of shufflevector to interleave and deinterleave the vector elements. On AArch64, we emit additional uzp1, uzp2 and st2 instructions for those.
In the case above however, there is no need to de-interleave the loaded/stored data and instead we could mix the consecutive operands together in a single vector register and apply the operations to vectors with compounded operands. As for real-world use cases, complex number arithmetic for example could also benefit from not interleaving.
In what follows, I propose an extension to LoopVectorize to make it aware of SLP opportunities and detect operations on "compound values" (for the lack of a better term) as an extension to interleaved access handling. This idea is described in [1].
Structure