loop vectorizer: this loop is not worth vectorizing

I am trying a setup where the one loop is rewritten as two loops. This avoids the 'rem' and 'div' instructions in the index calculation (which give the loop vectorizer a hard time).

However, with this setup the loop vectorizer complains about a too small loop.

LV: Checking a loop in "main"
LV: Found a loop: L3
LV: Found a loop with a very small trip count. This loop is not worth vectorizing.
LV: Not vectorizing.

Here the IR:

define void @main(i64 %arg0, i64 %arg1, i1 %arg2, i64 %arg3, float* noalias %arg4, float* noalias %arg5, float* noalias %arg6) {
entrypoint:
   br i1 %arg2, label %L0, label %L2

L0: ; preds = %entrypoint
   %0 = add nsw i64 %arg0, %arg3
   %1 = add nsw i64 %arg1, %arg3
   br label %L2

L2: ; preds = %entrypoint, %L0
   %2 = phi i64 [ %0, %L0 ], [ %arg0, %entrypoint ]
   %3 = phi i64 [ %1, %L0 ], [ %arg1, %entrypoint ]
   %4 = sdiv i64 %2, 4
   %5 = sdiv i64 %3, 4
   br label %L5

L3: ; preds = %L3, %L5
   %6 = phi i64 [ %21, %L3 ], [ 0, %L5 ]
   %7 = add nsw i64 %26, %6
   %8 = add nsw i64 %27, %6
   %9 = getelementptr float* %arg5, i64 %7
   %10 = load float* %9, align 4
   %11 = getelementptr float* %arg5, i64 %8
   %12 = load float* %11, align 4
   %13 = getelementptr float* %arg6, i64 %7
   %14 = load float* %13, align 4
   %15 = getelementptr float* %arg6, i64 %8
   %16 = load float* %15, align 4
   %17 = fadd float %16, %12
   %18 = fadd float %14, %10
   %19 = getelementptr float* %arg4, i64 %7
   store float %18, float* %19, align 4
   %20 = getelementptr float* %arg4, i64 %8
   store float %17, float* %20, align 4
   %21 = add nsw i64 %6, 1
   %22 = icmp sgt i64 %6, 2
   br i1 %22, label %L4, label %L3

L4: ; preds = %L3
   %23 = add nsw i64 %25, 1
   %24 = icmp slt i64 %23, %5
   br i1 %24, label %L5, label %L6

L5: ; preds = %L4, %L2
   %25 = phi i64 [ %23, %L4 ], [ %4, %L2 ]
   %26 = shl i64 %25, 3
   %27 = or i64 %26, 4
   br label %L3

L6: ; preds = %L4
   ret void
}

The L3 loop has a trip count of 4. The L5 outer loop has a variable trip count depending on the functions arguments.

I cannot make the L3 loop larger so that the vectorizer might be happy, because this will again introduce 'rem' and 'div' in the index calculation.

I am using these passes:

functionPassManager->add(llvm::createBasicAliasAnalysisPass());
       functionPassManager->add(llvm::createLICMPass());
       functionPassManager->add(llvm::createGVNPass());
       functionPassManager->add(llvm::createLoopVectorizePass());
functionPassManager->add(llvm::createInstructionCombiningPass());
       functionPassManager->add(llvm::createEarlyCSEPass());
functionPassManager->add(llvm::createCFGSimplificationPass());

I am wondering, whether there might be pass I could issue before the loop vectorizer that transforms the code so that the vectorizer is happy. I am wondering because coming from a C function which tries to mimic the above IR

void bar(std::uint64_t start, std::uint64_t end, float * __restrict__ c, float * __restrict__ a, float * __restrict__ b)
{
   const std::uint64_t inner = 4;
   for (std::uint64_t i = start/inner ; i < end/inner ; i++ ) {
     for (std::uint64_t q = 0 ; q < inner ; q++ ) {
       const std::uint64_t ir0 = ( i * 2 + 0 ) * inner + q;
       const std::uint64_t ir1 = ( i * 2 + 1 ) * inner + q;

       c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
       c[ ir1 ] = a[ ir1 ] + b[ ir1 ];
     }
   }
}

the loop vectorizer complains as well, but the produced code is vectorized:

LV: Checking a loop in "_Z3barmmPfS_S_"
LV: Found a loop: for.body4
LV: Found an induction variable.
LV: Found unvectorizable type.
LV: Can't vectorize the instructions or CFG
LV: Not vectorizing.

; Function Attrs: nounwind uwtable
define void @_Z3barmmPfS_S_(i64 %start, i64 %end, float* noalias %c, float* noalias %a, float* noalias %b) #3 {
entry:
   %div = lshr i64 %start, 2
   %div1 = lshr i64 %end, 2
   %cmp9 = icmp ult i64 %div, %div1
   br i1 %cmp9, label %for.body4.preheader, label %for.end20

for.body4.preheader: ; preds = %entry
   br label %for.body4

for.body4: ; preds = %for.body4.preheader, %for.body4
   %storemerge10 = phi i64 [ %inc19, %for.body4 ], [ %div, %for.body4.preheader ]
   %mul5 = shl i64 %storemerge10, 3
   %add82 = or i64 %mul5, 4
   %arrayidx = getelementptr inbounds float* %a, i64 %mul5
   %arrayidx11 = getelementptr inbounds float* %b, i64 %mul5
   %arrayidx13 = getelementptr inbounds float* %c, i64 %mul5
   %arrayidx14 = getelementptr inbounds float* %a, i64 %add82
   %arrayidx15 = getelementptr inbounds float* %b, i64 %add82
   %arrayidx17 = getelementptr inbounds float* %c, i64 %add82
   %0 = bitcast float* %arrayidx to <4 x float>*
   %1 = load <4 x float>* %0, align 4
   %2 = bitcast float* %arrayidx11 to <4 x float>*
   %3 = load <4 x float>* %2, align 4
   %4 = fadd <4 x float> %1, %3
   %5 = bitcast float* %arrayidx13 to <4 x float>*
   store <4 x float> %4, <4 x float>* %5, align 4
   %6 = bitcast float* %arrayidx14 to <4 x float>*
   %7 = load <4 x float>* %6, align 4
   %8 = bitcast float* %arrayidx15 to <4 x float>*
   %9 = load <4 x float>* %8, align 4
   %10 = fadd <4 x float> %7, %9
   %11 = bitcast float* %arrayidx17 to <4 x float>*
   store <4 x float> %10, <4 x float>* %11, align 4
   %inc19 = add i64 %storemerge10, 1
   %cmp = icmp ult i64 %inc19, %div1
   br i1 %cmp, label %for.body4, label %for.end20.loopexit

for.end20.loopexit: ; preds = %for.body4
   br label %for.end20

for.end20: ; preds = %for.end20.loopexit, %entry
   ret void
}

But here the vectorization must have happened before. It's starting to get frustrating.

Frank

In the case when coming from C it was probably the loop unroller and SLP vectorizer which vectorized the code. Potentially I could do the same in the IR. However, the loop body that is generated in the IR can get very large. Thus, the loop unroller will refuse to unroll the loop in a large number of (important) cases.

Isn't there a way to convince the loop vectorizer that it should vectorize a loop even when its trip count equals the SIMD vector length?

Frank

You can control the small trip count threshold using the command line option -mllvm -vectorizer-min-trip-count=XXX. At the moment we don’t detect loops that can be completely vectorized, but it would be a nice feature to add.

Thanks,
Nadav