loop vectorizer: this loop is not worth vectorizing

fwinter · November 1, 2013, 3:27am

I am trying a setup where the one loop is rewritten as two loops. This avoids the 'rem' and 'div' instructions in the index calculation (which give the loop vectorizer a hard time).

However, with this setup the loop vectorizer complains about a too small loop.

LV: Checking a loop in "main"
LV: Found a loop: L3
LV: Found a loop with a very small trip count. This loop is not worth vectorizing.
LV: Not vectorizing.

Here the IR:

define void @main(i64 %arg0, i64 %arg1, i1 %arg2, i64 %arg3, float* noalias %arg4, float* noalias %arg5, float* noalias %arg6) {
entrypoint:
br i1 %arg2, label %L0, label %L2

L0: ; preds = %entrypoint
   %0 = add nsw i64 %arg0, %arg3
   %1 = add nsw i64 %arg1, %arg3
   br label %L2

L2: ; preds = %entrypoint, %L0
   %2 = phi i64 [ %0, %L0 ], [ %arg0, %entrypoint ]
   %3 = phi i64 [ %1, %L0 ], [ %arg1, %entrypoint ]
   %4 = sdiv i64 %2, 4
   %5 = sdiv i64 %3, 4
   br label %L5

L3: ; preds = %L3, %L5
   %6 = phi i64 [ %21, %L3 ], [ 0, %L5 ]
   %7 = add nsw i64 %26, %6
   %8 = add nsw i64 %27, %6
   %9 = getelementptr float* %arg5, i64 %7
   %10 = load float* %9, align 4
   %11 = getelementptr float* %arg5, i64 %8
   %12 = load float* %11, align 4
   %13 = getelementptr float* %arg6, i64 %7
   %14 = load float* %13, align 4
   %15 = getelementptr float* %arg6, i64 %8
   %16 = load float* %15, align 4
   %17 = fadd float %16, %12
   %18 = fadd float %14, %10
   %19 = getelementptr float* %arg4, i64 %7
   store float %18, float* %19, align 4
   %20 = getelementptr float* %arg4, i64 %8
   store float %17, float* %20, align 4
   %21 = add nsw i64 %6, 1
   %22 = icmp sgt i64 %6, 2
   br i1 %22, label %L4, label %L3

L4: ; preds = %L3
   %23 = add nsw i64 %25, 1
   %24 = icmp slt i64 %23, %5
   br i1 %24, label %L5, label %L6

L5: ; preds = %L4, %L2
   %25 = phi i64 [ %23, %L4 ], [ %4, %L2 ]
   %26 = shl i64 %25, 3
   %27 = or i64 %26, 4
   br label %L3

L6: ; preds = %L4
ret void
}

The L3 loop has a trip count of 4. The L5 outer loop has a variable trip count depending on the functions arguments.

I cannot make the L3 loop larger so that the vectorizer might be happy, because this will again introduce 'rem' and 'div' in the index calculation.

I am using these passes:

functionPassManager->add(llvm::createBasicAliasAnalysisPass());
       functionPassManager->add(llvm::createLICMPass());
       functionPassManager->add(llvm::createGVNPass());
       functionPassManager->add(llvm::createLoopVectorizePass());
functionPassManager->add(llvm::createInstructionCombiningPass());
       functionPassManager->add(llvm::createEarlyCSEPass());
functionPassManager->add(llvm::createCFGSimplificationPass());

I am wondering, whether there might be pass I could issue before the loop vectorizer that transforms the code so that the vectorizer is happy. I am wondering because coming from a C function which tries to mimic the above IR

void bar(std::uint64_t start, std::uint64_t end, float * __restrict__ c, float * __restrict__ a, float * __restrict__ b)
{
   const std::uint64_t inner = 4;
   for (std::uint64_t i = start/inner ; i < end/inner ; i++ ) {
     for (std::uint64_t q = 0 ; q < inner ; q++ ) {
       const std::uint64_t ir0 = ( i * 2 + 0 ) * inner + q;
       const std::uint64_t ir1 = ( i * 2 + 1 ) * inner + q;

       c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
       c[ ir1 ] = a[ ir1 ] + b[ ir1 ];
     }
   }
}

the loop vectorizer complains as well, but the produced code is vectorized:

LV: Checking a loop in "_Z3barmmPfS_S_"
LV: Found a loop: for.body4
LV: Found an induction variable.
LV: Found unvectorizable type.
LV: Can't vectorize the instructions or CFG
LV: Not vectorizing.

; Function Attrs: nounwind uwtable
define void @_Z3barmmPfS_S_(i64 %start, i64 %end, float* noalias %c, float* noalias %a, float* noalias %b) #3 {
entry:
   %div = lshr i64 %start, 2
   %div1 = lshr i64 %end, 2
   %cmp9 = icmp ult i64 %div, %div1
   br i1 %cmp9, label %for.body4.preheader, label %for.end20

for.body4.preheader: ; preds = %entry
br label %for.body4

for.body4: ; preds = %for.body4.preheader, %for.body4
   %storemerge10 = phi i64 [ %inc19, %for.body4 ], [ %div, %for.body4.preheader ]
   %mul5 = shl i64 %storemerge10, 3
   %add82 = or i64 %mul5, 4
   %arrayidx = getelementptr inbounds float* %a, i64 %mul5
   %arrayidx11 = getelementptr inbounds float* %b, i64 %mul5
   %arrayidx13 = getelementptr inbounds float* %c, i64 %mul5
   %arrayidx14 = getelementptr inbounds float* %a, i64 %add82
   %arrayidx15 = getelementptr inbounds float* %b, i64 %add82
   %arrayidx17 = getelementptr inbounds float* %c, i64 %add82
   %0 = bitcast float* %arrayidx to <4 x float>*
   %1 = load <4 x float>* %0, align 4
   %2 = bitcast float* %arrayidx11 to <4 x float>*
   %3 = load <4 x float>* %2, align 4
   %4 = fadd <4 x float> %1, %3
   %5 = bitcast float* %arrayidx13 to <4 x float>*
   store <4 x float> %4, <4 x float>* %5, align 4
   %6 = bitcast float* %arrayidx14 to <4 x float>*
   %7 = load <4 x float>* %6, align 4
   %8 = bitcast float* %arrayidx15 to <4 x float>*
   %9 = load <4 x float>* %8, align 4
   %10 = fadd <4 x float> %7, %9
   %11 = bitcast float* %arrayidx17 to <4 x float>*
   store <4 x float> %10, <4 x float>* %11, align 4
   %inc19 = add i64 %storemerge10, 1
   %cmp = icmp ult i64 %inc19, %div1
   br i1 %cmp, label %for.body4, label %for.end20.loopexit

for.end20.loopexit: ; preds = %for.body4
br label %for.end20

for.end20: ; preds = %for.end20.loopexit, %entry
ret void
}

But here the vectorization must have happened before. It's starting to get frustrating.

Frank

fwinter · November 1, 2013, 3:41am

In the case when coming from C it was probably the loop unroller and SLP vectorizer which vectorized the code. Potentially I could do the same in the IR. However, the loop body that is generated in the IR can get very large. Thus, the loop unroller will refuse to unroll the loop in a large number of (important) cases.

Isn't there a way to convince the loop vectorizer that it should vectorize a loop even when its trip count equals the SIMD vector length?

Frank

Nadav_Rotem1 · November 1, 2013, 6:03am

You can control the small trip count threshold using the command line option -mllvm -vectorizer-min-trip-count=XXX. At the moment we don’t detect loops that can be completely vectorized, but it would be a nice feature to add.

Thanks,
Nadav

Topic		Replies	Views
loop vectorizer LLVM Dev List Archives	34	116	November 6, 2013
Question about the loop vectorizer LLVM Dev List Archives	1	115	January 14, 2013
[Vectorization] Mis match in code generated LLVM Dev List Archives	9	103	November 11, 2014
loop vectorizer issue LLVM Dev List Archives	5	101	November 4, 2013
LLVM Loop Vectorizer puzzle LLVM Dev List Archives	18	90	May 23, 2013

loop vectorizer: this loop is not worth vectorizing

Related Topics