Trip count and Loop Vectorizer

Hi,

I am trying to get a small loop to not vectorize for cases where it doesn’t make sense. For instance, this loop:

void foo(int a[4][8], int n)

{

int b[4][8];

for(int i = 0; i < 4; i++) {

for(int j = 0; j < n; j++) {

a[i][j] = b[i][j];

}

}

}

  • Has maximum of 8ints copy. LLVM tries to use Memcpy for the inner loop. It is not helpful to perform memcpy for such small moves, especially when the outer loop is unrolled since the trip count is constant (4). The 4 calls to memcpy is not efficient.

  • Therefore, I disabled the memcpy optimization for such cases, and found that LLVM LoopVectorizer successfully vectorizes and unrolls the inner loop. However, in order to take the fast path (vmovups) it must copy at least 32 ints, where as in this case we only do an 8int copy.

** Upon closer look, LoopVectorizer obtains the TripCount for the innerloop using getSmallConstantTripCount(Loop,…). This value is 0 for the loop with unknown trip count. Loop unrolling is disabled when TC > 0. Should this be changed to TC >= 0 (which does the job for this testcase)? Or is there a better way to disable loop unrolling for such trivial loops, at least the ones with known array size?

Thanks for your feedback

Sriram

Hi Sriram,

Thanks for performing this analysis. The problem here, both for memcpy and the vectorizer, is that we can’t predict the size of “n”, even though the only use of ’n’ is for the loop bound for the alloca [4 x [8 x i32]]. If you change the unroll condition to TC >= 0 then you will disable loop unrolling for all loops because getSmallConstantTripCount returns an unsigned number. You can control the unroll factor using metadata or using the command line tools.

Thanks,
Nadav

Hi Nadav,

Thanks for the response. I forgot to mention that there is an upper limit of 16 for the Trip Count check,

TinyTripCountVectorThreshold = 16;

if (TC > 0u && TC < TinyTripCountVectorThreshold). So right now, any loop with Trip Count as 0, or with value >=16, LV with unroll. With the change to the lower bound, it will also include the loop with 0 trip count.

SCEV returns 0 trip count for this case, because it identifies that there is no backedge taken.

ScalarEvolution::ComputeExitLimitFromCond () {

if (L->contains(FBB) == !CI->getZExtValue())

{ }

else

// The backedge is never taken.

return getConstant(CI->getType(), 0);

}

Sriram,

The problem is that you want to unroll/vectorize many loops with non-constant loop count - it is a trade-off of which case you estimate as more likely.

int foo(int *ptr, int n) {
  for ( .. i <n)
    ptr[i] = ...
}

The question is: is it more likely to have “n” such that unrolling is beneficial or not.

Now, you could probably write an analysis that bounds the loop count (for the purpose of this heuristic) based on the only possible legal access in the loop. In your example you have an access to an alloca of which the size is known, so you could infer that n must be smaller than 8 (because you know the range of the other dimension). The question is how often does such an example occur, where this is possible, to make such an effort justifiable?

Best,
Arnold

smaller equal, of course :wink:

Hey Arnold,
I have run into this situation many times while benchmarking.
I think it is best if this is addressed using a simple heuristic. For that, we need to identify the loop cost and decide if it makes sense to completely unroll the loop, or partially unroll. I am unsure of the optimal way to implement this though.

I want to run it by the list to get any ideas floating around :slight_smile:
Thanks
Sriram

We have frame work for simple heuristics in place:

1.) We vectorize loops with unknown trip count.
2.) You can find our partial unroll heuristic in selectUnrollFactor. It probably needs tuning.

  // We unroll the loop in order to expose ILP and reduce the loop overhead.
  // There are many micro-architectural considerations that we can't predict
  // at this level. For example frontend pressure (on decode or fetch) due to
  // code size, or the number and capabilities of the execution ports.
  //
  // We use the following heuristics to select the unroll factor:
  // 1. If the code has reductions the we unroll in order to break the cross
  // iteration dependency.
  // 2. If the loop is really small then we unroll in order to reduce the loop
  // overhead.

   <<< This is the heuristic that works against your example.
    
  // 3. We don't unroll if we think that we will spill registers to memory due
  // to the increased register pressure.