Loop vectorizer preamble

Hi Arnold, Nadav,

I’ve been taking a look at the preamble and bailout tests created by the loop vectorizer, and I can’t help but feel it could be rather conservative. I’m not a vectorization expert, so I apologise in advance if say something obviously wrong…

I’m looking in particular at the overflow check and the trip count computation. From my reading, it goes something like:

take the backedge taken count and add one → Count
emit code to check Count didn’t overflow
// pointer aliasing checks, if any
calculate vector trip count = Count - (Count % Step)

It seems to me that there should be cases when we don’t need to check for overflow. In a well-formed loop, which this should be at this point, there is an increment of the indvar before the backedge. If this increment is marked ‘nuw’, we should be guaranteed that we don’t get an overflow when calculating numBackedges + 1.

Also, many many loops don’t have a single point-test for exit (x != 0). Instead, they have a greater-than or less-than condition. If this is the case, we should be able to elide all of our logic with Count and just count down until the test is broken. For example:

for (i = 0; i < n; ++i)

count = 0
loop:

count += VF * UF
if count >= n goto scalar_loop else goto loop

This could remove a lot of overflow checks and "urem"s from real code.

Also, we don’t currently coalesce overflow checks, vector memchecks and trip count computation for adjacent and similar loops. For example:

for (i = 0; i < n/2; ++i)
p[i] = q[i+1];
for (i = n/2; i < n; ++i)
p[i] = q[i-1];

Really, these two loops should share a common preamble which checks between the range [0, n). Now, they have two preambles, one checking [0, n/2) and the other [n/2, n).

Sorry for the braindump, and for probably missing many issues!

Cheers,

James

Hi James,

thank you for looking into this!

Hi Arnold, Nadav,

I've been taking a look at the preamble and bailout tests created by the loop vectorizer, and I can't help but feel it could be rather conservative. I'm not a vectorization expert, so I apologise in advance if say something obviously wrong...

I'm looking in particular at the overflow check and the trip count computation. From my reading, it goes something like:

  take the backedge taken count and add one -> Count
  emit code to check Count didn't overflow
  // pointer aliasing checks, if any
  calculate vector trip count = Count - (Count % Step)

It seems to me that there should be cases when we don't need to check for overflow. In a well-formed loop, which this should be at this point, there is an increment of the indvar before the backedge. If this increment is marked 'nuw', we should be guaranteed that we don't get an overflow when calculating numBackedges + 1.

Lets say we have a loop that iterates 256 times (if n == 0):

unsigned char i = n;
do {

  —i;
} while (i);

We need to guard against such cases (see also PR17288).

However, I agree that by looking at the IR (in addition to the SCEV expression for the back-edge) we should be able to do better in some cases.

Also, many many loops don't have a single point-test for exit (x != 0). Instead, they have a greater-than or less-than condition. If this is the case, we should be able to elide all of our logic with Count and just count down until the test is broken. For example:

for (i = 0; i < n; ++i)
  ...

->

count = 0
loop:
  ...
  count += VF * UF
  if count >= n goto scalar_loop else goto loop

This could remove a lot of overflow checks and "urem"s from real code.

This would overflow for say n = 255, VF = 2, char ?

start = 0

loop:
  i = phi (start, next)
  next = + (i, VF)
  stop = >= (n, next)
  br stop, exit, loop

exit

Also, we don't currently coalesce overflow checks, vector memchecks and trip count computation for adjacent and similar loops. For example:

  for (i = 0; i < n/2; ++i)
    p[i] = q[i+1];
  for (i = n/2; i < n; ++i)
    p[i] = q[i-1];

Really, these two loops should share a common preamble which checks between the range [0, n). Now, they have two preambles, one checking [0, n/2) and the other [n/2, n).

You are right currently we only look at one loop at a time.

Hi Arnold,

Thanks for the reply. Taking your example:

This would overflow for say n = 255, VF = 2, char ?
start = 0
loop:
i = phi (start, next)
next = + (i, VF)
stop = >= (n, next)
br stop, exit, loop
exit

Why do we have to use i8 for the induction variable type? If the original add is no-unsigned-wrap, then we could safely extend the induction variable to i32/i64 and have an accurate >= comparison.

We couldn’t do this with an i64 type because there’d be nothing to extend it to, but in practice people still write loops with 32-bit variables a lot of the time.

Cheers,

James

I agree, and I did something similar to this for types < i32 (r195008) - secretly hoping nobody would actually write loops that run 2^32 times but I got caught - I got this bug report :wink: :http://llvm.org/bugs/show_bug.cgi?id=19846
One could do something similar for i32 types on i64 archs.

You can promote from i32 to i64 and specialize the loop skeleton the way you describe if you are on an i64 arch (or from i16 to i32 if you are on an i32 arch). Keep in mind that the induction variable / vector-scalar loop skeleton is already quite complex and any extension to it should simplify/refactor the current logic.