[LoopVectorizer] Missed vectorization opportunities caused by sext/zext operations

Hi,

This is somewhat similar to the previous thread regarding missed vectorization

opportunities (http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-April/084765.html),

but maybe different enough to require a new thread.

I’m seeing some missed vectorization opportunities in the loop vectorizer because SCEV

is not able to fold sext/zext expressions into recurrence expressions (AddRecExpr).

This can manifest in multiple ways:

  • We cannot get the back-edges taken count since SCEV because we may have something like (sext (1,+1))

which we can’t evaluate as it can overflow

  • We cannot get SCEV AddRec expressions for pointers which need runtime checks, and the

loop vectorizer fails with a “Can’t vectorize due to memory conflicts” error.

I think there are two cases:

  1. It would be possible for SCEV to prove that it is safe to fold the sext/zext nodes into an AddRec

expression, but this doesn’t happen because either nsw/nuw flags have been lost or the code

to make the inference of nsw/nuw flags in some particular case is missing

  1. It is actually possible for some operations to overflow, so folding sext/zext nodes into AddRec

expressions would be incorrect.

Here is an example where we fail to get the number of back-edge branches taken because of sext/zext

operations:

void test0(unsigned short a, unsigned short * in, unsigned short * out) {

for (unsigned short w = 1; w < a - 1; w++) //this will never overflow

out[w] = in[w+7] * 2;

}

In there anyone working on improving the 1) aspect of SCEV? If so, maybe some coordination of effort

might be a good idea.

Since the issue seems to be that certain operations can overflow and SCEV cannot properly reason about

overflows and extend operations, would it make more sense to try and:

  • Promote values that go into the trip count calculation and memory access indices to the smallest type

which would remove sext/zext/trunc operations from the loop body. This should remove the sext/zext

issue, as SCEV wouldn’t have to deal with these operations.

  • Add nsw/nuw flags where necessary

  • Add runtime checks (outside the loop) to detect overflows in the original loop

Would there be any fundamental issue with this approach? I think it would it be preferable to point fixes

for case 1), so if anyone is working on something similar it would be good to know.

Thanks,

Silviu

1) It would be possible for SCEV to prove that it is safe to fold the
sext/zext nodes into an AddRec expression, but this doesn’t happen because either nsw/nuw flags have been
lost or the code

Hi Silviu,

It's possible that this is something I spotted two years ago while
working on the stride vectorizer draft. I believe the flag loss is
still in there.

2) It is actually possible for some operations to overflow, so folding
sext/zext nodes into AddRec expressions would be incorrect.

Exactly. We need to keep the flags as much as possible.

In there anyone working on improving the 1) aspect of SCEV? If so, maybe
some coordination of effort might be a good idea.

I haven't heard of anyone so far.

- Promote values that go into the trip count calculation and memory
access indices to the smallest type
which would remove sext/zext/trunc operations from the loop body. This
should remove the sext/zext
issue, as SCEV wouldn’t have to deal with these operations.

You mean demote the range specifier (a in your example above), to a
more constrained type than the induction variable, right?

If possible, this would help static analysis of loop bounds and not
require run time checks.

- Add nsw/nuw flags where necessary

Or at least make sure they're not removed. The cases I've seen had
then in O0 but not in O3, but I haven't gone through to see (or I
can't remember) which pass removed them.

- Add runtime checks (outside the loop) to detect overflows in the
original loop

I think we do some of that. Maybe the two steps above will make the
little we have today work on most cases already.

cheers,
--renato

For

void test0(unsigned short a, unsigned short * in, unsigned short * out) {
  for (unsigned short w = 1; w < a - 1; w++) //this will never overflow
      out[w] = in[w+7] * 2;
}

I think it will be sufficient to add a couple of new cases to
ScalarEvolution::HowManyLessThans --

  zext(A) ult zext(B) == A ult B
  sext(A) slt sext(B) == A slt B

Currently it bails out if it sees a non-add recurrence on the LHS of
the ult/slt.

You could also teach ScalarEvolution::isImpliedCondOperands the
following:

  zext(A) ult zext(B) <=> A ult B
  sext(A) slt sext(B) <=> A slt B

to get more aggressive promotion from ext{A,+,B} to {ext A,+,ext B}.

I'll be happy to review these patches.

-- Sanjoy

Hi Renato, Sanjoy,

void test0(unsigned short a, unsigned short * in, unsigned short * out) {
  for (unsigned short w = 1; w < a - 1; w++) //this will never overflow
      out[w] = in[w+7] * 2;
}

Turns out this can actually overflow for a = 0, and in this case would
be an infinite loop (end condition would be w < MAX_USHORT). Run-time
checks could still be added here:

if (a > 0)
  do vectorized version
else
  default to scalar version

I think it will be sufficient to add a couple of new cases to
ScalarEvolution::HowManyLessThans --

  zext(A) ult zext(B) == A ult B
  sext(A) slt sext(B) == A slt B

I had a go at this but realized that the problem above would affect
at least the ult case (we need to have B < max(type(B)) for it to hold),
and so I gave up (at least temporarily). I think this should still be
possible, but some range checking might be required.

> - Promote values that go into the trip count calculation and memory
> access indices to the smallest type
> which would remove sext/zext/trunc operations from the loop body. This
> should remove the sext/zext issue, as SCEV wouldn’t have to deal with
> these operations.

You mean demote the range specifier (a in your example above), to a more
constrained type than the induction variable, right?

If possible, this would help static analysis of loop bounds and not require run
time checks.

I was thinking about doing something like promoting the
induction variable (in my case w) to a wider type (int), but there might be
some problems with that. I think we need SCEV working to be able to reason
about when that would be safe, and even if we could do that, there is no
guarantee that this would give us any benefit - we might end up
increasing the code size if vectorization wouldn't be possible
even after that transformation.

> - Add runtime checks (outside the loop) to detect overflows in the
> original loop

I think we do some of that. Maybe the two steps above will make the little
we have today work on most cases already.

I think we're only checking for overflows in the trip count variable that's
being generated by the loop vectorizer. So any potential overflows in the
actual loop would make us bail out in the legality check part.

I'm currently trying to work around this issue by teaching SCEV
to be able to optimistically fold sext/zext expressions while
accumulating the assumptions that it has made doing so (which
need to be checked at run-time). This allows us to gather all the
assumptions in the vectorization legality check of the loop vectorizer
and even allows us to go through the cost model without modifying
the IR.

This would of course only happen if we would otherwise abort
because we didn't get a SCEV expression that we want to be
an AddRecExpr.

This seems to work at least on a toy example (it can work around
issues with the number of back-edges taken and memory checks),
so I hope to have a patch sent out for RFC soon.

Thanks,
Silviu

Hi Silviu,

void test0(unsigned short a, unsigned short * in, unsigned short * out) {
  for (unsigned short w = 1; w < a - 1; w++) //this will never overflow
      out[w] = in[w+7] * 2;
}

Turns out this can actually overflow for a = 0, and in this case would
be an infinite loop (end condition would be w < MAX_USHORT). Run-time
checks could still be added here:

Maybe I'm missing something here, but for a = 0 this will run exactly
(MAX_USHORT - 1) iterations, it will stop once w is MAX_USHORT.

In general, if your loop condition is "I u< N" for any N, I++ cannot
unsigned overflow since the only value of I for which I++ overflows is
MAX_USHORT (= -1) and MAX_USHORT will not be u< N for any N. A
similar fact follows for "I s< N" and sign overflow in I++.

I think it will be sufficient to add a couple of new cases to
ScalarEvolution::HowManyLessThans --

  zext(A) ult zext(B) == A ult B
  sext(A) slt sext(B) == A slt B

I had a go at this but realized that the problem above would affect
at least the ult case (we need to have B < max(type(B)) for it to hold),

Why? AFAICT, the above two facts hold for all values of A and B,
including INT_MAX, INT_MIN, INT_UMAX etc.

-- Sanjoy

Hi Sanjoy,

Yes, I was completely wrong about that, sorry.

I was probably thinking about the ule/sle cases and
the case where the increment can be something different
then 1. These can also be handled, but we would need
to test the range of whatever we're comparing against.

Thanks,
Silviu