Missed vectorization opportunities?

Vaivaswatha_N · April 22, 2015, 11:01am

Hi,

I am trying to understand the limitations of the current vectorizer, and came upon these test cases that fail to get vectorized.

loop1 below (note the increment by 2) fails to get vectorized because the access a[j+1] is considered to wrap around (the corresponding SCEV doesn’t have nsw/nuw set) and hence isStridedPtr() in LoopAccessAnalysis.cpp return false.

#define SIZE 100000
void loop1 (float a[SIZE])
{
long t, j;
float cc = 23, dd = 34, ee = 233;
#pragma clang loop vectorize(enable)
for (j = 1; j < SIZE-1; j += 2) {
float x;
x = a[j+1] + ee;
x = x / dd;
x = (cc + x) / dd;
a[j] = x + 2 ;
}
}

Of the two loops below, the loop “works” gets vectorized, while the loop “fails” does not get vectorized.
#define SIZE 10000
int a[SIZE+4] = {};
void works(unsigned long m, unsigned long n)
{
long j;
// SCEV can compute the exit value for this loop
for (j = m; j < n+1; j += 1) {
a[j] = a[j+1] + 2;
}
}
void fails(unsigned long m, unsigned long n)
{
long j;
// SCEV cannot compute the exit value for this loop
for (j = m; j <= n; j += 1) {
a[j] = a[j+1] + 2;
}
}

The only difference between the two loops is the loop exit condition, both semantically same. Scalar Evolution is unable to determine the exit value of the second loop, leading to the vectorizer failing to vectorize. It seemed to me that this is due to ScalarEvolution::SimplifyICmpOperands failing to canonicalize the corresponding ICMP_ULE instruction.

Thanks,

Vaivaswatha

mats_petersson1 · April 22, 2015, 11:16am

In your first case, I fail to see what the obvious vectorisation would
be - you are reading one and writing one of the every other element -
that is not very easy to made into vectorized operatioons - have you
tried writing code for that, and confirmed that it's actually better
than just straight loop?

Vaivaswatha_N · April 22, 2015, 11:35am

have you tried writing code for that, and confirmed that it’s actually better than just straight loop?
The same loop gets vectorized on icc and runs faster hence.

Also, as an experiment, I tried forcing isStridedPtr() to return true. The vectorizer proceeds to vectorize the loop, but does a poor job. Instead of using vector Load/Store instructions and packing/unpacking, it uses scalar Load/Stores instructions and packs them for the arithmetic operations. Note that without the pragma, the cost model does advice against the vectorization. My reason for forcing vectorization was that it did well on icc.

On icc, I tried the above loop without any pragmas and saw that the loop was vectorized. Also using the pragma “novector” resulted in the loop not being vectorized. With vectorization, performance was better.

Thanks,

Vaivaswatha

James_Molloy1 · April 22, 2015, 12:44pm

Hi,

Your first example is similar to the strided loops that Hao is working on vectorizing with his indexed load intrinsics. We have dedicated instructions for these in NEON, but they can be synthesized with a load+unzip / zip+store, which I assume is what icc is doing.

If I’m right, this is in progress.

Cheers,

James

Daniel_Berlin1 · April 22, 2015, 3:29pm

2. Of the two loops below, the loop "works" gets vectorized, while the loop
"fails" does not get vectorized.
#define SIZE 10000
int a[SIZE+4] = {};
void works(unsigned long m, unsigned long n)
{
  long j;
  // SCEV can compute the exit value for this loop
  for (j = m; j < n+1; j += 1) {
    a[j] = a[j+1] + 2;
  }
}
void fails(unsigned long m, unsigned long n)
{
  long j;
  // SCEV cannot compute the exit value for this loop
  for (j = m; j <= n; j += 1) {
    a[j] = a[j+1] + 2;
  }
}

The only difference between the two loops is the loop exit condition, both
semantically same.

I expect SCEV treats them differently because of MAX_INT handling.
Look as the definedness of both if n == MAX_INT. The first has
undefined behavior, the second does not.
If you change the second into the first, you introduce undefined behavior.
(or maybe it's implementation defined, but whatever)

This is the:
if (!getUnsignedRange(RHS).getUnsignedMax().isMaxValue()) {

check in that function simplify.

But you should file a bug anyway.

Sanjoy_Das1 · April 22, 2015, 6:44pm

I expect SCEV treats them differently because of MAX_INT handling.
Look as the definedness of both if n == MAX_INT. The first has
undefined behavior, the second does not.
If you change the second into the first, you introduce undefined behavior.
(or maybe it's implementation defined, but whatever)

To elaborate a little further on this:

In the first loop, you can never enter the loop with "j == INT_SMAX"
since INT_SMAX will never be < anything. This means j + 1 cannot
overflow. In the second loop you /can/ enter the loop with "j ==
INT_SMAX" if "n == INT_SMAX" so j + 1 can potentially overflow.

Ideally SCEV should be able to infer the nsw'ness of the additions
from the nsw bits in the source IR; but that's more complex that it
sounds since SCEV does not have a notion of control flow within the
loop and it hashes SCEVs by the operands and not by the nsw/nuw bits.
Crude example:

define void @x(i32 %a, i32 %b, i1 %c) {
entry:
%m = add i32 %a, %b
br i1 %c, label %do, label %dont

do:
%m1 = add nsw i32 %a, %b
br label %dont

dont:
ret void
}

both %m and %m1 get mapped to the *same* SCEV, and you cannot mark
that SCEV as nsw even though %m1 is nsw.

-- Sanjoy

Vaivaswatha_N · April 23, 2015, 4:59am

Thank you Sanjoy for the explanation. Is it worth filing a bug over this at this point?

Hi James,

Your first example is similar to the strided loops that Hao is working on vectorizing with his indexed load intrinsics.
I’m curious. For the example I mentioned, legality check fails because the corresponding SCEV doesn’t have nsw set and hence isStridedPtr() returns false. In reality the induction variable has a statically known bound and it cannot overflow, so it is really legal to vectorize the loop. Did you face this problem (and solve it) ?

Thanks everyone for your response and clarification.

Vaivaswatha

I expect SCEV treats them differently because of MAX_INT handling.
Look as the definedness of both if n == MAX_INT. The first has
undefined behavior, the second does not.
If you change the second into the first, you introduce undefined behavior.
(or maybe it’s implementation defined, but whatever)

To elaborate a little further on this:

In the first loop, you can never enter the loop with “j == INT_SMAX”
since INT_SMAX will never be < anything. This means j + 1 cannot
overflow. In the second loop you /can/ enter the loop with “j ==
INT_SMAX” if “n == INT_SMAX” so j + 1 can potentially overflow.

Ideally SCEV should be able to infer the nsw’ness of the additions
from the nsw bits in the source IR; but that’s more complex that it
sounds since SCEV does not have a notion of control flow within the
loop and it hashes SCEVs by the operands and not by the nsw/nuw bits.
Crude example:

define void @x(i32 %a, i32 %b, i1 %c) {
entry:
%m = add i32 %a, %b
br i1 %c, label %do, label %dont

do:
%m1 = add nsw i32 %a, %b
br label %dont

dont:
ret void
}

both %m and %m1 get mapped to the same SCEV, and you cannot mark
that SCEV as nsw even though %m1 is nsw.

– Sanjoy

Topic		Replies	Views
scev questions LLVM Dev List Archives	4	187	August 22, 2013
loop vectorizer LLVM Dev List Archives	34	384	November 6, 2013
Imprecise information by SCEV after loop unrolling and Vectorization. LLVM Dev List Archives	0	110	August 24, 2018
Missed vectorization opportunities? LLVM Dev List Archives	0	115	April 23, 2015
Autovectorization questions LLVM Dev List Archives	9	198	March 13, 2014

Missed vectorization opportunities?

Related topics