Vectorization of loops with conditional dereferencing

Nadav, Arnold, et al.,

I have a number of loops that I would like us to be able to autovectorize (common, for example, in n-body inter-particle force kernels), and the problem is that they look like this:

for (int i = 0; i < N; ++i) {
  if (r[i] > 0)
    v += m[i]*...;
}

where, as written, m[i] is not accessed unless the condition is true. The general problem (as is noted by the loop vectorizer), is that we don't know that dereferencing m[i] is safe for values where the guarding condition is not true, and so if-converting the block directly is not allowed.

One possibility, if we disallow for m[i] to have been munmaped or page protected in the middle, is to first collect the first and last index for which the condition is true, and then run the vector loop only on that portion. For the loop above, I mean something like:

FirstI = -1, LastI = -1, I = 0;
for (; I < N; ++I)
  if (r[i] > 0) {
    FirstI = I;
    break;
  }

for (; I < N; ++I) {
  if (r[i] > 0)
    LastI = I;
}

[In this case, since the guard dominates all side-effect, we don't need to do anything outside that range, otherwise, we'd need to run the scalar loop until FirstI, and then again from LastI until N]

for (I = FirstI; I <= LastI; ...) {
  // Run the scalar/vector loop sequence as before.
}

On the other hand, if we can't make this range assumption generally, then I'd assume that we could make it per-page (where a page is 4kB or similar). Then the scheme becomes more complicated, because we need to break the ranges on page boundaries (maybe something like this):

Done = false;
FirstI = -1, LastI = -1;
while (!Done) {
  for (I = FirstI+1; I < N; ++I)
    if (r[i] > 0) {
      FirstI = I;
      break;
    }

  for (; I < N && !page_bound(&m[i]) && ...; ++I) {
    if (r[i] > 0)
      LastI = I;
  }

  Done = I == N;

  for (I = FirstI; I <= LastI; ...) {
    // Run the scalar/vector loop sequence as before.
  }
}

I hope that we don't need to do something like this, but I've been disappointed by C/POSIX semantics before. On the other hand, we might want to break these into groups anyway, and allocate a local stack buffer to hold the mask computed from the condition, so that we don't have to reevaluate the condition in the loop.

Also, I'm certainly open to other ways to do this. Are there any other ways? Thoughts?

Thanks again,
Hal

Hi Hal,

Yes, I agree that this is a problem that prevents vectorization in many loops. Another problem that we have is that sunk loads don’t preserve their control dependence properties. For example in the code below, if we sink the load into the branch then we can't vectorize the loop.

x = A[i]
if (cond) {
  sum += x;
}

I agree with you that checking the first and last element for each 4k page has many disadvantages and I would like to explore other options. Yesterday Pekka proposed marking some operations as 'no-trap'. Maybe we should consider marking some memory regions as ‘mapped'. Every time we malloc memory or pass a pointer to statically allocated memory we could propagate the information about the availability and size of the memory. What do you think ?

Thanks,
Nadav

I forgot to mention that it would be interesting to disable this safety check in the vectorizer and run the LLVM test suite. I wonder how many tests will run faster and how many will crash.

Hi Hal,

Yes, I agree that this is a problem that prevents vectorization in
many loops. Another problem that we have is that sunk loads don’t
preserve their control dependence properties. For example in the
code below, if we sink the load into the branch then we can't
vectorize the loop.

x = A[i]
if (cond) {
  sum += x;
}

I agree with you that checking the first and last element for each 4k
page has many disadvantages and I would like to explore other
options.

Yes -- So far, I've failed to think of a better way that will work automatically (meaning without adding pragmas, etc.) :frowning:

Yesterday Pekka proposed marking some operations as
'no-trap'. Maybe we should consider marking some memory regions as
‘mapped'. Every time we malloc memory or pass a pointer to
statically allocated memory we could propagate the information about
the availability and size of the memory. What do you think ?

I think that this is a good idea (and I suspect that code already exists to do some of this; poolalloc perhaps?), although I fear that it generally depends on LTO to work for general codes.

In the short term, I think a pragma, and a compiler option that changes the default, may be the best solution.

Realistically, I know of 0 real applications that use loop-dependent conditions to prevent dereferencing invalid pointers. As you say, it will be interesting to turn off the safety check, and see what in the test suite is affected.

Thanks again,
Hal

Sounds good. We’d also like to add other vectorization pragmas, such as #unroll, #vectorize, etc.

Hi Hal,

I may have missed the point here, but what about the "naive" solution of using a cascade of guarded, scalar loads in the vectorized code?

for (int i = 0; i < N; i+=4) {
   v0 = ...

   mask cond = r[i] > 0;
   vector l;
   if (mask[0])
     l[0] = m[0];
   if (mask[1])
     l[0] = m[1];
   ...

   v1 = v0 + l*...;
   blend(cond, v0, v1);
}

(I used array notation for accessing vector elements)

The rest of the code can remain vectorized (e.g. the += and *).
Or is it the case that the load is exactly the instruction you want vectorized because it dominates the runtime of the loop?

Oh, and sorry for the late reply, I don't have the time to check LLVMdev regularly atm :).

Cheers,
Ralf

Hi Hal,

Even if you could do something like this, there is no guarantee that (Last
- First) will be multiple of VF, or even bigger than VF, and you'll have to
cope with boundary conditions on every external loop, which, depending on
the distribution of r[] could be as many as half the size of r[] itself.

I can't see an algorithmically (compile-time or run-time) way to guarantee
that the number of clusters will be small without scanning the whole array.
So, even if you add such a check in run-time and only vectorize if the
number of clusters is small, the worst case scenario will run twice as many
times, (the initial check can be vectorized, as it's a read-only loop), but
the later cannot. in the best case scenario, you'll have only a handful of
loops, which will run in parallel.

Worst case: n/VF + n
Best case: n/VF + ammortized n/VF

For VF == 2,
* best case is as fast as scalar code, considering overheads.
* worst case is 50% slower

For VF == 4,
* best case is 50% faster than scalar code
* worst case is 25% slower

And all that, depends on each workload, so it'll change for every different
set of arguments, which in an n-body simulation, changes dynamically.

cheers,
--renato

I think that the best way to move forward with the vectorization of this loop is to make progress on the vectorization pragmas. The LoopVectorizer is already prepared for handling pragmas and we just need to add the clang-side support. Is anyone planning to work on this ?

I'm not. :frowning:

What kind of pragmas would work for this loop? Something telling that it's
safe to speculatively read from m[] at any position? In this reduction case
it might be enough. But if this would be an induction store like:

for () {
  if (a[i] > 0)
    x[i] = ... + m[i];

then, the store would be a more complicated way to write to memory and
you'd need the read-pragma to not affect such cases.

cheers,
--renato

Hi all,

I'm not. :frowning:

I think that this is probably the most important feature for the vectorizer right now. Other features require adding complexity to the vectorizer while this feature is relatively simple.

What kind of pragmas would work for this loop? Something telling that it's safe to speculatively read from m[] at any position? In this reduction case it might be enough. But if this would be an induction store like:

for () {
  if (a[i] > 0)
    x[i] = ... + m[i];

Sure. Vectorization of stores is done by loading the current value from memory, blending the new value and saving it back to memory.

then, the store would be a more complicated way to write to memory and you'd need the read-pragma to not affect such cases.

There is no need for read pragma or even a special attribute. The ‘vectorize’ pragma tells the vectorizer that it is safe to access the predicated memory (read or write).

I guess the cost model could let you know if it's profitable or not with
the extra load+mask. But you need to teach it first.

I was worried that telling that m[] can be speculatively read/written does
not automatically propagate to x[], so maybe you should have to add the
same pragma on both to make the induction case work, but only one for the
reduction case.

cheers,
--renato

Hi Nadav,

Sure. Vectorization of stores is done by loading the current value from memory, blending the new value and saving it back to memory.

Just a side note: You may run into trouble with this approach when people start using the loop vectorizer in combination with multi threading. The load/blend/store scheme introduces a race condition in this scenario.
I don't know if this is a valid use case for you, but it may be worth keeping in mind.

Cheers,
Ralf

Its a good point. We will need to document the semantics of the vectorization pragma well.

Nadav and Arnold,

What is the current status of vectorization pragmas? Do you think that’s something I might be able to take on if you two are busy with other things?

Stephen

Hi Stephen,

I was planing on having a look at that this month, but if you have the
bandwidth for it, I won't impose. Unfortunately, my feature implementation
time is restricted with so much admin going on, so you're likely to finish
before I even start. Feel free to keep me in the loop, though.

cheers,
--renato

If you were already planning on looking at it, then I think you should go
ahead with it since you probably have better understanding of the use cases
then I have. I was just looking for places where I could contribute at the
moment but I have found others.

Thanks,
Stephen