[Polly] Parallelizing outer loop containing inner reduction loop

Oops, sorry for the message title, making it more descriptive now…

2013/2/3 Dmitry Mikushin <dmitry@kernelgen.org>

Hi Dmitry,

[FORTRAN code]

    The difference between two versions is only in the way how inner
    reduction loop is implemented. In first case it is explicit and in
    second case - uses sum intrinsic, which is lowered to plain code by
    DragonEgg. In first case Polly successfully detects parallel outer
    loop, in second case - it can't.

Thanks for the test case.

    After some attempts to modify the Polly behavior, I concluded it
    cannot resist the presence of non-IV PHI-node. Polly attempts to
    produce independent basic blocks, trying to gather dependent
    instructions in single block and allocating scratch variables.
    Allocas make the situation even worse, because they are always
    placed into the first block of function, completely killing
    potential parallelism.

As a first step, I was trying to reproduce your observations. When optimizing the kernels with polly, possible aliasing prevents polly to detect the scop.

$ polly-opt -O3 -polly 1.ll -debug-only=polly-detect

Checking region: Loop Function Root => <Function Return>
  Top level region is invalid
Checking region: 3.cloned.header => return.loopexit.exitStub
  Possible aliasing: "[0 x double]* inttoptr (i64 47255179264 to [0 x double]*)", "[0 x double]* inttoptr (i64 47246786560 to [0 x double]*)"
Checking region: 5.cloned => 6.cloned
  Possible aliasing: "[0 x double]* inttoptr (i64 47246786560 to [0 x double]*)", "[0 x double]* inttoptr (i64 47246749696 to [0 x double]*)"

$ polly-opt -O3 -polly 2.ll -debug-only=polly-detect
Checking region: Loop Function Root => <Function Return>
  Top level region is invalid
Checking region: 3.cloned.header => return.loopexit.exitStub
  Possible aliasing: "[0 x double]* inttoptr (i64 47255179264 to [0 x double]*)", "[0 x double]* inttoptr (i64 47246786560 to [0 x double]*)"
Checking region: 4.cloned => 5.cloned
  Possible aliasing: "[0 x double]* inttoptr (i64 47255179264 to [0 x double]*)", "[0 x double]* inttoptr (i64 47246786560 to [0 x double]*)"

The problematic aliasing is due to these inttoptr expressions to some fixed based addresses. The basic-alias-analysis can not analyze them at all. The scev-alias-analysis understands that the base addresses can not alias, however it is impossible to prove complete independence of
any derived address. Hence, to reason about the addresses we would need
to teach Polly to understand that there is no aliasing for the subset of the memory space that is accessed within the Scop.

In any case, you seemed to have in some way convinced Polly to accept this code. Would you mind to share what you did?

Regarding your problem. As far as I understand, the problem is that the following code:

for (i
   A[i] = 0
   for (j
       A[i] +=
   ... = A[i]

is changed by gcc (and other optimizations) to:

for (i
   A[i] = 0
   tmp = A[i]

   for (j
       tmp +=

   A[i] = tmp
   ... = A[i]

This is a common optimization that unfortunately introduces a lot of dependences on tmp that block any direct parallelization. To parallelize the loop anyway we would need to expand the memory of tmp, such that each parallel thread has a private copy of 'tmp'. Deciding where and how to expand memory to enable further transformations is in general difficult such that I would normally run Polly before such optimizations are performed. Tough, in your case it may still be possible to parallelize the loop. To do this, you would need to ignore all dependences that can be removed by creating thread private copies of 'tmp'. If you are interested in having this implemented either open a bug report or give it a try yourself. I am happy to assist.

Cheers
Tobi

Hi Tobi,

Thanks for looking into this!

2013/2/4 Tobias Grosser <tobias@grosser.es>

In any case, you seemed to have in some way convinced Polly to accept this code. Would you mind to share what you did?

Sure. Aliasing is simply ignored. Instead we have substituted pointers and sizes for arrays and a special pass that converts memory accesses from every scop statement into ISL general form. Sorry, we are quite far from standard polly invocation process, maybe I should prepare some simplified plugin for testing purposes…

Regarding your problem. As far as I understand, the problem is that the following code:

for (i
A[i] = 0
for (j
A[i] +=
… = A[i]

is changed by gcc (and other optimizations) to:

for (i
A[i] = 0
tmp = A[i]

for (j
tmp +=

A[i] = tmp
… = A[i]

Yes, exactly!

This is a common optimization that unfortunately introduces a lot of dependences on tmp that block any direct parallelization. To parallelize the loop anyway we would need to expand the memory of tmp, such that each parallel thread has a private copy of ‘tmp’. Deciding where and how to expand memory to enable further transformations is in general difficult such that I would normally run Polly before such optimizations are performed. Tough, in your case it may still be possible to parallelize the loop. To do this, you would need to ignore all dependences that can be removed by creating thread private copies of ‘tmp’. If you are interested in having this implemented either open a bug report or give it a try yourself. I am happy to assist.

Hm, how to create thread-private copies of tmp at that point and how useful could it be? The problem is that platform-dependent view of threads only steps into the process, once the loop is proved to be parallel. Before that, as far as I know, Polly/CLooG/ISL can’t be aware of such things. I thought more about pushing AllocaInst-s down closer to the nested array header - would that work?

Thanks,

  • D.

Hi Tobi,

Thanks for looking into this!

2013/2/4 Tobias Grosser <tobias@grosser.es <mailto:tobias@grosser.es>>

    In any case, you seemed to have in some way convinced Polly to
    accept this code. Would you mind to share what you did?

Sure. Aliasing is simply ignored. Instead we have substituted pointers
and sizes for arrays and a special pass that converts memory accesses
from every scop statement into ISL general form.

Interesting, why did you do this?

Sorry, we are quite far
from standard polly invocation process, maybe I should prepare some
simplified plugin for testing purposes...

Or at least state that this is not standard polly. :wink: Then I would not keep trying to reproduce the problem.

    Regarding your problem. As far as I understand, the problem is that
    the following code:

    for (i
       A[i] = 0
       for (j
           A[i] +=
       ... = A[i]

    is changed by gcc (and other optimizations) to:

    for (i
       A[i] = 0
       tmp = A[i]

       for (j
           tmp +=

       A[i] = tmp
       ... = A[i]

Yes, exactly!

    This is a common optimization that unfortunately introduces a lot of
    dependences on tmp that block any direct parallelization. To
    parallelize the loop anyway we would need to expand the memory of
    tmp, such that each parallel thread has a private copy of 'tmp'.
    Deciding where and how to expand memory to enable further
    transformations is in general difficult such that I would normally
    run Polly before such optimizations are performed. Tough, in your
    case it may still be possible to parallelize the loop. To do this,
    you would need to ignore all dependences that can be removed by
    creating thread private copies of 'tmp'. If you are interested in
    having this implemented either open a bug report or give it a try
    yourself. I am happy to assist.

Hm, how to create thread-private copies of tmp at that point and how
useful could it be?

There are different approaches how to create thread private copies. You could aim to get back to the code in '1.ll' by creating an array with as
many elements as there are iterations in the loop. That should be very general and independent of the platform specific parallel implementation. However, such a general implementation may introduce a lot of additional memory, which requires us to be careful with such transformations.

> The problem is that platform-dependent view of

threads only steps into the process, once the loop is proved to be
parallel. Before that, as far as I know, Polly/CLooG/ISL can't be aware
of such things.

True. In general Polly does not know how parallelism is implemented exactly. Though we can perform some preparing transformations.

> I thought more about pushing AllocaInst-s down closer to

the nested array header - would that work?

I don't think pushing AllocaInst-s into the loop is a better solution.
In terms of memory behavior

int A[i]; // stack allocated
for i
   A[i] =

and

for i
     int *A = alloca(sizeof(int))
     *A =

are the same. We really do not want to allocate a full temporary array on the stack.

I see three options:

  0) We avoid optimizations that yield '2.ll'

  If we can (in some way) retain the version '1.ll', this is the
  best case and we should try hard to stay here.

  1) We malloc a temporary array

  In some cases it may be OK to do full memory expansion, e.g.
  by allocating the temporary array on the heap.

  2) Mark loops conditionally parallel

  We could mark loops parallel under the condition that certain
  memory references are privatized. For your use case, you could
  use the new loop.parallel meta-data and mark loops partial
  parallel. Meaning, you only mark the accesses parallel that
         are not involved in loop carried dependences. The remaining
  accesses then either need to be privatized or they need to
  be removed by mem2reg. For your example, mem2reg should remove
  the arrays, and the loop is then detected as parallel.

Cheers
Tobi

What do we want to