[Polly] Parallelizing outer loop containing inner reduction loop

From: Tobias Grosser <tobias@grosser.es>
To: Dmitry Mikushin <dmitry@kernelgen.org>
Cc: polly-dev <polly-dev@googlegroups.com>, LLVM Developers Mailing List <
llvmdev@cs.uiuc.edu>
Date: Mon, 04 Feb 2013 17:53:11 +0100
Subject: Re: [LLVMdev] [Polly] Parallelizing outer loop containing inner
reduction loop

Hi Tobi,

Thanks for looking into this!

2013/2/4 Tobias Grosser <tobias@grosser.es <mailto:tobias@grosser.es>>

    In any case, you seemed to have in some way convinced Polly to
    accept this code. Would you mind to share what you did?

Sure. Aliasing is simply ignored. Instead we have substituted pointers
and sizes for arrays and a special pass that converts memory accesses
from every scop statement into ISL general form.

Interesting, why did you do this?

Sorry, we are quite far

from standard polly invocation process, maybe I should prepare some
simplified plugin for testing purposes...

Or at least state that this is not standard polly. :wink: Then I would not
keep trying to reproduce the problem.

     Regarding your problem. As far as I understand, the problem is that

    the following code:

    for (i
       A[i] = 0
       for (j
           A[i] +=
       ... = A[i]

    is changed by gcc (and other optimizations) to:

    for (i
       A[i] = 0
       tmp = A[i]

       for (j
           tmp +=

       A[i] = tmp
       ... = A[i]

Yes, exactly!

    This is a common optimization that unfortunately introduces a lot of
    dependences on tmp that block any direct parallelization. To
    parallelize the loop anyway we would need to expand the memory of
    tmp, such that each parallel thread has a private copy of 'tmp'.
    Deciding where and how to expand memory to enable further
    transformations is in general difficult such that I would normally
    run Polly before such optimizations are performed. Tough, in your
    case it may still be possible to parallelize the loop. To do this,
    you would need to ignore all dependences that can be removed by
    creating thread private copies of 'tmp'. If you are interested in
    having this implemented either open a bug report or give it a try
    yourself. I am happy to assist.

Hm, how to create thread-private copies of tmp at that point and how
useful could it be?

Many auto-vectorizing compilers do such a thing (sometimes called "scalar
expansion") as a way to attack otherwise parallel loops.

There are different approaches how to create thread private copies. You

could aim to get back to the code in '1.ll' by creating an array with as
many elements as there are iterations in the loop. That should be very
general and independent of the platform specific parallel implementation.
However, such a general implementation may introduce a lot of additional
memory, which requires us to be careful with such transformations.

A better solution is to create an array with as many elements as there are
threads, or, in the case of vectorization, as many elements as the length
of the vector. Much less expensive!

Very true. If possible we should go this way.

Tobi