Vectorization: Next Steps

Hal wrote:

  1. Loop vectorization - It would be nice to have, in addition to
    basic-block vectorization, a more-traditional loop vectorization pass. I
    think that we’ll need a better loop analysis pass in order for this to
    happen. Some of this was started in LoopDependenceAnalysis, but that
    pass is not yet finished. We’ll need something like this to recognize
    affine memory references, etc.

I’ve recently started working on a dependence analyzer for LLVM. We’re more interested in parallelization than vectorization,
so are building a dependence graph for a complete function. Of course, such a thing is useful for vectorization and all sorts of other dependence-based loop transforms.

I’m looking at the problem in two parts:

  1. a dependence test that, given two memory references, decides if a dependence exists between them, and
  2. the dependence-graph builder that looks over the complete function and finds pairs of memory references to pass to the dependence test, using the results to build a dependence graph, with edges labeled as to the kind of dependence and direction vectors.
    Currently I’m focused on (2), the simpler task I think, as a way to lean my way around infrastructure. The idea, roughly, is to find all the alias sets associated with each load and store and compute what Wolfe calls factored redef-use chains for each alias set (similar to SSA, but supporting computation of all 4 kinds of dependences). By exploring these chains, we can find the right pairs to test, accounting for control flow and the confusion caused by procedure calls, etc.

I’m not yet sure how I’ll proceed for (1). It’s seems natural and wise to take advantage of the scalar evolutions, but they’re new to me and I’ll need to study them more. I’d like to provide an interface that’s independent of (2), so that it can be used independently, e.g., by a software pipeliner. I’d also like to provide a caching interface, so we can avoid re-testing pairs of references with identical subscript forms. It also seems desirable to start with relatively simple tests, and expand the coverage over time. We’ll see how it goes.

Preston

Hal wrote:
> 3. Loop vectorization - It would be nice to have, in addition to
> basic-block vectorization, a more-traditional loop vectorization
pass. I
> think that we'll need a better loop analysis pass in order for this
to
> happen. Some of this was started in LoopDependenceAnalysis, but that
> pass is not yet finished. We'll need something like this to
recognize
> affine memory references, etc.

I've recently started working on a dependence analyzer for LLVM.

Great!

  We're more interested in parallelization than vectorization,

I am also interested in parallalization, but that is another story :wink:

so are building a dependence graph for a complete function. Of
course, such a thing is useful for vectorization and all sorts of
other dependence-based loop transforms.

I'm looking at the problem in two parts:
     1. a dependence test that, given two memory references, decides
        if a dependence exists between them, and
     2. the dependence-graph builder that looks over the complete
        function and finds pairs of memory references to pass to the
        dependence test, using the results to build a dependence
        graph, with edges labeled as to the kind of dependence and
        direction vectors.
Currently I'm focused on (2), the simpler task I think, as a way to
lean my way around infrastructure. The idea, roughly, is to find all
the alias sets associated with each load and store and compute what
Wolfe calls factored redef-use chains for each alias set (similar to
SSA, but supporting computation of all 4 kinds of dependences). By
exploring these chains, we can find the right pairs to test,
accounting for control flow and the confusion caused by procedure
calls, etc.

This sounds very useful.

I'm not yet sure how I'll proceed for (1). It's seems natural and
wise to take advantage of the scalar evolutions, but they're new to me
and I'll need to study them more. I'd like to provide an interface
that's independent of (2), so that it can be used independently, e.g.,
by a software pipeliner.

This makes sense.

I'd also like to provide a caching interface, so we can avoid
re-testing pairs of references with identical subscript forms. It
also seems desirable to start with relatively simple tests, and expand
the coverage over time. We'll see how it goes.

For both (1) and (2), if you'd like to sketch out the interface you have
in mind, I'd be happy to provide some feedback.

On a practical note, I think that it is important that the mechanisms
developed for this support both direct multidimensional indexing
(a[i][j][k]) and also manually-expanded indexing (a[n*(j+n*i) + k]).
Recognizing the latter may require some specific idiom-recognition code
(and scalar evolution analysis should be very useful for this).

-Hal

so are building a dependence graph for a complete function. Of
course, such a thing is useful for vectorization and all sorts of
other dependence-based loop transforms.

I'm looking at the problem in two parts:
1. a dependence test that, given two memory references, decides
if a dependence exists between them, and

Would you consider using Polly http://polly.grosser.es to avoid
writing this code? If you do not want to use polly, you could use ISL
http://freecode.com/projects/isl to set up the dependence problem and
use ISL's ILP to solve it.

These would be my preferred choices to implement the dependence test:
on the GCC side I have implemented a stand alone dependence tester
that does not use ILP (i.e., sets the dependence test as a Diophantine
equation and then solves it) and I can say that it was quite painful
to write it and to clean the code of bugs.

 2\. the dependence\-graph builder that looks over the complete
    function and finds pairs of memory references to pass to the
    dependence test, using the results to build a dependence
    graph, with edges labeled as to the kind of dependence and
    direction vectors\.

Currently I'm focused on (2), the simpler task I think, as a way to
lean my way around infrastructure. The idea, roughly, is to find all
the alias sets associated with each load and store and compute what
Wolfe calls factored redef-use chains for each alias set (similar to
SSA, but supporting computation of all 4 kinds of dependences). By
exploring these chains, we can find the right pairs to test,
accounting for control flow and the confusion caused by procedure
calls, etc.

This sounds very useful.

I'm not yet sure how I'll proceed for (1). It's seems natural and
wise to take advantage of the scalar evolutions, but they're new to me
and I'll need to study them more. I'd like to provide an interface
that's independent of (2), so that it can be used independently, e.g.,
by a software pipeliner.

This makes sense.

I'd also like to provide a caching interface, so we can avoid
re-testing pairs of references with identical subscript forms. It
also seems desirable to start with relatively simple tests, and expand
the coverage over time. We'll see how it goes.

For both (1) and (2), if you'd like to sketch out the interface you have
in mind, I'd be happy to provide some feedback.

On a practical note, I think that it is important that the mechanisms
developed for this support both direct multidimensional indexing
(a[i][j][k]) and also manually-expanded indexing (a[n*(j+n*i) + k]).

It would be good to have multidimensional arrays in LLVM: note that
"recognizing" the multidimensional access from a linearized access is
not trivial. This was the original reason why I have started working
on the gimple representation in GCC: the RTL has all the memory
accesses linearized as in gep http://llvm.org/docs/GetElementPtr.html
and so you have lost the type info about the array shape for memory
accesses.

Recognizing the latter may require some specific idiom-recognition code
(and scalar evolution analysis should be very useful for this).

Unfortunately the SCEV does not help much in recognizing a multi-dim
access.

Sebastian

[many things, but I’m only going to focus on one of them]
Would you consider using Polly http://polly.grosser.es to avoid
writing this code?

My impression is that Polly (and polyhedral analysis generally) doesn’t do want I want. But I’m happy to talk about it 'cause I might be wrong.

If I look at all the ways I know how to sort in parallel, they all look roughly like this simple counting sort:

for (unsigned i = 0; i < buckets; i++)
count[i] = 0;

for (unsigned i = 0; i < n; i++)
count[src[i]]++;

start[0] = 0;
for (unsigned i = 1; i < buckets; i++)
start[i] = start[i - 1] + count[i - 1];

#pragma assert parallel
for (unsigned i = 0; i < n; i++) {
unsigned loc = int_fetch_add(start + src[i], 1);
dst[loc] = src[i];
}

The 1st loop is trivially parallel. I think Polly would recognize this and do good things.

The 2nd loop has a race condition that can be handled by using an atomic increment provided by the architecture, if the compiler knows about such things. I don’t think Polly would help here.

The 3rd loop is a linear recurrence that can be rewritten and solved in parallel, if the compiler knows about such things. I don’t think Polly would help here.

The 4th loop relies on the programmer to use both an explicit assertion and an explicit fetch-and-add intrinsic. I think both of these are outside of Polly’s scope.

This is the sort of code I want to handle – parallel code written by knowledgable programmers who’re counting on the compiler to recognize parallel loops, recognize and rewrite recurrences and reductions, and insert synchronization where helpful, using a variety of directives to clarify their intent.

Preston

>> so are building a dependence graph for a complete function. Of
>> course, such a thing is useful for vectorization and all sorts of
>> other dependence-based loop transforms.
>>
>> I'm looking at the problem in two parts:
>> 1. a dependence test that, given two memory references, decides
>> if a dependence exists between them, and

Would you consider using Polly http://polly.grosser.es to avoid
writing this code?

I agree, we should use the (self-contained) passes developed in Polly
where possible.

If you do not want to use polly, you could use ISL
http://freecode.com/projects/isl to set up the dependence problem and
use ISL's ILP to solve it.

isl is an LGPL project. It is not clear to me what the general consensus
would be on having a core analysis pass carry an LGPL dependency.

These would be my preferred choices to implement the dependence test:
on the GCC side I have implemented a stand alone dependence tester
that does not use ILP (i.e., sets the dependence test as a Diophantine
equation and then solves it) and I can say that it was quite painful
to write it and to clean the code of bugs.

>> 2. the dependence-graph builder that looks over the complete
>> function and finds pairs of memory references to pass to the
>> dependence test, using the results to build a dependence
>> graph, with edges labeled as to the kind of dependence and
>> direction vectors.
>> Currently I'm focused on (2), the simpler task I think, as a way to
>> lean my way around infrastructure. The idea, roughly, is to find all
>> the alias sets associated with each load and store and compute what
>> Wolfe calls factored redef-use chains for each alias set (similar to
>> SSA, but supporting computation of all 4 kinds of dependences). By
>> exploring these chains, we can find the right pairs to test,
>> accounting for control flow and the confusion caused by procedure
>> calls, etc.
>>
>
> This sounds very useful.
>
>>
>> I'm not yet sure how I'll proceed for (1). It's seems natural and
>> wise to take advantage of the scalar evolutions, but they're new to me
>> and I'll need to study them more. I'd like to provide an interface
>> that's independent of (2), so that it can be used independently, e.g.,
>> by a software pipeliner.
>
> This makes sense.
>
>> I'd also like to provide a caching interface, so we can avoid
>> re-testing pairs of references with identical subscript forms. It
>> also seems desirable to start with relatively simple tests, and expand
>> the coverage over time. We'll see how it goes.
>
> For both (1) and (2), if you'd like to sketch out the interface you have
> in mind, I'd be happy to provide some feedback.
>
> On a practical note, I think that it is important that the mechanisms
> developed for this support both direct multidimensional indexing
> (a[i][j][k]) and also manually-expanded indexing (a[n*(j+n*i) + k]).

It would be good to have multidimensional arrays in LLVM: note that
"recognizing" the multidimensional access from a linearized access is
not trivial. This was the original reason why I have started working
on the gimple representation in GCC: the RTL has all the memory
accesses linearized as in gep http://llvm.org/docs/GetElementPtr.html
and so you have lost the type info about the array shape for memory
accesses.

I agree, to some extent, but an analysis pass would also be quite
useful. There is a lot of real C/C++ code that uses linearized access,
and it will be very important to handle those cases.

> Recognizing the latter may require some specific idiom-recognition code
> (and scalar evolution analysis should be very useful for this).

Unfortunately the SCEV does not help much in recognizing a multi-dim
access.

I've not thought this through in detail, but it would seem helpful.
Let's say that I have a loop with two induction variables, i and j, and
I want to see if an expression looks like i*n+j for some constant n (and
I also need to determine n). Then I would take the expression, subtract
j (SE->getMinusSCEV) and divide by i. If the result is a SCEVConstant,
then I have n and I'm done. Otherwise, I can try something else, (or
fail). Maybe, for example, n is not a constant, but is a loop-invariant
scalar, etc.

-Hal

Of course.

There's some literature that attacks the problem, e.g.,
http://dl.acm.org/citation.cfm?id=143130
and I expect there's more.

Preston

This is fine for something that wants to live out of tree or be a secondary subproject, but isn't acceptable for something that wants to go into the mainline llvm repository and be a default part of the compiler.

-Chris

If there is consensus that parts of Polly should be included in the core compiler, I am sure there are ways to solve these issues. Polly itself is BSD anyways, ISL and GMP are LGPL. As isl is using just a tiny subset of GMP I am pretty sure we can either rewrite this subset ourselves or extend the existing arbitrary (but fixed) width integer functionality in LLVM.

This leaves us with ISL and CLooG. For both the copyright situation is pretty clear. The more complex library here is ISL. To my knowledge it is the only competitive open source integer set library. Rewriting it will be hard*. However, it was developed by one person (Sven Verdoolaege) and some smaller patches of me. AFAIK the copyrights are hold by the academic institutions he was working for. This means the copyright situation is clear. It cannot be changed overnight, but the institutions that need to be addressed are known.

The main reason I did not look into this earlier is that I prioritized the development of Polly itself and I postponed copyright issues to the point where they actually need to be solved. This was not the case for Polly as a pure research project, but if parts of it may also be used in production that is obviously different.

If there is a consensus that people want isl being used within core LLVM, I am very glad to investigate this further. (Otherwise it is still on my TODO list, but not highest priority). For now I must admit I am very comfortable having Polly as a side project, but I can also see that it might be useful to move more mature parts of it into LLVM.

Cheers
Tobi

* Developing an equivalent library from scratch within LLVM will be at least as hard.

>>> If you do not want to use polly, you could use ISL
>>> http://freecode.com/projects/isl to set up the dependence problem and
>>> use ISL's ILP to solve it.
>>
>> isl is an LGPL project. It is not clear to me what the general consensus
>> would be on having a core analysis pass carry an LGPL dependency.
>
> This is fine for something that wants to live out of tree or be a secondary subproject, but isn't acceptable for something that wants to go into the mainline llvm repository and be a default part of the compiler.

If there is consensus that parts of Polly should be included in the core
compiler, I am sure there are ways to solve these issues. Polly itself
is BSD anyways, ISL and GMP are LGPL. As isl is using just a tiny subset
of GMP I am pretty sure we can either rewrite this subset ourselves or
extend the existing arbitrary (but fixed) width integer functionality in
LLVM.

This leaves us with ISL and CLooG. For both the copyright situation is
pretty clear. The more complex library here is ISL. To my knowledge it
is the only competitive open source integer set library. Rewriting it
will be hard*. However, it was developed by one person (Sven
Verdoolaege) and some smaller patches of me. AFAIK the copyrights are
hold by the academic institutions he was working for. This means the
copyright situation is clear. It cannot be changed overnight, but the
institutions that need to be addressed are known.

In my experience, it is better to investigate these kinds of things
sooner rather than later. The more contributors a project acquires the
harder it is to get the license changed.

-Hal

Preston Briggs <preston.briggs@gmail.com> writes:

[many things, but I'm only going to focus on one of them]
Would you consider using Polly http://polly.grosser.es to avoid
writing this code?

My impression is that Polly (and polyhedral analysis generally)
doesn't do want I want. But I'm happy to talk about it 'cause I might
be wrong.

I think you are right. However, Polly might be useful as one layer in a
dependence tester. I haven't thought too deeply about it but polyhedral
analysis can do some pretty cool things.

I'm really not clear on what something like CLooG would do with the
linear recurrence example. But the analysis side of Polly might be
quite useful.

                             -Dave

OK. I started looking into this.

Tobi

[many things, but I'm only going to focus on one of them] Would
you consider using Polly http://polly.grosser.es to avoid writing
this code?

My impression is that Polly (and polyhedral analysis generally)
doesn't do want I want. But I'm happy to talk about it 'cause I
might be wrong.

That depends what you want. :wink:

Polly may offer good possibilities (and maybe even code) or it may not
help at all. The best is to discuss this stuff.

If I look at all the ways I know how to sort in parallel, they all
look roughly like this simple counting sort:

for (unsigned i = 0; i < buckets; i++)
  count[i] = 0;

for (unsigned i = 0; i < n; i++)

> count[src[i]]++;

start[0] = 0;

> for (unsigned i = 1; i < buckets; i++)
> start[i] = start[i - 1] + count[i - 1];

#pragma assert parallel

> for (unsigned i = 0; i < n; i++) {
> unsigned loc = int_fetch_add(start + src[i], 1);

Should this be:

unsigned loc = int_fetch_add(start[src[i]], 1);

> dst[loc] = src[i];
> }

The 1st loop is trivially parallel. I think Polly would recognize
this and do good things.

This case is trivial.

But keep in mind that unsigned loop ivs and integers can cause modulo
wrapping which is not trivial to handle - Both Polly, but also any other
tool, need to take into account additional (possibly non-uniform)
dependences that may appear in addition to the dependences for non-wrapping integers. (In LLVM even signed integers can have modulo wrapping, so actually we need to always be careful here).

isl can actually represent modulo constraints pretty well.

The 2nd loop has a race condition that can be handled by using an
atomic increment provided by the architecture, if the compiler knows
about such things. I don't think Polly would help here.

for (unsigned i = 0; i < n; i++)

> count[src[i]]++;

I do not see why it should be difficult to teach polyhedral tools to
model an increment and to understand that it can be done atomically.
Though, if you just want to do this specific transformation, Polly is
probably overhead.

The big selling point of polyhedral transformations is uniform handling
of several transformations. In case you want additional loop
transformations to expose such patterns Polly might be helpful.

Another reason may be GPU code generation. Here you often use techniques like tiling to create the thread groups. Within the polyhedral model this is a very simple transformation. On LLVM-IR that may become complex.

The 3rd loop is a linear recurrence that can be rewritten and solved
in parallel, if the compiler knows about such things. I don't think
Polly would help here.

Polly has not perform this transformation yet. But it may still be helpful as it abstracts away the low level details of LLVM-IR and allows you to focus on the higher level problem.

start[0] = 0;

> for (unsigned i = 1; i < buckets; i++)
> start[i] = start[i - 1] + count[i - 1];

The previous code will be translated to:

Statements: Init[]; Body[i] : 1 <= i < buckets

Writes: Init[] -> start[0]; Body[i] -> start[i];
Reades: Body[i] -> start[i-1]; Body[i] -> count[i-1];

When we additionally extract the information about the operation performed by body

The idea is to go away from low level pattern matching, but to use a
high level approach to implement the transformations proposed by you.
For code that does not fit into the polyhedral model, this is obviously
not possible, but the examples you have given so far can be represented
easily.

The 4th loop relies on the programmer to use both an explicit
assertion and an explicit fetch-and-add intrinsic. I think both of
these are outside of Polly's scope.

We do not yet use this within Polly, but I think asking the programmer
to provide additional knowledge (when available) is more than reasonable.

#pragma assert parallel

> for (unsigned i = 0; i < n; i++) {
> unsigned loc = int_fetch_add(start[src[i]], 1);

  dst[loc] = src[i];
}

As the int_fetch_add is side effect free, it is fully
polyhedral. It can be implemented with the relevant LLVM
atomicrmw instruction [1]. Polly does not yet allow atomicrmw instructions, but adding support for them should not be too difficult.

If polly get's the additional #pragma information, it can remove all dependences and it can execute the code in parallel. This is also not implemented, but seems doable.

This is the sort of code I want to handle -- parallel code written
by knowledgable programmers who're counting on the compiler to
recognize parallel loops, recognize and rewrite recurrences and
reductions, and insert synchronization where helpful, using a
variety of directives to clarify their intent.

Sounds very much like what we want to handle with Polly. This does
not mean we handle all of this already, but we already have some high level optimizations based on the Pluto algorithm [2] and more importantly, we already are able to abstract away a lot of low level details. So I definitely expect significant overlap. Let's see where we can reuse each others work.

(I am e.g. very interested in array delinearization or in how you pass loop pragma-annotations to LLVM-IR) :slight_smile:

Cheers
Tobi

[1] http://llvm.org/docs/LangRef.html#i_atomicrmw
[2] http://pluto-compiler.sourceforge.net/

Preston Briggs <preston.briggs@gmail.com> writes:

[many things, but I'm only going to focus on one of them]
Would you consider using Polly http://polly.grosser.es to avoid
writing this code?

My impression is that Polly (and polyhedral analysis generally)
doesn't do want I want. But I'm happy to talk about it 'cause I might
be wrong.

I think you are right. However, Polly might be useful as one layer in a
dependence tester. I haven't thought too deeply about it but polyhedral
analysis can do some pretty cool things.

I'm really not clear on what something like CLooG would do with the
linear recurrence example. But the analysis side of Polly might be
quite useful.

                             -Dave

CLooG does not do anything to it. It is just a layer to lower a high level abstraction back to imperative control flow. The question her is,
if the linear recurrence example and similar code can be easily optimized within the high level polyhedral abstraction without having to rewrite on the LLVM-IR layer. If you provide the expected output, I may give this a try.

Tobi

I expect it's more of a research problem, versus simply coding.
I've looked some, and don't find anything about handling recurrences and
reductions inside a polyhedral framework. But more eyes may see more...

The paper I refer to for solving these is here:
http://www.springerlink.com/content/x1346405k3834474/

Preston

for (unsigned i = 0; i < buckets; i++)
count[i] = 0;

for (unsigned i = 0; i < n; i++)
count[src[i]]++;

start[0] = 0;
for (unsigned i = 1; i < buckets; i++)
start[i] = start[i - 1] + count[i - 1];

#pragma assert parallel
for (unsigned i = 0; i < n; i++) {
unsigned loc = int_fetch_add(start + src[i], 1);
dst[loc] = src[i];
}

Should this be:

    unsigned loc = int_fetch_add(start[src[i]], 1);

Our intrinsic wants a pointer, so either int_fetch_add(start + src[i], 1)
or int_fetch_add(&start[src[i]], 1) wil work.

The 1st loop is trivially parallel. I think Polly would recognize
this and do good things.

This case is trivial.

But keep in mind that unsigned loop ivs and integers can cause modulo
wrapping which is not trivial to handle - Both Polly, but also any other
tool, need to take into account additional (possibly non-uniform)
dependences that may appear in addition to the dependences for non-wrapping
integers. (In LLVM even signed integers can have modulo wrapping, so
actually we need to always be careful here).

Sure. In this case, it would affect the code that determines the
number of iterations.
If the loop terminates, there's no dependence.

The 2nd loop has a race condition that can be handled by using an
atomic increment provided by the architecture, if the compiler knows
about such things. I don't think Polly would help here.

for (unsigned i = 0; i < n; i++)
count[src[i]]++;

I do not see why it should be difficult to teach polyhedral tools to
model an increment and to understand that it can be done atomically.

When I read about polyhedral frame works, the first thing I see is
the restriction to affine subscripts. For loops 2 and 4, this seems to be
the stumbling block.

The big selling point of polyhedral transformations is uniform handling
of several transformations.

Right. In olden times, people use a bunch of independent loop xforms,
composed in different ways to achieve different effects. Then came the
unimodular xforms which could achieve the effect of combining several
xforms (but not all). The came the polyhedral approach, which could do
quite a bit more, including the actual dependence analysis, as long as
your problem fit in the framework.

I think the sorting examples don't fit within the framework.
Indeed, I think it'd be tough to express a sort (something easy, like a
vector of unsigned values) in a way that Polly would parallelize all the loops.

Another reason may be GPU code generation. Here you often use techniques
like tiling to create the thread groups. Within the polyhedral model this is
a very simple transformation. On LLVM-IR that may become complex.

Yes, of course Polly is good for many things, but I use reductions and
recurrences every day; I must have a good way to handle them.

The 3rd loop is a linear recurrence that can be rewritten and solved
in parallel, if the compiler knows about such things. I don't think
Polly would help here.

Polly has not perform this transformation yet. But it may still be helpful
as it abstracts away the low level details of LLVM-IR and allows you to
focus on the higher level problem.

That's the way I think of the dependence graph. It gets me above the IR
and lets me study the flow of dependences.

start[0] = 0;
for (unsigned i = 1; i < buckets; i++)
start[i] = start[i - 1] + count[i - 1];

The previous code will be translated to:

Statements: Init[]; Body[i] : 1 <= i < buckets

Writes: Init[] -> start[0]; Body[i] -> start[i];
Reades: Body[i] -> start[i-1]; Body[i] -> count[i-1];

When we additionally extract the information about the operation performed
by body

The idea is to go away from low level pattern matching, but to use a
high level approach to implement the transformations proposed by you.
For code that does not fit into the polyhedral model, this is obviously
not possible, but the examples you have given so far can be represented
easily.

The 4th loop relies on the programmer to use both an explicit
assertion and an explicit fetch-and-add intrinsic. I think both of
these are outside of Polly's scope.

We do not yet use this within Polly, but I think asking the programmer
to provide additional knowledge (when available) is more than reasonable.

Yep. My doubts arose because nobody ever mentions any way to take
directives into account when analyzing loops with the polyhedral approach.

#pragma assert parallel
for (unsigned i = 0; i < n; i++) {
unsigned loc = int_fetch_add(&start[src[i]], 1);
dst[loc] = src[i];
}

As the int_fetch_add is side effect free, it is fully
polyhedral. It can be implemented with the relevant LLVM
atomicrmw instruction [1]. Polly does not yet allow atomicrmw instructions,
but adding support for them should not be too difficult.

The non-affine subscript is a problem, isn't it?
And do we really consider the int_fetch_add side-effect free?
After all, memory is modified.

If polly get's the additional #pragma information, it can remove all
dependences and it can execute the code in parallel. This is also not
implemented, but seems doable.

But is it? Nobody seems to write about it. Makes me wonder why not.

This is the sort of code I want to handle -- parallel code written
by knowledgable programmers who're counting on the compiler to
recognize parallel loops, recognize and rewrite recurrences and
reductions, and insert synchronization where helpful, using a
variety of directives to clarify their intent.

Sounds very much like what we want to handle with Polly. This does
not mean we handle all of this already, but we already have some high level
optimizations based on the Pluto algorithm [2] and more importantly, we
already are able to abstract away a lot of low level details. So I
definitely expect significant overlap. Let's see where we can reuse each
others work.

Sounds good.

(I am e.g. very interested in array delinearization or in how you pass loop
pragma-annotations to LLVM-IR)

I'm a long ways from doing these things,
so don't have much to say yet.
It's just the direction I'm headed.

Preston

The 1st loop is trivially parallel. I think Polly would recognize
this and do good things.

This case is trivial.

But keep in mind that unsigned loop ivs and integers can cause modulo
wrapping which is not trivial to handle - Both Polly, but also any other
tool, need to take into account additional (possibly% non-uniform)
dependences that may appear in addition to the dependences for non-wrapping
integers. (In LLVM even signed integers can have modulo wrapping, so
actually we need to always be careful here).

Sure. In this case, it would affect the code that determines the
number of iterations.
If the loop terminates, there's no dependence.

This becomes interesting if you have an upper bound that is
a + b. Keep in mind that you need to calculate this module an some integer value that is derived from the type size. This becomes tricky when fusing loops.

The 2nd loop has a race condition that can be handled by using an
atomic increment provided by the architecture, if the compiler knows
about such things. I don't think Polly would help here.

for (unsigned i = 0; i< n; i++)
   count[src[i]]++;

I do not see why it should be difficult to teach polyhedral tools to
model an increment and to understand that it can be done atomically.

When I read about polyhedral frame works, the first thing I see is
the restriction to affine subscripts. For loops 2 and 4, this seems to be
the stumbling block.

You need affine subscripts to get exact dependence information (meaning to know statically which statement instance has written the value read by another statement instance). As this information is not available at compile time, neither a polyhedral framework nor any other framework can derive it statically. In such a case you can over approximate. For the example above this would be a read of count[p], where p is an unknown value. In case you have additional information you may get away with less over approximation. E.g. you could take into account that p is even or that you know src[i] is a bijective function.

Just because you can represent affine subscripts very detailed, does not mean non affine subscripts cannot be worked with at all. They just provide less information.

The big selling point of polyhedral transformations is uniform handling
of several transformations.

Right. In olden times, people use a bunch of independent loop xforms,
composed in different ways to achieve different effects. Then came the
unimodular xforms which could achieve the effect of combining several
xforms (but not all). The came the polyhedral approach, which could do
quite a bit more, including the actual dependence analysis, as long as
your problem fit in the framework.

>

I think the sorting examples don't fit within the framework.

Actually they are pretty close. At least from the ones that I have seen.

Indeed, I think it'd be tough to express a sort (something easy, like a
vector of unsigned values) in a way that Polly would parallelize all the loops.

Knowing if Polly would parallelize some code is hard to judge without seeing the code, but as Polly just recently has grown the first complex transformations I would doubt it. The more interesting question is how much of the stuff in Polly can be reused, even though Polly does not do
this transformation (and may never do it). Maybe you can use the RegionInfo infrastructure from LLVM which was developed for Polly, some of the canonicalization passes, the dependency analysis or you can steal ideas from the OpenMP code generation.

As there is a lot of overlap, we should make sure to share code. I just realized that I would love to use an independent, generic OpenMP code generation in Polly. Let me know if you start work in this area

Another reason may be GPU code generation. Here you often use techniques
like tiling to create the thread groups. Within the polyhedral model this is
a very simple transformation. On LLVM-IR that may become complex.

Yes, of course Polly is good for many things, but I use reductions and
recurrences every day; I must have a good way to handle them.

So how does such a representation look like?

sequence of statement iterations that work on the same data element. If the operation applied is associative some nice optimizations can be performed and some dependences can be removed.

I need to read the paper more thoroughly to get the exact definition and constraints of (bounded) recurrences, but from looking at your code I believe it can be detected and matched with Polly. The transformation you apply do not seem to be a pure scheduling transformation, but rather introduces new statements and may even create non affine loops. This is more complex to handle and maybe not even possible or clever to do in a polyhedral framework. I will try to have a close look into the paper to understand what transformations you are doing.

The 3rd loop is a linear recurrence that can be rewritten and solved
in parallel, if the compiler knows about such things. I don't think
Polly would help here.

Polly has not perform this transformation yet. But it may still be helpful
as it abstracts away the low level details of LLVM-IR and allows you to
focus on the higher level problem.

That's the way I think of the dependence graph. It gets me above the IR
and lets me study the flow of dependences.

OK. So let me know a little bit more about your dependence graph. What is a statement in your graph? An LLVM-IR instruction? Or do you group them somehow?

start[0] = 0;
for (unsigned i = 1; i< buckets; i++)
   start[i] = start[i - 1] + count[i - 1];

The previous code will be translated to:

Statements: Init[]; Body[i] : 1<= i< buckets

Writes: Init[] -> start[0]; Body[i] -> start[i];
Reades: Body[i] -> start[i-1]; Body[i] -> count[i-1];

When we additionally extract the information about the operation performed
by body

The idea is to go away from low level pattern matching, but to use a
high level approach to implement the transformations proposed by you.
For code that does not fit into the polyhedral model, this is obviously
not possible, but the examples you have given so far can be represented
easily.

The 4th loop relies on the programmer to use both an explicit
assertion and an explicit fetch-and-add intrinsic. I think both of
these are outside of Polly's scope.

We do not yet use this within Polly, but I think asking the programmer
to provide additional knowledge (when available) is more than reasonable.

Yep. My doubts arose because nobody ever mentions any way to take
directives into account when analyzing loops with the polyhedral approach.

It is not directly obvious how you would pass this information to LLVM-IR. But for source to source tools people commonly use annotations to provide information about live-in, live-out values, the sizes of arrays, ... It makes perfect sense to state the absence of dependences.

#pragma assert parallel
for (unsigned i = 0; i< n; i++) {
   unsigned loc = int_fetch_add(&start[src[i]], 1);
  dst[loc] = src[i];
}

As the int_fetch_add is side effect free, it is fully
polyhedral. It can be implemented with the relevant LLVM
atomicrmw instruction [1]. Polly does not yet allow atomicrmw instructions,
but adding support for them should not be too difficult.

The non-affine subscript is a problem, isn't it?

See above. It just limits how detailed we can analyze the problem. As the user stated that there are no dependences in the loop, there is no need for exact dependences. The loop can be parallelized even with a conservative over approximation of the dependences.

And do we really consider the int_fetch_add side-effect free?
After all, memory is modified.

If you implement it with the atomicrmw instruction only the memory that is pointed to is modified. side-effect free does not mean nothing is touched at all, but that we know about the side effects.

If polly get's the additional #pragma information, it can remove all
dependences and it can execute the code in parallel. This is also not
implemented, but seems doable.

But is it? Nobody seems to write about it. Makes me wonder why not.

The problem is not removing the dependences. This is simple. In a source to source tool, we could do it easily. Passing such information from a C loop to a LLVM-IR loop is more difficult (With or without Polly). There is no explicit LLVM-IR so you may need to add meta-data to the loop header or the loop induction variable. That can work, but is a little difficult to maintain through transformations.

This is the sort of code I want to handle -- parallel code written
by knowledgable programmers who're counting on the compiler to
recognize parallel loops, recognize and rewrite recurrences and
reductions, and insert synchronization where helpful, using a
variety of directives to clarify their intent.

Sounds very much like what we want to handle with Polly. This does
not mean we handle all of this already, but we already have some high level
optimizations based on the Pluto algorithm [2] and more importantly, we
already are able to abstract away a lot of low level details. So I
definitely expect significant overlap. Let's see where we can reuse each
others work.

Sounds good.

(I am e.g. very interested in array delinearization or in how you pass loop
pragma-annotations to LLVM-IR)

I'm a long ways from doing these things,
so don't have much to say yet.
It's just the direction I'm headed.

Just keep me up to date.

Tobi

Very interesting. I just skimmed through it and won't have time to read it today, but I should really take a close look here. Thanks for the pointer.

Tobi

Knowing if Polly would parallelize some code is hard to judge without seeing the code,

Which reminds me...
I think that any compiler that's making non-trivial xforms should have
a feedback
mechanism to show the programmer what's going on, i.e., which loops are
vectorized or parallelized, loops that are fused or distributed,
dependences that
prevent xforms, and so forth. Sometimes we see a sort of annotated listing,
with notes as to what's going on and why.

Is there support for such a thing with Polly?

From my point of view reductions are not very special.
Basically a sequence of statement iterations that work on the same data element.
If the operation applied is associative some nice optimizations can be performed
and some dependences can be removed.

Right. The interesting things are that they come up a lot in user code
and they can be
solved in parallel.

So let me know a little bit more about your dependence graph.
What is a statement in your graph? An LLVM-IR instruction?
Or do you group them somehow?

Hal wondered about this too.
I don't have an implementation, only a direction.
That said, here's what I'm thinking...

The dependence graph has nodes and edges.
There's a node for every memory reference (load, store, call)
and edges for dependences between them.
Edges are directed, with a source and a destination.
There are 3 kinds of edge, for True, Anti, and Output dependences,
depending on whether the source and destination are loads and/or stores.
(it may also prove useful to represent input dependences)

An edge has several attributes:
- common, the number of loops around both references (might be 0 if
they are in different loop nests)
- levels, the number of loops around each reference that were analyzed
(might be greater than levels when the loops are candidates for loop
fusion)
- confused, a flag indicating no detailed info is available
- consistent, a flag indicating a simple distance vector (otherwise,
we'll have a direction vector)

There's also a direction/distance vector with an entry for each loop level.
Entries are the usual disjunction of <, =, > plus a few special values:
- scalar, the address expression is invariant at this level
- unknown
and perhaps a couple more. All pretty classical stuff.
Then, given all this info, we look for opportunities to make a variety
of loop xforms,
focusing on the dependence graph rather than the IR.

Just keep me up to date.

Will do.

Thanks,
Preston

It took about 7 months to get this issue resolved, but since the 2nd of
September isl trunk is available under the MIT license:

http://repo.or.cz/w/isl.git/commit/056289f285e62c52cc59ee826172e4d3092ef3fe

The next official isl release will consequently be under the MIT license. isl still depends on libgmp, but replacing gmp with another arbitrary precision library should be very well possible.

Cheers
Tobi