[Polly] Analysis of extra compile-time overhead for simple nested loops

Hi all,

I have investigated the 6X extra compile-time overhead when Polly compiles the simple nestedloop benchmark in LLVM-testsuite. (http://188.40.87.11:8000/db_default/v4/nts/31?compare_to=28&baseline=28). Preliminary results show that such compile-time overhead is resulted by the complicated polly-dependence analysis. However, the key seems to be the polly-prepare pass, which introduces a large number of store instructions, and thus significantly complicating the polly-dependence pass.

Let me show the results with a tiny code example:

int main(int argc, char *argv[]) {
int n = ((argc == 2) ? atoi(argv[1]) :! 46);
int a, b, x=0;
for (a=0; a<n; a++)
for (b=0; b<n; b++)
x++;
printf("%d\n", x);
return(0);
}

The basic LLVM IR code (produced by “clang -O1”) is shown as:

@.str = private unnamed_addr constant [4 x i8] c"%d\0A\00", align 1
; Function Attrs: nounwind uwtable
define i32 @main(i32 %argc, i8** nocapture readonly %argv) {
entry:
%cmp = icmp eq i32 %argc, 2
br i1 %cmp, label %cond.end, label %for.cond2.preheader.lr.ph
cond.end:
%arrayidx = getelementptr inbounds i8** %argv, i64 1
%0 = load i8** %arrayidx, align 8
%call = t! ail call i32 (i8*, …)* bitcast (i32 (…)* @atoi to i32 (i8*, …)) (i8 %0) #3
%cmp117 = icmp sgt i32 %call, 0
br i1 %cmp117, label %for.cond2.preheader.lr.ph, label %for.end8
for.cond2.preheader.lr.ph:
%cond22 = phi i32 [ %call, %cond.end ], [ 46, %entry ]
%cmp314 = icmp sgt i32 %cond22, 0
br label %for.cond2.preheader
for.cond2.preheader:
%x.019 = phi i32 [ 0, %for.cond2.preheader.lr.ph ], [ %x.1.lcssa, %for.inc6 ]
%a.018 = phi i32 [ 0, %for.cond2.preheader.lr.ph ], [ %inc7, %for.inc6 ]
br i1 %cmp314, label %for.body4, label %for.inc6
for.body4:
%x.1! 16 = phi i32 [ %inc, %for.body4 ], [ %x.019, %for.cond2.preheader ]
%b.015 = phi i32 [ %inc5, %for.body4 ], [ 0, %for.cond2.preheader ]
%inc = add nsw i32 %x.116, 1
%inc5 = add nsw i32 %b.015, 1
%cmp3 = icmp slt i32 %inc5, %cond22
br i1 %cmp3, label %for.body4, label %for.inc6
for.inc6:
%x.1.lcssa = phi i32 [ %x.019, %for.cond2.preheader ], [ %inc, %for.body4 ]
%inc7 = add nsw i32 %a.018, 1
%cmp1 = icmp slt i32 %inc7, %cond22
br i1 %cmp1, label %for.cond2.preheader, label %for.end8
for.end8:
%x.0.lcssa = ph! i i32 [ 0, %cond.end ], [ %x.1.lcssa, %for.inc6 ]
%ca ll9 = tail call i32 (i8*, …)* @printf(i8* getelementptr inbounds ([4 x i8]* @.str, i64 0, i64 0), i32 %x.0.lcssa) #3
ret i32 0
}
declare i32 @atoi(…)
declare i32 @printf(i8* nocapture readonly, …)

Such code is very simple and there is no memory instruction at all, so the polly-dependence pass runs very fast on this code. Unfortunately, when we run “opt -load LLVMPolly.so” for this basic LLVM IR code, the polly-prepare pass would introduce a large number of store instructions like this:

define i32 @main(i32 %argc, i8** nocapture readonly %argv) {
entry:
%cond22.reg2mem = alloca i32
%x.019.reg2mem = alloca i32
%x.1.lcssa.reg2mem = alloca i32
%x.1.lcssa.lcssa.reg2mem = alloca i32
%x.0.lcssa.reg2mem = alloca i32
br label %entry.split
entry.split: &nbs! p;
%cmp = icmp eq i32 %argc, 2
store i32 46, i32* %cond22.reg2mem
br i1 %cmp, label %cond.end, label %for.cond2.preheader.lr.ph
cond.end:
%arrayidx = getelementptr inbounds i8** %argv, i64 1
%0 = load i8** %arrayidx, align 8
%call = tail call i32 (i8*, …)* bitcast (i32 (…)* @atoi to i32 (i8*, …))(i8 %0)
%cmp117 = icmp sgt i32 %call, 0
store i32 0, i32* %x.0.lcssa.reg2mem
store i32 %call, i32* %cond22.reg2mem
br i1 %cmp117, label %for.cond2.preheader.lr.ph, label %for.end8

These store instructions significantly complicate the “polly-dependence” pass, and thus leading to high compile-time overhead.

I hav! e noticed that such memory instructions are finally simplified to scal ar operations by the SROA pass, so one possible way to reduce such compile-time overhead is to move the SROA pass ahead of polly-dependence analysis.

Can anyone give me some hints that why the polly-prepare pass introduces such memory instructions? Is it possible to move the SROA pass ahead of polly-dependence analysis?

Thanks,
Star Tan

Codeprepare and independent blocks are introducing these loads and stores.
These are prepasses that polly runs prior to building the dependence graph
to transform scalar dependences into data dependences.
Ether was working on eliminating the rewrite of scalar dependences.

Hi Sebpop,

Thanks for your explanation.  
I noticed that Polly would finally run the SROA pass to transform these load/store instructions into scalar operations.  Is it possible to run such a pass before polly-dependence analysis?
Star Tan

I do not think that running SROA before polly is a good idea:
it would defeat the purpose of the code preparation passes that
polly intentionally schedules for the data dependence analysis.
If you remove the data references before polly runs, you would
miss them in the dependence graph: that could lead to incorrect
transforms.

I see. Thank you.
Do you have any ideas to reduce such expensive dependence analysis caused by those extra load/store instructions?
Any suggestion would be appreciated. Thanks~

Hi all,

Hi,

I tried to reproduce your findings, but could not do so.

I have investigated the 6X extra compile-time overhead when Polly compiles the simple nestedloop benchmark in LLVM-testsuite. (http://188.40.87.11:8000/db_default/v4/nts/31?compare_to=28&baseline=28). Preliminary results show that such compile-time overhead is resulted by the complicated polly-dependence analysis. However, the key seems to be the polly-prepare pass, which introduces a large number of store instructions, and thus significantly complicating the polly-dependence pass.

Let me show the results with a tiny code example:

int main(int argc, char *argv) {
   int n = ((argc == 2) ? atoi(argv[1]) : 46);
   int a, b, x=0;
   for (a=0; a<n; a++)
     for (b=0; b<n; b++)
         x++;
   printf("%d\n", x);
   return(0);
}

This one misses some includes. Also, it is more convenient if you attach the actual C file you include.

The basic LLVM IR code (produced by "clang -O1") is shown as:
@.str = private unnamed_addr constant [4 x i8] c"%d\0A\00", align 1
; Function Attrs: nounwind uwtable
define i32 @main(i32 %argc, i8** nocapture readonly %argv) {
entry:
   %cmp = icmp eq i32 %argc, 2
   br i1 %cmp, label %cond.end, label %for.cond2.preheader.lr.ph
cond.end:
   %arrayidx = getelementptr inbounds i8** %argv, i64 1
   %0 = load i8** %arrayidx, align 8
   %call = tail call i32 (i8*, ...)* bitcast (i32 (...)* @atoi to i32 (i8*, ...)*)(i8* %0) #3
   %cmp117 = icmp sgt i32 %call, 0
   br i1 %cmp117, label %for.cond2.preheader.lr.ph, label %for.end8
for.cond2.preheader.lr.ph:
   %cond22 = phi i32 [ %call, %cond.end ], [ 46, %entry ]
   %cmp314 = icmp sgt i32 %cond22, 0
   br label %for.cond2.preheader
for.cond2.preheader:
   %x.019 = phi i32 [ 0, %for.cond2.preheader.lr.ph ], [ %x.1.lcssa, %for.inc6 ]
   %a.018 = phi i32 [ 0, %for.cond2.preheader.lr.ph ], [ %inc7, %for.inc6 ]
   br i1 %cmp314, label %for.body4, label %for.inc6
for.body4:
   %x.116 = phi i32 [ %inc, %for.body4 ], [ %x.019, %for.cond2.preheader ]
   %b.015 = phi i32 [ %inc5, %for.body4 ], [ 0, %for.cond2.preheader ]
   %inc = add nsw i32 %x.116, 1
   %inc5 = add nsw i32 %b.015, 1
   %cmp3 = icmp slt i32 %inc5, %cond22
   br i1 %cmp3, label %for.body4, label %for.inc6
for.inc6:
   %x.1.lcssa = phi i32 [ %x.019, %for.cond2.preheader ], [ %inc, %for.body4 ]
   %inc7 = add nsw i32 %a.018, 1
   %cmp1 = icmp slt i32 %inc7, %cond22
   br i1 %cmp1, label %for.cond2.preheader, label %for.end8
for.end8:
   %x.0.lcssa = phi i32 [ 0, %cond.end ], [ %x.1.lcssa, %for.inc6 ]
   %call9 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([4 x i8]* @.str, i64 0, i64 0), i32 %x.0.lcssa) #3
   ret i32 0
}
declare i32 @atoi(...)
declare i32 @printf(i8* nocapture readonly, ...)

This test case misses the target data. If you would attach the original file, it would be easier to reproduce.

Such code is very simple and there is no memory instruction at all, so the polly-dependence pass runs very fast on this code. Unfortunately, when we run "opt -load LLVMPolly.so" for this basic LLVM IR code, the polly-prepare pass would introduce a large number of store instructions like this:

Which passes and commands have you actually run? The command "opt -load LLVMPolly.so" does not run the -polly-prepare pass.

Sorry for being picky here. Unfortunately, there are too many ways you could have run such passes, that I am afraid I would not guess the right one. And if we are running different experiments, it is difficult to discuss them.

The way I propose to run the passes is as follows:

$ clang -O0 -S -emit-llvm -o out.ll

# Get the polly passes to that are run
$ polly-opt test.ll -O3 -polly -debug-passes=Structure

Pass Arguments: -datalayout -notti -basictti -x86tti -targetlibinfo -no-aa -tbaa -basicaa -preverify -domtree -verify -mem2reg -instcombine -simplifycfg -tailcallelim -simplifycfg -reassociate -domtree -loops -loop-simplify -lcssa -loop-rotate -instcombine -scalar-evolution -loop-simplify -lcssa -iv-users -polly-indvars -polly-prepare -postdomtree -domfrontier -regions -scalar-evolution -polly-detect -polly-independent -polly-analyze-ir -polly-scops -polly-dependences -polly-opt-isl -polly-cloog -polly-codegen -simplifycfg -domtree -sroa -early-cse -lower-expect

[...]

# Run everything up to the offending pass
$ polly-opt test.ll -targetlibinfo -no-aa -tbaa -basicaa -preverify -domtree -verify -mem2reg -instcombine -simplifycfg -tailcallelim -simplifycfg -reassociate -domtree -loops -loop-simplify -lcssa -loop-rotate -instcombine -scalar-evolution -loop-simplify -lcssa -iv-users -polly-indvars -o out.preopt.ll -S

# Show scops with and without preopt

~$ polly-opt -basicaa -view-scops out.preopt.ll -disable-output
Writing '/tmp/scops-f3a2fa.dot'... done.
Running 'xdot.py' program... done.
~$ polly-opt -basicaa -polly-prepare -view-scops out.preopt.ll -disable-output
Writing '/tmp/scops-37f35f.dot'... done.
Running 'xdot.py' program... done.

# Print the scop info:
$ polly-opt -basicaa -polly-prepare -polly-scops out.preopt.ll -analyze

# Time with and without preoptimization
~$ time polly-opt -basicaa -polly-prepare -polly-dependences out.preopt.ll -disable-output

real 0m0.040s
user 0m0.032s
sys 0m0.004s

~$ time polly-opt -basicaa -polly-dependences out.preopt.ll -disable-output

real 0m0.011s
user 0m0.004s
sys 0m0.004s

Looking at the timings, this does in fact not look too bad. Run within -O3

~$ time polly-opt -basicaa -O3 out.ll -disable-output

real 0m0.015s
user 0m0.004s
sys 0m0.008s

~$ time polly-opt -basicaa -O3 out.ll -disable-output -polly

real 0m0.029s
user 0m0.024s
sys 0m0.004s

Within -O3 the overhead is still far away from the 60 slowdown we have seen on the other test case. A 2x slowdown for a kernel that we fully analyse and code generate is not optimal, but probably not an issue. The question is if the increasing depth of the loop nest can yield this 60x slowdown.

[..]

These store instructions significantly complicate the "polly-dependence" pass, and thus leading to high compile-time overhead.

What do you mean by "complicated the 'polly-dependence' pass"? Without -polly-prepare the scop that is detected is a lot smaller. Hence, the two scops are not really comparable.

The point of the -polly-prepare pass is to transform some scalar dependences into memory dependences. Introducing explicit load/store instructions simplifies the later passes as scalar dependences do not
be handled specially. However, introducing store instructions does (or should) not make the actual dependence analysis problem more complicated. If Polly would be extended to detect SCoPs with scalar dependences those scalar dependences should be the very same once that
are now created for the memory dependences introduced by the -polly-prepare pass.

I have noticed that such memory instructions are finally simplified to scalar operations by the SROA pass, so one possible way to reduce such compile-time overhead is to move the SROA pass ahead of polly-dependence analysis.

Can anyone give me some hints that why the polly-prepare pass introduces such memory instructions? Is it possible to move the SROA pass ahead of polly-dependence analysis?

Moving the SROA pass is not the solution. It would just prevent Polly from detecting some SCoPs.

I think there are two angles we can have a look at:

1) Does -polly-prepare introduce unneeded dependences?

This means, could we model the scalar dependences more efficiently than how they are modeled by the array accesses introduced by -polly-prepare.

2) What triggers the 60x slowdown in dependence analysis

Is this due to the increasing loop nest depth? Or the increasing number of allocations introduced by -polly-prepare?

An interesting experiment would be to see if the dependence analysis runs faster on a loop that uses a single element array to implement the reduction:

for (i
   for (j
      ...
        X[0] = ...

Meaning one alloc instruction and a single load and store in the innermost loop, without any non-induction variable PHI nodes.

When implementing this in C, LLVM may introduce again PHI nodes. So it may possibly be necessary to write this directly in LLVM-IR.

This became a long mail. Hope it gives you some ideas how to proceed.

Cheers,
Tobi

P.S: Sebastian, also thank you for your comments!

test.c (195 Bytes)

out.ll (1.5 KB)

out.preopt.ll (2.89 KB)

Hi,

I tried to reproduce your findings, but could not do so.

Sorry, I did not put all code in my previous email because the code seems a little too long and complicated.
You can refer to the detailed C code and LLVM IR code on http://llvm.org/bugs/show_bug.cgi?id=16843
There are four attachments for our C code and LLVM IR code:

nestedloop.c (http://llvm.org/bugs/attachment.cgi?id=11043): the simplified C code.
nestedloop.ll (http://llvm.org/bugs/attachment.cgi?id=11044): the basic LLVM IR code.
nestedloop.preopt.ll (http://llvm.org/bugs/attachment.cgi?id=11045): the preprocessed LLVM IR code.
nestedloop.prepare.ll (http://llvm.org/bugs/attachment.cgi?id=11046): the LLVM IR code after! running the polly-prepare pass.

I have investigated the 6X extra compile-time overhead when Polly compiles the simple nestedloop benchmark in LLVM-testsuite. (http://188.40.87.11:8000/db_default/v4/nts/31?compare_to=28&baseline=28). Preliminary results show that such compile-time overhead is resulted by the complicated polly-dependence analysis. However, the key seems to be the polly-prepare pass, which introduces a large number of store instructions, and thus significantly complicating the polly-dependence pass.

Let me show the results with a tiny code example:

int main(int argc, char *argv[]) {
int n = ((argc == 2) ? atoi(argv[1]) : 46);
int a, b, x=0;
for (a=0; a<n; a++)
&n! bsp;for (b=0; b<n; b++)
&nb sp; x++;
printf("%d\n", x);
return(0);
}

This one misses some includes. Also, it is more convenient if you attach
the actual C file you include.

Yes, it should includes two header file <stdio.h> and <stdlib.h>. You can refer to the source code on http://llvm.org/bugs/attachment.cgi?id=11043
Compared with the original nestedloop benchmark, I changed the accumulation operation to a single assignment as follows:
for (a=0; a<n; a++)
for (b=0; b<n; b++)
for (c=0; c<n; c++)
for (d=0; d<n; d++)
for (e=0; e<n; e++)
for (f=0! ; f<n; f++)
x=1; //original is x++;

This test case misses the target data. If you would attach the original
file, it would be easier to reproduce.

I see, you can refer to different LLVM IR code on http://llvm.org/bugs/show_bug.cgi?id=16843.

Such code is very simple and there is no memory instruction at all, so the polly-dependence pass runs very fast on this code. Unfortunately, when we run “opt -load LLVMPolly.so” for this basic LLVM IR code, the polly-prepare pass would introduce a large number of store instructions like this:

Which passes and commands have you actually run? The command “opt -load
LLVMPolly.so” does not run the -polly-prepare pass.

Oops, I actually meant “opt -load LLVMPolly.so -polly -O3” here.

!

Sorry for being picky here. Unfortunately, there are to o many ways you
could have run such passes, that I am afraid I would not guess the right
one. And if we are running different experiments, it is difficult to
discuss them.

Yes, you are right. I should explain the detailed experimental steps.

The way I propose to run the passes is as follows:
[…]
$ polly-opt -basicaa -polly-prepare -polly-scops out.preopt.ll -analyze

Scop info varies in the two cases (with/without -polly-scops).
No valid scops are detected if we run polly without -polly-prepare:
$ polly-opt -basicaa -polly-scops out.preopt.ll -analyze

However, a very large region “for.cond2.preheader => for.end31” is detected as valid region if we run Polly with -polly-prepare:
$ polly-opt -basicaa -polly-prepare -polly-scops nestedloop.preopt.ll! -analyze

Note that the region “for.cond2.preheader => for.end31” is reported as invalid as the exit BasicBlock has PHI nodes, if we run Polly without -polly-prepare.

Time with and without preoptimization

~$ time polly-opt -basicaa -polly-prepare -polly-dependences
out.preopt.ll -disable-output
[…]

Within -O3 the overhead is still far away from the 60 slowdown we have
seen on the other test case. A 2x slowdown for a kernel that we fully
analyse and code generate is not optimal, but probably not an issue. The
question is if the increasing depth! of the loop nest can yield this 60x
slowdown.

This is because the depth of the loop nest is only two. If we increase the depth of the loop nest, then the compile-time would significantly increase. I have posted results for the original code which contains 6 nested loops on http://llvm.org/bugs/show_bug.cgi?id=16843#c5

In fact, the compile time is exponentially growing as the depth of loop nest increases:
2-nested: ~2X
4-nested: ~10X
6-nested: ~50X
8-nested: ~200X
10-nested: ~700X
12-nested: ~2000X

These store instructions significantly complicate the “polly-dependence” pass, and thus leading to high compile-time overhead.

What do you mean by “complicated the ‘polly-dependence’ pass”? Without
-polly-prepare the scop that is detected is a lot smaller. Hence, the

Yes, no valid scop is detected at all without -polly-prepare.

The point of the -polly-prepare pass is to transform some scalar
dependences into memory dependences. Introducing explicit load/store
instructions simplifies the later passes as scalar dependences do not
be handled specially. However, introducing store instructions does (or
should) not make the actual dependence analysis problem more
complicated. If Polly would be extended to detect SCoPs with scalar
dependences those scalar dependences should be the very same once that
are now created for the memory dependences introduced by the
-polly-prepare pass.

At first, I thought -polly-dependence mainly do dependence analysis for memory instructions, so I ! dump all intermediate LLVM IR code for each opt passes using “-print-a fter-all”. Then I find all store instructions are introduced by the -polly-prepare pass. That is why I asked about the -polly-prepare pass.

Now I see, -polly-prepare just transforms some scalar dependences into memory dependences and it should not introduce new dependences. So, -polly-prepare may be not the key problem ~

I have noticed that such memory instructions are finally simplified to scalar operations by the SROA pass, so one possible way to reduce such compile-time overhead is to move the SROA pass ahead of polly-dependence analysis.

Can anyone give me some hints that why the polly-prepare pass introduces such memory instructions? Is it possible to move the SROA pass ahead of polly-dependence analysis?

Moving the SROA pass is not the solution. It would just prevent Polly
from detecting some SCoPs.

I! see. We must find other ways.

I think there are two angles we can have a look at:

  1. Does -polly-prepare introduce unneeded dependences?

This means, could we model the scalar dependences more efficiently than
how they are modeled by the array accesses introduced by -polly-prepare.

This is really necessary. In fact, there should be no dependence in our code example as shown in our previous example. There is only a single and constant assignment “a=1” in the nested loop. Unfortunately, Polly still reports very complex data dependences.

  1. What triggers the 60x slowdown in dependence analysis

Is this due to the increasing loop nest depth? Or the increasing number
!>of allocations introduced by -polly-prepare?

I have found that the compile time is exponentially growing as the depth of loop nest increases. However, increasing the loop nest depth would also increases the number of allocations and store instructions.

An interesting experiment would be to see if the dependence analysis
runs faster on a loop that uses a single element array to implement the
reduction:

for (i
for (j

X[0] = …

Yes, I have changed the original code to the form you suggested:
for (i
for (j

x=1
There is no data dependence at all. However, it seems that -polly-prepare still introduces a lot of allocs a! nd store instructions.

Meaning one alloc instruction and a single load and store in the
innermost loop, without any non-induction variable PHI nodes.

When implementing this in C, LLVM may introduce again PHI nodes. So it
may possibly be necessary to write this directly in LLVM-IR.

This became a long mail. Hope it gives you some ideas how to proceed.

Thank you so much, Tobias!
I sincerely appreciate your helpful suggestions!

Cheers,
Star Tan

Hi,

I tried to reproduce your findings, but could not do so.

Sorry, I did not put all code in my previous email because the code seems a little too long and complicated.
You can refer to the detailed C code and LLVM IR code on http://llvm.org/bugs/show_bug.cgi?id=16843
There are four attachments for our C code and LLVM IR code:

nestedloop.c (http://llvm.org/bugs/attachment.cgi?id=11043): the simplified C code.
nestedloop.ll (http://llvm.org/bugs/attachment.cgi?id=11044): the basic LLVM IR code.
nestedloop.preopt.ll (http://llvm.org/bugs/attachment.cgi?id=11045): the preprocessed LLVM IR code.
nestedloop.prepare.ll (http://llvm.org/bugs/attachment.cgi?id=11046): the LLVM IR code after running the polly-prepare pass.

Thanks, this works for me.

# Time with and without preoptimization
~$ time polly-opt -basicaa -polly-prepare -polly-dependences
out.preopt.ll -disable-output
[..]

Within -O3 the overhead is still far away from the 60 slowdown we have
seen on the other test case. A 2x slowdown for a kernel that we fully
analyse and code generate is not optimal, but probably not an issue. The
question is if the increasing depth of the loop nest can yield this 60x
slowdown.

This is because the depth of the loop nest is only two. If we increase the depth of the loop nest, then the compile-time would significantly increase. I have posted results for the original code which contains 6 nested loops on http://llvm.org/bugs/show_bug.cgi?id=16843#c5

In fact, the compile time is exponentially growing as the depth of loop nest increases:
2-nested: ~2X
4-nested: ~10X
6-nested: ~50X
8-nested: ~200X
10-nested: ~700X
12-nested: ~2000X

Perfect. This is what I was expecting. Let's see what we can do about it.

I think there are two angles we can have a look at:

1) Does -polly-prepare introduce unneeded dependences?

This means, could we model the scalar dependences more efficiently than
how they are modeled by the array accesses introduced by -polly-prepare.

This is really necessary. In fact, there should be no dependence in our code example as shown in our previous example. There is only a single and constant assignment "a=1" in the nested loop. Unfortunately, Polly still reports very complex data dependences.

Sure, LLVM may even be able to replace this entire loop by a closed form expression. However, it is also a good test case to challenge Polly. Let's see what we can do.

2) What triggers the 60x slowdown in dependence analysis

Is this due to the increasing loop nest depth? Or the increasing number
of allocations introduced by -polly-prepare?

I have found that the compile time is exponentially growing as the depth of loop nest increases. However, increasing the loop nest depth would also increases the number of allocations and store instructions.

An interesting experiment would be to see if the dependence analysis
runs faster on a loop that uses a single element array to implement the
reduction:

for (i
   for (j
      ...
        X[0] = ...

Yes, I have changed the original code to the form you suggested:
   for (i
       for (j
           ...
                x=1

Sorry, I meant
                  x[0] +=

There is no data dependence at all. However, it seems that -polly-prepare still introduces a lot of allocs and store instructions.

Meaning one alloc instruction and a single load and store in the
innermost loop, without any non-induction variable PHI nodes.

When implementing this in C, LLVM may introduce again PHI nodes. So it
may possibly be necessary to write this directly in LLVM-IR.

This is the important one. I wonder what the performance of the dependence analysis is for increasing loop depth, if we have only a single statement, that just reads and writes from the very same memory location, but that does not have all the mess from the non-induction-variable PHI nodes.

Tobi

Yes, I have changed the original code to the form you suggested:
for (i
for (j

x=1

Sorry, I meant
x[0] +=

It is interesting that Polly would run much faster if we change the “x=0” to “X[0]=0” or “X[0]+=0 in the nested loops.
First, if the nested loop only contains a statement “X[0]+=0”, like this:
// SingleAcc.c
#include <stdio.h>
#include <stdlib.h>
int main(int argc, ! char *argv[]) {
int n = ((argc == 2) ? atoi(argv[1]) : 46);
int a, b, c, d, e, f, x=0;
int X[10];
for (a=0; a<n; a++)
for (b=0; b<n; b++)
for (c=0; c<n; c++)
for (d=0; d<n; d++)
for (e=0; e<n; e++)
for (f=0; f<n; f++)
X[0]+=1; // could be X[0]+=1, X[0]=1 or x=1
printf(”%d\n", X[0]); // could be X[0] or x
return 0;
}
Then the scops results would be:

$ clang -O0 -emit-llvm -S SingleAcc.c -o SingleAcc.ll

Run everything up to the offending pass

$ polly-opt SingleAcc.ll -targetlibinfo -no-aa … -polly-indvars -o SingleAcc.preopt.ll -S

$ polly-opt -basicaa -polly-prepare -polly-scops SingleAcc.preopt.ll -analyze
(*Note: you can download SingleAcc.preopt.ll on http://llvm.org/bugs/attachment.cgi?id=11059)

Printing analysis ‘Polly - Create polyhedral description of Scops’ for region: ‘for.cond2.preheader => for.end32’ in function ‘main’:
Context:
[cond.reload] → { : cond.reload >= -2147483648 and cond.reload <= 2147483647 }
p0: %cond.reload
Statements {
Stmt_for_body16
Domain :=
[cond.reload] → { Stmt_for_body16[i0, i1, i2, i3, i4, i5] : i0 >= 0 and i0 <= -1 + cond.reload and i1 >= 0 and i1 <= -1 + cond.reload and i2 >= 0 and i2 <= -1 + cond.reload and i3 >= 0 and i3! <= -1 + cond.reload and i4 >= 0 and i4 <= -1 + cond.reload and i5 >= 0 and i5 <= -1 + cond.reload and cond.reload >= 1 };
Scattering :=
[cond.reload] → { Stmt_for_body16[i0, i1, i2, i3, i4, i5] → scattering[0, i0, 0, i1, 0, i2, 0, i3, 0, i4, 0, i5, 0] };
ReadAccess :=
[cond.reload] → { Stmt_for_body16[i0, i1, i2, i3, i4, i5] → MemRef_X[0] };
MustWriteAccess :=
[cond.reload] → { Stmt_for_body16[i0, i1, i2, i3, i4, i5] → MemRef_X[0] };
}

Similarly, the scops result for “X[0]=0” would be:

$ polly-opt -b! asicaa -polly-prepare -polly-scops SingleAssign.preopt.ll -analyze&nbs p;
(*Note: you can download SingleAssign.preopt.ll on http://llvm.org/bugs/attachment.cgi?id=11060)

Printing analysis ‘Polly - Create polyhedral description of Scops’ for region: ‘for.cond2.preheader => for.end32’ in function ‘main’:
Context:
[cond.reload] → { : cond.reload >= -2147483648 and cond.reload <= 2147483647 }
p0: %cond.reload
Statements {
Stmt_for_body16
Domain :=
[cond.reload] → { Stmt_for_body16[i0, i1, i2, i3, i4, i5] : i0 >= 0 and i0 <= -1 + cond.reload and i1 >= 0 and i1 <= -1 + cond.reload and i2 >= 0 and i2 <= -1 + cond.reload and i3 >= 0 and i3 <= -1 + cond.reload and i4 >= 0 and! i4 <= -1 + cond.reload and i5 >= 0 and i5 <= -1 + cond.reload and cond.reload >= 1 };
Scattering :=
[cond.reload] → { Stmt_for_body16[i0, i1, i2, i3, i4, i5] → scattering[0, i0, 0, i1, 0, i2, 0, i3, 0, i4, 0, i5, 0] };
MustWriteAccess :=
[cond.reload] → { Stmt_for_body16[i0, i1, i2, i3, i4, i5] → MemRef_X[0] };
}

Unfortunately, the scops for “x=1” is much more complicated as follows:

$ polly-opt -basicaa -polly-prepare -polly-scops nestedloop.preopt.ll -analyze
(*Note: you can download nestedloop.preopt.ll on http://llvm.org/bugs/attachment.cgi?id=11045)

Printing analysis ‘Polly - Create polyhed! ral description of Scops’ for region: ‘for.cond2.preheader => for.e nd31’ in function ‘main’:
Context:
[cond.reload] → { : cond.reload >= -2147483648 and cond.reload <= 2147483647 }
p0: %cond.reload
Statements {
Stmt_for_cond2_preheader
Domain :=
[cond.reload] → { Stmt_for_cond2_preheader[i0] : i0 >= 0 and i0 <= -1 + cond.reload };
Scattering :=
[cond.reload] → { Stmt_for_cond2_preheader[i0] → scattering[0, i0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] };
ReadAccess :=
[cond.reload] -&g! t; { Stmt_for_cond2_preheader[i0] → MemRef_x_018_reg2mem[0] };
MustWriteAccess :=
[cond.reload] → { Stmt_for_cond2_preheader[i0] → MemRef_x_018_reload_s2a[0] };
MustWriteAccess :=
[cond.reload] → { Stmt_for_cond2_preheader[i0] → MemRef_x_1_lcssa_reg2mem[0] };
Stmt_for_cond5_preheader_lr_ph
Domain :=
[cond.reload] → { Stmt_for_cond5_preheader_lr_ph[i0] : i0 >= 0 and i0 <= -1 + cond.reload and cond.reload >= 1 };
Scattering :=
! [cond.reload] → { Stmt_ for_cond5_preheader_lr_ph[i0] → scattering[0, i0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] };
ReadAccess :=
[cond.reload] → { Stmt_for_cond5_preheader_lr_ph[i0] → MemRef_x_018_reload_s2a[0] };
MustWriteAccess :=
[cond.reload] → { Stmt_for_cond5_preheader_lr_ph[i0] → MemRef_x_114_reg2mem[0] };
Stmt_for_cond5_preheader
Domain :=
[cond.reload] → { Stmt_for_cond5_preheader[i0, i1] : i0 >= 0 and i0 <= -1 + cond.reload and i1 >= 0 and i1 <= -1 + cond.reload and cond.reload >= 1 };
&nb! sp; Scattering :=
[cond.reload] → { Stmt_for_cond5_preheader[i0, i1] → scattering[0, i0, 2, i1, 0, 0, 0, 0, 0, 0, 0, 0, 0] };
ReadAccess :=
[cond.reload] → { Stmt_for_cond5_preheader[i0, i1] → MemRef_x_114_reg2mem[0] };
MustWriteAccess :=
[cond.reload] → { Stmt_for_cond5_preheader[i0, i1] → MemRef_x_114_reload_s2a[0] };
MustWriteAccess :=
[cond.reload] → { Stmt_for_cond5_preheader[i0, i1] → MemRef_x_2_lcssa_reg2mem[0] };
[…]

Domain :=
[cond.reload] → { Stmt_for_inc29[i0] : i0 >= 0 and i0 <= -1 + cond.reload };
Scattering :=
[cond.reload] → { Stmt_for_inc29[i0] → scattering[0, i0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] };
ReadAccess :=
[cond.reload] → { Stmt_for_inc29[i0] → MemRef_x_1_lcssa_reg2mem[0] };
MustWriteAccess :=
[cond.reload] → { Stmt_for_inc29[i0] → MemRef_x_1_lcssa_lcssa_reg2mem[0] };
MustWriteAccess :=
&n! bsp; [cond.reload] → { Stmt_for_inc29[i0] → MemRef_x_018_reg2mem[0] };
Stmt_for_cond_for_end31_crit_edge
Domain :=
[cond.reload] → { Stmt_for_cond_for_end31_crit_edge[] };
Scattering :=
[cond.reload] → { Stmt_for_cond_for_end31_crit_edge[] → scattering[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] };
ReadAccess :=
[cond.reload] → { Stmt_for_cond_for_end31_crit_edge[] → MemRef_x_1_lcssa_lcssa_reg2mem[0] };
MustWriteAccess :=&! nbsp;
; [cond.reload] → { Stmt_for_cond_for_end31_crit_edge[] → MemRef_x_0_lcssa_reg2mem[0] };
}

The compile time of polly-dependence analysis for the three cases would be:
loop with “X[0]+=1”: 0.130s
loop with “X[0] =1”: 0.053s
loop with “x=1”: 1.037s

There is no data dependence at all. However, it seems that -polly-prepare still introduces a lot of allocs and store instructions.

Meaning one alloc instruction and a single load and store in the
innermost loop, without any non-induction variable PHI nodes.

When implementing this in C, LLVM may introduce again PHI nodes. ! So it
may possibly be necessary to write this directly in LLVM-IR.

This is the important one. I wonder what the performance of the
dependence analysis is for increasing loop depth, if we have only a
single statement, that just reads and writes from the very same memory
location, but that does not have all the mess from the
non-induction-variable PHI nodes.

If we change the “x=1” to “X[0]+=1”, then there would be only one “store” instruction and one “load” instructions in the loop no matter how much the depth of the loop nest. In such a case, the compile time would increase exponentially as the depth of the loop nest increases:

nest-2: 0.063s
nest-4: 0.110s
nest-6: 0.221s
nest-8: 0.465s
nest-10: 0.893s
nest-12: 1.710s

Chee! rs,
Star Tan

Yes, I have changed the original code to the form you suggested:
    for (i
        for (j
            ...
                 x=1

Sorry, I meant
                  x[0] +=

It is interesting that Polly would run much faster if we change the "x=0" to "X[0]=0" or "X[0]+=0 in the nested loops.

Indeed, this is interesting.

First, if the nested loop only contains a statement "X[0]+=0", like this:

[...]

The compile time of polly-dependence analysis for the three cases would be:
loop with "X[0]+=1":0.130s
loop with "X[0] =1": 0.053s
loop with "x=1":1.037s

OK. So it seems in case reductions are converted to PHI nodes, we see a lot slower compiles. This is possibly due to the increased number of statements as well as the increased number of data-locations modelled.

I see two directions we can approach this problem with:

1) Optimize isl to work faster

We can see if there are some optimization opportunities in isl to make the code faster

2) Generate a less complicated SCoP

If possible, detect that the SCoP can be modelled with less statements and less memory locations and do so.

I did not look into either of the two, so those are basically open questions (that may not be easy to solve). If you are interested, you
may want to take a look.

There is no data dependence at all. However, it seems that -polly-prepare still introduces a lot of allocs and store instructions.

Meaning one alloc instruction and a single load and store in the
innermost loop, without any non-induction variable PHI nodes.

When implementing this in C, LLVM may introduce again PHI nodes. So it
may possibly be necessary to write this directly in LLVM-IR.

This is the important one. I wonder what the performance of the
dependence analysis is for increasing loop depth, if we have only a
single statement, that just reads and writes from the very same memory
location, but that does not have all the mess from the
non-induction-variable PHI nodes.

If we change the "x=1" to "X[0]+=1", then there would be only one "store" instruction and one "load" instructions in the loop no matter how much the depth of the loop nest. In such a case, the compile time would increase exponentially as the depth of the loop nest increases:

nest-2: 0.063s
nest-4: 0.110s
nest-6: 0.221s
nest-8: 0.465s
nest-10: 0.893s
nest-12: 1.710s

OK, so we still have exponentially growth with the loop-depth, but in general we seem to be a lot faster. It is not surprising that the complexity grows exponentially with the loop depth. However, seeing that even for a 12 level depth loop we only spend two seconds, seems to be OK.

Cheers,
Tobias

>>>>
>>>> Yes, I have changed the original code to the form you suggested:
>>>>     for (i
>>>>         for (j
>>>>             ...
>>>>                  x=1
>>>
>>> Sorry, I meant
>>>                   x[0] +=
>>>
>>
>>
>> It is interesting that Polly would run much faster if we change the "x=0" to "X[0]=0" or "X[0]+=0 in the nested loops.
>
>Indeed, this is interesting.

Experiments show that this is because their dependence results are different.

1) When the loop statement is a array operation such as "X[0]+=0", then clang would generate a store instruction and a load instruction within the same basic blocks like this:
    for (i=0;i<n;i++)
		for (j=0;j<n;j++)
			for (k=0;k<n;k++){
				load %a, memory(A,0)
				add  %a, 1
				store %a, memory(A,0)
			}

Such data dependence can be easily analysed by polly-dependence pass.

2) When the loop statement is a scalar operation such as "x+=1", then clang would generate a serial of scalar operations and phi nodes like this:

	x0=0
	for (i=0; i<n; i++) {
		x1 = phi(x0, x6)
		for (j=0; j<n; j++) {
			x2 = phi(x1, x5)
			for (k=0;k<n;k++){
				x3=phi(x2, x4)
				x4=x3+1
			}
			x5 = phi(x2, x4)
		}
		x6 = phi(x1, x5)
	}
	x7 = phi(x0, x6)

As we can see, the data dependences	are becoming very complex as the depth of loop nest increases, which thus slow the polly-dependence analysis. Unfortunately, I have not find a way to simplify such data dependences. Do you have any idea?

>> First, if the nested loop only contains a statement "X[0]+=0", like this:
>[...]
>
>> The compile time of polly-dependence analysis for the three cases would be:
>> loop with "X[0]+=1":0.130s
>> loop with "X[0] =1": 0.053s
>> loop with "x=1":1.037s
>
>OK. So it seems in case reductions are converted to PHI nodes, we see a 
>lot slower compiles. This is possibly due to the increased number of 
>statements as well as the increased number of data-locations modelled.
>
>I see two directions we can approach this problem with:
>
>1) Optimize isl to work faster
>
>We can see if there are some optimization opportunities in isl to make 
>the code faster

As shown in the above example, the key problem is the complex data dependence of the application, so I do not think it is the time to turn to ISL now.

>
>2) Generate a less complicated SCoP
>
>If possible, detect that the SCoP can be modelled with less statements 
>and less memory locations and do so.

Do you mean we should detect smaller SCoP or represent the same SCoP with less statements and less memory locations?
For the sake of execution performance, we should detect large SCoP to enable more Polly optimizations.

>I did not look into either of the two, so those are basically open 
>questions (that may not be easy to solve). If you are interested, you
>may want to take a look.

Sure, I would like to investigate if they are indeed useful to reduce the compile-time of Polly.

Thanks for your suggestions!
Star Tan

Yes, I have changed the original code to the form you suggested:
     for (i
         for (j
             ...
                  x=1

Sorry, I meant
                   x[0] +=

It is interesting that Polly would run much faster if we change the "x=0" to "X[0]=0" or "X[0]+=0 in the nested loops.

Indeed, this is interesting.

Experiments show that this is because their dependence results are different.

1) When the loop statement is a array operation such as "X[0]+=0", then clang would generate a store instruction and a load instruction within the same basic blocks like this:
     for (i=0;i<n;i++)
    for (j=0;j<n;j++)
      for (k=0;k<n;k++){
        load %a, memory(A,0)
        add %a, 1
        store %a, memory(A,0)
      }

Such data dependence can be easily analysed by polly-dependence pass.

2) When the loop statement is a scalar operation such as "x+=1", then clang would generate a serial of scalar operations and phi nodes like this:

  x0=0
  for (i=0; i<n; i++) {
    x1 = phi(x0, x6)
    for (j=0; j<n; j++) {
      x2 = phi(x1, x5)
      for (k=0;k<n;k++){
        x3=phi(x2, x4)
        x4=x3+1
      }
      x5 = phi(x2, x4)
    }
    x6 = phi(x1, x5)
  }
  x7 = phi(x0, x6)

As we can see, the data dependences are becoming very complex as the depth of loop nest increases, which thus slow the polly-dependence analysis. Unfortunately, I have not find a way to simplify such data dependences. Do you have any idea?

See below.

First, if the nested loop only contains a statement "X[0]+=0", like this:

[...]

The compile time of polly-dependence analysis for the three cases would be:
loop with "X[0]+=1":0.130s
loop with "X[0] =1": 0.053s
loop with "x=1":1.037s

OK. So it seems in case reductions are converted to PHI nodes, we see a
lot slower compiles. This is possibly due to the increased number of
statements as well as the increased number of data-locations modelled.

I see two directions we can approach this problem with:

1) Optimize isl to work faster

We can see if there are some optimization opportunities in isl to make
the code faster

As shown in the above example, the key problem is the complex data dependence of the application, so I do not think it is the time to turn to ISL now.

Agreed.

2) Generate a less complicated SCoP

If possible, detect that the SCoP can be modelled with less statements
and less memory locations and do so.

Do you mean we should detect smaller SCoP or represent the same SCoP with less statements and less memory locations?
For the sake of execution performance, we should detect large SCoP to enable more Polly optimizations.

I mean we should detect the same SCoP, but we should check if there is an easy way to understand that those PHI nodes basically just model a reduction and that we can represent this reduction with less scop statements (which then again yields less dependences).

Sorry, I am the next couple of days pretty busy. Please feel free to look by yourself into this. Maybe you have an idea.

Tobi