[Polly]GSoC Proposal: Reducing LLVM-Polly Compiling overhead

Hello Tobi,

I am interested in Polly project. Polly seems to be a very promising tool to find out program parallelization based on LLVM infrastructure. However, I find that Polly analysis and optimization can consume significant compiling time, so I propose a GSoC project to reduce Polly compiling time and I hope my work can make the Polly tool more applicable for all LLVM users.

I have done some preliminary experiments. My experimental environment is built as follows:
First, I build the LLVM, Clang and Polly using -O3 option, so all of these tools can be run in best case;
Second, I evaluate the compiling time of Polybench using the following options:
Clang-O3: clang -O3 (the basic clang without polly)
Polly-basic: clang -Xclang -load -Xclang LLVMPolly.so -O3 (! load polly but no use of polly optimization)
Polly-optimize: clang -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -O3 (use polly optimization)

The preliminary experimental results are as follows: (benchmarks are collected from Po


| Clang
(seconds) | Polly-basic
(seconds) | Polly-optimize
(seconds) | Polly-load overhead | Polly-optimize
overhead |

  • | - | - | - | - | - |
    2mm.c | 0.786 | 0.802 | 1.944 | 2.0% | 147.3% |
    correlation.c | 0.782 | 0.792 | 2.005 | 1.3% | 156.4% |
    gesummv.c | 0.583 | 0.603 | 1.08 | 3.4% | 85.2% |
    ludcmp.c | 0.787 | 0.806 | 2.475 | 2.4% | 214.5% |
    3mm.c | 0.786 | 0.811 | 2.617 | 3.2% | 233.0% |
    covariance.c | 0.73 | 0.74 | 2.294 | 1.4% | 214.2% |
    gramschmidt.c | 0.63 | 0.643 | 1.134 | 2.1% | 80.0% |
    seidel.c | 0.632 | 0.645 | 2.036 | 2.1% | 222.2% |
    adi.c | 0.8 | 0.811 | 3.044 | 1.4% | 280.5% |
    doitgen.c | 0.742 | 0.752 | 2.32 | 1.3% | 212.7% |
    instrument.c | 0.445 | 0.45 | 0.495 | 1.1% | 11.2% |
    atax.c | 0.614 | 0.627 | 1.007 | 2.1% | 64.0% |
    gemm.c | 0.721 | 0.74 | 1.327 | 2.6% | 84.0% |
    jacobi-2d-imper.c | 0.721 | 0.735 | 2.211 | 1.9% | 206.7% |
    bicg.c | 0.577 | 0.597 | 1.01 | 3.5% | 75.0% |
    gemver.c | 0.799 | 0.857 | 1.296 | 7.3% | 62.2% |
    lu.c | 0.68 | 0.702 | 1.132 | 3.2% | 66.5% |
    Average |
    |
    |
    | 2.49% | 142.10% |

Experimental results show that Polly analysis and optimization can leads to 142% extra compiling overhead, which maybe unacceptable in many large software building. As a result, it is an urgent task to reduce the compiling time of Polly analysis and optimization.

My plan for this proposal is to reduce Polly compiling overhead step by step:

  1. Investigate the details of Polly, find out how much time spent on analysis and how much time spent on optimization. Of course it is very important to distinguish the overhead on codes where Polly is applicable and codes where Polly is not applicable.
  2. Profile the Polly to find out the hot code and find out the sources of compiling overhead; Based on the profiling, I will try to rewrite the hot code to improving the compiling process.
  3. Remove expensive analysis passes. For example, the scop detection currently requires both the post-dominance analysis
    as well as the dominance frontier analysis. Not requir! ing these up front (at all) would be great.
  4. Canonicalization passes scheduled before Polly. Before running Polly, we currently schedule a set of passes to canonicalize the LLVM-IR on which the scop detection is run on. If I can reduce the number of preparing passes, then the compiling overhead can be reduced.
  5. Find out other ways to reduce compiling overhead.

Hope you can give me further advices.

Best Regards,
Star Tan

  Hello Tobi,

I am interested in Polly project. Polly seems to be a very promising tool to find out program parallelization based on LLVM infrastructure. However, I find that Polly analysis and optimization can consume significant compiling time, so I propose a GSoC project to reduce Polly compiling time and I hope my work can make the Polly tool more applicable for all LLVM users.

Dear Star Tan,

the project your propose sounds both interesting and important. As you correctly pointed out, maintaining fast compile times when Polly is enabled is a very important point. Especially, if we want to think of enabling Polly by default.

I have done some preliminary experiments. My experimental environment is built as follows:
First, I build the LLVM, Clang and Polly using -O3 option, so all of these tools can be run in best case;
Second, I evaluate the compiling time of Polybench using the following options:
       Clang-O3: clang -O3 (the basic clang without polly)
       Polly-basic: clang -Xclang -load -Xclang LLVMPolly.so -O3 (load polly but no use of polly optimization)
       Polly-optimize: clang -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -O3 (use polly optimization)

I believe this setup is a very good start. However, to draw useful conclusions we probably need more information:

1) Did you build LLVM/Polly in release mode?

The compile times with and without Polly look very high. Compiling 2mm with a current version of clang takes 0.076 seconds for me, whereas you
report a compile time of 0.780 seconds. This may caused by performance
differences of our machines or, what I believe is more likely, that you
did not compile LLVM and Polly in Release mode. Profiling of debug builds may give a wrong idea on where compile time is spent. Can you
verify you built in Release mode?

2) Compile time impact when Polly does not optimize the program code

The benchmarks you have chosen can be optimized well by Polly. This is
obviously a nice thing, but it also implies that Polly transforms and
optimizes a large percentage of this code. In cases, where Polly can create an optimized version of the code, (limited) increases in compile time are often acceptable. However, one area where we should be super critical with compile time is the case when Polly is enabled, but when it can not perform any useful transformation.

For polybench, you can get a first idea of the current overhead, by timing only the polly-detect part, but by disabling the polly optimizations and the polly code-generation. However, to get a more realistic picture, you probably want to run Polly on the LLVM test-suite. I know of some in-house nightly testers, that track Polly
performance on the LLVM test suite. Your work would be even another reason to get a public version of such a tester. I will see if I can get hardware to do so. Meanwhile, you can run it on your own machine.

The preliminary experimental results are as follows: (benchmarks are collected from Po

>
> Clang
(seconds) | Polly-basic
(seconds) | Polly-optimize
(seconds) | Polly-load overhead | Polly-optimize
overhead |
> 2mm.c | 0.786 | 0.802 | 1.944 | 2.0% | 147.3% |
> correlation.c | 0.782 | 0.792 | 2.005 | 1.3% | 156.4% |
> gesummv.c | 0.583 | 0.603 | 1.08 | 3.4% | 85.2% |
> ludcmp.c | 0.787 | 0.806 | 2.475 | 2.4% | 214.5% |
> 3mm.c | 0.786 | 0.811 | 2.617 | 3.2% | 233.0% |
> covariance.c | 0.73 | 0.74 | 2.294 | 1.4% | 214.2% |
> gramschmidt.c | 0.63 | 0.643 | 1.134 | 2.1% | 80.0% |
> seidel.c | 0.632 | 0.645 | 2.036 | 2.1% | 222.2% |
> adi.c | 0.8 | 0.811 | 3.044 | 1.4% | 280.5% |
> doitgen.c | 0.742 | 0.752 | 2.32 | 1.3% | 212.7% |
> instrument.c | 0.445 | 0.45 | 0.495 | 1.1% | 11.2% |

It is interesting to see that the only file that does not contain a kernel that is optimized by polly, but just some auxiliary functions has a very low compile time overhead. This may imply that most of the compile time overhead is due to optimizations Polly can perform.
However, we should only draw conclusions from this after we verified those numbers are from a Release build.

Experimental results show that Polly analysis and optimization can leads to 142% extra compiling overhead, which maybe unacceptable in many large software building. As a result, it is an urgent task to reduce the compiling time of Polly analysis and optimization.

My plan for this proposal is to reduce Polly compiling overhead step by step:
1) Investigate the details of Polly, find out how much time spent on analysis and how much time spent on optimization. Of course it is very important to distinguish the overhead on codes where Polly is applicable and codes where Polly is not applicable.
2) Profile the Polly to find out the hot code and find out the sources of compiling overhead; Based on the profiling, I will try to rewrite the hot code to improving the compiling process.
3) Remove expensive analysis passes. For example, the scop detection currently requires both the post-dominance analysis
as well as the dominance frontier analysis. Not requiring these up front (at all) would be great.
4) Canonicalization passes scheduled before Polly. Before running Polly, we currently schedule a set of passes to canonicalize the LLVM-IR on which the scop detection is run on. If I can reduce the number of preparing passes, then the compiling overhead can be reduced.
5) Find out other ways to reduce compiling overhead.

Very good points.

I believe the general direction is great. As a next step I propose to continue your profiling work to get a better idea of the performance characteristics of Polly and to understand where compile time is spent.
I have some ideas myself, but I would prefer that you do this analysis independent of my suggestions to double check my own findings. I invite you to regularly report your findings. I should be able to both help you to ensure that the numbers you get allow us to draw correct conclusions and also to point you to source code changes that you could perform to reduce the compile time.

Thanks again for this interesting proposal.

All the best,
Tobias

Dear Tobias Grosser,

Thank you so much for your kind reply. Your advice is very helpful and inspiring.

>On 03/17/2013 11:54 PM, Star Tan wrote:
>>   Hello Tobi,
>>
>> I am interested in Polly project. Polly seems to be a very promising tool to find out program parallelization based on LLVM infrastructure.   However, I find that Polly analysis and optimization can consume significant compiling time, so I propose a GSoC project to reduce Polly compiling time and I hope my work can make the Polly tool more applicable for all LLVM users.
>
>
>Dear Star Tan,
>
>the project your propose sounds both interesting and important. As you 
>correctly pointed out, maintaining fast compile times when Polly is 
>enabled is a very important point. Especially, if we want to think of 
>enabling Polly by default.
Yes, it is very important to reduce Polly compiling overhead, especially if Polly is enabled by default. That is the motivation why I propose this GSoC project and I hope my work can be helpful for the further development of Polly project.
>
>> I have done some preliminary experiments.  My experimental environment is built as follows:
>> First, I build the LLVM, Clang and Polly using -O3 option, so all of these tools can be run in best case;
>> Second,  I evaluate the compiling time of Polybench using the following options:
>>        Clang-O3:  clang -O3  (the basic clang without polly)
>>        Polly-basic:  clang -Xclang -load -Xclang LLVMPolly.so -O3  (load polly but no use of polly optimization)
>>        Polly-optimize: clang -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -O3 (use polly optimization)
>
>I believe this setup is a very good start. However, to draw useful 
>conclusions we probably need more information:
>
>1) Did you build LLVM/Polly in release mode?
>
>The compile times with and without Polly look very high. Compiling 2mm 
>with a current version of clang takes 0.076 seconds for me, whereas you
>report a compile time of 0.780 seconds. This may caused by performance
>differences of our machines or, what I believe is more likely, that you
>did not compile LLVM and Polly in Release mode. Profiling of debug 
>builds may give a wrong idea on where compile time is spent. Can you
>verify you built in Release mode?
>
Thank you so much for your advice. You are right. I just use the polly.sh (provided by Polly website) to build LLVM-Polly and it builds LLVM-Polly in Debug mode in default.  I will rebuild the LLVM-Polly in Release mode by configuring with --enable-optimized option and I will report the new data after I redo these experiments in the new environments.
>2) Compile time impact when Polly does not optimize the program code
>
>The benchmarks you have chosen can be optimized well by Polly. This is
>obviously a nice thing, but it also implies that Polly transforms and
>optimizes a large percentage of this code. In cases, where Polly can 
>create an optimized version of the code, (limited) increases in compile 
>time are often acceptable. However, one area where we should be super 
>critical with compile time is the case when Polly is enabled, but when 
>it can not perform any useful transformation.
>
>For polybench, you can get a first idea of the current overhead, by 
>timing only the polly-detect part, but by disabling the polly 
>optimizations and the polly code-generation. However, to get a more 
>realistic picture, you probably want to run Polly on the LLVM 
>test-suite. I know of some in-house nightly testers, that track Polly
>performance on the LLVM test suite. Your work would be even another 
>reason to get a public version of such a tester. I will see if I can get 
>hardware to do so. Meanwhile, you can run it on your own machine.
>
That is OK. I will evaluate Polly using both LLVM testsuite and Polybench in the following experiments.
>> The preliminary experimental results are as follows: (benchmarks are collected from Po
>>
>> >
>> > Clang
>> (seconds) | Polly-basic
>> (seconds) | Polly-optimize
>> (seconds) | Polly-load overhead | Polly-optimize
>> overhead |
>> > 2mm.c | 0.786 | 0.802 | 1.944 | 2.0% | 147.3% |
>> > correlation.c | 0.782 | 0.792 | 2.005 | 1.3% | 156.4% |
>> > gesummv.c | 0.583 | 0.603 | 1.08 | 3.4% | 85.2% |
>> > ludcmp.c | 0.787 | 0.806 | 2.475 | 2.4% | 214.5% |
>> > 3mm.c | 0.786 | 0.811 | 2.617 | 3.2% | 233.0% |
>> > covariance.c | 0.73 | 0.74 | 2.294 | 1.4% | 214.2% |
>> > gramschmidt.c | 0.63 | 0.643 | 1.134 | 2.1% | 80.0% |
>> > seidel.c | 0.632 | 0.645 | 2.036 | 2.1% | 222.2% |
>> > adi.c | 0.8 | 0.811 | 3.044 | 1.4% | 280.5% |
>> > doitgen.c | 0.742 | 0.752 | 2.32 | 1.3% | 212.7% |
>> > instrument.c | 0.445 | 0.45 | 0.495 | 1.1% | 11.2% |
>
>It is interesting to see that the only file that does not contain a 
>kernel that is optimized by polly, but just some auxiliary functions has 
>a very low compile time overhead. This may imply that most of the 
>compile time overhead is due to optimizations Polly can perform.
>However, we should only draw conclusions from this after we verified 
>those numbers are from a Release build.
>
Yes, I will report the new data from the Release build in the next few days and then we can discuss the results.
>> Experimental results show that Polly analysis and optimization can leads to 142% extra compiling overhead, which maybe unacceptable in many large software building. As a result, it is an urgent task to reduce the compiling time of Polly analysis and optimization.
>>
>> My plan for this proposal is to reduce Polly compiling overhead step by step:
>> 1) Investigate the details of Polly, find out how much time spent on analysis and how much time spent on optimization. Of course it is very important to distinguish the overhead on codes where  Polly is applicable and codes where Polly is not applicable.
>> 2) Profile the Polly to find out the hot code and find out the sources of compiling overhead; Based on the profiling, I will try to rewrite the hot code to improving the compiling process.
>> 3) Remove expensive analysis passes. For example, the scop detection currently requires both the post-dominance analysis
>> as well as the dominance frontier analysis. Not requiring these up front (at all) would be great.
>> 4) Canonicalization passes scheduled before Polly.  Before running Polly, we currently schedule a set of passes to canonicalize the LLVM-IR on which the scop detection is run on. If I can reduce the number of preparing passes, then the compiling overhead can be reduced.
>> 5) Find out other ways to reduce compiling overhead.
>
>Very good points.
>
>I believe the general direction is great. As a next step I propose to 
>continue your profiling work to get a better idea of the performance 
>characteristics of Polly and to understand where compile time is spent.
>I have some ideas myself, but I would prefer that you do this analysis 
>independent of my suggestions to double check my own findings. I invite 
>you to regularly report your findings. I should be able to both help you 
>to ensure that the numbers you get allow us to draw correct conclusions 
>and also to point you to source code changes that you could perform to 
>reduce the compile time.
>
>Thanks again for this interesting proposal.
>
>All the best,
>Tobias
>
>

Dear Tobias Grosser,

Today I have rebuilt the LLVM-Polly in Release mode. The configuration of my own testing machine is: Intel Pentium Dual CPU T2390(1.86GHz) with 2GB DDR2 memory.

I evaluated the Polly using PolyBench and Mediabench. It takes too long time to evaluate the whole LLVM-testsuite, so I just choose the Mediabench from LLVM-testsuite.

The preliminary results of Polly compiling overhead is listed as follows:

Table 1: Compiling time overhead of Polly for PolyBench.

Clang
(econd) | Polly-load
(econd) | Polly-optimize
(econd) | Polly-load penalty | Polly-optimize
Penalty |

  • | - | - | - | - | - |
    2mm.c | 0.155 | 0.158 | 0.75 | 1.9% | 383.9% |
    correlation.c | 0.132 | 0.133 | 0.319 | 0.8% | 141.7% |
    geummv.c | 0.152 | 0.157 | 0.794 | 3.3% | 422.4% |
    ludcmp.c | 0.157 | 0.159 | 0.391 | 1.3% | 149.0% |
    3mm.c | 0.103 | 0.109 | 0.122 | 5.8% | 18.4% |
    covariance.c | 0.16 | 0.163 | 1.346 | 1.9% | 741.3% |
    gramchmidt.c | 0.159 | 0.167 | 1.023 | 5.0% | 543.4% |
    eidel.c | 0.125 | 0.13 | 0.285 | 4.0% | 128.0% |
    adi.c | 0.155 | 0.156 | 0.953 | 0.6% | 514.8% |
    doitgen.c | 0.124 | 0.128 | 0.298 | 3.2% | 140.3% |
    intrument.c | 0.149 | 0.151 | 0.837 | 1.3% | 461.7% |
    atax.c | 0.135 | 0.136 | 0.917 | 0.7% | 579.3% |
    gemm.c | 0.161 | 0.162 | 1.839 | 0.6% | 1042.2% |
    jacobi-2d-imper.c | 0.16 | 0.161 | 0.649 | 0.6% | 305.6% |
    bicg.c | 0.149 | 0.152 | 0.444 | 2.0% | 198.0% |
    gemver.c | 0.135 | 0.136 | 0.416 | 0.7% | 208.1% |
    lu.c | 0.143 | 0.148 | 0.398 | 3.5% | 178.3% |
    Average | | | | 2.20% | 362.15% |

Table 2: Compiling time overhead of Polly for Mediabench (Selected from LLVM-testsuite).

Clang
(econd) | Polly-load
(econd) | Polly-optimize
(econd) | Polly-load penalty | Polly-optimize
Penalty |

  • | - | - | - | - | - |
    adpcm | 0.18 | 0.187 | 0.218 | 3.9% | 21.1% |
    g721 | 0.538 | 0.538 | 0.803 | 0.0% | 49.3% |
    gsm | 2.869 | 2.936 | 4.789 | 2.3% | 66.9% |
    mpeg2 | 3.026 | 3.072 | 4.662 | 1.5% | 54.1% |
    jpeg | 13.083 | 13.248 | 22.488 | 1.3% | 71.9% |
    Average | | | | 1.80% | 52.65% |

As shown in these two tables, Polly can significantly increase the compiling time when it indeed works for the Polybench. On average, Polly will increase the compiling time by 4.5X for Polybench. Even for the Mediabench, in which Polly does not actually improve the efficiency of generated code, it still increases the compiling time by 1.5X.

Based on the above observation, I think we should not only reduce the Polly analysis and optimization time, but also make it bail out early when it cannot improve the efficiency of generated code. That is very important when Polly is enabled in default for LLVM users.

Thanks again for Tobi’s kind help.

Best Regards,
Star Tan

Dear Tobias Grosser,

Today I have rebuilt the LLVM-Polly in Release mode. The configuration of my own testing machine is: Intel Pentium Dual CPU T2390(1.86GHz) with 2GB DDR2 memory.
I evaluated the Polly using PolyBench and Mediabench. It takes too long time to evaluate the whole LLVM-testsuite, so I just choose the Mediabench from LLVM-testsuite.

OK. This is a good baseline.

The preliminary results of Polly compiling overhead is listed as follows:

Table 1: Compiling time overhead of Polly for PolyBench.

> > Clang
(econd) | Polly-load
(econd) | Polly-optimize
(econd) | Polly-load penalty | Polly-optimize
Penalty |
> 2mm.c | 0.155 | 0.158 | 0.75 | 1.9% | 383.9% |
> correlation.c | 0.132 | 0.133 | 0.319 | 0.8% | 141.7% |
> geummv.c | 0.152 | 0.157 | 0.794 | 3.3% | 422.4% |
> ludcmp.c | 0.157 | 0.159 | 0.391 | 1.3% | 149.0% |
> 3mm.c | 0.103 | 0.109 | 0.122 | 5.8% | 18.4% |
> covariance.c | 0.16 | 0.163 | 1.346 | 1.9% | 741.3% |

This is a very large slowdown. On my system I get

0.06 sec for Polly-load
0.09 sec for Polly-optimize

What exact version of Polybench did you use? What compiler
flags did you use to compile the benchmark?
Also, did you run the executables several times? How large is the
standard deviation of the results? (You can use a tool like ministat to calculate these values [1])

> gramchmidt.c | 0.159 | 0.167 | 1.023 | 5.0% | 543.4% |
> eidel.c | 0.125 | 0.13 | 0.285 | 4.0% | 128.0% |
> adi.c | 0.155 | 0.156 | 0.953 | 0.6% | 514.8% |
> doitgen.c | 0.124 | 0.128 | 0.298 | 3.2% | 140.3% |
> intrument.c | 0.149 | 0.151 | 0.837 | 1.3% | 461.7% |

This number is surprising. In your last numbers you reported Polly-optimize as taking 0.495 sec in debug mode. The time you now
report for the release mode is almost twice as much. Can you verify
this number please?

> atax.c | 0.135 | 0.136 | 0.917 | 0.7% | 579.3% |
> gemm.c | 0.161 | 0.162 | 1.839 | 0.6% | 1042.2% |

This number looks also fishy. In debug mode you reported for Polly-optimize 1.327 seconds. This is again faster than in release mode.

> jacobi-2d-imper.c | 0.16 | 0.161 | 0.649 | 0.6% | 305.6% |
> bicg.c | 0.149 | 0.152 | 0.444 | 2.0% | 198.0% |
> gemver.c | 0.135 | 0.136 | 0.416 | 0.7% | 208.1% |
> lu.c | 0.143 | 0.148 | 0.398 | 3.5% | 178.3% |
> Average | | | | 2.20% | 362.15% |

Otherwise, those numbers look like a good start. Maybe you can put them
on some website/wiki/document where you can extend them as you proceed with benchmarking.

Table 2: Compiling time overhead of Polly for Mediabench (Selected from LLVM-testsuite).
> > Clang
(econd) | Polly-load
(econd) | Polly-optimize
(econd) | Polly-load penalty | Polly-optimize
Penalty |
> adpcm | 0.18 | 0.187 | 0.218 | 3.9% | 21.1% |
> g721 | 0.538 | 0.538 | 0.803 | 0.0% | 49.3% |
> gsm | 2.869 | 2.936 | 4.789 | 2.3% | 66.9% |
> mpeg2 | 3.026 | 3.072 | 4.662 | 1.5% | 54.1% |
> jpeg | 13.083 | 13.248 | 22.488 | 1.3% | 71.9% |
> Average | | | | 1.80% | 52.65% |

I run jpeg myself to verify these numbers on my machine. I got:

B: -O3 -load LLVMPolly.so
C: -O3 -load LLVMPolly.so -mllvm -polly
D: -O3 -load LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none
E: -O3 -load LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none
    -mllvm -polly-code-generator=none

           A B C D E

jpeg | 5.1 | 5.2 | 8.0 | 7.9 | 5.5

The overhead between A and C is similar to the one you report. Hence, the numbers seem to be correct.

I also added two more runs D and E to figure out where the slowdown comes from. As you can see most of the slow down disappears when we
do not do code generation. This either means that the polly code generation itself is slow or that the LLVM passes afterwards need more
time due to the code we generated (it contains many opportunities for scalar simplifications). It would be interesting to see if this holds for the other benchmarks and to investigate the actual reasons for the slowdown. It is also interesting to see that just running Polly, but without applying optimizations does not slow down the compilation a lot. Does this also hold for other benchmarks?

As shown in these two tables, Polly can significantly increase the compiling time when it indeed works for the Polybench. On average, Polly will increase the compiling time by 4.5X for Polybench. Even for the Mediabench, in which Polly does not actually improve the efficiency of generated code, it still increases the compiling time by 1.5X.
Based on the above observation, I think we should not only reduce the Polly analysis and optimization time, but also make it bail out early when it cannot improve the efficiency of generated code. That is very important when Polly is enabled in default for LLVM users.

Bailing out early is definitely something we can think about.

To get started here, you could e.g. look into the jpeg benchmark and investigate on which files Polly is spending a lot of time, where exactly the time is spent and what kind of SCoPs Polly is optimizing. In case we do not expect any benefit, we may skip code generation entirely.

Thanks again for your interesting analysis.

Cheers,
Tobi

[1] GitHub - codahale/ministat: A small tool to do the statistics legwork on benchmarks etc.

Dear Tobies,

Sorry for the late reply.

I have checked the experiment and I found some of the data is mismatched because of incorrect manual copy and paste, so I have written a Shell script to automatically collect data. Newest data is listed in the attached file.

Tobies, I have made a simple HTML page (attached polly-compiling-overhead.html) to show the experimental data and my plans for this project. I think a public webpage can be helpful for our further discussion. If possible, could you put it on Polly website (Either a public link or a temporary webpage) ?

I think I will try to remove unnecessary code transformations for canonicalization in next step.

Thank you very much for your warm help.

Best Regards,
Star Tan

polly-compiling-overhead.html (8.48 KB)

polly_build.sh (1.15 KB)

polly_compile.sh (1.18 KB)

Dear Tobies,

Sorry for the late reply.

I have checked the experiment and I found some of the data is mismatched because of incorrect manual copy and paste, so I have written a Shell script to automatically collect data. Newest data is listed in the attached file.

Yes, automatizing those experiments to make them reproducible is a very
good idea. I did not yet verify the numbers, but will as soon as your
script is online.

Two comments:

o Can you run also with the following flags:

D: -O3 -load LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none
E: -O3 -load LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none
   -mllvm -polly-code-generator=none

o Some numbers are again fishy:

adi: In your previous report you reported 0.953 seconds, the website now
     says 1.839 seconds.

ludcmp: In your previous report you reported 0.391 seconds, the website
        now says 1.346 seconds

instrument: It seems you rounded the previous numbers to one significant
            digit and calculated the performance difference from the
      rounded numbers. I would prefer if you would use the
      original numbers and you would only round when
      displaying/printing the results

Tobies, I have made a simple HTML page (attached polly-compiling-overhead.html) to show the experimental data and my plans for this project. I think a public webpage can be helpful for our further discussion. If possible, could you put it on Polly website (Either a public link or a temporary webpage) ?

Yes, I believe a website is a very good start to illustrate your
findings and to organize the information that you got. For now I propose
to host it yourself as I expect it to change often and you waiting for
me to add changes just adds overhead (there are plenty of free hosting
services). We can later move it to the Polly website.

Some comments on the content:

- Just put your name as the person who runs the project

I appreciate that you put my name on the top, but this is work you
started and that you will use as a summer of code project application.
So you should be the only person mentioned there

- Cite properly

Also, as this will later become an application, I believe it is
necessary to make clear what part of the document comes from you and
which part was something you got from reviews/external sources.
Specifically, if you copy a larger text from one of my emails, please
mark it accordingly.

- Typo

'memeory'

I think I will try to remove unnecessary code transformations for canonicalization in next step.

Are you referring to the region simplification change, I was proposing
earlier? I believe this is a good change to work on as it is simple,
self contained and also a conceptual cleanup.

After this patch, I believe it is necessary to get more details about
your performance numbers to understand better where your work will be
beneficial.

All the best,
Tobi

Hi all,

This is Star Tan, who proposed a project to reduce the Polly compiling overhead several days ago. After that, I kept on moving forward for this project. By now, I am much familiar with Polly and LLVM. Thanks to the help from Polly and LLVM group, I have also provided some LLVM-Polly patch files for this project, such as r179673, r179586, r179159, r179158, r179157, r178530. I am confident that I am on the right position to propose a GSoC project.

I have wrote a GSoC proposal draft in https://gist.github.com/tanstar/5441808. I am pleased to answer any questions about this project. Any advice or comment would be appreciated.

Thank you very much!

Best regards,
Star Tan.

Hi all,

I have updated my GSoS proposal: "FastPolly: Reducing LLVM-Polly Compiling overhead" (https://gist.github.com/tanstar/5441808). I think the pass ordering problem you discussed early can be also investigated in this project!

Is there any comment or advice about my proposal? I appreciate all your help and advice.

Thanks,
Star Tan
Proposal: https://gist.github.com/tanstar/5441808

Hi all,

When I was profiling Polly, I found some results were beyond my understanding. I need your help.

To profile the Polly in details, I developed some timers to count the compiling overhead for each part of Polly. Attached is the patch! file for this purpose. For each runOnRegion/runOnScop/runOnFunction/runOnLoop function in Polly, a timer is inserted to count its compiling overhead. Since these functions usually account for the major compiling time, I think those timers should catch most of Polly compiling overhead. Unfortunately, this is not true. My experimental results show that the compiling time captured by those timers only accounts for less than half of total Polly compiling time.

For example, when compiling the doitgen.c in PolyBench with Polly, the total Polly compiling overhead is about 0.7 seconds, but the compiling overhead captured by our timers is only about 0.2 seconds. A lot of compiling time is consumed by LLVM codes out of Polly. Fo! r example, the RegisterPasses.cpp shows that PM.add(polly::createIslScheduleOptimizerPass()) is immediately followed by PM.add(polly::createCodeGenerationPass()), but our profiling shows that 0.4 seconds elapse between the two passes (ScheduleOptimizer and CodeGeneration). I have checked that CodeGeneration pass only depends on a few LLVM passes as follows:

AU.addRequired();
AU.addRequired();
AU.addRequired();
AU.addRequired();
AU.addRequired();
AU.addRequired<Scop! Detection>();
AU.addRequired();
AU.addRequired();

How could I find out where the time is spent on between two adjacent Polly passes? Can anyone give me some advice?

Thank you!

Best Regards
Star Tan.

PS: I have updated my GSoC proposal. You can access the application o! n https://gist.github.com/tanstar/5441808 or on https://gist.github.com/tanstar/5441808/raw/c041e001300e3502403eb4071e9556a6eb2b7fd5/%5BGSoc2013%5D%5BLLVM-Polly%5D+Reducing+Polly+Compiling+Overhead
I would submit the proposal in recent days. Any comments or advice would be appreciated. Thank you!


[PollyProfile.patch|attachment](upload://nGNxPCfDqLY2M7wJELteUpph9Mu.patch) (19.5 KB)

Hi all,

[...]

How could I find out where the time is spent on between two adjacent Polly passes? Can anyone give me some advice?

Hi Star Tan,

I propose to do the performance analysis using the 'opt' tool and optimizing LLVM-IR, instead of running it from within clang. For the 'opt' tool there are two commands that should help you:

1) -debug-pass=Structure or -debug-pass=Details

This should give you the list of passes that is executed. You can compare the list to see at which point additional passes are scheduled.

2) -time-passes

This gives you the time spent in the different passes.

These two commands may remove the need for a Polly specific profiling infrastructure. Also, if you talk about performance issues you see, it would be great if you could attach the .ll file you use as well as the exact command line you profile.

Thanks,
Tobias

Hi all,

Hi,

thanks for the update and sorry for the delay in reviewing. I just had a look at your proposal.

I have updated my GSoS proposal: "FastPolly: Reducing LLVM-Polly Compiling overhead" (https://gist.github.com/tanstar/5441808). I think the pass ordering problem you discussed early can be also investigated in this project!

Yes, figuring out the optimal path ordering sequence is very good.

Is there any comment or advice about my proposal? I appreciate all your help and advice.

1. Summary:

LLVM-Polly is a promising polyhedral optimizer for data-locality and
parallelism, which takes advantage of multi-cores, cache hierarchies,
short vector instructions as well as dedicated accelerators. However,
Polly analysis and optimization can lead to significant compiling
overhead, which makes it much less attractive for LLVM users. I argue
that maintaining fast compiling time when Polly is enabled is very
important, especially if we want to think of enabling Polly in default.
Based on this assumption, I try to reduce Polly compiling overhead in
this project.

Sounds good.

2. Motivation:

LLVM is an incredible open-source project. It has been widely in C/C++

                                          You miss a verb here ^^^

compilers, high-level synthesis compilers, virtual machines, optimizing
tools, etc. As a graduate student, I am going to work on compiler
analysis and optimization, especially on program vectorization and
parallelization. I find Polly is a very useful and powerful polyhedral
optimizer. I would like to use this tool and contribute to this project.

When I was using Polly tool, I found that Polly optimization can lead to

No need for 'tool' here ^^^

significant compiling overhead. On average, polly optimization will
increase the compiling time by 393% for PolyBench benchmarks and by 53%
for MediaBench benchmarks compared with clang. That means if you want to
gain from Polly, you have to pay 4 times extra compiling overhead. Even
if you do not want to gain much from Polly, you still have to pay 53%
compiling overhead. Such expensive compiling overhead would make the
Polly much less attractive to LLVM users.

Good point.

In this project, I try to reduce Polly compiling overhead by removing

I would call it 'compile-time overhead' instead of 'compiling overhead'.

unnecessary passes and improving critical passes. For this purpose, I
firstly try to find out where the compiling overhead comes from. When
Polly optimizes a program, it takes the following steps: 1) Polly
canonicalization: prepare some basic information and do some basic
transformation, such as loop-simplify and region-simplify. 2) LLVM-IR
to Polly description: detect polly scops and translates the detected
scops into a polyhedral representation. 3) Polly optimization: analyze
and optimize polyhedral scops. 4) Polly description to LLVM-IR:
translates the polyhedral description back into new LLVM-IR.

In attched table 1 and 2, pBasic shows the overhead of loading the

      attached

LLVMPolly.so; pNoGen shows the overhead of step 1) and 2); pNoOpt shows
the overhead of step 1), 2) and 4). So the compiling overhead of Polly
can be divided into three parts:
PolyBench: canonicalization(13%-1%=12%), code generation(248%-13%=235%)
and optimization(393%-248%=145%) MediaBench:canonicalization( 9%-1%=
8%), code generation( 43%- 9%= 34%) and optimization( 53%- 43%= 10%)

Thanks for adding numbers for pNoGen. Having only 10% runtime increase if Polly is not used is a good sign, especially for the amount of canonicalization passes we run. This makes me confident we can get it to an even smaller number.

The other numbers are large, but there are likely ways to improve on this significantly. Also, it would be good to show at least for one benchmark which passes the different numbers actually contain. (You can use -debug-pass=Structure for this). E.g. the code generation time looks
rather large. I suppose most of the time is not actually spent in code generation, but also in the LLVM passes such as common subexpression elimination that have more LLVM-IR to work on or clean up after Polly was run.

Also, I believe the names of your columns, and the command line options
given above are a little out of sync. I could e.g. not find a description for pBasic

Based on these results, I plan to reduce Polly compiling overhead by the
following steps: First, I will try to remove unnecessary
canonicalization passes to reduce canonicalization time; Second, I will
try to remove or rewrite expensive analysis passes to reduce
optimization overhead; Third, I will try to improve the code generation
passes to reduce code generation overhead. Another interesting work is
to let the polly bail out early, which can be very helpful to save
compiling overhead if Polly cannot benefit the program.

OK, this sounds like a reasonable approach. Some more points may be worth adding:

1) It is important to pick criteria you can evaluate your work on

It is a good start that you identified two benchmarks. Especially looking into non-polybench code is very valuable. You should make sure that you evaluate your work throughout the project to see the benefit
of your changes. In fact, it may even be worthwhile to set up a Polly performance tester to track the compile time with Polly enabled and how
your changes influence it.

2) Add some specific bug reports you are planning to lock into

This bug report shows a large performance problem in Polly that is mainly due to creating a very difficult dependency analysis problem:
llvm.org/PR15643

There was a larger discussion on the Polly mailing list that discusses this bug.

3. Details about the project:

StageI -- Remove unnecessary canonicalization transformation. [Week 1-2]

Polly relies on some canonicalization passes to simplify the following
analysis and optimization. Canonicalization passes include
loop-simplify, region-simplify, Induction variable canonicalization and
block independent. For example, region-simplify pass is run to simplify
the region to single entry and single exit edge before -polly-detect.
However, such approach will introduce unnecessary modifications that
increase compile time even in the cases where Polly cannot optimize the
code.

A first step is to remove -region-simplify pass. For this purpose, I
have modified the scop detection pass and polly code generation pass to
allow scops with multiple entry edges and multiple exit edges. Details
can be referred to the following patch files: (Thanks for all the help
from Polly group)

r179673: Remove unneeded RegionSimplify pass r179586: Support SCoPs with
multiple entry edges r179159: Support SCoPs with multiple exit edges
r179158: Codegen: Replace region exit and entries recursively r179157:
RegionInfo: Add helpers to replace entry/exit recursively r178530:
ScopDetection: Use isTopLevelRegion

In this project, I plan to spend two weeks to reduce canonicalization
overhead.

It was a good idea to write down what you plan to do each week.

Week 1: Profile the compiling overhead of each canonicalization pass,
including PromoteMemoryToRegisterPass, CFGSimplificationPass,
ReassociatePass, LoopRotatePass, InstructionCombiningPass,
IndVarSimplifyPass, CodePreparationPass and LoopSimplifyPass. Week 2:
Remove or improve one or two most expensive canonicalization passes. I
will also try to revise the pass ordering to move some expensive
canonicalization passes later.

Instead of speeding up the canonicalization passes your focus should really be integrating Polly into the -O3 pass chain without the need to have any additional canonicalization passes. This part is not so much about the patch itself that implements it. It rather requires careful analysis how the number of detected scops changes when moving Polly.
At the moment we optimized for optimal scop coverage while neglecting compile time. Now we want both, optimal scop coverage and good compile time.

Another point that can be mentioned is removing the need for induction
variable canonicalization. We currently do this using the -polly-indvars pass. However, the new option -polly-codegen-scev enables us to remove this pass entirely. This could also be an interesting performance
problem as -polly-codegen-scev produces a lot cleaner LLVM-IR at code generation time, which may take more time to generate but it may also
require less time to be cleaned up. This could also be interesting to investigate.

StageII -- Remove or rewrite expensive analysis passes for compiling
performance. [Week 3-5]

There are many optimization libraries for Polly, such as ScopLib, Pluto,
ISL and Jason optimization. To balance the tradeoff between code

           JSON

performance and compiling overhead, I will profile each optimization
library and try to improve some of these libraries to reduce compiling
overhead.

The only relevant one is currently isl. It may in some cases be useful to compare against Pluto so. No need to optimize scoplib or JSON.

Week 3: Profile the compiling overhead of each Polly optimization
library, including ScopLib, Pluto, ISL and Jason.

Instead of profiling per library, I would rather profile per Polly pass
using --time-passes

You could do this later for several programs, but it would be good to have this already today for a single program to get an idea where time is spent and what needs optimization.

Week 4: Profile the
compiling overhead of each optimization pass for one or two libraries
(such as ISL and ScopLib). For example, ISL optimization provides many
optimization passes, such as dependence simplify, schedule optimization,
and various loop transformation. Week 5: remove some expensive
optimization passes and rewrite some critical but expensive optimization
passes.

StageIII -- Improve code generation passes for compiling performance.
[Week 6-9]

Our evalutions show that polly code generation passes are very
expensive, especially for some benchmarks like ludcmp.c and adi.c. Polly
code generation passes can increase the compiling time by 500% or more
(See table 1). My plan is to improve various code generation passes.

Can you verify your numbers here. You report for ludcmp the following:

       clang pBasic pNoOpt pNoGen pOPt
ludcmp.c 0.157 0.1602 0.2002 1.0761 1.3175

      pBasic% pNoGen% pNoOpt% pOpt%
      2.04% 27.52% 585.41% 739.17%

I have the feeling the headings of the pNoGen% and pNoOpt% columns have been switched accidentally. At least from the numbers above, I see an increase from 0.16 to 0.20 for code generation, which is far from being a 500% increase. On the other side, the optimization itself seems to add a larger amount of time as well as the code generation of the optimized code. O

Week 6: Profile the compiling overhead of each Polly code generation
pass, especially for ISL code generation. Week 7: Remove unnecessary
analysis for code generation. Currently, Polly code generation pass
dependents on a lot of analysis passes such as DominatorTree,
IslAstInfo, RegionInfo, ScalarEvolution, ScopDetection, ScopInfo. I will
try to remove some of expensive analysis passes.

Those passes add little overhead to the code generation. In fact the analysis is normally already available, such that these analysis requirements are for free. They have been added her mainly to allow the code generation to update them, such that we do not need to spend time rebuilding them later.

> Week 8-9: Rewrite some

expensive functions for Polly code generation based on profiling
information.

This is still very vague. I propose to

StageIV -- Let Polly bail out early. [Week 10]

Week 10: Add support in canicalization step or optimization step to

        Typo -----> canonicalization

allow Polly boil out early if it cannot benefit programs.

StageV -- Improve other parts. [Week 11-12]

Week 11: Improve other parts of Polly. Especially, I will focus on some
expensive helper functions such as TempScop analysis. This helper
function is critical and expensive.

How do you know TempScop is expensive?

Week 12: Integrate all improvements
and evaluate the whole Polly with multiple benchmarks.

I think the only way to do this project is to continuously evaluate your changes on Polybench and mediabench and to directly integrate them
into the svn repository. This should be made clear at the beginning and
I believe it is very fine to spend more time on the individual steps,
such that we can make sure the changes are properly evaluated and integrated.

4. Profit for LLVM users and Polly users

This project can benefit both LLVM users and Polly users. For LLVM
users, our project will make the Polly more acceptable if it can
provides extra performance gains within little extra compiling overhead.
For Polly users, this project will make the Polly more powerful by
significantly reducing compiling overhead and improving code quality.

Nice.

You could make your goals more concrete saying that we want to show that
by enabling Polly we can significantly optimizing the polybench benchmarks, while at the same time no prohibitively large compile time increase can be seen for mediabench. Reaching this goal would be a great
step forward.

[Attachments]

Our evaluation is based on Intel Pentium Dual CPU T2390(1.86GHz) with
2GB DDR2 memory. Each benchmark is run multiple times and data are
collected using ministat (GitHub - codahale/ministat: A small tool to do the statistics legwork on benchmarks etc.). Results
are shown in table 1 and table 2. Five cases are tested: (alias
pollycc="clang -O3 -load LLVMPolly.so -mllvm -polly) *clang: clang -O3
*pLoad: clang -O3 -load LLVMPolly.so *pNoGen:pollycc -O3 -mllvm
-polly-optimizer=none -mllvm -polly-code-generatorr=none *pNoOpt:pollycc
-O3 -mllvm -polly-optimizer=none *polly: pollycc -O3

Table 1: Compile time for PolyBench (Seconds, each benchmark is run 10
times)

    clang pBasic pNoOpt pNoGen pOPt pBasic% pNoGen%
pNoOpt% pOpt% 2mm.c 0.1521 0.1593 0.1711 0.3235 0.7247
4.73% 12.49% 112.69% 376.46% atax.c 0.1386 0.1349 0.1449
0.2066 0.313 0.00% 0.00% 49.06% 125.83% covariance.c 0.1498
0.1517 0.1526 0.3561 0.7706 1.27% 1.87% 137.72% 414.42% gemver.c
0.1562 0.1587 0.1724 0.2674 0.3936 1.60% 10.37% 71.19% 151.99%
instrument.c 0.1062 0.1075 0.1124 0.123 0.1216 0.00% 5.84%
15.82% 14.50% ludcmp.c 0.157 0.1602 0.2002 1.0761 1.3175 2.04%
27.52% 585.41% 739.17% 3mm.c 0.1529 0.1559 0.1826 0.4134
1.0436 1.96% 19.42% 170.37% 582.54% bicg.c 0.1244 0.1268
0.1353 0.1977 0.2828 1.93% 8.76% 58.92% 127.33% doitgen.c
0.1492 0.1505 0.1644 0.3325 0.8971 0.00% 10.19% 122.86% 501.27%
gesummv.c 0.1224 0.1279 0.134 0.1999 0.2937 4.49% 9.48%
63.32% 139.95% jacobi.c 0.1444 0.1506 0.1592 0.3912 0.8494
0.00% 10.25% 170.91% 488.23% seidel.c 0.1337 0.1353 0.1462
0.6299 0.9155 0.00% 9.35% 371.13% 584.74% adi.c 0.1593
0.1621 0.1835 1.4375 1.849 1.76% 15.19% 802.39% 1060.70%
correlation.c 0.1579 0.1596 0.1802 0.3393 0.6337 1.08% 14.12%
114.88% 301.33% gemm.c 0.1407 0.1432 0.1576 0.2421 0.4477
1.78% 12.01% 72.07% 218.20% gramschmidt.c 0.1331 0.1349 0.1509
0.3069 0.4138 0.00% 13.37% 130.58% 210.89% lu.c 0.1419
0.1443 0.1581 0.3156 0.3943 1.69% 11.42% 122.41% 177.87% average
1.26% 13.22% 248.47% 393.80%

To improve readability, it may be worth ensuring this fits into 80 columns. You may be able to reduce the number of digits used here.

You could probably increase the readability of your proposal further if you use markdown. See here for an example of how a markdown file looks at github: https://gist.github.com/micmcg/976172 and here the raw version
https://gist.github.com/micmcg/976172/raw/70f1e0db278340bd8167c98fb880979b4571e847/gistfile1.md

You basically need to use the file ending '.md' and you can then use markdown syntax to format your text. The very same syntax will also improve the readability of the proposal on the mailing list.

All the best,

Tobias

Dear Tobias,

Thank you very much for your very helpful advice.

Yes, -debug-pass and -time-passes are two very useful and powerful options when evaluating the compile-time of each compiler pass. They are exactly what I need! With these options, I can step into details of the compile-time overhead of each pass. I have finished some preliminary testing based on two randomly selected files from PolyBench and MediaBench. Results are listed on https://gist.github.com/tanstar/5508153 ! ;.

Thanks again for your timely advice and help.

Best regards,
Star Tan.

Dear Tobias,

Thank you very much for your very helpful advice.

Yes, -debug-pass and -time-passes are two very useful and powerful
options when evaluating the compile-time of each compiler pass. They
are exactly what I need! With these options, I can step into details
of the compile-time overhead of each pass. I have finished some
preliminary testing based on two randomly selected files from
PolyBench and MediaBench. Results are listed on
[GSoC 2013 proposal] FastPolly: Reducing LLVM-Polly Compile-Time Overhead · GitHub .

Great. Thanks for the quick update and the nice formatting of your
proposal.

In this project, I try to reduce Polly compile-time overhead by
revising Polly passes. For this purpose, I firstly try to find out
where the compile-time overhead comes from. When Polly optimizes a
program, it takes the following steps:

* step1 - Polly canonicalization: prepare basic information and
complete some transformations.
* step2 - LLVM IR to Polly description: detect valid scops and
translates them into polyhedral representation.
* step3 - Polly optimization: analyze and optimize polyhedral scops.
* step4 - Polly description to LLVM-IR: translates the polyhedral
description back into LLVM-IR.

I may make sense to list in the attachments which passes exactly belong
to these different steps.

Based on these results, I plan to revise all canonicalization passes,
code generation passes and optimization passes in Polly. Then, I will
try my best to reduce the compile-time overhead for each part of them.

In the following 12 weeks, my schedule for this project includes the following stages:
* Stage1 (1 week): set up a Polly performance tester and integrate the automation testing environment.
* Stage2 (1 week): collect and analyze the detailed profiling information for Polly compile process;
* Stage3 (3 weeks): revise and improve some expensive Polly passes or LLVM passes that Polly depends on;
* Stage4 (3 weeks): revise canonicalization passes of Polly and try to remove most of canonicalization passes;
* Stage5 (1 week): revise the pass ordering problem and let Polly bail out if it cannot benefit programs;
* Stage6 (1 week): revise and improve Polly code generation passes;
* Stage7 (1 week): revise and improve Polly optimization passes;
* Stage8 (1 week): revise other parts of Polly.

I believe that before we should probably move Polly rather early into
the default -O3 optimization chain. You may take some time to profile
Polly and to understand it better, but we should not do extensive
optimizations before moving it. As long as we can ensure the number of
scops detected remains the same, this should be good enough. We can then
spend the time optimizing the Polly integration in the normal -O3 chain.

### 3. Project Details

#### Stage1 -- Set up a Polly performance tester to track the compile time. [Week 1]

The first important work is to pick criteria to evaluate my work in
this project. This project targets to reduce compile-time overhead,
but at the same time it must not degrade the performance. To simplify
the testing, I will use the number of scops that optimized by Polly as
the performance criteria. As a result, our Polly performance tester
should contains two parts:

* Code performance: the number of scops optimized by Polly should not
be degraded.
* Compile-time Overhead: both the total compile-time overhead and the
percentage of each Polly pass are both important.

Furthermore, I currently use some scripts to collect the compile-time
statistics, but it still requires some tricky and manual work. My plan
is to set up an automation testing environment and integrate it into
Polly.

Yes, this is nice. It would be great if we could get some hardware to
run such tests regularly. I will check if we can find a sponsor for
this?

**Week 3**: revise the "Polly - Calculate dependence" pass. Since this
pass leads to significant compile-time overhead in both PolyBench and
MediaBench, I will investigate the details of this pass and try to
reduce the compile-time overhead.
**Week 4**: revise the "Polly - Detect static control parts (SCoPs)"
and "Polly - Create polyhedral description of Scops" pass. As shown in
table 3 and table 4, these two passes are expensive. However, they are
also very important to Polly because all Polly optimizations are based
on scops detected in these two passes. I will step into the details of
the two passes and make sure no performance loss happens no matter how
much compile-time overhead is reduced.
**Week 5**: revise other top ten expensive Polly passes. For example,
"Combine redundant instructions" pass runs several times because of
Polly and it can causes a large compile-time overhead. I will try to
eliminate or reduce the run of this pass.

In this stage, I will also step into some existing bugs related to
those expensive passes discussed above. For example, the bug
llvm.org/PR15643 is related to the "Polly - Calculate dependence"
pass. Although it has brought some discussions in LLVM/Polly mail
list, there is still no feasible solution. I will try my best to fix
up such bugs when I step into the details of those passes.

I actually pointed out a solution. We need to change Polly or
recanonicalize the SCEVs we analyse as the current way the SCEV are
canonicalized makes Polly introduce different parameters for expressions
{p + 1, +, 1}_<l> and {p + 2, +, 1}_<l> in case the loop 'l' is not part
of the scop.

Our evaluation is based on Intel Pentium Dual CPU T2390(1.86GHz) with
2GB DDR2 memory. Each benchmark is run multiple times and data are
collected using ministat (GitHub - codahale/ministat: A small tool to do the statistics legwork on benchmarks etc.).
Polly is built with Cloog in Release+Assert mode. All Polly, LLVM and
Cloog source code are checked out from SVN at April 24, 2013.

Can you give the svn revision / git hash for the checkouts of LLVM,
Polly, isl, cloog and polybench.

Table 1 and table 2 show the compile-time overhead of Polly. Five cases are tested:
(alias pollycc="clang -O3 -load LLVMPolly.so -mllvm -polly)
* **clang**: clang -O3
* **pBasic**: clang -O3 -load LLVMPolly.so
* **pNoGen**: pollycc -O3 -mllvm -polly-optimizer=none -mllvm -polly-code-generatorr=none
* **pNoOpt**: pollycc -O3 -mllvm -polly-optimizer=none
* **pOpt**: pollycc -O3

You probably want to add a couple of additional parameters here to
ensure we detect all scops in polybench. I would add -mllvm
-polly-ignore-aliasing and -DPOLYBENCH_USE_SCALAR_LB

You can use '-mllvm -debug-only=polly-cloog' to see if the kernels
are properly detected. I get the following output:

alias polly-clang
alias polly-clang='~/Projekte/polly/build/bin/clang -Xclang -load -Xclang ~/Projekte/polly/build/lib/LLVMPolly.so'
grosser@tobilaptop:~/Projekte/polybench$ polly-clang -O3 -mllvm -polly -mllvm -debug-only=polly-cloog linear-algebra/kernels/gemm/gemm.c -I utilities/ utilities/polybench.c -mllvm -polly-ignore-aliasing -DPOLYBENCH_USE_SCALAR_LB
:: init_array : entry.split => for.end56
if ((nj >= 1) && (nk >= 1) && (p_1 >= 1) && (p_4 >= 1)) {
   for (c2=0;c2<=p_4-1;c2+=32) {
     for (c3=max(-32*floord(p_1-12*p_4+10,32)-32*p_4,-32*c2-32*floord(-12*c2+p_1+10,32)-640);c3<=-20*c2;c3+=32) {
       for (c4=max(ceild(-c3-p_1-30,20),c2);c4<=min(min(floord(-c3,20),c2+31),p_4-1);c4++) {
         for (c5=max(c3,-20*c4-p_1+1);c5<=min(-20*c4,c3+31);c5++) {
           Stmt_for_body41(c4,-20*c4-c5);
         }
       }
     }
   }
}
if ((ni >= 1) && (nk >= 1) && (p_0 >= 1) && (p_4 >= 1)) {
   for (c2=0;c2<=p_0-1;c2+=32) {
     for (c3=max(-32*floord(-12*p_0+p_4+10,32)-32*p_0,-32*c2-32*floord(-12*c2+p_4+10,32)-640);c3<=-20*c2;c3+=32) {
       for (c4=max(ceild(-c3-p_4-30,20),c2);c4<=min(min(floord(-c3,20),c2+31),p_0-1);c4++) {
         for (c5=max(c3,-20*c4-p_4+1);c5<=min(-20*c4,c3+31);c5++) {
           Stmt_for_body18(c4,-20*c4-c5);
         }
       }
     }
   }
}
if ((ni >= 1) && (nj >= 1) && (p_0 >= 1) && (p_1 >= 1)) {
   for (c2=0;c2<=p_0-1;c2+=32) {
     for (c3=max(-32*floord(-12*p_0+p_1+10,32)-32*p_0,-32*c2-32*floord(-12*c2+p_1+10,32)-640);c3<=-20*c2;c3+=32) {
       for (c4=max(ceild(-c3-p_1-30,20),c2);c4<=min(min(floord(-c3,20),c2+31),p_0-1);c4++) {
         for (c5=max(c3,-20*c4-p_1+1);c5<=min(-20*c4,c3+31);c5++) {
           Stmt_for_body3(c4,-20*c4-c5);
         }
       }
     }
   }
}
Stmt_entry_split();
:: kernel_gemm : for.cond1.preheader => for.end28
for (c2=0;c2<=1023;c2+=32) {
   for (c3=0;c3<=1023;c3+=32) {
     for (c4=c2;c4<=c2+31;c4++) {
       for (c5=c3;c5<=c3+31;c5++) {
         Stmt_for_body3(c4,c5);
       }
     }
   }
}
for (c2=0;c2<=1023;c2+=32) {
   for (c3=0;c3<=1023;c3+=32) {
     for (c4=0;c4<=1023;c4+=32) {
       for (c5=c2;c5<=c2+31;c5++) {
         for (c6=c3;c6<=c3+31;c6++) {
           for (c7=c4;c7<=c4+31;c7++) {
             Stmt_for_body8(c5,c6,c7);
           }
         }
       }
     }
   }
}
:: polybench_flush_cache : for.inc => for.end
for (c1=0;c1<=4194559;c1++) {
   Stmt_for_inc(c1);
}

As a first step, we should ensure that Polly is fast for these flags. As
subsequent steps we should then ensure that the absence of these flags
does not increase compile-time.

Table 3 and table 4 show the compile-time overhead of top 15 passes in
polly-opt. we first generate LLVM IR files by the command "clang -O3
-emit-llvm XX.c -o XX.ll", then we run opt to collect the compile time
of each pass using the command "opt -basicaa -load LLVMPolly.so -O3
-polly -polly-codegen-scev XX.ll -S -o XX.opt.ll -time-passes"

You want to extract the .ll file with 'clang -O0'. Otherwise you run
polly on code that is already -O3 optimized, making the runs not
comparable to the clang integrated ones and also unrealistic as they do
not reflect what we do when running Polly from within clang.

It would be interesting to understand the doitgen results better. The
time in the optimizer is only 0.408 seconds, whereas the increase from
pBasic to pOpt is 0.897 - 0.151 = 0.746 seconds. This seems surprising.
Is this because of running polly on -O3 optimized code, is Polly
producing bigger .ll files which yield to longer object file emmission
time or is there a measurement problem?

That's it for now. Thanks for the update!

Cheers,
Tobi

Dear Tobias,

Thanks for your timely reply. Your advice is really helpful.

I have updated the proposal on https://gist.github.com/tanstar/5508153. Major differences include:
(1) Add table 3 and table 4 to show the compile-time overhead of top 15 hot passes;
(2) Describe a new schedule for this project. The new schedule pay more attention on reducing compile-time overhead of hot passes. The new schedule includes eight stages.
(3) Enrich the proposal with a lot of concrete work plans in each stage.
(4) Rewrite the proposal using markdown to make it more readab! le.

>On 04/26/2013 05:08 AM, tanmx_star wrote:
>> Hi all,
>
>Hi,
>
>thanks for the update and sorry for the delay in reviewing. I just had a 
>look at your proposal.
>
>
>> I have updated my GSoS proposal: "FastPolly: Reducing LLVM-Polly Compiling overhead" (https://gist.github.com/tanstar/5441808).  I think the pass ordering problem you discussed early can be also investigated in this project!
>
>Yes, figuring out the optimal path ordering sequence is very good.
>
>> Is there any comment or advice about my proposal?  I appreciate all your help and advice.
>
>
>> 1. Summary:
>>
>> LLVM-Polly is a promising polyhedral optimizer for data-locality and
>> parallelism, which takes advantage of multi-cores, cache hierarchies,
>> short vector instructions as well as dedicated accelerators. However,
>> Polly analysis and optimization can lead to significant compiling
>> overhead, which makes it much less attractive for LLVM users. I argue
>> that maintaining fast compiling time when Polly is enabled is very
>> important, especially if we want to think of enabling Polly in default.
>> Based on this assumption, I try to reduce Polly compiling overhead in
>> this project.
>
>Sounds good.
>
>
>> 2. Motivation:
>>
>> LLVM is an incredible open-source project. It has been widely in C/C++
>
>                                          You miss a verb here ^^^
>
>> compilers, high-level synthesis compilers, virtual machines, optimizing
>> tools, etc. As a graduate student, I am going to work on compiler
>> analysis and optimization, especially on program vectorization and
>> parallelization. I find Polly is a very useful and powerful polyhedral
>> optimizer. I would like to use this tool and contribute to this project.
>>
>> When I was using Polly tool, I found that Polly optimization can lead to
>No need for 'tool' here  ^^^
>
>> significant compiling overhead. On average, polly optimization will
>> increase the compiling time by 393% for PolyBench benchmarks and by 53%
>> for MediaBench benchmarks compared with clang. That means if you want to
>> gain from Polly, you have to pay 4 times extra compiling overhead. Even
>> if you do not want to gain much from Polly, you still have to pay 53%
>> compiling overhead. Such expensive compiling overhead would make the
>> Polly much less attractive to LLVM users.
>
>Good point.
>
>> In this project, I try to reduce Polly compiling overhead by removing
>
>I would call it 'compile-time overhead' instead of 'compiling overhead'.
>
>> unnecessary passes and improving critical passes.  For this purpose, I
>> firstly try to find out where the compiling overhead comes from. When
>> Polly optimizes a program, it takes the following steps: 1) Polly
>> canonicalization: prepare some basic information and do some basic
>> transformation, such as loop-simplify and region-simplify.  2) LLVM-IR
>> to Polly description: detect polly scops and translates the detected
>> scops into a polyhedral representation.  3) Polly optimization: analyze
>> and optimize polyhedral scops.  4) Polly description to LLVM-IR:
>> translates the polyhedral description back into new LLVM-IR.
>>
>> In attched table 1 and 2, pBasic shows the overhead of loading the
>      attached
>
>> LLVMPolly.so; pNoGen shows the overhead of step 1) and 2); pNoOpt shows
>> the overhead of step 1), 2) and 4). So the compiling overhead of Polly
>> can be divided into three parts:
>> PolyBench: canonicalization(13%-1%=12%), code generation(248%-13%=235%)
>> and optimization(393%-248%=145%) MediaBench:canonicalization( 9%-1%=
>> 8%), code generation( 43%- 9%= 34%) and optimization( 53%- 43%= 10%)
>
>Thanks for adding numbers for pNoGen. Having only 10% runtime increase 
>if Polly is not used is a good sign, especially for the amount of 
>canonicalization passes we run. This makes me confident we can get it to 
>an even smaller number.
>
>The other numbers are large, but there are likely ways to improve on 
>this significantly. Also, it would be good to show at least for one 
>benchmark which passes the different numbers actually contain. (You can 
>use -debug-pass=Structure for this). E.g. the code generation time looks
>rather large. I suppose most of the time is not actually spent in code 
>generation, but also in the LLVM passes such as common subexpression 
>elimination that have more LLVM-IR to work on or clean up after Polly 
>was run.
>
>Also, I believe the names of your columns, and the command line options
>given above are a little out of sync. I could e.g. not find a 
>description for pBasic
Sorry, pBasic means pLoad. I have added the description for pBasic in the new proposal.
>
>> Based on these results, I plan to reduce Polly compiling overhead by the
>> following steps: First, I will try to remove unnecessary
>> canonicalization passes to reduce canonicalization time; Second, I will
>> try to remove or rewrite expensive analysis passes to reduce
>> optimization overhead; Third, I will try to improve the code generation
>> passes to reduce code generation overhead. Another interesting work is
>> to let the polly bail out early, which can be very helpful to save
>> compiling overhead if Polly cannot benefit the program.
>
>OK, this sounds like a reasonable approach. Some more points may be 
>worth adding:
>
>1) It is important to pick criteria you can evaluate your work on
>
>It is a good start that you identified two benchmarks. Especially 
>looking into non-polybench code is very valuable. You should make sure 
>that you evaluate your work throughout the project to see the benefit
>of your changes. In fact, it may even be worthwhile to set up a Polly 
>performance tester to track the compile time with Polly enabled and how
>your changes influence it.
Yes, your are right. It is very important prerequisite work to pick criteria for the continuous evaluation. I add an extra stage (stage1) for this work. In my option, "number of scops optimized by Polly" can be used as the performance criteria, while "total compile-time overhead" and  "compile-time overhead of each Polly pass" can be used as the compile-time overhead criteria.  I will set up the testing environment and integrate it into Polly SVN repository as soon as possible.
>
>2) Add some specific bug reports you are planning to lock into
>
>This bug report shows a large performance problem in Polly that is 
>mainly due to creating a very difficult dependency analysis problem:
>llvm.org/PR15643
>
>There was a larger discussion on the Polly mailing list that discusses 
>this bug.
I have added such kind of work plans to stage3 in the new proposal.
>
>> 3. Details about the project:
>>
>> StageI -- Remove unnecessary canonicalization transformation. [Week 1-2]
>>
>> Polly relies on some canonicalization passes to simplify the following
>> analysis and optimization. Canonicalization passes include
>> loop-simplify, region-simplify, Induction variable canonicalization and
>> block independent. For example, region-simplify pass is run to simplify
>> the region to single entry and single exit edge before -polly-detect.
>> However, such approach will introduce unnecessary modifications that
>> increase compile time even in the cases where Polly cannot optimize the
>> code.
>>
>> A first step is to remove -region-simplify pass. For this purpose, I
>> have modified the scop detection pass and polly code generation pass to
>> allow scops with multiple entry edges and multiple exit edges. Details
>> can be referred to the following patch files: (Thanks for all the help
>> from Polly group)
>
>> r179673: Remove unneeded RegionSimplify pass r179586: Support SCoPs with
>> multiple entry edges r179159: Support SCoPs with multiple exit edges
>> r179158: Codegen: Replace region exit and entries recursively r179157:
>> RegionInfo: Add helpers to replace entry/exit recursively r178530:
>> ScopDetection: Use isTopLevelRegion
>>
>> In this project, I plan to spend two weeks to reduce canonicalization
>> overhead.
>
>It was a good idea to write down what you plan to do each week.
>
>> Week 1:  Profile the compiling overhead of each canonicalization pass,
>> including PromoteMemoryToRegisterPass, CFGSimplificationPass,
>> ReassociatePass, LoopRotatePass, InstructionCombiningPass,
>> IndVarSimplifyPass, CodePreparationPass and LoopSimplifyPass.  Week 2:
>> Remove or improve one or two most expensive canonicalization passes. I
>> will also try to revise the pass ordering to move some expensive
>> canonicalization passes later.
>
>Instead of speeding up the canonicalization passes your focus should 
>really be integrating Polly into the -O3 pass chain without the need to 
>have any additional canonicalization passes. This part is not so much 
>about the patch itself that implements it. It rather requires careful 
>analysis how the number of detected scops changes when moving Polly.
>At the moment we optimized for optimal scop coverage while neglecting 
>compile time. Now we want both, optimal scop coverage and good compile time.
>
>Another point that can be mentioned is removing the need for induction
>variable canonicalization. We currently do this using the -polly-indvars 
>pass. However, the new option -polly-codegen-scev enables us to remove 
>this pass entirely. This could also be an interesting performance
>problem as -polly-codegen-scev produces a lot cleaner LLVM-IR at code 
>generation time, which may take more time to generate but it may also
>require less time to be cleaned up. This could also be interesting to 
>investigate.
>
Work plans for such work are added to stage4 in the new proposal.
You are right, It would be great if we can completely remove canonicalization passes. I think I will try to remove  -polly-indvars at first, and then I will investigate the other canonicalization passes. 
>
>
>> StageII -- Remove or rewrite expensive analysis passes for compiling
>> performance. [Week 3-5]
>>
>> There are many optimization libraries for Polly, such as ScopLib, Pluto,
>> ISL and Jason optimization. To balance the tradeoff between code
>           JSON
>> performance and compiling overhead, I will profile each optimization
>> library and try to improve some of these libraries to reduce compiling
>> overhead.
>
>The only relevant one is currently isl. It may in some cases be useful 
>to compare against Pluto so. No need to optimize scoplib or JSON.
Yes, your comments are right. However,  it seems Polly uses Cloog to generate code in default, which is much slower than ISL. Do you mean we will use ISL as default in the future?
>
>> Week 3:  Profile the compiling overhead of each Polly optimization
>> library, including ScopLib, Pluto, ISL and Jason.
>
>Instead of profiling per library, I would rather profile per Polly pass
>using --time-passes
>
>You could do this later for several programs, but it would be good to 
>have this already today for a single program to get an idea where time 
>is spent and what needs optimization.
Yes, the new proposal pays more attention on profiling and improving each Polly pass. "--time-passes" is really a very useful command!
>
>> Week 4:  Profile the
>> compiling overhead of each optimization pass for one or two libraries
>> (such as ISL and ScopLib). For example, ISL optimization provides many
>> optimization passes, such as dependence simplify, schedule optimization,
>> and various loop transformation.  Week 5: remove some expensive
>> optimization passes and rewrite some critical but expensive optimization
>> passes.
>
>
>
>
>> StageIII -- Improve code generation passes for compiling performance.
>> [Week 6-9]
>>
>> Our evalutions show that polly code generation passes are very
>> expensive, especially for some benchmarks like ludcmp.c and adi.c. Polly
>> code generation passes can increase the compiling time by 500% or more
>> (See table 1). My plan is to improve various code generation passes.
>
>Can you verify your numbers here. You report for ludcmp the following:
>
>	   	clang	pBasic	pNoOpt	pNoGen	pOPt
>ludcmp.c	0.157	0.1602	0.2002	1.0761	1.3175
>
>			pBasic%	pNoGen%	pNoOpt%	pOpt%
>			2.04%	27.52%	585.41%	739.17%
>
>I have the feeling the headings of the pNoGen% and pNoOpt% columns have 
>been switched accidentally. At least from the numbers above, I see an 
>increase from 0.16 to 0.20 for code generation, which is far from being 
>a 500% increase. On the other side, the optimization itself seems to add 
>a larger amount of time as well as the code generation of the optimized 
>code. O
Sorry, the right order should be  "clang pBasic pNoGen pNoOpt pOPt pBasic% pNoGen% pNoOpt% pOpt%". I have fixed this problem in the new proposal.
>> Week 6:  Profile the compiling overhead of each Polly code generation
>> pass, especially for ISL code generation.  Week 7:  Remove unnecessary
>> analysis for code generation. Currently, Polly code generation pass
>> dependents on a lot of  analysis passes such as DominatorTree,
>> IslAstInfo, RegionInfo, ScalarEvolution, ScopDetection, ScopInfo. I will
>> try to remove some of expensive analysis passes.
>
>Those passes add little overhead to the code generation. In fact the 
>analysis is normally already available, such that these analysis 
>requirements are for free. They have been added her mainly to allow the 
>code generation to update them, such that we do not need to spend time 
>rebuilding them later.
>
> > Week 8-9: Rewrite some
>> expensive functions for Polly code generation based on profiling
>> information.
>
>This is still very vague. I propose to
>
>> StageIV -- Let Polly bail out early. [Week 10]
>>
>> Week 10: Add support in canicalization step or optimization step to
>        Typo ----->        canonicalization
>
>> allow Polly boil out early if it cannot benefit programs.
>
>
>> StageV -- Improve other parts. [Week 11-12]
>>
>> Week 11: Improve other parts of Polly. Especially, I will focus on some
>> expensive helper functions such as TempScop analysis. This helper
>> function is critical and expensive.
>
>How do you know TempScop is expensive?
>
>> Week 12: Integrate all improvements
>> and evaluate the whole Polly with multiple benchmarks.
>
>I think the only way to do this project is to continuously evaluate your 
>changes on Polybench and mediabench and to directly integrate them
>into the svn repository. This should be made clear at the beginning and
>I believe it is very fine to spend more time on the individual steps,
>such that we can make sure the changes are properly evaluated and 
>integrated.
Yes, I will set the environment as soon as possible and integrate it into Polly SVN repository. Currently I have finished some scripts and I think this work can be done in the next week.
>
>> 4. Profit for LLVM users and Polly users
>>
>> This project can benefit both LLVM users and Polly users. For LLVM
>> users, our project will make the Polly more acceptable if it can
>> provides extra performance gains within little extra compiling overhead.
>> For Polly users, this project will make the Polly more powerful by
>> significantly reducing compiling overhead and improving code quality.
>
>Nice.
>
>You could make your goals more concrete saying that we want to show that
>by enabling Polly we can significantly optimizing the polybench 
>benchmarks, while at the same time no prohibitively large compile time 
>increase can be seen for mediabench. Reaching this goal would be a great
>step forward.
>
>> [Attachments]
>>
>> Our evaluation is based on Intel Pentium Dual CPU T2390(1.86GHz) with
>> 2GB DDR2 memory. Each benchmark is run multiple times and data are
>> collected using ministat (https://github.com/codahale/ministat). Results
>> are shown in table 1 and table 2.  Five cases are tested: (alias
>> pollycc="clang -O3 -load LLVMPolly.so -mllvm -polly) *clang: clang -O3
>> *pLoad: clang -O3 -load LLVMPolly.so *pNoGen:pollycc -O3 -mllvm
>> -polly-optimizer=none -mllvm -polly-code-generatorr=none *pNoOpt:pollycc
>> -O3 -mllvm -polly-optimizer=none *polly: pollycc -O3
>>
>> Table 1: Compile time for PolyBench (Seconds, each benchmark is run 10
>> times)
>>
>> 		clang	pBasic	pNoOpt	pNoGen	pOPt	pBasic%	pNoGen%
>> pNoOpt%	pOpt% 2mm.c     	0.1521	0.1593	0.1711	0.3235	0.7247
>> 4.73%	12.49%	112.69%	376.46% atax.c  	0.1386	0.1349	0.1449
>> 0.2066	0.313	0.00%	0.00%	49.06%	125.83% covariance.c	0.1498
>> 0.1517	0.1526	0.3561	0.7706	1.27%	1.87%	137.72%	414.42% gemver.c
>> 0.1562	0.1587	0.1724	0.2674	0.3936	1.60%	10.37%	71.19%	151.99%
>> instrument.c	0.1062	0.1075	0.1124	0.123	0.1216	0.00%	5.84%
>> 15.82%	14.50% ludcmp.c	0.157	0.1602	0.2002	1.0761	1.3175	2.04%
>> 27.52%	585.41%	739.17% 3mm.c   	0.1529	0.1559	0.1826	0.4134
>> 1.0436	1.96%	19.42%	170.37%	582.54% bicg.c   	0.1244	0.1268
>> 0.1353	0.1977	0.2828	1.93%	8.76%	58.92%	127.33% doitgen.c
>> 0.1492	0.1505	0.1644	0.3325	0.8971	0.00%	10.19%	122.86%	501.27%
>> gesummv.c	0.1224	0.1279	0.134	0.1999	0.2937	4.49%	9.48%
>> 63.32%	139.95% jacobi.c	0.1444	0.1506	0.1592	0.3912	0.8494
>> 0.00%	10.25%	170.91%	488.23% seidel.c	0.1337	0.1353	0.1462
>> 0.6299	0.9155	0.00%	9.35%	371.13%	584.74% adi.c   	0.1593
>> 0.1621	0.1835	1.4375	1.849	1.76%	15.19%	802.39%	1060.70%
>> correlation.c	0.1579	0.1596	0.1802	0.3393	0.6337	1.08%	14.12%
>> 114.88%	301.33% gemm.c   	0.1407	0.1432	0.1576	0.2421	0.4477
>> 1.78%	12.01%	72.07%	218.20% gramschmidt.c	0.1331	0.1349	0.1509
>> 0.3069	0.4138	0.00%	13.37%	130.58%	210.89% lu.c    	0.1419
>> 0.1443	0.1581	0.3156	0.3943	1.69%	11.42%	122.41%	177.87% average
>> 1.26%	13.22%	248.47%	393.80%
>
>To improve readability, it may be worth ensuring this fits into 80 
>columns. You may be able to reduce the number of digits used here.
>
>You could probably increase the readability of your proposal further if 
>you use markdown. See here for an example of how a markdown file looks 
>at github: https://gist.github.com/micmcg/976172 and here the raw version
>https://gist.github.com/micmcg/976172/raw/70f1e0db278340bd8167c98fb880979b4571e847/gistfile1.md
>
>You basically need to use the file ending '.md' and you can then use 
>markdown syntax to format your text. The very same syntax will also 
>improve the readability of the proposal on the mailing list.
Thank you so much for your very helpful advice. I have rewrite the proposal using markdown. This tool is really interesting and powerful.
>
>All the best,
>
>Tobias
>

Best regards,
Star Tan

Dear Tobias and all LLVM/Polly developers,

Thank you very much for all your help and advice.

I have submitted my proposal to GSoC 2013 application system:
http://www.google-melange.com/gsoc/proposal/review/google/gsoc2013/star/1.
Some tables and paragraphs are simplified to make it more readable on GSoC official pages. Any suggestion or comment would be appreciated.

>On 05/03/2013 11:39 AM, Star Tan wrote:
>> Dear Tobias,
>>
>>
>> Thank you very much for your very helpful advice.
>>
>>
>> Yes, -debug-pass and -time-passes are two very useful and powerful
>> options when evaluating the compile-time of each compiler pass. They
>> are exactly what I need! With these options, I can step into details
>> of the compile-time overhead of each pass. I have finished some
>> preliminary testing based on two randomly selected files from
>> PolyBench and MediaBench. Results are listed on
>> https://gist.github.com/tanstar/5508153 .
>
>Great. Thanks for the quick update and the nice formatting of your
>proposal.
>
>> In this project, I try to reduce Polly compile-time overhead by
>> revising Polly passes.  For this purpose, I firstly try to find out
>> where the compile-time overhead comes from. When Polly optimizes a
>> program, it takes the following steps:
>>
>> * step1 - Polly canonicalization: prepare basic information and
>> complete some transformations.
>> * step2 - LLVM IR to Polly description: detect valid scops and
>> translates them into polyhedral representation.
>> * step3 - Polly optimization: analyze and optimize polyhedral scops.
>> * step4 - Polly description to LLVM-IR: translates the polyhedral
>> description back into LLVM-IR.
>
>I may make sense to list in the attachments which passes exactly belong
>to these different steps.
OK, I will add further information in Table 5.
>
>> Based on these results, I plan to revise all canonicalization passes,
>> code generation passes and optimization passes in Polly. Then, I will
>> try my best to reduce the compile-time overhead for each part of them.
>>
>> In the following 12 weeks, my schedule for this project includes the following stages:
>> * Stage1 (1 week): set up a Polly performance tester and integrate the automation testing environment.
>> * Stage2 (1 week): collect and analyze the detailed profiling information for Polly compile process;
>> * Stage3 (3 weeks): revise and improve some expensive Polly passes or LLVM passes that Polly depends on;
>> * Stage4 (3 weeks): revise canonicalization passes of Polly and try to remove most of canonicalization passes;
>> * Stage5 (1 week): revise the pass ordering problem and let Polly bail out if it cannot benefit programs;
>> * Stage6 (1 week): revise and improve Polly code generation passes;
>> * Stage7 (1 week): revise and improve Polly optimization passes;
>> * Stage8 (1 week): revise other parts of Polly.
>
>I believe that before we should probably move Polly rather early into
>the default -O3 optimization chain. You may take some time to profile
>Polly and to understand it better, but we should not do extensive
>optimizations before moving it. As long as we can ensure the number of
>scops detected remains the same, this should be good enough. We can then
>spend the time optimizing the Polly integration in the normal -O3 chain.
I agree with you.  I will try to understand Polly better as soon as possible by profiling Polly and stepping into the details of some critical Polly passes. It may take a little time before we can  start optimizing the Polly integration in the normal -O3 chain.

>
>> ### 3. Project Details
>>
>> #### Stage1 -- Set up a Polly performance tester to track the compile time. [Week 1]
>>
>> The first important work is to pick criteria to evaluate my work in
>> this project. This project targets to reduce compile-time overhead,
>> but at the same time it must not degrade the performance. To simplify
>> the testing, I will use the number of scops that optimized by Polly as
>> the performance criteria. As a result, our Polly performance tester
>> should contains two parts:
>>
>> * Code performance: the number of scops optimized by Polly should not
>> be degraded.
>> * Compile-time Overhead: both the total compile-time overhead and the
>> percentage of each Polly pass are both important.
>>
>> Furthermore, I currently use some scripts to collect the compile-time
>> statistics, but it still requires some tricky and manual work. My plan
>> is to set up an automation testing environment and integrate it into
>> Polly.
>
>Yes, this is nice. It would be great if we could get some hardware to
>run such tests regularly. I will check if we can find a sponsor for
>this?
Thank you. That would be great if we can find a sponsor.
Otherwise, I think I have to run all test on my laptop.

>
>> **Week 3**: revise the "Polly - Calculate dependence" pass. Since this
>> pass leads to significant compile-time overhead in both PolyBench and
>> MediaBench, I will investigate the details of this pass and try to
>> reduce the compile-time overhead.
>> **Week 4**: revise the "Polly - Detect static control parts (SCoPs)"
>> and "Polly - Create polyhedral description of Scops" pass. As shown in
>> table 3 and table 4, these two passes are expensive. However, they are
>> also very important to Polly because all Polly optimizations are based
>> on scops detected in these two passes. I will step into the details of
>> the two passes and make sure no performance loss happens no matter how
>> much compile-time overhead is reduced.
>> **Week 5**: revise other top ten expensive Polly passes. For example,
>> "Combine redundant instructions" pass runs several times because of
>> Polly and it can causes a large compile-time overhead. I will try to
>> eliminate or reduce the run of this pass.
>>
>> In this stage, I will also step into some existing bugs related to
>> those expensive passes discussed above. For example, the bug
>> llvm.org/PR15643 is related to the "Polly - Calculate dependence"
>> pass. Although it has brought some discussions in LLVM/Polly mail
>> list, there is still no feasible solution. I will try my best to fix
>> up such bugs when I step into the details of those passes.
>
>I actually pointed out a solution. We need to change Polly or
>recanonicalize the SCEVs we analyse as the current way the SCEV are
>canonicalized makes Polly introduce different parameters for expressions
>{p + 1, +, 1}_<l> and {p + 2, +, 1}_<l> in case the loop 'l' is not part
>of the scop.
>
>> Our evaluation is based on Intel Pentium Dual CPU T2390(1.86GHz) with
>> 2GB DDR2 memory. Each benchmark is run multiple times and data are
>> collected using ministat (https://github.com/codahale/ministat).
>> Polly is built with Cloog in Release+Assert mode. All Polly, LLVM and
>> Cloog source code are checked out from SVN at April 24, 2013.
>
>Can you give the svn revision / git hash for the checkouts of LLVM,
>Polly, isl, cloog and polybench.
I use the following git clones: [http://llvm.org/git/llvm.git](http://llvm.org/git/llvm.git), [http://llvm.org/git/polly.git](http://llvm.org/git/polly.git) and [http://llvm.org/git/clang.git](http://llvm.org/git/clang.git).  

I usually run "git pull" command for all the three repositories before I start a new evaluation. Of course sometimes the LLVM is not up-to-date because rebuilding LLVM is really time consuming.  Their git hash value are:
Cloog: 77c44782f4cbf6d60a7dbfe996b0649336ec7205
Polly:   5f3abecad642a6646c53cbd39879091da87c3c37
LLVM: d050e96133!
 fac8565e3bb1eabe9a587dd5a6ac4d

>
>> Table 1 and table 2 show the compile-time overhead of Polly. Five cases are tested:
>> (alias pollycc="clang -O3 -load LLVMPolly.so -mllvm -polly)
>> * **clang**:  clang -O3
>> * **pBasic**: clang -O3 -load LLVMPolly.so
>> * **pNoGen**: pollycc -O3 -mllvm -polly-optimizer=none -mllvm -polly-code-generatorr=none
>> * **pNoOpt**: pollycc -O3 -mllvm -polly-optimizer=none
>> * **pOpt**:   pollycc -O3
>
>You probably want to add a couple of additional parameters here to
>ensure we detect all scops in polybench. I would add -mllvm
>-polly-ignore-aliasing and -DPOLYBENCH_USE_SCALAR_LB
>
>You can use '-mllvm -debug-only=polly-cloog' to see if the kernels
>are properly detected. I get the following output:
>
>alias polly-clang
>alias polly-clang='~/Projekte/polly/build/bin/clang -Xclang -load 
>-Xclang ~/Projekte/polly/build/lib/LLVMPolly.so'
>grosser@tobilaptop:~/Projekte/polybench$ polly-clang -O3 -mllvm -polly 
>-mllvm -debug-only=polly-cloog linear-algebra/kernels/gemm/gemm.c -I 
>utilities/ utilities/polybench.c  -mllvm -polly-ignore-aliasing 
>-DPOLYBENCH_USE_SCALAR_LB
>:: init_array : entry.split => for.end56
>if ((nj >= 1) && (nk >= 1) && (p_1 >= 1) && (p_4 >= 1)) {
>   for (c2=0;c2<=p_4-1;c2+=32) {
>     for 
>(c3=max(-32*floord(p_1-12*p_4+10,32)-32*p_4,-32*c2-32*floord(-12*c2+p_1+10,32)-640);c3<=-20*c2;c3+=32) 
>{
>       for 
>(c4=max(ceild(-c3-p_1-30,20),c2);c4<=min(min(floord(-c3,20),c2+31),p_4-1);c4++) 
>{
>         for (c5=max(c3,-20*c4-p_1+1);c5<=min(-20*c4,c3+31);c5++) {
>           Stmt_for_body41(c4,-20*c4-c5);
>         }
>       }
>     }
>   }
>}
>if ((ni >= 1) && (nk >= 1) && (p_0 >= 1) && (p_4 >= 1)) {
>   for (c2=0;c2<=p_0-1;c2+=32) {
>     for 
>(c3=max(-32*floord(-12*p_0+p_4+10,32)-32*p_0,-32*c2-32*floord(-12*c2+p_4+10,32)-640);c3<=-20*c2;c3+=32) 
>{
>       for 
>(c4=max(ceild(-c3-p_4-30,20),c2);c4<=min(min(floord(-c3,20),c2+31),p_0-1);c4++) 
>{
>         for (c5=max(c3,-20*c4-p_4+1);c5<=min(-20*c4,c3+31);c5++) {
>           Stmt_for_body18(c4,-20*c4-c5);
>         }
>       }
>     }
>   }
>}
>if ((ni >= 1) && (nj >= 1) && (p_0 >= 1) && (p_1 >= 1)) {
>   for (c2=0;c2<=p_0-1;c2+=32) {
>     for 
>(c3=max(-32*floord(-12*p_0+p_1+10,32)-32*p_0,-32*c2-32*floord(-12*c2+p_1+10,32)-640);c3<=-20*c2;c3+=32) 
>{
>       for 
>(c4=max(ceild(-c3-p_1-30,20),c2);c4<=min(min(floord(-c3,20),c2+31),p_0-1);c4++) 
>{
>         for (c5=max(c3,-20*c4-p_1+1);c5<=min(-20*c4,c3+31);c5++) {
>           Stmt_for_body3(c4,-20*c4-c5);
>         }
>       }
>     }
>   }
>}
>Stmt_entry_split();
>:: kernel_gemm : for.cond1.preheader => for.end28
>for (c2=0;c2<=1023;c2+=32) {
>   for (c3=0;c3<=1023;c3+=32) {
>     for (c4=c2;c4<=c2+31;c4++) {
>       for (c5=c3;c5<=c3+31;c5++) {
>         Stmt_for_body3(c4,c5);
>       }
>     }
>   }
>}
>for (c2=0;c2<=1023;c2+=32) {
>   for (c3=0;c3<=1023;c3+=32) {
>     for (c4=0;c4<=1023;c4+=32) {
>       for (c5=c2;c5<=c2+31;c5++) {
>         for (c6=c3;c6<=c3+31;c6++) {
>           for (c7=c4;c7<=c4+31;c7++) {
>             Stmt_for_body8(c5,c6,c7);
>           }
>         }
>       }
>     }
>   }
>}
>:: polybench_flush_cache : for.inc => for.end
>for (c1=0;c1<=4194559;c1++) {
>   Stmt_for_inc(c1);
>}
>
>As a first step, we should ensure that Polly is fast for these flags. As
>subsequent steps we should then ensure that the absence of these flags
>does not increase compile-time.
Thanks for your advice. I will investigate more flags and enrich the content of table 1 and table 2.

>
>> Table 3 and table 4 show the compile-time overhead of top 15 passes in
>> polly-opt. we first generate LLVM IR files by the command "clang -O3
>> -emit-llvm XX.c -o XX.ll", then we run opt to collect the compile time
>> of each pass using the command "opt -basicaa  -load LLVMPolly.so -O3
>> -polly -polly-codegen-scev XX.ll -S -o XX.opt.ll -time-passes"
>
>You want to extract the .ll file with 'clang -O0'. Otherwise you run
>polly on code that is already -O3 optimized, making the runs not
>comparable to the clang integrated ones and also unrealistic as they do
>not reflect what we do when running Polly from within clang.
>
>It would be interesting to understand the doitgen results better. The
>time in the optimizer is only 0.408 seconds, whereas the increase from
>pBasic to pOpt is 0.897 - 0.151 = 0.746 seconds. This seems surprising.
>Is this because of running polly on -O3 optimized code, is Polly
>producing bigger .ll files which yield to longer object file emmission
>time or is there a measurement problem?
 I run three groups of commands:
Group 1:  "clang -O0 -emit-llvm" (0.082s) + "polly-opt -O3" (0.589s)
Group 2:  "clang -O3 -emit-llvm" (0.128s) + "polly-opt -O3" (0.402s)
Group 3: "clang -O3 -mllvm -polly -emit-llvm" (0.848s)
*Note: I did not use -emit-llvm in previous evaluation, which means it take a little time to translate LLVM IR to X86 assembler code, so the time in Group 3 (0.848s) is a little less than pOpt  listed in table 1 (0.897)
You are right. I should use "clang -O0 -emit-llvm" to generate LLVM IR for the subsequent Polly optimization.
However, I notice that there is still a gap beween Group 1 (0.671s in total) and Group 3 (0.848s in total). I guess this is because Polly generates bigger and more complicated LLVM IR, which slows down clang passes after Polly in Group 3 commands. 
Is there any option like !
 -time-passes when running clang? Such kind of options can help to find out where the extra compile time spends on.

>
>That's it for now. Thanks for the update!
>
>Cheers,
>Tobi
>

Best wishes,
Star Tan

Star Tan wrote:

>> #### Stage1 -- Set up a Polly performance tester to track the compile time. [Week 1]
>>
>> The first important work is to pick criteria to evaluate my work in
>> this project. This project targets to reduce compile-time overhead,
>> but at the same time it must not degrade the performance. To simplify
>> the testing, I will use the number of scops that optimized by Polly as
>> the performance criteria. As a result, our Polly performance tester
>> should contains two parts:
>>
>> * Code performance: the number of scops optimized by Polly should not
>> be degraded.
>> * Compile-time Overhead: both the total compile-time overhead and the
>> percentage of each Polly pass are both important.
>>
>> Furthermore, I currently use some scripts to collect the compile-time
>> statistics, but it still requires some tricky and manual work. My plan
>> is to set up an automation testing environment and integrate it into
>> Polly.
>
>Yes, this is nice. It would be great if we could get some hardware to
>run such tests regularly. I will check if we can find a sponsor for
>this?

I will ask if we can have a machine to run a polly perf build-bot.

Thank you. That would be great if we can find a sponsor.
Otherwise, I think I have to run all test on my laptop.

An alternative is to set up an automatic tester in the gcc compile farm
http://gcc.gnu.org/wiki/CompileFarm

and run the perf measurements on several different machines: collect all the
data and use a noise filter to discard any hiccups due to the use of the system
by another process, etc.

http://repo.or.cz/w/gcc-perf-regression-tester.git/blob_plain/HEAD:/analyze.R
http://repo.or.cz/w/gcc-perf-regression-tester.git/blob_plain/HEAD:/analyze-core.R

Note that even on a dedicated system you would have the same noise problem.

Sebastian

>Star Tan wrote:
>> >> #### Stage1 -- Set up a Polly performance tester to track the compile time. [Week 1]
>> >>
>> >> The first important work is to pick criteria to evaluate my work in
>> >> this project. This project targets to reduce compile-time overhead,
>> >> but at the same time it must not degrade the performance. To simplify
>> >> the testing, I will use the number of scops that optimized by Polly as
>> >> the performance criteria. As a result, our Polly performance tester
>> >> should contains two parts:
>> >>
>> >> * Code performance: the number of scops optimized by Polly should not
>> >> be degraded.
>> >> * Compile-time Overhead: both the total compile-time overhead and the
>> >> percentage of each Polly pass are both important.
>> >>
>> >> Furthermore, I currently use some scripts to collect the compile-time
>> >> statistics, but it still requires some tricky and manual work. My plan
>> >> is to set up an automation testing environment and integrate it into
>> >> Polly.
>> >
>> >Yes, this is nice. It would be great if we could get some hardware to
>> >run such tests regularly. I will check if we can find a sponsor for
>> >this?
>
>I will ask if we can have a machine to run a polly perf build-bot.
>
Thanks,  Sebastian. Waiting for the further news.

>> Thank you. That would be great if we can find a sponsor.
>> Otherwise, I think I have to run all test on my laptop.
>> 
>
>An alternative is to set up an automatic tester in the gcc compile farm
>http://gcc.gnu.org/wiki/CompileFarm
>
I have sent a request email to the CompileFarm committees. Hope my request can be approved!
>and run the perf measurements on several different machines: collect all the
>data and use a noise filter to discard any hiccups due to the use of the system
>by another process, etc.
>
>http://repo.or.cz/w/gcc-perf-regression-tester.git/blob_plain/HEAD:/analyze.R
>http://repo.or.cz/w/gcc-perf-regression-tester.git/blob_plain/HEAD:/analyze-core.R
>
>Note that even on a dedicated system you would have the same noise problem.
You are right. A good noise filter is important in our testing. Thanks for your helpful scripts.
>
>Sebastian
>-- 
>Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
>hosted by The Linux Foundation


Best wishes!
Star Tan