[Polly] Compile-time and Execution-time analysis for the SCEV canonicalization

Hello all,

I have done some basic experiments about Polly canonicalization passes and I found the SCEV canonicalization has significant impact on both compile-time and execution-time performance.

Detailed results for SCEV and default canonicalization can be viewed on: http://188.40.87.11:8000/db_default/v4/nts/32 (or 33, 34)
*pNoGen with SCEV canonicalization (run 32): -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none -mllvm -polly-codegen-scev
*pNoGen with default canonicalization (run 33): -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none
*pBasic without any canonicalization (run 34): -O3 -Xclang -load -Xclang LLVMPolly.so

Impact of SCEV canonicalization:
http://188.40.87.11:8000/db_default/v4/nts/32?compare_to=34&baseline=34
Impact of default canonicalization:
http://188.40.87.11:8000/db_default/v4/nts/33?compare_to=34&baseline=34
Comparison of SCEV canonicalization with default canonicalization:
http://188.40.87.11:80! 00/db_default/v4/nts/32?compare_to=33&baseline=33

As we expected, both SCEV canonicalization and default canonicalization will slightly increase the compile-time overhead (at most 30% extra compile-time). They also lead to some execution-time regressions and improvements.

The only difference between SCEV canonicalization and default canonicalization is the “IndVarSimplify” pass as shown in the code RegisterPasses.cpp:212:

if (!SCEVCodegen)
PM.add(polly::createIndVarSimplifyPass());

However, I find it is interesting to look into the comparison between ! SCEV canonicalization and default canonicalization (http://188.40.87.11:8000/db_default/v4/nts/32?compare_to=33&baseline=33):

First of all, we can expect SCEV canonicalization has better compile-time performance since it avoids the “IndVarSimplify” pass. Actually, it can gain more than 5% compile-time performance improvement for 32 benchmarks, especially for the following benchmarks:

MultiSource/Applications/lemon/lemon -11.02%
SingleSource/Benchmarks/Misc/oourafft -10.53%
SingleSource/Benchmarks/Linpack/linpack-pc -10.00%
MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan -8.31%
MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt -8.18%

Second, w! e find that SCEV canonicalization has both regression and improvement of execution performance compared with default canonicalization. Actually, there are many execution-time regressions such as:
SingleSource/Benchmarks/Shootout/nestedloop +16363.64%
SingleSource/Benchmarks/Shootout-C++/nestedloop +16200.00%
SingleSource/UnitTe! sts/Vectorizer/gcc-loops +107.35%
SingleSource/Benchmarks/Polybench/medley/reg_detect/reg_detect +75.00
SingleSource/Benchmarks/Misc/flops-6 +40.03%
SingleSource/Benchmarks/Misc/flops-5 +40.00%
&n bsp; MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan 30.00%
as well as many execution-time improvements such as:

SingleSource/Benchmarks/Shootout/ary3 -28.98%

SingleSource/Benchmarks/Polybench/linear-algebra/solvers/dynprog/dynprog -26.97%

SingleSource/Benchmarks/CoyoteBench/lpbench -25.84%

MultiSource/Benchmarks/BitBench/drop3/drop3 -16.58%

MultiSource/Benchmarks/Ptrdist/yacr2/yacr2 -16.46%

&nb sp;MultiSource/Benchmarks/TSVC/Symbolics-flt/Symbolics-flt -14.96%

I think the execution-time performance regression is mainly because of the unexpected performance improvements from non-SCEV canonicalization as shown int eh following bug: http://llvm.org/bugs/show_bug.cgi?id=17153. I will try to find out why “IndVarSimplify” can produce better code in the next step. If we can eliminate “IndVarSimplify” canonicalization but keep on producing high-quality code, then we can gain better compile-time performance without execution-time performance loss.

Best,
Star Tan

Hello all,

I have done some basic experiments about Polly canonicalization passes and I found the SCEV canonicalization has significant impact on both compile-time and execution-time performance.

Interesting.

Detailed results for SCEV and default canonicalization can be viewed on: http://188.40.87.11:8000/db_default/v4/nts/32 (or 33, 34)
    *pNoGen with SCEV canonicalization (run 32): -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none -mllvm -polly-codegen-scev
    *pNoGen with default canonicalization (run 33): -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none
    *pBasic without any canonicalization (run 34): -O3 -Xclang -load -Xclang LLVMPolly.so

Impact of SCEV canonicalization:
     http://188.40.87.11:8000/db_default/v4/nts/32?compare_to=34&baseline=34
Impact of default canonicalization:
     http://188.40.87.11:8000/db_default/v4/nts/33?compare_to=34&baseline=34
Comparison of SCEV canonicalization with default canonicalization:
     http://188.40.87.11:8000/db_default/v4/nts/32?compare_to=33&baseline=33

As we expected, both SCEV canonicalization and default canonicalization will slightly increase the compile-time overhead (at most 30% extra compile-time). They also lead to some execution-time regressions and improvements.

The only difference between SCEV canonicalization and default canonicalization is the "IndVarSimplify" pass as shown in the code RegisterPasses.cpp:212:
       if (!SCEVCodegen)
         PM.add(polly::createIndVarSimplifyPass());

There are actually more differences (see grep -R SCEVCodegen polly/), but the other differences will mainly be code generation differences.

However, I find it is interesting to look into the comparison between SCEV canonicalization and default canonicalization (http://188.40.87.11:8000/db_default/v4/nts/32?compare_to=33&baseline=33):

Yes, this is definitely a good start.

First of all, we can expect SCEV canonicalization has better compile-time performance since it avoids the "IndVarSimplify" pass. Actually, it can gain more than 5% compile-time performance improvement for 32 benchmarks, especially for the following benchmarks:
         MultiSource/Applications/lemon/lemon-11.02%
         SingleSource/Benchmarks/Misc/oourafft-10.53%
         SingleSource/Benchmarks/Linpack/linpack-pc-10.00%
         MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan-8.31%
         MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt-8.18%

Second, we find that SCEV canonicalization has both regression and improvement of execution performance compared with default canonicalization. Actually, there are many execution-time regressions such as:
         SingleSource/Benchmarks/Shootout/nestedloop+16363.64%
         SingleSource/Benchmarks/Shootout-C++/nestedloop+16200.00%

Those two have a huge impact. Understanding what is going on here would be nice.

I think the execution-time performance regression is mainly because of the unexpected performance improvements from non-SCEV canonicalization as shown int eh following bug: http://llvm.org/bugs/show_bug.cgi?id=17153. I will try to find out why "IndVarSimplify" can produce better code in the next step. If we can eliminate "IndVarSimplify" canonicalization but keep on producing high-quality code, then we can gain better compile-time performance without execution-time performance loss.

Previous experience has shown that the indvars pass as we run it in Polly can unpredictably change performance both negatively and positively. It was disabled as it people did not manage to eliminate all regressions it introduced, such that the positive performance changes could not really be valued.

So regarding performance tuning, I do not think we need to get this optimal. As soon as -polly-codegen-scev reaches similar performance than
the original approach, we are fine.

Also, I wonder if your runs include the dependence analysis. If this is the case, the numbers are very good. Otherwise, 30% overhead seems still to be a little bit much.

Tobi

>On 09/08/2013 08:03 PM, Star Tan wrote:
>> Hello all,
>>
>>
>> I have done some basic experiments about Polly canonicalization passes and I found the SCEV canonicalization has significant impact on both compile-time and execution-time performance.
>
>Interesting.
>
>> Detailed results for SCEV and default canonicalization can be viewed on: http://188.40.87.11:8000/db_default/v4/nts/32 (or 33, 34)
>>     *pNoGen with SCEV canonicalization (run 32): -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none -mllvm -polly-codegen-scev
>>     *pNoGen with default canonicalization (run 33): -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none
>>     *pBasic without any canonicalization (run 34): -O3 -Xclang -load -Xclang LLVMPolly.so
>>
>>
>> Impact of SCEV canonicalization:
>>      http://188.40.87.11:8000/db_default/v4/nts/32?compare_to=34&baseline=34
>> Impact of default canonicalization:
>>      http://188.40.87.11:8000/db_default/v4/nts/33?compare_to=34&baseline=34
>> Comparison of SCEV canonicalization with default canonicalization:
>>      http://188.40.87.11:8000/db_default/v4/nts/32?compare_to=33&baseline=33
>>
>>
>> As we expected, both SCEV canonicalization and default canonicalization will slightly increase the compile-time overhead (at most 30% extra compile-time). They also lead to some execution-time regressions and improvements.
>>
>>
>> The only difference between SCEV canonicalization and default canonicalization is the "IndVarSimplify" pass as shown in the code RegisterPasses.cpp:212:
>>        if (!SCEVCodegen)
>>          PM.add(polly::createIndVarSimplifyPass());
>
>There are actually more differences (see grep -R SCEVCodegen polly/), 
>but the other differences will mainly be code generation differences.
Thanks for your reminder. Since we are currently focusing on canonicalization passes, the other differences for code generation do not matter.

>> However, I find it is interesting to look into the comparison between SCEV canonicalization and default canonicalization (http://188.40.87.11:8000/db_default/v4/nts/32?compare_to=33&baseline=33):
>
>Yes, this is definitely a good start.
>
>> First of all, we can expect SCEV canonicalization has better compile-time performance since it avoids the "IndVarSimplify" pass. Actually, it can gain more than 5% compile-time performance improvement for 32 benchmarks, especially for the following benchmarks:
>>          MultiSource/Applications/lemon/lemon-11.02%
>>          SingleSource/Benchmarks/Misc/oourafft-10.53%
>>          SingleSource/Benchmarks/Linpack/linpack-pc-10.00%
>>          MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan-8.31%
>>          MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt-8.18%
>>
>>
>> Second, we find that SCEV canonicalization has both regression and improvement of execution performance compared with default canonicalization. Actually, there are many execution-time regressions such as:
>>          SingleSource/Benchmarks/Shootout/nestedloop+16363.64%
>>          SingleSource/Benchmarks/Shootout-C++/nestedloop+16200.00%
>Those two have a huge impact. Understanding what is going on here would 
>be nice.
Yes, I am investigating these cases.
>> I think the execution-time performance regression is mainly because of the unexpected performance improvements from non-SCEV canonicalization as shown int eh following bug: http://llvm.org/bugs/show_bug.cgi?id=17153. I will try to find out why "IndVarSimplify" can produce better code in the next step. If we can eliminate "IndVarSimplify" canonicalization but keep on producing high-quality code, then we can gain better compile-time performance without execution-time performance loss.
>
>Previous experience has shown that the indvars pass as we run it in 
>Polly can unpredictably change performance both negatively and 
>positively. It was disabled as it people did not manage to eliminate all 
>regressions it introduced, such that the positive performance changes 
>could not really be valued.
>
>So regarding performance tuning, I do not think we need to get this 
>optimal. As soon as -polly-codegen-scev reaches similar performance than
>the original approach, we are fine.
I see. I agree with you. I think we care more about compile-time performance for Polly's canonicalization passes since no Polly optimization or Polly code generation happens here.

>Also, I wonder if your runs include the dependence analysis. If this is 
>the case, the numbers are very good. Otherwise, 30% overhead seems still 
>to be a little bit much.
I think no Polly Dependence analysis is involved since our compiling command is:  
clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none  -mllvm -polly-codegen-scev
Fortunately, with the option "-polly-codegen-scev", only three benchmark shows >20% extra compile-time overhead:
SingleSource/Benchmarks/Misc/flops	28.57%
MultiSource/Benchmarks/MiBench/security-sha/security-sha	22.22%
MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes	21.05%
When I look into the compile-time for the flop benchmark using "-ftime-report", I find the extra compile-time overhead mainly comes from the "Combine redundant instructions" pass.
the top 5 passes when compiled with Polly canonicalization passes:
   ---User Time---   --User+System--   ---Wall Time---  --- Name ---
   0.0160 ( 20.0%)   0.0160 ( 20.0%)   0.0164 ( 20.8%)  Combine redundant instructions
   0.0120 ( 15.0%)   0.0120 ( 15.0%)   0.0138 ( 17.5%)  X86 DAG->DAG Instruction Selection
   0.0040 (  5.0%)   0.0040 (  5.0%)   0.0045 (  5.7%)  Greedy Register Allocator
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0029 (  3.7%)  Global Value Numbering
   0.0040 (  5.0%)   0.0040 (  5.0%)   0.0028 (  3.6%)  Polly - Create polyhedral description of Scops

But the top 5 passes for clang is:
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.0120 ( 25.0%)   0.0000 (  0.0%)   0.0120 ( 21.4%)   0.0141 ( 25.2%)  X86 DAG->DAG Instruction Selection
   0.0040 (  8.3%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0047 (  8.4%)  Greedy Register Allocator
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0034 (  6.1%)  Combine redundant instructions
   0.0000 (  0.0%)   0.0040 ( 50.0%)   0.0040 (  7.1%)   0.0029 (  5.2%)  Global Value Numbering
   0.0040 (  8.3%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0029 (  5.2%)  Combine redundant instructions
We can see the "Combine redundant instructions" are invoked many times and the extra invoke resulted by Polly's canonicalization is the most significant one. We have found this problem before and I need to look into the details of canonicalization passes related to "Combine redundant instructions".
BTW, I want to point out that although SCEV based Polly canonicalization (with -polly-codegen-scev) runs faster than default canonicalization, it can lead to 5 extra compile errors and 3 extra runtime errors as shown on [http://188.40.87.11:8000/db_default/v4/nts/32?compare_to=34&baseline=34](http://188.40.87.11:8000/db_default/v4/nts/32?compare_to=34&baseline=34).
I have done !
 some basic analysis for one of the compile error (7zip-benchmark). Results can be viewed on http://llvm.org/bugs/show_bug.cgi?Cid=17159
Best,
Star Tan


Also, I wonder if your runs include the dependence analysis. If this is
the case, the numbers are very good. Otherwise, 30% overhead seems still
to be a little bit much.

I think no Polly Dependence analysis is involved since our compiling command is:
clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none -mllvm -polly-codegen-scev
Fortunately, with the option "-polly-codegen-scev", only three benchmark shows >20% extra compile-time overhead:

I believe so to, but please verify with -debug-pass=Structure

SingleSource/Benchmarks/Misc/flops 28.57%
MultiSource/Benchmarks/MiBench/security-sha/security-sha 22.22%
MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes 21.05%
When I look into the compile-time for the flop benchmark using "-ftime-report", I find the extra compile-time overhead mainly comes from the "Combine redundant instructions" pass.
the top 5 passes when compiled with Polly canonicalization passes:
    ---User Time--- --User+System-- ---Wall Time--- --- Name ---
    0.0160 ( 20.0%) 0.0160 ( 20.0%) 0.0164 ( 20.8%) Combine redundant instructions
    0.0120 ( 15.0%) 0.0120 ( 15.0%) 0.0138 ( 17.5%) X86 DAG->DAG Instruction Selection
    0.0040 ( 5.0%) 0.0040 ( 5.0%) 0.0045 ( 5.7%) Greedy Register Allocator
    0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0029 ( 3.7%) Global Value Numbering
    0.0040 ( 5.0%) 0.0040 ( 5.0%) 0.0028 ( 3.6%) Polly - Create polyhedral description of Scops

But the top 5 passes for clang is:
    ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name ---
    0.0120 ( 25.0%) 0.0000 ( 0.0%) 0.0120 ( 21.4%) 0.0141 ( 25.2%) X86 DAG->DAG Instruction Selection
    0.0040 ( 8.3%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0047 ( 8.4%) Greedy Register Allocator
    0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0034 ( 6.1%) Combine redundant instructions
    0.0000 ( 0.0%) 0.0040 ( 50.0%) 0.0040 ( 7.1%) 0.0029 ( 5.2%) Global Value Numbering
    0.0040 ( 8.3%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0029 ( 5.2%) Combine redundant instructions
We can see the "Combine redundant instructions" are invoked many times and the extra invoke resulted by Polly's canonicalization is the most significant one. We have found this problem before and I need to look into the details of canonicalization passes related to "Combine redundant instructions".

OK.

BTW, I want to point out that although SCEV based Polly canonicalization (with -polly-codegen-scev) runs faster than default canonicalization, it can lead to 5 extra compile errors and 3 extra runtime errors as shown on http://188.40.87.11:8000/db_default/v4/nts/32?compare_to=34&baseline=34.
I have done some basic analysis for one of the compile error (7zip-benchmark). Results can be viewed on http://llvm.org/bugs/show_bug.cgi?Cid=17159

Great. I will help looking into this starting this WE.

Cheers,
Tobias

>On 09/09/2013 05:18 AM, Star Tan wrote:
>>
>>
>>> On 09/08/2013 08:03 PM, Star Tan wrote:
>>> Also, I wonder if your runs include the dependence analysis. If this is
>>> the case, the numbers are very good. Otherwise, 30% overhead seems still
>>> to be a little bit much.
>> I think no Polly Dependence analysis is involved since our compiling command is:
>> clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none  -mllvm -polly-codegen-scev
>> Fortunately, with the option "-polly-codegen-scev", only three benchmark shows >20% extra compile-time overhead:
>
>I believe so to, but please verify with -debug-pass=Structure
I have verified. It indeed does not involve Polly Dependence analysis. "Polly Dependence Pass" for flop is still high for some benchmarks as we have discussed before. 
>> SingleSource/Benchmarks/Misc/flops	28.57%
>> MultiSource/Benchmarks/MiBench/security-sha/security-sha	22.22%
>> MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes	21.05%
>> When I look into the compile-time for the flop benchmark using "-ftime-report", I find the extra compile-time overhead mainly comes from the "Combine redundant instructions" pass.
>> the top 5 passes when compiled with Polly canonicalization passes:
>>     ---User Time---   --User+System--   ---Wall Time---  --- Name ---
>>     0.0160 ( 20.0%)   0.0160 ( 20.0%)   0.0164 ( 20.8%)  Combine redundant instructions
>>     0.0120 ( 15.0%)   0.0120 ( 15.0%)   0.0138 ( 17.5%)  X86 DAG->DAG Instruction Selection
>>     0.0040 (  5.0%)   0.0040 (  5.0%)   0.0045 (  5.7%)  Greedy Register Allocator
>>     0.0000 (  0.0%)   0.0000 (  0.0%)   0.0029 (  3.7%)  Global Value Numbering
>>     0.0040 (  5.0%)   0.0040 (  5.0%)   0.0028 (  3.6%)  Polly - Create polyhedral description of Scops
>>
>> But the top 5 passes for clang is:
>>     ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
>>     0.0120 ( 25.0%)   0.0000 (  0.0%)   0.0120 ( 21.4%)   0.0141 ( 25.2%)  X86 DAG->DAG Instruction Selection
>>     0.0040 (  8.3%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0047 (  8.4%)  Greedy Register Allocator
>>     0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0034 (  6.1%)  Combine redundant instructions
>>     0.0000 (  0.0%)   0.0040 ( 50.0%)   0.0040 (  7.1%)   0.0029 (  5.2%)  Global Value Numbering
>>     0.0040 (  8.3%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0029 (  5.2%)  Combine redundant instructions
>> We can see the "Combine redundant instructions" are invoked many times and the extra invoke resulted by Polly's canonicalization is the most significant one. We have found this problem before and I need to look into the details of canonicalization passes related to "Combine redundant instructions".
>
>OK.

By investigating the flop benchmark, I find the key is the first "InstructionCombining" pass in a serial of canonicalization passes listed as follows:
static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) {
  PM.add(llvm::createPromoteMemoryToRegisterPass());
  PM.add(llvm::createInstructionCombiningPass());  //this is the most expensive canonicalization pass for flop benchmark
  PM.add(llvm::createCFGSimplificationPass());
  PM.add(llvm::createTailCallEliminationPass());
  PM.add(llvm::createCFGSimplificationPass());
  PM.add(llvm::createReassociatePass());
  PM.add(llvm::createLoopRotatePass());
  PM.add(llvm::createInstructionCombiningPass());
  if (!SCEVCodegen)
    PM.add(polly::createIndVarSimplifyPass());
  PM.add(polly::createCodePreparationPass());
}
If we remove the first "InstructionCombining" pass, then the compile-time is reduced by more than 10% . The results reported by -ftime-report become very similar to the case without Polly canonicalization:
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.0120 ( 23.1%)   0.0000 (  0.0%)   0.0120 ( 21.4%)   0.0138 ( 21.5%)  X86 DAG->DAG Instruction Selection
   0.0040 (  7.7%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0045 (  7.1%)  Greedy Register Allocator
   0.0040 (  7.7%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0042 (  6.6%)  Polly - Create polyhedral description of Scops
   0.0040 (  7.7%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0038 (  5.9%)  Combine redundant instructions
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0029 (  4.5%)  Global Value Numbering
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0027 (  4.2%)  Combine redundant instructions
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0020 (  3.2%)  Combine redundant instructions
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0020 (  3.1%)  Combine redundant instructions
Similar results have been found in the benchmark whetstone.  I will have a full test using LLVM test-suite tonight to see whether it has similar effectiveness for other test-suite benchmarks.
@Tobias, do you have any idea about the performance impact and other consequences that if we remove such a  canonicalization pass. In my option, it should not be important since we still run the "InstructionCombining" pass after "createLoopRotatePass" pass and in fact there are many more runs of "InstructionCombine" pass after this point.
Best,
Star Tan

Hello all,

I have evaluated the compile-time and execution-time performance of Polly canonicalization passes. Details can be referred to http://188.40.87.11:8000/db_default/v4/nts/recent_activity. There are four runs:
pollyBasic (run 45): clang -O3 -Xclang -load -Xclang LLVMPolly.so
pollyNoGenSCEV (run 44): clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-codegen-scev
pollyNoGenSCEV_1comb (run 46): same option as pollyNoGenSCEV but remove the first “InstructionCombining” canonicalization pass when generate LLVMPolly.so

pollyNoGenSCEV_nocan (run 47): same option as pollyNoGenSCEV but remove all canonicalization passes (actually only keep “createCodePreparationPass”) when generate LLVMPolly.so

Fist. let’s see the results of removing the first “InstructionCombining” pass like this:

static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) {
  PM.add(llvm::createPromoteMemoryToRegisterPass());
//  PM.add(llvm::createInstructionCombiningPass());  //this is the most expensive canonicalization pass for flop benchmark
  PM.add(llvm::createCFGSimplificationPass());
  PM.add(llvm::createTailCallEliminationPass());
  PM.add(llvm::createCFGSimplificationPass());
  PM.add(llvm::createReassociatePass());
  PM.add(llvm::createLoopRotatePass());
  PM.add(llvm::createInstructionCombiningPass());
  PM.add(polly::createCodePreparationPass());
}

Results are shown on http://188.40.87.11:8000/db_default/v4/nts/46?baseline=44&compare_to=44. As shown in the results, 13 benchmarks have >5% compile-time performance improvements by simply removing the first “createInstructionCombiningPass”. The top 5 benchmarks are listed as follows:

SingleSource/Regression/C++/2003-09-29-NonPODsByValue -38.46%
SingleSource/Benchmarks/Misc/flops -19.30%
SingleSource/Benchmarks/Misc/himenobmtxpa -12.94%
MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes -12.68%
MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000 -10.68%

Unfortunately, there are also two serious execution-time performance regressions:

SingleSource/Benchmarks/Adobe-C++/simple_types_constant_folding 204.19%
SingleSource/Benchmarks/Polyb! ench/linear-algebra/solvers/dynprog/dynprog 44.58%

By looking into the simple_types_constant_folding benchmark, I find it is mainly caused by the unexpected impact of the createPromoteMemoryToRegisterPass(). Removing “createPromoteMemoryToRegisterPass” would eliminate the execution-time performance regression for simple_types_constant_folding benchmark. Right now, I have no idea why createPromoteMemoryToRegisterPass" would lead to such great execution-time performance regression.

[http://188.40.87.11:8000/db_default/v4/nts/46?baseline=45&compare_to=45](http://188.40.87.11:8000/db_default/v4/nts/!
46?baseline=45&compare_to=45) shows the extra compile-time overhead of Polly canonicalization passes without the first “InstructionCombining” pass. By removing the first “InstructionCombining” pass, we see the extra compile-time overhead of Polly canonicalization is at most 13.5%, which is much smaller than the original Polly canonicalization overhead (>20%).

Second, let’s look into the total impact of those polly canonicalization passes by removing all optional canonicalization passes as follows:

static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) {
//  PM.add(llvm::createPromoteMemoryToRegisterPass());
//  PM.add(llvm::createInstructionCombiningPass());  //this is the most expensive canonicalization pass for flop benchmark
//  PM.add(llvm::createCFGSimplificationPass());
//  PM.add(llvm::createTailCallEliminationPass());
//  PM.add(llvm::createCFGSimplificationPass());
//  PM.add(llvm::createReassociatePass());
//  PM.add(llvm::createLoopRotatePass());
//  PM.add(llvm::createInstructionCombiningPass());
  PM.add(polly::createCodePreparationPass());
}

As shown on http://188.40.87.11:8000/db_default/v4/nts/47?baseline=45&compare_to=45, the extra compile-time overhead is very small (5.04% at most) by removing all optional Polly canonicalization passes. However, I think it is not practical to remove all these canonicalizations for the sake of Polly optimization performance. I would further evaluate Polly’s performance (with optimization and code generation) in the case all optional canonicalization passes are removed.

As a simple informal conclusion, I think we should revise Polly’s canonicalization passes. At least we should consider removing the first “InstructionCombining” pass!

Best,
Star Tan

Now, we come to more evaluations on http://188.40.87.11:8000/db_default/v4/nts/recent_activity

I mainly care about the compile-time and execution time impact for the following cases:
pBasic (run 45): clang -O3 -load LLVMPolly.so
pNoGenSCEV (run 44): clang -O3 -load LLVMPolly.so -polly-codegen-sce! v -polly -polly-optimizer=none -polly-code-generator=none
pNoGenSCEV_nocan (run 47): same option with pNoGenSCEV but replace the LLVMPolly.so by removing all Polly canonicalization passes
pNoGenSCEV_procomb (run 51): same option with pNoGenSCEV but rep lace the LLVMPolly.so by removing only the “InstructionCombining” and “PromoteMemoryToRegister” canonicalization passes
pOptSCEV (run 48): clang -O3 -load LLVMPolly.so -polly-codegen-scev -polly
pOptSCEV_nocan (run 50): same option with pNoOptSCEV but replace the LLVMPolly.so by removing all Polly canonicalization passes
pOptSCEV_procomb (run 52): same option with pNoOptSCEV but replace the LLVMPolly.so by removing only the “InstructionCombining” and “PromoteMemoryToRegister” canonicalization passes
pollyOpt (run 53): clang -O3 -load LLVMPolly.so -mllvm -polly

Discovery 1: Polly optimization and code generation heavily relies on the&! nbsp;“InstructionComb ining” and “PromoteMemoryToRegister” canonicalization passes.
http://188.40.87.11:8000/db_default/v4/nts/52?compare_to=45&baseline=45 shows the comparison between pOptSCEV_procomb with pBasic. As the results shown, Polly optimization and code generation lead to very small compile-time overhead (20% at most) compared with clang, i.e. the top four benmarks are:

SingleSource/UnitTests/SignlessTypes/rem 20.37%
! SingleSource/Benchmarks/Misc/oourafft 11.34%
MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl 10.22%
MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset 10.21%

It means that most of expensive Polly analysis/optimization/code generation passes are not enabled without running these two canonicalization passes. Of course Polly also introduces little performance gains in this case. The top benchmarks for performance improvements are:
SingleSource/Benchmarks/Shootout/nestedloop -100.00%
SingleSource/Benchmarks/Shootout-C++/nestedloop -100.00%
MultiSource/Benchmarks/Ptrdist/anagram/anagram -14.26%
SingleSource/Benchmarks/Shootout/lists -10.77%
BTW, this bug (llvm.org/bugs/show_bug.cgi?id=17159) shown in general SCEV optimization does not appear any more.

Discovery 2: Removing polly canonicalization passes significantly reduce compile-time and may also reduce execution-time.
http://188.40.87.11:8000/db_default/v4/nts/50?compare_to=48&baseline=48 show the comparison between “full polly canonicalization” and “non polly canonicalization”. Definitely, removing canonicalization passes can significantly reduce compile-time overh! ead and my decrease the execution-time performance since “canonicalization passes” can provide more opportunities for optimization. However, we find that removing polly canonicalization passes may also improve the execution-time performance for some benchmarks as shown in the follows:

Performance Regressions - Execution Time
MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt 45.89%
SingleSource/Benchmarks/CoyoteBench/huffbench 22.24%
SingleSource/Benchmarks/Shootout/fib2 15.06%
SingleSource/Benchmarks/Stanford/FloatMM 13.98%
SingleSource/Benchmarks/Misc-C++/mandel-text 13.16%

Performance Improvements - Execution Time
SingleSource/Benchmarks/Polybench/medley/reg_detect/reg_detect -37.50%
SingleSource/Benchmarks/Polybench/linear-algebra/solvers/dynprog/dynprog -27.69%
MultiSource/Benchmarks/TSVC/Symbolics-flt/Symbolics-flt -22.59%
SingleSourc! e/Benchmarks/Misc/himenobmtxpa -21.98%
MultiSource/Benchmarks/TSVC/GlobalDataFlow-flt/GlobalDataFlow-flt -16.44%

It means Polly’s optimization does not always improve the performance. It may lead the performance regression at the same time. This discovery can be also found in the comparison between “clang -O3 with Polly” and “clang -O3 without Polly” on http://188.40.87.11:8000/db_default/v4/nts/48?compare_to=45&baseline=45. Many benchmarks have execution time regression. So we need to further refine Polly’s optimization. At least we should avoid the perf! ormance regression.

In the next step, I will evaluate those polly canonicalization passes without -polly-codegen-scev to understand their compile-time and execution-time impact.

Best,
Mingxing

Hi Star Tan,

thanks for this very extensive analysis. The results look very interesting. As you found out, just removing some canonicalization passes will reduce compile time, but this reduction may in large part being due to Polly not being able to optimise certain pieces of code.

Instead of removing canonicalization passes, I believe we may want to move Polly to a later place in the pass manager. Possibly at the beginning of the loop optimizer right before
PM.add(createLoopRotatePass());

We would then only need a very low number of canonicalization passes (possibly zero) and instead would put a couple of cleanup passes right
after Polly. What do you think?

Cheers,
Tobias

>On 09/17/2013 04:12 AM, Star Tan wrote:
>> Now, we come to more evaluations on http://188.40.87.11:8000/db_default/v4/nts/recent_activity
>
>Hi Star Tan,
>
>thanks for this very extensive analysis. The results look very 
>interesting. As you found out, just removing some canonicalization 
>passes will reduce compile time, but this reduction may in large part 
>being due to Polly not being able to optimise certain pieces of code.
>
>Instead of removing canonicalization passes, I believe we may want to 
>move Polly to a later place in the pass manager. Possibly at the 
>beginning of the loop optimizer right before
>PM.add(createLoopRotatePass());
>
>We would then only need a very low number of canonicalization passes 
>(possibly zero) and instead would put a couple of cleanup passes right
>after Polly. What do you think?

Sure, I agree with you. I did those previous evaluations to see what is the impact of each polly canonicalization pass. Results show that "InstructionCombining" and "PromoteMemoryToRegister" passes are critical to enabling Polly optimization. These passes may be also called by other LLVM components, so I am trying to find out which later point we can start Polly to avoid Polly's canonicalization passes by reusing those existing LLVM passes.
Thanks for your helpful suggestion. I will to look into where we should start Polly. 
Best,
Star Tan

Hi Tobias,

I am trying to move Polly later.

LLVM provides some predefined ExtensionPointTy:

EP_EarlyAsPossible,
EP_ModuleOptimizerEarly,
EP_LoopOptimizerEnd,
EP_ScalarOptimizerLate,

Currently Polly uses “EP_EarlyAsPossible” to run as early as possible. As what you suggested:

>Instead of removing canonicalization passes, I believe we may want to 
>move Polly to a later place in the pass manager. Possibly at the 
>beginning of the loop optimizer right before PM.add(createLoopRotatePass());
I want to move it to the point immediate after someone Loop optimization pass, e.g. MPM.add(createLoopRotatePass()).  However no predefined ExtensionPointTy is available for this purpose. Instead, the "EP_ModuleOptimizerEarly" would move Polly before all loop optimization passes.

In my option, there are two solutions: one is to use “EP_ModuleOptimizerEarly” (only modify the tool/polly/lib/RegisterPasses.cpp) to move Polly before all loop optimization passes; the other is to add a new ExtensionPointTy, e.g. “EP_LoopRotateEnd” and move Polly exactly immediate after the “LoopRotate” pass (need to modify tool/polly/lib/RegisterPasses.cpp, include/llvm/Transforms/IPO/PassManagerBuilder.h and lib/Transforms/IPO/PassManagerBuilder.cpp). We can use the second way to investigate other points to start Polly.

Is my understanding correct? Do you have any further suggestion?

Star Tan

Yes, fully correct. I would play with solution two. Why would you add Polly after the loop rotate pass?

I would possibly add it at the beginning of the loop optimizer (right before the loop rotation pass) possibly experimenting with a new extension point called EP_LoopOptimizerBegin.

Cheers,
Tobias

Hi Tobias,

>On 09/19/2013 04:46 PM, Star Tan wrote:
>> Hi Tobias,
>>
>>
>> I am trying to move Polly later.
>>
>>
>> LLVM provides some predefined ExtensionPointTy:
>>      EP_EarlyAsPossible,
>>      EP_ModuleOptimizerEarly,
>>      EP_LoopOptimizerEnd,
>>      EP_ScalarOptimizerLate,
>>      ...
>>
>>
>> Currently Polly uses "EP_EarlyAsPossible" to run as early as possible.  As what you suggested:
>>> Instead of removing canonicalization passes, I believe we may want to
>>> move Polly to a later place in the pass manager. Possibly at the
>>> beginning of the loop optimizer right before PM.add(createLoopRotatePass());
>> I want to move it to the point immediate after someone Loop optimization pass, e.g. MPM.add(createLoopRotatePass()).  However no predefined ExtensionPointTy is available for this purpose. Instead, the "EP_ModuleOptimizerEarly" would move Polly before all loop optimization passes.
>>
>>
>> In my option, there are two solutions: one is to use "EP_ModuleOptimizerEarly" (only modify the tool/polly/lib/RegisterPasses.cpp) to move Polly before all loop optimization passes; the other is to add a new ExtensionPointTy, e.g. "EP_LoopRotateEnd" and move Polly exactly immediate after the "LoopRotate" pass (need to modify tool/polly/lib/RegisterPasses.cpp, include/llvm/Transforms/IPO/PassManagerBuilder.h and lib/Transforms/IPO/PassManagerBuilder.cpp). We can use the second way to investigate other points to start Polly.
>>
>> Is my understanding correct? Do you have any further suggestion?
>
>Yes, fully correct. I would play with solution two. Why would you add 
>Polly after the loop rotate pass?
Forgot it. I just wanted to give an immature example to show how to move Polly to other points.
>I would possibly add it at the beginning of the loop optimizer (right 
>before the loop rotation pass) possibly experimenting with a new 
>extension point called EP_LoopOptimizerBegin.
>
Thanks for your advice. I have tried to move Polly immediately before the loop rotation pass. The temporary patch file can be reached on [http://llvm.org/bugs/attachment.cgi?id=11262](http://llvm.org/bugs/attachment.cgi?id=11262).
First of all, I evaluated Polly's canonicalization passes based on this patch file. Results are posted on [http://188.40.87.11:8000](http://188.40.87.11:8000/), which contains two new runs:
	PollyNoGen_loop (run 56): move polly later but keep all polly canonicalization passes;
	PollyNoGen_loop_nocan(run57): move polly later a!
 nd delete all polly canonicalization passes;
Comparison is shown on  [http://188.40.87.11:8000/db_default/v4/nts/57?compare_to=56&baseline=56](http://188.40.87.11:8000/db_default/v4/nts/57?compare_to=56&baseline=56). It seems that even we move polly later, those canonicalization passes still have significant impact on the final execution-time performance. 
A very bad news is that Polly would produce wrong code if we move Polly right immediately before the loop rotate pass. Many benchmarks in LLVM testsuite would fail in execution. I constructed a relatively simple testcase for this bug as shown on [http://llvm.org/bugs/show_bug.cgi?id=17323](http://llvm.org/bugs/show_bug.cgi?id=17323). The key problem is Polly would produce wrong basic blocks like this:
	polly.loop_header27:      
                        ; preds = %vector.body, %polly.loop_header27
  		br label %polly.loop_header27
For your information, the source code is posted on [http://llvm.org/bugs/attachment.cgi?id=11263](http://llvm.org/bugs/attachment.cgi?id=11263) and the LLVM IR code is posted on [http://llvm.org/bugs/attachment.cgi?id=11264](http://llvm.org/bugs/attachment.cgi?id=11264)
I have not yet found out why Polly produces such wrong code.  However, maybe I should also investigate other points where we can move Polly. Do you have any suggestion?
Best,
Star Tan

Here is an update about moving Polly later.

  1. Why does Polly generate incorrect code when we move Polly immediately after the loop rotating pass?

It is mainly caused by a wrong polly merge block. When Polly detects a valid loop for Polyhedral transformations, it usually introduces a new basic block “polly.merge_new_and_old” after the original loop exit block. This new basic block contains a branch instruction that decides whether to exit or re-enter the loop like this:

polly.loop_exit28: ; preds = %polly.stmt.for.body.i, %polly.loop_if25
br label %polly.merge_new_and_old
polly.merge_new_and_old: ; preds = %polly.loop_exit28, %for.body.i

%cmp = icmp slt i32 %2, 0

br i1 %cmp, label %for.body, label %for.end12
for.end12: ; preds = %loop.exit

ret i32 0

Unfortunately, the new basic block “polly.merge_new_and_old” is marked as unreachable after a serial of immediate Loop optimizations in the following pass order:
Polly - Create LLVM-IR from SCoPs
Loop Pass Manager
Canonicalize natural loops
Loop-Closed SSA Form Pass
Rotate Loops
Loop Invariant Code Motion
Loop-Closed SSA Form Pass
Unswitch loops
Combine redundant instructions

After those loop optimition passes, the “Combine redundant instructions” would treat the basic block “merge_new_and_old” unreachable and transfer the block “polly.loop_exit28” into an infinity loop:
polly.loop_exit28: ; preds = %polly.stmt.for.body.i, %polly.loop_if25
br label polly.loop_exit28

Actually, I have no idea why this happens, but experiments show that this problem can be addressed by adding a pass “MPM.add(createCFGSimplificationPass())” after Polly. It seems it is necessary to run this pass after Polly code generation.

  1. Where should we move Polly to?

There are many choices to move Polly later. Tobias suggests to move it immediately after the loop rotate pass (the first loop optimization pass), but unfortunately Polly would generate incorrect code in that case (http://llvm.org/bugs/show_bug.cgi?id=17323).

By investigating those Polly canonicalization passes (polly/lib/RegisterPasses.cpp):
PM.add(llvm::createPromoteMemoryToRegisterPass());
PM.add(llvm::createInstructionCombiningPass());
PM.add(llvm::createCFGSimplificationPass());
PM.add(llvm::createTailCallEliminationPass());
PM.add(llvm::createCFGSimplificationPass());
PM.add(llvm::createReassociatePass());
PM.add(llvm::createLoopRotatePass());
PM.add(llvm::createInstructionCombiningPass());

We can see that some of them are also called in common cases (lib/Transforms/IPO/PassManagerBuilder.cpp:181):

MPM.add(createEarlyCSEPass());
MPM.add(createJumpThreadingPass());
MPM.add(createCorrelatedValuePropagationPass());
MPM.add(createCFGSimplificationPass()); // also called in Polly
MPM.add(createInstructionCombiningPass()); // also called in Polly
MPM.add(createTailCallEliminationPass()); // also called in Polly
MPM.add(createCFGSimplificationPass()); // also called in Polly
MPM.add(createReassociatePass()); // also called in Polly
MPM.add(createLoopRotatePass()); // also called in Polly
MPM.add(createLICMPass());
MPM.add(createLoopUnswitchPass(SizeLevel || OptLevel < 3));

The initial idea is to move Polly immediately after LoopRotatePass, which will maximum the reuse of canonicalization passes and we need only keep one canonicalization pass “createPromoteMemoryToRegisterPass” in Polly. Unfortunately, it would lead to errors as shown in previous analysis (see bug http://llvm.org/bugs/show_bug.cgi?id=17323). We need to run “createCFGSimplificationPass” to ensure the correct code.

The second possible solution is to move Polly immediately before “createCFGSimplificationPass”, and then remove all Polly canonicalization passes except the “createTailCallEliminationPass” and “createCFGSimplificationPass”! Unfortunately, it would still trigger an assert error when compiling some benchmarks in LLVM test-suite (see http://llvm.org/bugs/show_bug.cgi?id=17351).

Similarly I also investigated other points to move Polly, such as before “MPM.add(createTailCallEliminationPass())” or before “MPM.add(createCorrelatedValuePropagationPass())”, but all these cases would trigger some unexpected compile error or execution error. I think we need more efforts to investigate why Polly cannot be placed in these points.

At last, I also investigated the basic solution that move Polly immediately before “CallGraph SCC passes” which is a very early stage in module level optimizations. However, there are still some pre-executed basic passes (DeadArgElimination/InstructionCombining/CFGSimplification), which may reduce Polly canonicalization passes . Preliminary testing shows no extra error happens in LLVM test-suite and detailed evaluation is running. Hope we can get detailed results after several hours’s running.

Best,
Star Tan

Here is an update about moving Polly later.

Hi star tan,

thanks for your report.

1. Why does Polly generate incorrect code when we move Polly immediately after the loop rotating pass?

It is mainly caused by a wrong polly merge block. When Polly detects a valid loop for Polyhedral transformations, it usually introduces a new basic block "polly.merge_new_and_old" after the original loop exit block. This new basic block contains a branch instruction that decides whether to exit or re-enter the loop like this:

polly.loop_exit28: ; preds = %polly.stmt.for.body.i, %polly.loop_if25
   br label %polly.merge_new_and_old
polly.merge_new_and_old: ; preds = %polly.loop_exit28, %for.body.i
   ...
   %cmp = icmp slt i32 %2, 0
   ...
   br i1 %cmp, label %for.body, label %for.end12
for.end12: ; preds = %loop.exit
   ...
   ret i32 0

Unfortunately, the new basic block "polly.merge_new_and_old" is marked as unreachable after a serial of immediate Loop optimizations in the following pass order:
           Polly - Create LLVM-IR from SCoPs
         Loop Pass Manager
           Canonicalize natural loops
           Loop-Closed SSA Form Pass
           Rotate Loops
           Loop Invariant Code Motion
           Loop-Closed SSA Form Pass
           Unswitch loops
         Combine redundant instructions

After those loop optimition passes, the "Combine redundant instructions" would treat the basic block "merge_new_and_old" unreachable and transfer the block "polly.loop_exit28" into an infinity loop:
polly.loop_exit28: ; preds = %polly.stmt.for.body.i, %polly.loop_if25
   br label polly.loop_exit28

Actually, I have no idea why this happens, but experiments show that this problem can be addressed by adding a pass "MPM.add(createCFGSimplificationPass())" after Polly. It seems it is necessary to run this pass after Polly code generation.

Any pass order should yield correct code. If this is not the case there is either a bug in Polly or we expose a bug in LLVM. We should track this down.

Our path ordering decisions should not be driven by us trying to avoid triggering this bug.

As you already created a bug report, we should work on fixing it.
The next step would be to reduce the set of passes involved. You can do this by calling -debug-pass=Arguments

2. Where should we move Polly to?

There are many choices to move Polly later. Tobias suggests to move it immediately after the loop rotate pass (the first loop optimization pass), but unfortunately Polly would generate incorrect code in that case (http://llvm.org/bugs/show_bug.cgi?id=17323).

By investigating those Polly canonicalization passes (polly/lib/RegisterPasses.cpp):
     PM.add(llvm::createPromoteMemoryToRegisterPass());
     PM.add(llvm::createInstructionCombiningPass());
     PM.add(llvm::createCFGSimplificationPass());
     PM.add(llvm::createTailCallEliminationPass());
     PM.add(llvm::createCFGSimplificationPass());
     PM.add(llvm::createReassociatePass());
     PM.add(llvm::createLoopRotatePass());
     PM.add(llvm::createInstructionCombiningPass());

We can see that some of them are also called in common cases (lib/Transforms/IPO/PassManagerBuilder.cpp:181):

   MPM.add(createEarlyCSEPass());
   MPM.add(createJumpThreadingPass());
   MPM.add(createCorrelatedValuePropagationPass());
   MPM.add(createCFGSimplificationPass()); // also called in Polly
   MPM.add(createInstructionCombiningPass()); // also called in Polly
   MPM.add(createTailCallEliminationPass()); // also called in Polly
   MPM.add(createCFGSimplificationPass()); // also called in Polly
   MPM.add(createReassociatePass()); // also called in Polly
   MPM.add(createLoopRotatePass()); // also called in Polly
   MPM.add(createLICMPass());
   MPM.add(createLoopUnswitchPass(SizeLevel || OptLevel < 3));

The initial idea is to move Polly immediately after LoopRotatePass, which will maximum the reuse of canonicalization passes and we need only keep one canonicalization pass "createPromoteMemoryToRegisterPass" in Polly. Unfortunately, it would lead to errors as shown in previous analysis (see bug http://llvm.org/bugs/show_bug.cgi?id=17323). We need to run "createCFGSimplificationPass" to ensure the correct code.

In fact, I propose to move it _before_ the LoopRotate Pass. I think that is also what you tried, no?

The second possible solution is to move Polly immediately before "createCFGSimplificationPass", and then remove all Polly canonicalization passes except the "createTailCallEliminationPass" and "createCFGSimplificationPass"! Unfortunately, it would still trigger an assert error when compiling some benchmarks in LLVM test-suite (see http://llvm.org/bugs/show_bug.cgi?id=17351).

I think this is too early, as most of the canonicalization is not yet done. We probably don't need to investigate this bug immediately, but
it would be nice if we could make it reproducible without your changes to polly. For this please run the command with -debug-pass=Arguments
and replace the -O3 with the actual set of commands that is needed to trigger the bug. If you can reduce the set of commands using bugpoint, that would be even better.

Similarly I also investigated other points to move Polly, such as before "MPM.add(createTailCallEliminationPass())" or before "MPM.add(createCorrelatedValuePropagationPass())", but all these cases would trigger some unexpected compile error or execution error. I think we need more efforts to investigate why Polly cannot be placed in these points.

This is unfortunate. I wonder if they all have the same root cause.

In any case, I would focus on moving Polly _before_ the loop rotate pass and would start with fixing the blocking bug.

Tobias

Here is an update about moving Polly later.

Hi star tan,

thanks for your report.

  1. Why does Polly generate incorrect code when we move Polly immediately after the loop rotating pass?

It is mainly caused by a wrong polly merge block. When Polly detects a valid loop for Polyhedral transformations, it usually introduces a new basic block “polly.merge_new_and_old” after the original loop exit block. This new basic block contains a branch instruction that decides whether to exit or re-enter the loop like this:

polly.loop_exit28: ; preds = %polly.stmt.for.body.i, %polly.loop_if25
br label %polly.merge_new_and_old
polly.merge_new_and_old: ; preds = %polly.loop_exit28, %for.body.i

%cmp = icmp slt i32 %2, 0

br i1 %cmp, label %for.body, label %for.end12
for.end12: ; preds = %loop.exit

ret i32 0

Unfortunately, the new basic block “polly.merge_new_and_old” is marked as unreachable after a serial of immediate Loop optimizations in the following pass order:
Polly - Create LLVM-IR from SCoPs
Loop Pass Manager
Canonicalize natural loops
Loop-Closed SSA Form Pass
Rotate Loops
Loop Invariant Code Motion
Loop-Closed SSA Form Pass
Unswitch loops
Combine redundant instructions

After those loop optimition passes, the “Combine redundant instructions” would treat the basic block “merge_new_and_old” unreachable and transfer the block “polly.loop_exit28” into an infinity loop:
polly.loop_exit28: ; preds = %polly.stmt.for.body.i, %polly.loop_if25
br label polly.loop_exit28

Actually, I have no idea why this happens, but experiments show that this problem can be addressed by adding a pass “MPM.add(createCFGSimplificationPass())” after Polly. It seems it is necessary to run this pass after Polly code generation.

Any pass order should yield correct code. If this is not the case there
is either a bug in Polly or we expose a bug in LLVM. We should track
this down.

Our path ordering decisions should not be driven by us trying to avoid
triggering this bug.

As you already created a bug report, we should work on fixing it.
The next step would be to reduce the set of passes involved. You can do
this by calling -debug-pass=Arguments

Thanks for your suggestion. I intended to get some quick idea of how much compile/execution performance impact by moving Polly later. Of course, I would also work on fixing these bug.

  1. Where should we move Polly to?

There are many choices to move Polly later. Tobias suggests to move it immediately after the loop rotate pass (the first loop optimization pass), but unfortunately Polly would generate incorrect code in that case (http://llvm.org/bugs/show_bug.cgi?id=17323).

By investigating those Polly canonicalization passes (polly/lib/RegisterPasses.cpp):
PM.add(llvm::createPromoteMemoryToRegisterPass());
PM.add(llvm::createInstructionCombiningPass());
PM.add(llvm::createCFGSimplificationPass());
PM.add(llvm::createTailCallEliminationPass());
PM.add(llvm::createCFGSimplificationPass());
PM.add(llvm::createReassociatePass());
PM.add(llvm::createLoopRotatePass());
PM.add(llvm::createInstructionCombiningPass());

We can see that some of them are also called in common cases (lib/Transforms/IPO/PassManagerBuilder.cpp:181):

MPM.add(createEarlyCSEPass());
MPM.add(createJumpThreadingPass());
MPM.add(createCorrelatedValuePropagationPass());
MPM.add(createCFGSimplificationPass()); // also called in Polly
MPM.add(createInstructionCombiningPass()); // also called in Polly
MPM.add(createTailCallEliminationPass()); // also called in Polly
MPM.add(createCFGSimplificationPass()); // also called in Polly
MPM.add(createReassociatePass()); // also called in Polly
MPM.add(createLoopRotatePass()); // also called in Polly
MPM.add(createLICMPass());
MPM.add(createLoopUnswitchPass(SizeLevel || OptLevel < 3));

The initial idea is to move Polly immediately after LoopRotatePass, which will maximum the reuse of canonicalization passes and we need only keep one canonicalization pass “createPromoteMemoryToRegisterPass” in Polly. Unfortunately, it would lead to errors as shown in previous analysis (see bug http://llvm.org/bugs/show_bug.cgi?id=17323). We need to run “createCFGSimplificationPass” to ensure the correct code.

In fact, I propose to move it before the LoopRotate Pass. I think that
is also what you tried, no?

Yes, I actually meant before the Loop Rotate pass. However, I wonder why you care about the LoopRotate pass so much? Is this pass very important to loop optimization? You know, it is also one of Polly’s canonicalization passes, so in that case the LoopRotatePass would be called twice, one as Polly’s canonicalization pass and the other as Loop canonicalization pass after Polly. It is just a discussion; I am still working on moving Polly before LoopRotatePass.

The second possible solution is to move Polly immediately before “createCFGSimplificationPass”, and then remove all Polly canonicalization passes except the “createTailCallEliminationPass” and “createCFGSimplificationPass”! Unfortunately, it would still trigger an assert error when compiling some benchmarks in LLVM test-suite (see http://llvm.org/bugs/show_bug.cgi?id=17351).

I think this is too early, as most of the canonicalization is not yet
done. We probably don’t need to investigate this bug immediately, but
it would be nice if we could make it reproducible without your changes
to polly. For this please run the command with -debug-pass=Arguments
and replace the -O3 with the actual set of commands that is needed to
trigger the bug. If you can reduce the set of commands using bugpoint,
that would be even better.

I see. I will try to reproduce the bug in this way.

Similarly I also investigated other points to move Polly, such as before “MPM.add(createTailCallEliminationPass())” or before “MPM.add(createCorrelatedValuePropagationPass())”, but all these cases would trigger some unexpected compile error or execution error. I think we need more efforts to investigate why Polly cannot be placed in these points.

This is unfortunate. I wonder if they all have the same root cause.

In any case, I would focus on moving Polly before the loop rotate pass
and would start with fixing the blocking bug.

Tobias

Thanks for your suggestion.

Star Tan

Hi Star Tan,

Thanks for the very interesting perf analyses.

Star Tan wrote:

We can see the "Combine redundant instructions" are invoked many times and the
extra invoke resulted by Polly's canonicalization is the most significant
one. We have found this problem before and I need to look into the details of
canonicalization passes related to "Combine redundant instructions".

It could be that the scev codegen produces the same subexpression again and
again due to the fact that we are asking the same question again and again for
each array index: basically, in the original code we have a set of array access
functions A1(i), A2(i), ..., An(i), that get transformed by polly using a linear
transform function t: A1(t(i)), A2(t(i)), ..., An(t(i)), so you see that t(i)
appears again and again, and we probably do generate redundantly the same code
for it.

BTW, I want to point out that although SCEV based Polly canonicalization (with
-polly-codegen-scev) runs faster than default canonicalization, it can lead to
5 extra compile errors and 3 extra runtime errors

That's one of the reasons why we have not turned SCEV codegen on by default yet.
I will address all these issues and then we'll flip the default value of the
-polly-codegen-scev flag.

I have done some basic analysis for one of the compile error
(7zip-benchmark). Results can be viewed on
http://llvm.org/bugs/show_bug.cgi?Cid=17159

Thanks for filling up that bug report: I just assigned it to me.

Sebastian

Tobias Grosser wrote:

Star Tan wrote:

I have not yet found out why Polly produces such wrong code.

IMO this is because Polly does not recognize a given number of patterns that
occur after the scalar opts have transformed the code. There are a given number
of assumptions that we took for granted and these do not hold down the pipeline.

However, maybe I should also investigate other points where we can move
Polly. Do you have any suggestion?

Yep, try to move Polly just after function inlining, that's better than the
current place, and will extend the loop coverage in many cases.

Sebastian

At 2013-09-27 04:05:07,“Sebastian Pop” spop@codeaurora.org wrote:>Hi Star Tan,

Thanks for the very interesting perf analyses.

Star Tan wrote:

We can see the “Combine redundant instructions” are invoked many times and the
extra invoke resulted by Polly’s canonicalization is the most significant
one. We have found this problem before and I need to look into the details of
canonicalization passes related to “Combine redundant instructions”.

It could be that the scev codegen produces the same subexpression again and
again due to the fact that we are asking the same question again and again for
each array index: basically, in the original code we have a set of array access
functions A1(i), A2(i), …, An(i), that get transformed by polly using a linear
transform function t: A1(t(i)), A2(t(i)), …, An(t(i)), so you see that t(i)
appears again and again, and we probably do generate redundantly the same code
for it.

BTW, I want to point out that although SCEV based Polly canonicalization (with
-polly-codegen-scev) runs faster than default canonicalization, it can lead to
5 extra compile errors and 3 extra runtime errors

That’s one of the reasons why we have not turned SCEV codegen on by default yet.
I will address all these issues and then we’ll flip the default value of the
-polly-codegen-scev flag.

Great! I will try to investigate other errors and put them into LLVM bugzilla or try to fix them.
I also look forward to fixing these errors and flipping the default option value as soon as possible.

I have done some basic analysis for one of the compile error
(7zip-benchmark). Results can be viewed on
http://llvm.org/bugs/show_bug.cgi?Cid=17159

Thanks for filling up that bug report: I just assigned it to me.

Sebastian

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

Thanks,
Mingxing

Do you mean we should move it after “MPM.add(Inliner);”?
Inliner pass is an early pass, so moving Polly after Inliner may not reduce the Polly’s canonicalization passes. On the contrary, it may make the Polly more complex. However, as what you said, moving Polly after the inlinerg pass will extend the loop coverage and thus improve the performance of Polly! I’ll try it.

Thanks for your suggestion,
Star Tan