[Polly][GSOC2013] FastPolly -- SCOP Detection Pass

Hi all,

I have investigated the compile-time overhead of “Polly Scop Detection” pass based on LNT testing results.
This mail is to share some results I have found.

(1) Analysis of “SCOP Detection Pass” for PolyBench (Attached file PolyBench_SCoPs.log)
Experimental results show that the “SCOP Detection pass” does not lead to significant extra compile-time overhead for compiling PolyBench. The percent of compile-time overhead caused by “SCOP Detection Pass” is usually less than 4% of total compile-time. Details for each benchmark can be seen in attached file SCoPs.tgz. I think this is because a lot of other Polly passes, such as “Cloog code generation” and “Induction Variable Simplification” are much more expensive than the “SCOP Detection Pass”.

(2) Analysis of “SCOP Detection Pass for two hot benchmarks (tramp3d and oggenc)
“SCOP Detection Pass” would lead to significant compile-time overhead for these two benchmarks: tramp3d and oggenc, both of which are included in LLVM test-suite/MultiSource.

The top 5 passes in compiling tramp3d are: (Attached file tramp3d-SCoPs.log)

—User Time— --System Time-- --User+System-- —Wall Time— — Name —
6.0720 ( 21.7%) 0.0400 ( 2.1%) 6.1120 ( 20.5%) 6.1986 ( 20.6%) Polly - Detect static control parts (SCoPs)
4.0600 ( 14.5%) 0.2000 ( 10.7%) 4.2600 ( 14.3%) 4.3655 ( 14.5%) X86 DAG->DAG Instruction Selection
1.9880 ( 7.1%) 0.2080 ( 11.1%) 2.1960 ( 7.4%) 2.2277 ( 7.4%) Function Integration/Inlining
1.7520 ( 6.3%) 0.0840 ( 4.5%) 1.8360 ( 6.2%) 1.8765 ( 6.2%) Global Value Numbering
1.2440 ( 4.4%) 0.1040 ( 5.5%) 1.3480 ( 4.5%) 1.2925 ( 4.3%) Combine redundant instructions

and the top 5 passes in compiling oggenc are: (Attached file oggenc-SCoPs.log)

—User Time— --System Time-- --User+System-- —Wall Time— — Name —
0.7760 ( 14.6%) 0.0280 ( 11.1%) 0.8040 ( 14.4%) 0.8207 ( 14.5%) X86 DAG->DAG Instruction Selection
0.7080 ( 13.3%) 0.0040 ( 1.6%) 0.7120 ( 12.8%) 0.7317 ( 13.0%) Polly - Detect static control parts (SCoPs)
0.4200 ( 7.9%) 0.0000 ( 0.0%) 0.4200 ( 7.5%) 0.4135 ( 7.3%) Polly - Calculate dependences
0.3120 ( 5.9%) 0.0200 ( 7.9%) 0.3320 ( 6.0%) 0.2947 ( 5.2%) Loop Strength Reduction
0.1720 ( 3.2%) 0.0080 ( 3.2%) 0.1800 ( 3.2%) 0.1992 ( 3.5%) Global Value Numbering

Results show that Polly spends a lot of time on detecting scops, but most of region scops are proved to be invalid at last. As a result, this pass waste a lot of compile-time.I think we should improve this pass by detect invalid scop early.

(3) About detecting scop regions in bottom-up order.
Detecting scop regions in bottom-up order can significantly speed up the scop detection pass. However, as I have discussed with Sebastian, detecting scops in bottom-up order and up-bottom order will lead to different results. As a result, we should not change the detection order.

Do you have any other suggestions that may speed up the scop detection pass?

Best,
Star Tan

PolyBench_SCoPs.log (18.9 KB)

SCoPs.tgz (35.7 KB)

tramp3d-SCoPs.log (23.5 KB)

oggenc-SCoPs.log (23.5 KB)

Hi all,

I have investigated the compile-time overhead of "Polly Scop Detection" pass based on LNT testing results.
This mail is to share some results I have found.

(1) Analysis of "SCOP Detection Pass" for PolyBench (Attached file PolyBench_SCoPs.log)
Experimental results show that the "SCOP Detection pass" does not lead to significant extra compile-time overhead for compiling PolyBench. The percent of compile-time overhead caused by "SCOP Detection Pass" is usually less than 4% of total compile-time. Details for each benchmark can be seen in attached file SCoPs.tgz. I think this is because a lot of other Polly passes, such as "Cloog code generation" and "Induction Variable Simplification" are much more expensive than the "SCOP Detection Pass".

Good.

(2) Analysis of "SCOP Detection Pass for two hot benchmarks (tramp3d and oggenc)
¡°SCOP Detection Pass" would lead to significant compile-time overhead for these two benchmarks: tramp3d and oggenc, both of which are included in LLVM test-suite/MultiSource.

The top 5 passes in compiling tramp3d are: (Attached file tramp3d-SCoPs.log)
    ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name ---
    6.0720 ( 21.7%) 0.0400 ( 2.1%) 6.1120 ( 20.5%) 6.1986 ( 20.6%) Polly - Detect static control parts (SCoPs)
    4.0600 ( 14.5%) 0.2000 ( 10.7%) 4.2600 ( 14.3%) 4.3655 ( 14.5%) X86 DAG->DAG Instruction Selection
    1.9880 ( 7.1%) 0.2080 ( 11.1%) 2.1960 ( 7.4%) 2.2277 ( 7.4%) Function Integration/Inlining
    1.7520 ( 6.3%) 0.0840 ( 4.5%) 1.8360 ( 6.2%) 1.8765 ( 6.2%) Global Value Numbering
    1.2440 ( 4.4%) 0.1040 ( 5.5%) 1.3480 ( 4.5%) 1.2925 ( 4.3%) Combine redundant instructions

and the top 5 passes in compiling oggenc are: (Attached file oggenc-SCoPs.log)
    ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name ---
    0.7760 ( 14.6%) 0.0280 ( 11.1%) 0.8040 ( 14.4%) 0.8207 ( 14.5%) X86 DAG->DAG Instruction Selection
    0.7080 ( 13.3%) 0.0040 ( 1.6%) 0.7120 ( 12.8%) 0.7317 ( 13.0%) Polly - Detect static control parts (SCoPs)
    0.4200 ( 7.9%) 0.0000 ( 0.0%) 0.4200 ( 7.5%) 0.4135 ( 7.3%) Polly - Calculate dependences
    0.3120 ( 5.9%) 0.0200 ( 7.9%) 0.3320 ( 6.0%) 0.2947 ( 5.2%) Loop Strength Reduction
    0.1720 ( 3.2%) 0.0080 ( 3.2%) 0.1800 ( 3.2%) 0.1992 ( 3.5%) Global Value Numbering

Results show that Polly spends a lot of time on detecting scops, but most of region scops are proved to be invalid at last. As a result, this pass waste a lot of compile-time.I think we should improve this pass by detect invalid scop early.

Great. Now we have two test cases we can work with. Can you
upload the LLVM-IR produced by clang -O0 (without Polly)?

The next step is to understand what is going on. Some ideas on how to
understand what is going on:

1) Reduce the amount of input code

At best, we can reduce this to the single function on which the Polly scop detection takes more than 20.6% of the overall time. To get there,
I propose to run the timings with 'opt -O3' (instead of clang). As a first step you disable inlining to make sure that your results are still reproducible. If this is the case, I would (semi-automatically?) try to reduce the test case by removing functions from it for which the removal does not reduce the Polly overhead.

2) Check why the Polly scop detection is failing

You can use 'opt -polly-detect -analyze' to see the most common reasons the scop detection failed. We should verify that we perform the most common and cheap tests early.

(3) About detecting scop regions in bottom-up order.
Detecting scop regions in bottom-up order can significantly speed up the scop detection pass. However, as I have discussed with Sebastian, detecting scops in bottom-up order and up-bottom order will lead to different results. As a result, we should not change the detection order.

Sebastian had a patch for this. Does his patch improve the scop detection time.

I agree with you that we can not just switch the order in which we detect scops, but I still believe that a bottom up detection is the right way to go. However, to abort early we need to classify the scop
detection failures into failures that will equally hold for larger scops
and ones that may disappear in larger scops. Only if a failure of the first class was detected, we can abort early without reducing scop coverage.

Do you have any other suggestions that may speed up the scop detection pass?

I believe bottom-up detection may be a good thing, but before drawing conclusions we need to understand where time is actually spent. The suggestions above should help us there.

Cheers
Tobi

P.S.: Please do not copy llvm-commits. as it is only for patches and commit messages.

>On 06/29/2013 05:04 PM, Star Tan wrote:
>> Hi all,
>>
>>
>>
>> I have investigated the compile-time overhead of "Polly Scop Detection" pass based on LNT testing results.
>> This mail is to share some results I have found.
>>
>>
>> (1) Analysis of "SCOP Detection Pass" for PolyBench (Attached file PolyBench_SCoPs.log)
>> Experimental results show that the "SCOP Detection pass" does not lead to significant extra compile-time overhead for compiling PolyBench. The percent of compile-time overhead caused by "SCOP Detection Pass" is usually less than 4% of total compile-time. Details for each benchmark can be seen in attached file SCoPs.tgz. I think this is because a lot of other Polly passes, such as "Cloog code generation" and "Induction Variable Simplification" are much more expensive than the "SCOP Detection Pass".
>
>Good.
>
>> (2) Analysis of "SCOP Detection Pass for two hot benchmarks (tramp3d and oggenc)
>> “SCOP Detection Pass" would lead to significant compile-time overhead for these two benchmarks: tramp3d and oggenc, both of which are included in LLVM test-suite/MultiSource.
>>
>>
>> The top 5 passes in compiling tramp3d are: (Attached file tramp3d-SCoPs.log)
>>     ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
>>     6.0720 ( 21.7%)   0.0400 (  2.1%)   6.1120 ( 20.5%)   6.1986 ( 20.6%)  Polly - Detect static control parts (SCoPs)
>>     4.0600 ( 14.5%)   0.2000 ( 10.7%)   4.2600 ( 14.3%)   4.3655 ( 14.5%)  X86 DAG->DAG Instruction Selection
>>     1.9880 (  7.1%)   0.2080 ( 11.1%)   2.1960 (  7.4%)   2.2277 (  7.4%)  Function Integration/Inlining
>>     1.7520 (  6.3%)   0.0840 (  4.5%)   1.8360 (  6.2%)   1.8765 (  6.2%)  Global Value Numbering
>>     1.2440 (  4.4%)   0.1040 (  5.5%)   1.3480 (  4.5%)   1.2925 (  4.3%)  Combine redundant instructions
>>
>>
>> and the top 5 passes in compiling oggenc are: (Attached file oggenc-SCoPs.log)
>>     ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
>>     0.7760 ( 14.6%)   0.0280 ( 11.1%)   0.8040 ( 14.4%)   0.8207 ( 14.5%)  X86 DAG->DAG Instruction Selection
>>     0.7080 ( 13.3%)   0.0040 (  1.6%)   0.7120 ( 12.8%)   0.7317 ( 13.0%)  Polly - Detect static control parts (SCoPs)
>>     0.4200 (  7.9%)   0.0000 (  0.0%)   0.4200 (  7.5%)   0.4135 (  7.3%)  Polly - Calculate dependences
>>     0.3120 (  5.9%)   0.0200 (  7.9%)   0.3320 (  6.0%)   0.2947 (  5.2%)  Loop Strength Reduction
>>     0.1720 (  3.2%)   0.0080 (  3.2%)   0.1800 (  3.2%)   0.1992 (  3.5%)  Global Value Numbering
>>
>>
>> Results show that Polly spends a lot of time on detecting scops, but most of region scops are proved to be invalid at last. As a result, this pass waste a lot of compile-time.I think we should improve this pass by detect invalid scop early.
>
>Great. Now we have two test cases we can work with. Can you
>upload the LLVM-IR produced by clang -O0 (without Polly)?
I have attached the LLVM-IR file for tramp3d and oggenc.
>
>The next step is to understand what is going on. Some ideas on how to
>understand what is going on:
>
>1) Reduce the amount of input code
>
>At best, we can reduce this to the single function on which the Polly 
>scop detection takes more than 20.6% of the overall time. To get there,
>I propose to run the timings with 'opt -O3' (instead of clang). As a 
>first step you disable inlining to make sure that your results are still 
>reproducible. If this is the case, I would (semi-automatically?) try to 
>reduce the test case by removing functions from it for which the removal 
>does not reduce the Polly overhead.
Yes, the original LLVM IR code is too complex.
I would try to reduce the code size by investigating some hot functions in these two testcases.
>
>2) Check why the Polly scop detection is failing
>
>You can use 'opt -polly-detect -analyze' to see the most common reasons 
>the scop detection failed. We should verify that we perform the most 
>common and cheap tests early.
I would dig into the details of these two testcases.
>
>> (3) About detecting scop regions in bottom-up order.
>> Detecting scop regions in bottom-up order can significantly speed up the scop detection pass. However, as I have discussed with Sebastian, detecting scops in bottom-up order and up-bottom order will lead to different results. As a result, we should not change the detection order.
>
>Sebastian had a patch for this. Does his patch improve the scop 
>detection time.
>
>I agree with you that we can not just switch the order in which we 
>detect scops, but I still believe that a bottom up detection is the 
>right way to go. However, to abort early we need to classify the scop
>detection failures into failures that will equally hold for larger scops
>and ones that may disappear in larger scops. Only if a failure of the 
>first class was detected, we can abort early without reducing scop coverage.
>
>> Do you have any other suggestions that may speed up the scop detection pass?
>
>I believe bottom-up detection may be a good thing, but before drawing 
>conclusions we need to understand where time is actually spent. The 
>suggestions above should help us there.
>
>Cheers
>Tobi
>
>P.S.: Please do not copy llvm-commits. as it is only for patches and 
>commit messages.

Thanks for your suggestions again.
Best wishes,
Star Tan

从网易yeah邮箱发来的云附件

oggenc_tramp3d_llvm_ir.tgz (2.11M, 2013年7月16日 10:58 到期)
下载

That did not seem to have worked. In general it seems better not not attach larger files to mails that go on a public mailing list. Just uploading them somewhere and linking to them should work. As soon as they are reduced a little, you could probably also open a bug report and attach them there.

Cheers,
Tobi

Great. Now we have two test cases we can work with. Can you

>upload the LLVM-IR produced by clang -O0 (without Polly)?
Since tramp3d-v4.ll is to large (19M with 267 thousand lines), I would focus on the oggenc benchmark at firat.
I attached the oggenc.ll (LLVM-IR produced by clang -O0 without Polly), which compressed into the file oggenc.tgz.
>2) Check why the Polly scop detection is failing
>
>You can use 'opt -polly-detect -analyze' to see the most common reasons 
>the scop detection failed. We should verify that we perform the most 
>common and cheap tests early.
>
I also attached the output file oggenc_polly_detect_analyze.log produced by "polly-opt -O3 -polly-detect -analyze oggenc.ll". Unfortunately, it only dumps valid scop regions. At first, I thought to dump all debugging information by "-debug" option, but it will dump too many unrelated information produced by other passes. Do you know any option that allows me to dump debugging information for the "-polly-detect" pass, but at the same time disabling debugging information for other passes?
Star Tan

oggenc.tgz (642 KB)

(3) About detecting scop regions in bottom-up order.

>> Detecting scop regions in bottom-up order can significantly speed up the scop detection pass. However, as I have discussed with Sebastian, detecting scops in bottom-up order and up-bottom order will lead to different results. As a result, we should not change the detection order.
>
>Sebastian had a patch for this. Does his patch improve the scop 
>detection time.

LNT testing results for Sebastian's patch file can be seen on [http://188.40.87.11:8000/db_default/v4/nts/recent_activity](http://188.40.87.11:8000/db_default/v4/nts/recent_activity) (Run Order: [ScopDetect130615](http://188.40.87.11:8000/db_default/v4/nts/order/6)). You can compare ScopDetect130615 (Polly with Sebastian's patch) to [pOpt130615](http://188.40.87.11:8000/db_default/v4/nts/order/5) (Polly without Sebastian's patch). The result seems not show significant performance improvements with the bottom-up patch for LLVM test-suite benchmarks.
You are right. I think I should better firstly focus on the some simple examples, such as oggenc, to understand where scop detection pass spend its time. After that, we can then investigate the scop detection order.
Bests,
Star Tan

Great. Now we have two test cases we can work with. Can you

upload the LLVM-IR produced by clang -O0 (without Polly)?

Since tramp3d-v4.ll is to large (19M with 267 thousand lines), I would focus on the oggenc benchmark at firat.
I attached the oggenc.ll (LLVM-IR produced by clang -O0 without Polly), which compressed into the file oggenc.tgz.

Sounds good.

2) Check why the Polly scop detection is failing

You can use 'opt -polly-detect -analyze' to see the most common reasons
the scop detection failed. We should verify that we perform the most
common and cheap tests early.

I also attached the output file oggenc_polly_detect_analyze.log produced by "polly-opt -O3 -polly-detect -analyze oggenc.ll". Unfortunately, it only dumps valid scop regions. At first, I thought to dump all debugging information by "-debug" option, but it will dump too many unrelated information produced by other passes. Do you know any option that allows me to dump debugging information for the "-polly-detect" pass, but at the same time disabling debugging information for other passes?

I really propose to not attach such large files. :wink:

To dump debug info of just one pass you can use -debug-only=polly-detect. However, for performance measurements, you want to use
a release build to get accurate numbers.

Another flag that is interesting is the flag '-stats'. It gives me the following information:

     4 polly-detect
                  - Number of bad regions for Scop: CFG too complex
   183 polly-detect
                  - Number of bad regions for Scop: Expression not affine
   103 polly-detect
                  - Number of bad regions for Scop: Found base address
                    alias
   167 polly-detect
                  - Number of bad regions for Scop: Found invalid region
                    entering edges
    59 polly-detect
                  - Number of bad regions for Scop: Function call with
                    side effects appeared
   725 polly-detect
                  - Number of bad regions for Scop: Loop bounds can not
                    be computed
    93 polly-detect
                  - Number of bad regions for Scop: Non canonical
                    induction variable in loop
     8 polly-detect
                  - Number of bad regions for Scop: Others
    53 polly-detect
                  - Number of regions that a valid part of Scop

This seems to suggest that we most scops fail due to loop bounds that can not be computed. It would be interesting to see what kind of expressions these are. In case SCEV often does not deliver a result,
this may be one of the cases where bottom up scop detection would help
a lot, as outer regions are automatically invalidated if we can not get a SCEV for the loop bounds of the inner regions.

However, I still have the feeling the test case is too large. You can reduce it I propose to first run opt with 'opt -O3 -polly -disable-inlining -time-passes'. You then replace all function definitions with
s/define internal/define/. After this preprocessing you can use a regexp such as "'<,'>s/define \([^{}]* \){\_[^{}]*}/declare \1" to replace function definitions with their declaration. You can use this to binary search for functions that have a large overhead in ScopDetect time.

I tried this a little, but realized that no matter if I removed the first or the second part of a module, the relative scop-detect time always went down. This is surprising. If you see similar effects, it would be interesting to investigate.

Cheers,
tobi

Cheers,
Tobi

This is actually surprising. I expected his patch to make a difference in the scop detect time.

Btw, if you rerun the experiments, it probably makes more sense to not modify the commit number, but instead use different lnt tester names
to report the changes. Changing the commit numbers causes crashes e.g.
when you click on the links for individual benchmarks.

Cheers,
Tobias

>On 07/01/2013 06:51 AM, Star Tan wrote:
>>> Great. Now we have two test cases we can work with. Can you
>>
>>> upload the LLVM-IR produced by clang -O0 (without Polly)?
>> Since tramp3d-v4.ll is to large (19M with 267 thousand lines), I would focus on the oggenc benchmark at firat.
>> I attached the oggenc.ll (LLVM-IR produced by clang -O0 without Polly), which compressed into the file oggenc.tgz.
>
>Sounds good.
>
>>> 2) Check why the Polly scop detection is failing
>>>
>>> You can use 'opt -polly-detect -analyze' to see the most common reasons
>>> the scop detection failed. We should verify that we perform the most
>>> common and cheap tests early.
>>>
>> I also attached the output file oggenc_polly_detect_analyze.log produced by "polly-opt -O3 -polly-detect -analyze oggenc.ll". Unfortunately, it only dumps valid scop regions. At first, I thought to dump all debugging information by "-debug" option, but it will dump too many unrelated information produced by other passes. Do you know any option that allows me to dump debugging information for the "-polly-detect" pass, but at the same time disabling debugging information for other passes?
>
>I really propose to not attach such large files. ;-)
>
>To dump debug info of just one pass you can use 
>-debug-only=polly-detect. However, for performance measurements, you 
>want to use
>a release build to get accurate numbers.
>
>Another flag that is interesting is the flag '-stats'. It gives me the 
>following information:
>
>     4 polly-detect
>                  - Number of bad regions for Scop: CFG too complex
>   183 polly-detect
>                  - Number of bad regions for Scop: Expression not affine
>   103 polly-detect
>                  - Number of bad regions for Scop: Found base address
>                    alias
>   167 polly-detect
>                  - Number of bad regions for Scop: Found invalid region
>                    entering edges
>    59 polly-detect
>                  - Number of bad regions for Scop: Function call with
>                    side effects appeared
>   725 polly-detect
>                  - Number of bad regions for Scop: Loop bounds can not
>                    be computed
>    93 polly-detect
>                  - Number of bad regions for Scop: Non canonical
>                    induction variable in loop
>     8 polly-detect
>                  - Number of bad regions for Scop: Others
>    53 polly-detect
>                  - Number of regions that a valid part of Scop
>
>This seems to suggest that we most scops fail due to loop bounds that 
>can not be computed. It would be interesting to see what kind of 
>expressions these are. In case SCEV often does not deliver a result,
>this may be one of the cases where bottom up scop detection would help
>a lot, as outer regions are automatically invalidated if we can not get 
>a SCEV for the loop bounds of the inner regions.
Thank you so much. This is what I need. I just want to know why these scops are invalid!
>
>However, I still have the feeling the test case is too large. You can 
>reduce it I propose to first run opt with 'opt -O3 -polly 
>-disable-inlining -time-passes'. You then replace all function 
>definitions with
>s/define internal/define/. After this preprocessing you can use a regexp 
>such as "'<,'>s/define \([^{}]* \){\_[^{}]*}/declare \1" to replace 
>function definitions with their declaration. You can use this to binary 
>search for functions that have a large overhead in ScopDetect time.
>
>I tried this a little, but realized that no matter if I removed the 
>first or the second part of a module, the relative scop-detect time 
>always went down. This is surprising. If you see similar effects, it 
>would be interesting to investigate.

No problem. I will try to reduce code size.
Bests,
Star Tan

Star Tan wrote:

I attached the oggenc.ll (LLVM-IR produced by clang -O0 without Polly), which compressed into the file oggenc.tgz.

Let me repeat what Tobi said: please do *not* send out large files to the
mailing lists.

Thanks,
Sebastian