[Polly] Update of Polly compile-time performance on LLVM test-suite

Star_Tan · July 30, 2013, 5:03pm

Hi Tobias and all Polly developers,

I have re-evaluated the Polly compile-time performance using newest LLVM/Polly source code. You can view the results on http://188.40.87.11:8000.

Especially, I also evaluated our r187102 patch file that avoids expensive failure string operations in normal execution. Specifically, I evaluated two cases for it:

Polly-NoCodeGe! n: clang -O3 -load LLVMPolly.so -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none
http://188.40.87.11:8000/db_default/v4/nts/16?compare_to=9&baseline=9&aggregation_fn=median
Polly-Opt: clang -O3 -load LLVMPolly.so -mllvm -polly
http://188.40.87.11:8000/db_default/v4/nts/18?compare_to=11&baseline=11&aggregation_fn=median

The “Polly-NoCodeGen” case is mainly used to compare the compile-time performance for the polly-detect pass. As shown in the results, our patch file could significantly reduce the compile-time overhead for some benchmarks such as tramp3dv4 (24.2%), simple_types_constant_folding(12.6%), [oggenc](http://188.40.87.11:8000/db_d!
efault/v4/nts/16/graph?test.331=2)(9.1%), loop_unroll(7.8%)

The “Polly-opt” case is used to compare the whole compile-time performance of Polly. Since our patch file mainly affects the Polly-Detect pass, it shows similar performance to “Polly-NoCodeGen”. As shown in results, it reduces the compile-time overhead of some benchmarks such as tramp3dv4 (23.7%), simple_types_constant_folding(12.9%), oggenc(8.3%), loop_unroll(7.5%)

At last, I also evaluated the performance of the ScopBottomUp patch that changes the up-down scop detection into bottom-up scop detection. Results can be viewed by:

pNoCodeGen-ScopBottomUp: clang -O3 -load LLVMPolly.so (v.s. LLVMPolly-ScopBottomUp.so) -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none
http://188.40.87.11:8000/db_default/v4/nts/21?compare_to=16&base line=16&aggregation_fn=median
pOpt-ScopBottomUp: clang -O3 -load LLVMPolly.so (v.s. LLVMPolly-ScopBottomUp.so) -mllvm -polly
http://188.40.87.11:8000/db_default/v4/nts/19?compare_to=18&baseline=18&aggregation_fn=median

(*Both of these results are based on LLVM r187116, which has included the r187102 patch file that we discussed above)

Please notice that this patch file will lead to some errors in Polly-tests, so the data shown here can not be regards as confident results. For example, this patch can significantly reduce the compile-time overhead of SingleSource/Benchmarks/Shootout/nestedloop only because it regards the nested loop as an invalid scop and skips all following transformations and optimizations. However, I evaluated it here to see its potential performance impact. Based on the results shown on http://188.40.87.11:8000/db_default/v4/nts/21?compare_to=16&baseline=16&aggregation_fn=median, we can see detecting scops bottom-up may further reduce Polly compile-time by more than 10%.

Best wishes,
Star Tan

Tobias_Grosser6 · July 31, 2013, 2:50pm

Hi Tobias and all Polly developers,

I have re-evaluated the Polly compile-time performance using newest
LLVM/Polly source code. You can view the results on
http://188.40.87.11:8000
<http://188.40.87.11:8000/db_default/v4/nts/16?compare_to=9&baseline=9&aggregation_fn=median>\.

Especially, I also evaluated ourr187102 patch file that avoids expensive
failure string operations in normal execution. Specifically, I evaluated
two cases for it:

Polly-NoCodeGen: clang -O3 -load LLVMPolly.so -mllvm
-polly-optimizer=none -mllvm -polly-code-generator=none
http://188.40.87.11:8000/db_default/v4/nts/16?compare_to=9&baseline=9&aggregation_fn=median
Polly-Opt: clang -O3 -load LLVMPolly.so -mllvm -polly
http://188.40.87.11:8000/db_default/v4/nts/18?compare_to=11&baseline=11&aggregation_fn=median

The "Polly-NoCodeGen" case is mainly used to compare the compile-time
performance for the polly-detect pass. As shown in the results, our
patch file could significantly reduce the compile-time overhead for some
benchmarks such as tramp3dv4
<http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.355=2> (24.2%), simple_types_constant_folding
<http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.366=2>\(12\.6%\),
oggenc
<http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.331=2>\(9\.1%\),
loop_unroll
<http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.235=2>\(7\.8%\)

Very nice!

Though I am surprised to also see performance regressions. They are all in very shortly executing kernels, so they may very well be measuring noice. Is this really the case?

Also, it may be interesting to compare against the non-polly case to see
how much overhead there is still due to our scop detetion.

The "Polly-opt" case is used to compare the whole compile-time
performance of Polly. Since our patch file mainly affects the
Polly-Detect pass, it shows similar performance to "Polly-NoCodeGen". As
shown in results, it reduces the compile-time overhead of some
benchmarks such as tramp3dv4
<http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.355=2> (23.7%), simple_types_constant_folding
<http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.366=2>\(12\.9%\),
oggenc
<http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.331=2>\(8\.3%\),
loop_unroll
<http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.235=2>\(7\.5%\)

At last, I also evaluated the performance of the ScopBottomUp patch that
changes the up-down scop detection into bottom-up scop detection.
Results can be viewed by:
pNoCodeGen-ScopBottomUp: clang -O3 -load LLVMPolly.so (v.s.
LLVMPolly-ScopBottomUp.so) -mllvm -polly-optimizer=none -mllvm
-polly-code-generator=none
http://188.40.87.11:8000/db_default/v4/nts/21?compare_to=16&baseline=16&aggregation_fn=median
pOpt-ScopBottomUp: clang -O3 -load LLVMPolly.so (v.s.
LLVMPolly-ScopBottomUp.so) -mllvm -polly
http://188.40.87.11:8000/db_default/v4/nts/19?compare_to=18&baseline=18&aggregation_fn=median
(*Both of these results are based on LLVM r187116, which has included
the r187102 patch file that we discussed above)

Please notice that this patch file will lead to some errors in
Polly-tests, so the data shown here can not be regards as confident
results. For example, this patch can significantly reduce the
compile-time overhead of SingleSource/Benchmarks/Shootout/nestedloop
<http://188.40.87.11:8000/db_default/v4/nts/19/graph?test.17=2> only
because it regards the nested loop as an invalid scop and skips all
following transformations and optimizations. However, I evaluated it
here to see its potential performance impact. Based on the results
shown on
http://188.40.87.11:8000/db_default/v4/nts/21?compare_to=16&baseline=16&aggregation_fn=median,
we can see detecting scops bottom-up may further reduce Polly
compile-time by more than 10%.

Interesting. For some reason it also regresses huffbench quite a bit. I think here an up-to-date non-polly to polly comparision would come handy to see which benchmarks we still see larger performance regressions. And if the bottom-up scop detection actually helps here.
As this is a larger patch, we should really have a need for it before switching to it.

Cheers,
Tobias

Star_Tan · August 1, 2013, 2:28am

Hi all,

I have also evaluated Poly compile-time performance with our patch file for polly-dependence pass. Results can be viewed on:
http://188.40.87.11:8000/db_default/v4/nts/23?baseline=18&compare_to=18

With this patch file, Polly would only create a single parameter for memory accesses that share the same loop variable with different base address value. As a result, it can significantly reduce compile-time for some array-intensive benchmarks such like lu (reduced by 83.65%) and AMGMK (reduced by 56.24%).

For our standard benchmark a shown in http://llvm.org/bugs/show_bug.cgi?id=14240, the total compile-time is reduced to 0.0164s f! rom 154.5389s. Especially, the compile-time of polly-dependence is reduced to 0.0066s (40.5%) from 148.8800s ( 96.3%).

Cheers,
Star Tan

Star_Tan · August 1, 2013, 4:23am

>On 07/30/2013 10:03 AM, Star Tan wrote:
>> Hi Tobias and all Polly developers,
>>
>> I have re-evaluated the Polly compile-time performance using newest
>> LLVM/Polly source code.  You can view the results on
>> http://188.40.87.11:8000
>> <http://188.40.87.11:8000/db_default/v4/nts/16?compare_to=9&baseline=9&aggregation_fn=median>.
>>
>> Especially, I also evaluated ourr187102 patch file that avoids expensive
>> failure string operations in normal execution. Specifically, I evaluated
>> two cases for it:
>>
>> Polly-NoCodeGen: clang -O3 -load LLVMPolly.so -mllvm
>> -polly-optimizer=none -mllvm -polly-code-generator=none
>> http://188.40.87.11:8000/db_default/v4/nts/16?compare_to=9&baseline=9&aggregation_fn=median
>> Polly-Opt: clang -O3 -load LLVMPolly.so -mllvm -polly
>> http://188.40.87.11:8000/db_default/v4/nts/18?compare_to=11&baseline=11&aggregation_fn=median
>>
>> The "Polly-NoCodeGen" case is mainly used to compare the compile-time
>> performance for the polly-detect pass. As shown in the results, our
>> patch file could significantly reduce the compile-time overhead for some
>> benchmarks such as tramp3dv4
>> <http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.355=2> (24.2%), simple_types_constant_folding
>> <http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.366=2>(12.6%),
>> oggenc
>> <http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.331=2>(9.1%),
>> loop_unroll
>> <http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.235=2>(7.8%)
>
>Very nice!
>
>Though I am surprised to also see performance regressions. They are all 
>in very shortly executing kernels, so they may very well be measuring 
>noice. Is this really the case?

Yes, it seems that shortly executing benchmarks always show huge unexpected noise even we run 10 samples for a test.

I have changed the ignore_small abs value to 0.05 from the original 0.01, which means benchmarks with the performance delta less then 0.05s would be skipped. In that case, the results seem to be much more stable.

However, I have noticed that there are many other Polly patches between the two version r185399 and r187116. They may also affect the compile-time performance. I would re-evaluate LLVM-testsuite to see the performance improvements caused only by!
  our

>
>Also, it may be interesting to compare against the non-polly case to see
>how much overhead there is still due to our scop detetion.
>
>> The "Polly-opt" case is used to compare the whole compile-time
>> performance of Polly. Since our patch file mainly affects the
>> Polly-Detect pass, it shows similar performance to "Polly-NoCodeGen". As
>> shown in results, it reduces the compile-time overhead of some
>> benchmarks such as tramp3dv4
>> <http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.355=2> (23.7%), simple_types_constant_folding
>> <http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.366=2>(12.9%),
>> oggenc
>> <http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.331=2>(8.3%),
>> loop_unroll
>> <http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.235=2>(7.5%)
>>
>> At last, I also evaluated the performance of the ScopBottomUp patch that
>> changes the up-down scop detection into bottom-up scop detection.
>> Results can be viewed by:
>> pNoCodeGen-ScopBottomUp: clang -O3 -load LLVMPolly.so (v.s.
>> LLVMPolly-ScopBottomUp.so)  -mllvm -polly-optimizer=none -mllvm
>> -polly-code-generator=none
>> http://188.40.87.11:8000/db_default/v4/nts/21?compare_to=16&baseline=16&aggregation_fn=median
>> pOpt-ScopBottomUp: clang -O3 -load LLVMPolly.so (v.s.
>> LLVMPolly-ScopBottomUp.so)  -mllvm -polly
>> http://188.40.87.11:8000/db_default/v4/nts/19?compare_to=18&baseline=18&aggregation_fn=median
>> (*Both of these results are based on LLVM r187116, which has included
>> the r187102 patch file that we discussed above)
>>
>> Please notice that this patch file will lead to some errors in
>> Polly-tests, so the data shown here can not be regards as confident
>> results. For example, this patch can significantly reduce the
>> compile-time overhead of SingleSource/Benchmarks/Shootout/nestedloop
>> <http://188.40.87.11:8000/db_default/v4/nts/19/graph?test.17=2> only
>> because it regards the nested loop as an invalid scop and skips all
>> following transformations and optimizations. However, I evaluated it
>> here to see its potential performance impact.  Based on the results
>> shown on
>> http://188.40.87.11:8000/db_default/v4/nts/21?compare_to=16&baseline=16&aggregation_fn=median,
>> we can see detecting scops bottom-up may further reduce Polly
>> compile-time by more than 10%.
>
>Interesting. For some reason it also regresses huffbench quite a bit.

This is because the ScopBottomUp patch file invalids the scop detection for huffbench. The run-time of huffbench with different options are shown as follows:

clang: 19.1680s  (see runid=14)

polly without ScopBottomUp patch file: 14.8340s (see runid=16)

polly with ScopBottomUp patch file: 19.2920s (see runid=21)

As you can see, the ScopBottomUp patch file shows almost the same execution !
 performance with clang. That is because no invalid scops is detected with this patch file at all.

>:-( I think here an up-to-date non-polly to polly comparision would come 
>handy to see which benchmarks we still see larger performance 
>regressions. And if the bottom-up scop detection actually helps here.
>As this is a larger patch, we should really have a need for it before 
>switching to it.
>
I have evaluated Polly compile-time performance for the following options:

  clang: clang -O3  (runid: 14)

  pBasic: clang -O3 -load LLVMPolly.so (runid:15)

  pNoGen: pollycc -O3 -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none (runid:16)

  pNoOpt: pollycc -O3 -mllvm -polly-optimizer=none (runid:17)

  pOpt: pollycc -O3 (runid:18)

For example, you can view the comparison between "clang" and "pNoGen" with:

Star_Tan · August 1, 2013, 12:39pm

However, I have noticed that there are many other Polly patches between the two version r185399 and r187116. They may also affect the compile-time performance. I would re-evaluate LLVM-testsuite to see the performance improvements caused only by our patch file.

Performance evaluation for our single “ScopDetection String Operation Patch” (r187102) can be viewed on:
http://188.40.87.11:8000/db_default/v4/nts/18?compare_to=24&baseline=24
The only difference between the two runs is whether they use the r187102 patch file, i.! e., run_id=18 is with the r187102 patch file, while run_id=24 is without the r187102 patch file.

Results show that this patch file significantly reduces compile time for tramp3d-v4(24.41%), simple_types_constant_folding(13.47%) and oggenc(9.68%). It does not affect the execution performance at all since it only removes some string operations for debugging.

Cheers,
Star Tan

Tobias_Grosser6 · August 1, 2013, 3:08pm

Very nice results indeed. It is also especially nice to see that there is no noise at all in the results!

Cheers,
Tobi

Tobias_Grosser6 · August 1, 2013, 3:29pm

Hi Tobias and all Polly developers,

I have re-evaluated the Polly compile-time performance using newest
LLVM/Polly source code. You can view the results on
http://188.40.87.11:8000
<http://188.40.87.11:8000/db_default/v4/nts/16?compare_to=9&baseline=9&aggregation_fn=median>\.

Especially, I also evaluated ourr187102 patch file that avoids expensive
failure string operations in normal execution. Specifically, I evaluated
two cases for it:

Polly-NoCodeGen: clang -O3 -load LLVMPolly.so -mllvm
-polly-optimizer=none -mllvm -polly-code-generator=none
http://188.40.87.11:8000/db_default/v4/nts/16?compare_to=9&baseline=9&aggregation_fn=median
Polly-Opt: clang -O3 -load LLVMPolly.so -mllvm -polly
http://188.40.87.11:8000/db_default/v4/nts/18?compare_to=11&baseline=11&aggregation_fn=median

The "Polly-NoCodeGen" case is mainly used to compare the compile-time
performance for the polly-detect pass. As shown in the results, our
patch file could significantly reduce the compile-time overhead for some
benchmarks such as tramp3dv4
<http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.355=2> (24.2%), simple_types_constant_folding
<http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.366=2>\(12\.6%\),
oggenc
<http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.331=2>\(9\.1%\),
loop_unroll
<http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.235=2>\(7\.8%\)

Very nice!

Though I am surprised to also see performance regressions. They are all
in very shortly executing kernels, so they may very well be measuring
noice. Is this really the case?

Yes, it seems that shortly executing benchmarks always show huge unexpected noise even we run 10 samples for a test.

I have changed the ignore_small abs value to 0.05 from the original 0.01, which means benchmarks with the performance delta less then 0.05s would be skipped. In that case,the results seem to be much more stable.
However, I have noticed that there are many other Polly patches between the two version r185399 and r187116. They may also affect the compile-time performance. I would re-evaluate LLVM-testsuite to see the performance improvements caused only by our

I doubt the Polly changes changed performance a much. However, there have been huge numbers of patches to LLVM/clang. Those obviously changed performance. The rerun test show that our results in fact filter noise out effectively. Can you check if this also holds for the original 0.01?

Also, it may be interesting to compare against the non-polly case to see
how much overhead there is still due to our scop detetion.

The "Polly-opt" case is used to compare the whole compile-time
performance of Polly. Since our patch file mainly affects the
Polly-Detect pass, it shows similar performance to "Polly-NoCodeGen". As
shown in results, it reduces the compile-time overhead of some
benchmarks such as tramp3dv4
<http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.355=2> (23.7%), simple_types_constant_folding
<http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.366=2>\(12\.9%\),
oggenc
<http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.331=2>\(8\.3%\),
loop_unroll
<http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.235=2>\(7\.5%\)

At last, I also evaluated the performance of the ScopBottomUp patch that
changes the up-down scop detection into bottom-up scop detection.
Results can be viewed by:
pNoCodeGen-ScopBottomUp: clang -O3 -load LLVMPolly.so (v.s.
LLVMPolly-ScopBottomUp.so) -mllvm -polly-optimizer=none -mllvm
-polly-code-generator=none
http://188.40.87.11:8000/db_default/v4/nts/21?compare_to=16&baseline=16&aggregation_fn=median
pOpt-ScopBottomUp: clang -O3 -load LLVMPolly.so (v.s.
LLVMPolly-ScopBottomUp.so) -mllvm -polly
http://188.40.87.11:8000/db_default/v4/nts/19?compare_to=18&baseline=18&aggregation_fn=median
(*Both of these results are based on LLVM r187116, which has included
the r187102 patch file that we discussed above)

Please notice that this patch file will lead to some errors in
Polly-tests, so the data shown here can not be regards as confident
results. For example, this patch can significantly reduce the
compile-time overhead of SingleSource/Benchmarks/Shootout/nestedloop
<http://188.40.87.11:8000/db_default/v4/nts/19/graph?test.17=2> only
because it regards the nested loop as an invalid scop and skips all
following transformations and optimizations. However, I evaluated it
here to see its potential performance impact. Based on the results
shown on
http://188.40.87.11:8000/db_default/v4/nts/21?compare_to=16&baseline=16&aggregation_fn=median,
we can see detecting scops bottom-up may further reduce Polly
compile-time by more than 10%.

Interesting. For some reason it also regresses huffbench quite a bit.

This is because the ScopBottomUp patch file invalids the scop detection for huffbench. The run-time of huffbench with different options are shown as follows:

clang: 19.1680s (see runid=14)

polly without ScopBottomUp patch file: 14.8340s (see runid=16)

polly with ScopBottomUp patch file: 19.2920s (see runid=21)

As you can see, the ScopBottomUp patch file shows almost the same execution performance with clang. That is because no invalid scops is detected with this patch file at all.

I am still confused. So you are saying Polly reduces the run-time from 19 to 14 secs for huffbench? This is nice, but very surprising for the no-codgen runs, no?

I think here an up-to-date non-polly to polly comparision would come
handy to see which benchmarks we still see larger performance
regressions. And if the bottom-up scop detection actually helps here.
As this is a larger patch, we should really have a need for it before
switching to it.

I have evaluated Polly compile-time performance for the following options:

   clang: clang -O3 (runid: 14)

   pBasic: clang -O3 -load LLVMPolly.so (runid:15)

   pNoGen: pollycc -O3 -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none (runid:16)

   pNoOpt: pollycc -O3 -mllvm -polly-optimizer=none (runid:17)

   pOpt: pollycc -O3 (runid:18)

For example, you can view the comparison between "clang" and "pNoGen" with:

http://188.40.87.11:8000/db_default/v4/nts/16?compare_to=14&baseline=14

It shows that without optimizer and code generator, Polly would lead to less then 30% extra compile-time overhead.

This is a step in the right direction, especially as most runs show a lot less overhead.

Still, we need to improve on this. Ideally, we should not see more than 5% slowdown. I suspect we can get some general speed-ups by reducing the set of passes we schedule for canonicalization. However, before, it may be good to look at some of the slow kernels. lemon e.g. looks interesting - 20% slowdown.

For the execution performance, it is interesting that pNoGen not only significantly improves the execution performance for some benchmarks (nestedloop/huffbench) but also significantly reduces the execution performance for another set of benchmarks (gcc-loops/lpbench).

Yes, that is really interesting. I suspect a couple of our canonicalization passes enabled/blocked additional optimizations in LLVM. The huffbench kernel seems especially interesting. This is not your number one priority in GSoC, but understanding why the gcc-loops got so much worse may be interesting. I suspect this may some kind of generic LLVM issue we expose and we should report a bug explaining the issue.

Tobi

Star_Tan · August 2, 2013, 2:10pm

>On 07/31/2013 09:23 PM, Star Tan wrote:
>> At 2013-07-31 22:50:57,"Tobias Grosser" <tobias@grosser.es
>>
>>>On 07/30/2013 10:03 AM, Star Tan wrote:
>>>> Hi Tobias and all Polly developers,
>>>>
>>>> I have re-evaluated the Polly compile-time performance using newest
>>>> LLVM/Polly source code.  You can view the results on
>>>> http://188.40.87.11:8000
>>>> <http://188.40.87.11:8000/db_default/v4/nts/16?compare_to=9&baseline=9&aggregation_fn=median>.
>>>>
>>>> Especially, I also evaluated ourr187102 patch file that avoids expensive
>>>> failure string operations in normal execution. Specifically, I evaluated
>>>> two cases for it:
>>>>
>>>> Polly-NoCodeGen: clang -O3 -load LLVMPolly.so -mllvm
>>>> -polly-optimizer=none -mllvm -polly-code-generator=none
>>>> http://188.40.87.11:8000/db_default/v4/nts/16?compare_to=9&baseline=9&aggregation_fn=median
>>>> Polly-Opt: clang -O3 -load LLVMPolly.so -mllvm -polly
>>>> http://188.40.87.11:8000/db_default/v4/nts/18?compare_to=11&baseline=11&aggregation_fn=median
>>>>
>>>> The "Polly-NoCodeGen" case is mainly used to compare the compile-time
>>>> performance for the polly-detect pass. As shown in the results, our
>>>> patch file could significantly reduce the compile-time overhead for some
>>>> benchmarks such as tramp3dv4
>>>> <http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.355=2> (24.2%), simple_types_constant_folding
>>>> <http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.366=2>(12.6%),
>>>> oggenc
>>>> <http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.331=2>(9.1%),
>>>> loop_unroll
>>>> <http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.235=2>(7.8%)
>>>
>>>Very nice!
>>>
>>>Though I am surprised to also see performance regressions. They are all
>>>in very shortly executing kernels, so they may very well be measuring
>>>noice. Is this really the case?
>>
>> Yes, it seems that shortly executing benchmarks always show huge unexpected noise even we run 10 samples for a test.
>>
>> I have changed the ignore_small abs value to 0.05 from the original 0.01, which means benchmarks with the performance delta less then 0.05s would be skipped. In that case,the results seem to be much more stable.
>> However, I have noticed that there are many other Polly patches between the two version r185399 and r187116. They may also affect the compile-time performance. I would re-evaluate LLVM-testsuite to see the performance improvements caused only by our
>
>I doubt the Polly changes changed performance a much. However, there 
>have been huge numbers of patches to LLVM/clang. Those obviously changed 
>performance. The rerun test show that our results in fact filter noise 
>out effectively. Can you check if this also holds for the original 0.01?

No, it was set as 0.05 to filter out small delta.

As you may need, I have reset it to 0.01 now.

>
>>>Also, it may be interesting to compare against the non-polly case to see
>>>how much overhead there is still due to our scop detetion.
>>>
>>>> The "Polly-opt" case is used to compare the whole compile-time
>>>> performance of Polly. Since our patch file mainly affects the
>>>> Polly-Detect pass, it shows similar performance to "Polly-NoCodeGen". As
>>>> shown in results, it reduces the compile-time overhead of some
>>>> benchmarks such as tramp3dv4
>>>> <http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.355=2> (23.7%), simple_types_constant_folding
>>>> <http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.366=2>(12.9%),
>>>> oggenc
>>>> <http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.331=2>(8.3%),
>>>> loop_unroll
>>>> <http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.235=2>(7.5%)
>>>>
>>>> At last, I also evaluated the performance of the ScopBottomUp patch that
>>>> changes the up-down scop detection into bottom-up scop detection.
>>>> Results can be viewed by:
>>>> pNoCodeGen-ScopBottomUp: clang -O3 -load LLVMPolly.so (v.s.
>>>> LLVMPolly-ScopBottomUp.so)  -mllvm -polly-optimizer=none -mllvm
>>>> -polly-code-generator=none
>>>> http://188.40.87.11:8000/db_default/v4/nts/21?compare_to=16&baseline=16&aggregation_fn=median
>>>> pOpt-ScopBottomUp: clang -O3 -load LLVMPolly.so (v.s.
>>>> LLVMPolly-ScopBottomUp.so)  -mllvm -polly
>>>> http://188.40.87.11:8000/db_default/v4/nts/19?compare_to=18&baseline=18&aggregation_fn=median
>>>> (*Both of these results are based on LLVM r187116, which has included
>>>> the r187102 patch file that we discussed above)
>>>>
>>>> Please notice that this patch file will lead to some errors in
>>>> Polly-tests, so the data shown here can not be regards as confident
>>>> results. For example, this patch can significantly reduce the
>>>> compile-time overhead of SingleSource/Benchmarks/Shootout/nestedloop
>>>> <http://188.40.87.11:8000/db_default/v4/nts/19/graph?test.17=2> only
>>>> because it regards the nested loop as an invalid scop and skips all
>>>> following transformations and optimizations. However, I evaluated it
>>>> here to see its potential performance impact.  Based on the results
>>>> shown on
>>>> http://188.40.87.11:8000/db_default/v4/nts/21?compare_to=16&baseline=16&aggregation_fn=median,
>>>> we can see detecting scops bottom-up may further reduce Polly
>>>> compile-time by more than 10%.
>>>
>>>Interesting. For some reason it also regresses huffbench quite a bit.
>>
>> This is because the ScopBottomUp patch file invalids the scop detection for huffbench. The run-time of huffbench with different options are shown as follows:
>>
>> clang: 19.1680s  (see runid=14)
>>
>> polly without ScopBottomUp patch file: 14.8340s (see runid=16)
>>
>> polly with ScopBottomUp patch file: 19.2920s (see runid=21)
>>
>> As you can see, the ScopBottomUp patch file shows almost the same execution performance with clang. That is because no invalid scops is detected with this patch file at all.
>
>I am still confused. So you are saying Polly reduces the run-time from 
>19 to 14 secs for huffbench? This is nice, but very surprising for the 
>no-codgen runs, no?

Yes, Polly reduces the run-time from 19 to 14 secs for the no-codegen run.

>
>>>:-( I think here an up-to-date non-polly to polly comparision would come
>>>handy to see which benchmarks we still see larger performance
>>>regressions. And if the bottom-up scop detection actually helps here.
>>>As this is a larger patch, we should really have a need for it before
>>>switching to it.
>>>
>> I have evaluated Polly compile-time performance for the following options:
>>
>>    clang: clang -O3  (runid: 14)
>>
>>    pBasic: clang -O3 -load LLVMPolly.so (runid:15)
>>
>>    pNoGen: pollycc -O3 -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none (runid:16)
>>
>>    pNoOpt: pollycc -O3 -mllvm -polly-optimizer=none (runid:17)
>>
>>    pOpt: pollycc -O3 (runid:18)
>>
>> For example, you can view the comparison between "clang" and "pNoGen" with:
>> http://188.40.87.11:8000/db_default/v4/nts/16?compare_to=14&baseline=14
>>
>> It shows that without optimizer and code generator, Polly would lead to less then 30% extra compile-time overhead.
>
>This is a step in the right direction, especially as most runs show a 
>lot less overhead.

Yes, but this is based on the fact that I set the ignore_small threshold to 0.05 from the original 0.01.

If the ignore_small threshold is set to 0.01, then some benchmarks show more than 30% compile-time overhead.

As I said above, I have reset the ignore_small threshold to 0.01. Now you can view them with the same URL:

http://188.40.87.11:8000/db_default/v4/nts/16?compare_to=14&baseline=14

>
>Still, we need to improve on this. Ideally, we should not see more than 
>5% slowdown. I suspect we can get some general speed-ups by reducing the 
>set of passes we schedule for canonicalization.  However, before, it may 
>be good to look at some of the slow kernels. lemon e.g. looks 
>interesting - 20% slowdown.

>
>> For the execution performance, it is interesting that pNoGen not only significantly improves the execution performance for some benchmarks (nestedloop/huffbench) but also significantly reduces the execution performance for another set of benchmarks (gcc-loops/lpbench).
>
>Yes, that is really interesting. I suspect a couple of our 
>canonicalization passes enabled/blocked additional optimizations in 
>LLVM. The huffbench kernel seems especially interesting. This is not 
>your number one priority in GSoC, but understanding why the gcc-loops 
>got so much worse may be interesting. I suspect this may some kind of 
>generic LLVM issue we expose and we should report a bug explaining the 
>issue.
>
Certainly,  I would try to investigate the huffbench after I commit the patch file for ScopInfo in recent days.

Best,

Star Tan

Tobias_Grosser6 · August 3, 2013, 12:28am

At 2013-07-31 22:50:57,"Tobias Grosser" <tobias@grosser.es

[..]

I doubt the Polly changes changed performance a much. However, there
have been huge numbers of patches to LLVM/clang. Those obviously changed
performance. The rerun test show that our results in fact filter noise
out effectively. Can you check if this also holds for the original 0.01?

No, it was set as 0.05 to filter out small delta.

As you may need, I have reset it to 0.01 now.

Great. It seems even with 0.01 some we actually do not have a lot of noise.

http://188.40.87.11:8000/db_default/v4/nts/18?compare_to=24&baseline=24

I think here an up-to-date non-polly to polly comparision would come
handy to see which benchmarks we still see larger performance
regressions. And if the bottom-up scop detection actually helps here.
As this is a larger patch, we should really have a need for it before
switching to it.

I have evaluated Polly compile-time performance for the following options:

   clang: clang -O3 (runid: 14)

   pBasic: clang -O3 -load LLVMPolly.so (runid:15)

   pNoGen: pollycc -O3 -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none (runid:16)

   pNoOpt: pollycc -O3 -mllvm -polly-optimizer=none (runid:17)

   pOpt: pollycc -O3 (runid:18)

For example, you can view the comparison between "clang" and "pNoGen" with:
http://188.40.87.11:8000/db_default/v4/nts/16?compare_to=14&baseline=14

It shows that without optimizer and code generator, Polly would lead to less then 30% extra compile-time overhead.

This is a step in the right direction, especially as most runs show a
lot less overhead.

Yes, but this is based on the fact that I set the ignore_small threshold to 0.05 from the original 0.01.

If the ignore_small threshold is set to 0.01, then some benchmarks show more than 30% compile-time overhead.

As I said above, I have reset the ignore_small threshold to 0.01. Now you can view them with the same URL:

http://188.40.87.11:8000/db_default/v4/nts/16?compare_to=14&baseline=14

We probably want to look at http://188.40.87.11:8000/db_default/v4/nts/16?compare_to=15&baseline=15 which just measures the difference between enabling and disabling polly, but not overhead introduced due to
loading a shared object file.

For the execution performance, it is interesting that pNoGen not only significantly improves the execution performance for some benchmarks (nestedloop/huffbench) but also significantly reduces the execution performance for another set of benchmarks (gcc-loops/lpbench).

Yes, that is really interesting. I suspect a couple of our
canonicalization passes enabled/blocked additional optimizations in
LLVM. The huffbench kernel seems especially interesting. This is not
your number one priority in GSoC, but understanding why the gcc-loops
got so much worse may be interesting. I suspect this may some kind of
generic LLVM issue we expose and we should report a bug explaining the
issue.

Certainly, I would try to investigate the huffbench after I commit the patch file for ScopInfo in recent days.

Great. gcc-loops may also be interesting to look at (especially as it is an unwanted regression that we would like to fix).

Cheers,
Tobi

Topic		Replies	Views
[FastPolly]: Update of Polly's performance on LLVM test-suite LLVM Dev List Archives	3	75	August 12, 2013
[Polly] Comionpile-time of Polly's code generation LLVM Dev List Archives	5	78	September 9, 2013
[Polly]GSoC Proposal: Reducing LLVM-Polly Compiling overhead LLVM Dev List Archives	17	137	May 4, 2013
[Polly] Analysis of extra compile-time overhead for simple nested loops LLVM Dev List Archives	11	91	August 21, 2013
Analysis of polly-detect overhead in oggenc LLVM Dev List Archives	19	95	July 24, 2013

[Polly] Update of Polly compile-time performance on LLVM test-suite

Related Topics