[FastPolly]: Update of Polly's performance on LLVM test-suite

Hi all,

I have evaluated Polly’s performance on LLVM test-suite with latest LLVM (r188054) and Polly (r187981). Results can be viewed on: http://188.40.87.11:8000.

There are mainly five new tests and each test is run with 10 samples:
clang (run id = 27): clang -O3
pollyBasic (run id = 28): clang -O3 -load LLVMPolly.so
pollyNoGen (run id = 29): pollycc -O3 -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none
pollyNoOpt (run id = 30): pollycc -O3 -mllvm -polly-optimizer=none
pollyOpt (run id = 31): pollycc -O3

Here is the performance comparison for the newest Polly:
http://188.40.87.11:8000/db_default/v4/nts/31?compare_to=18&baseline=18

Overall, there are 198 benchmarks improved and 16 benchmarks regressed. Especially, with those recent performance-oriented patch files for ScopDetect/ScopInfo/ScopDependences/…, we have significantly reduced the compile-time overhead of Polly for a large number of benchmarks, such as:
SingleSource/Benchmarks/Misc/salsa20 -97.84%
SingleSource/Benchmarks/Polybench/linear-algebra/solvers/lu/lu -85.01%
MultiSource/Applications/obsequi/Obsequi -57.12%
SingleSource/Benchmarks/Polybench/stencils/seidel-2d/seidel-2d -50.00%
MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm -40.09%
MultiSource/Benchmarks/mediabench/gsm/toast/toast -39.91%
SingleSource/Benchmarks/Misc/whetstone -39.02%
MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg -38.07%
MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg -37.70%

However, Polly can still lead to significant compile-time overhead for many benchmarks.
As shown on:
http://188.40.87.11:8000/db_default/v4/nts/31?compare_to=28&baseline=28
there are 11 benchmarks whose compile time are more than 2x than clang. Furthermore, it seems that PollyDependence pass is still one of most expensive passes in Polly.

Even without optimization and code generation, Polly also increases the compile time for some benchmarks.
As shown on:
http://188.40.87.11:8000/db_default/v4/nts/29?compare_to=28&baseline=28
there are 10 benchmarks that require more than 10% extra compile-time overhead compared with clang.

Recently, I will still focus on improving some expensive Polly passes such as PollyDependence.

Cheers,
Star Tan

Hi all,

I have evaluated Polly's performance on LLVM test-suite with latest LLVM (r188054) and Polly (r187981). Results can be viewed on: http://188.40.87.11:8000.

Hi Star Tan,

thanks for the update.

There are mainly five new tests and each test is run with 10 samples:
clang (run id = 27): clang -O3
pollyBasic (run id = 28): clang -O3 -load LLVMPolly.so
pollyNoGen (run id = 29): pollycc -O3 -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none
pollyNoOpt (run id = 30): pollycc -O3 -mllvm -polly-optimizer=none
pollyOpt (run id = 31): pollycc -O3

>

Here is the performance comparison for the newest Polly:
     http://188.40.87.11:8000/db_default/v4/nts/31?compare_to=18&baseline=18

It seems the machine is down/unreachable at the moment?

Overall, there are 198 benchmarks improved and 16 benchmarks regressed. Especially, with those recent performance-oriented patch files for ScopDetect/ScopInfo/ScopDependences/..., we have significantly reduced the compile-time overhead of Polly for a large number of benchmarks, such as:
     SingleSource/Benchmarks/Misc/salsa20 -97.84%
     SingleSource/Benchmarks/Polybench/linear-algebra/solvers/lu/lu -85.01%
     MultiSource/Applications/obsequi/Obsequi -57.12%
     SingleSource/Benchmarks/Polybench/stencils/seidel-2d/seidel-2d -50.00%
     MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm -40.09%
     MultiSource/Benchmarks/mediabench/gsm/toast/toast -39.91%
     SingleSource/Benchmarks/Misc/whetstone -39.02%
     MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg -38.07%
     MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg -37.70%

Very nice work!

However, Polly can still lead to significant compile-time overhead for many benchmarks.
As shown on:
     http://188.40.87.11:8000/db_default/v4/nts/31?compare_to=28&baseline=28
there are 11 benchmarks whose compile time are more than 2x than clang. Furthermore, it seems that PollyDependence pass is still one of most expensive passes in Polly.

We need to look at these on a case by case base. 2x compile time increase for large programs, where Polly is just run on small parts is not what we want. However, for small micro kernels (e.g. Polybench) where we can significantly increase the performance of the generated code, this is in fact a good baseline - especially as we did not spend too much time optimising this.

Even without optimization and code generation, Polly also increases the compile time for some benchmarks.
As shown on:
     http://188.40.87.11:8000/db_default/v4/nts/29?compare_to=28&baseline=28
there are 10 benchmarks that require more than 10% extra compile-time overhead compared with clang.

Having to pay at most 10% slowdown to decide if Polly should be run (including all the canonicalization) is actually not bad. Especially as the average on normal programs is probably a lot less.

Still, if we should have a look into why this is happening for some of the biggest slowdowns.

Can you ping me when the server is up again. I would like to see which kernels are slowed down most.

Recently, I will still focus on improving some expensive Polly passes such as PollyDependence.

Sure. Please keep me posted.

Tobi

Hi all,

I have evaluated Polly’s performance on LLVM test-suite with latest LLVM (r188054) and Polly (r187981). Results can be viewed on: http://188.40.87.11:8000.

Hi Star Tan,

thanks for the update.

There are mainly five new tests and each test is run with 10 samples:
clang (run id = 27): clang -O3
pollyBasic (run id = 28): clang -O3 -load LLVMPolly.so
pollyNoGen (run id = 29): pollycc -O3 -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none
pollyNoOpt (run id = 30): pollycc -O3 -mllvm -polly-optimizer=none
pollyOpt (run id = 31): pollycc -O3

Here is the performance comparison for the newest Polly:
http://188.40.87.11:8000/db_default/v4/nts/31?compare_to=18&baseline=18

It seems the machine is down/unreachable at the moment?

I restart the LNT server. It is available now.

Overall, there are 198 benchmarks improved and 16 benchmarks regressed. Especially, with those recent performance-oriented patch files for ScopDetect/ScopInfo/ScopDependences/…, we have significantly reduced the compile-time overhead of Polly for a large number of benchmarks, such as:
SingleSource/Benchmarks/Misc/salsa20 -97.84%
SingleSource/Benchmarks/Polybench/linear-algebra/solvers/lu/lu -85.01%
MultiSource/Applications/obsequi/Obsequi -57.12%
SingleSource/Benchmarks/Polybench/stencils/seidel-2d/seidel-2d -50.00%
MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm -40.09%
MultiSource/Benchmarks/mediabench/gsm/toast/toast -39.91%
SingleSource/Benchmarks/Misc/whetstone -39.02%
MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg -38.07%
MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg -37.70%

Very nice work!

However, Polly can still lead to significant compile-time overhead for many benchmarks.
As shown on:
http://188.40.87.11:8000/db_default/v4/nts/31?compare_to=28&baseline=28
there are 11 benchmarks whose compile time are more than 2x than clang. Furthermore, it seems that PollyDependence pass is still one of most expensive passes in Polly.

We need to look at these on a case by case base. 2x compile time
increase for large programs, where Polly is just run on small parts is
not what we want. However, for small micro kernels (e.g. Polybench)
where we can significantly increase the performance of the generated
code, this is in fact a good baseline - especially as we did not spend
too much time optimising this.

Yes, we should look into the compile-execution performance trade-off.
I have summarized some benchmarks (compile-time overhead is more than 200%) as follows:

SingleSource/Benchmarks/Shootout/nestedloop,
compile_time(+6355.56%), execution_time(-99.21%)
SingleSource/Benchmarks/Polybench/stencils/seidel-2d/seidel-2d,
compile_time(+1275.00%), execution_time (0%)
SingleSource/Benchmarks/Shootout-C++/nestedloop,
compile_time(+1155.56%), execution_time(-99.23%)
MultiSource/Benchmarks/ASC_Sequoia/AMGmk/AMGmk,
compile_time(+491.80%), execution_time (0%)
SingleSource/UnitTests/Vector/multiplies,
compile_time(+350.00%), execution_time(-13.64%)
SingleSource/Benchmarks/Stanford/Puzzle,
compile_time(+345.45%), execution_time(-40.91%)
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm,
compile_time(+278.95%), execution_time(0%)
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/3mm/3mm,
compile_time(+270.73%), execution_time(0%)
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/syrk/syrk,
compile_time(+208.57%), execution_time(0%)
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gemm/gemm,
compile_time(+202.63%), execution_time(0%)
SingleSource/Regression/C/test_indvars,
compile_time(+200.00%), execution_time(0%)

Results show that some Polly leads to significant compile-time overhead without any execution performance improvement.
I have reported a bug for nestedloop (http://llvm.org/bugs/show_bug.cgi?id=16843), and I would reported other bugs for those benchmarks whose compile time is significantly increased but without execution performance improvement.

Furthermore, you can view top 10 compiler passes when compiling with Polly as follows:
https://gist.github.com/tanstar/581bcea1e4e03498f935/raw/f6a4ec4e8565f7a7bbdb924cd59fcf145caac039/Polly-top10

Even without optimization and code generation, Polly also increases the compile time for some benchmarks.
As shown on:
http://188.40.87.11:8000/db_default/v4/nts/29?compare_to=28&baseline=28
there are 10 benchmarks that require more than 10% extra compile-time overhead compared with clang.

Having to pay at most 10% slowdown to decide if Polly should be run
(including all the canonicalization) is actually not bad. Especially as
the average on normal programs is probably a lot less.

Still, if we should have a look into why this is happening for some of
the biggest slowdowns.

Can you ping me when the server is up again. I would like to see which
kernels are slowed down most.

The server is up now.

For you information, you can view top 10 compiler passes when compiled with “polly without optimization and code generation” as follows:
https://gist.github.com/tanstar/40ece0e4e2bf9d052ca0/raw/9e892fe50544acca9609004941d4b3d4921cb302/Polly-NoGenOpt-top10

I have checked some benchmarks. It seems that the extra compile-time overhead is mainly resutled by the following Polly passes:
Polly - Create polyhedral description of Scops
Combine redundant instructions
Polly - Detect static control parts (SCoPs)
Induction Variable Simplification (Polly version)

Cheers,
Star Tan

  Overall, there are 198 benchmarks improved and 16 benchmarks regressed. Especially, with those recent performance-oriented patch files for ScopDetect/ScopInfo/ScopDependences/..., we have significantly reduced the compile-time overhead of Polly for a large number of benchmarks, such as:
       SingleSource/Benchmarks/Misc/salsa20 -97.84%
       SingleSource/Benchmarks/Polybench/linear-algebra/solvers/lu/lu -85.01%
       MultiSource/Applications/obsequi/Obsequi -57.12%
       SingleSource/Benchmarks/Polybench/stencils/seidel-2d/seidel-2d -50.00%
       MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm -40.09%
       MultiSource/Benchmarks/mediabench/gsm/toast/toast -39.91%
       SingleSource/Benchmarks/Misc/whetstone -39.02%
       MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg -38.07%
       MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg -37.70%

Very nice work!

  However, Polly can still lead to significant compile-time overhead for many benchmarks.
  As shown on:
       http://188.40.87.11:8000/db_default/v4/nts/31?compare_to=28&baseline=28
  there are 11 benchmarks whose compile time are more than 2x than clang. Furthermore, it seems that PollyDependence pass is still one of most expensive passes in Polly.

We need to look at these on a case by case base. 2x compile time
increase for large programs, where Polly is just run on small parts is
not what we want. However, for small micro kernels (e.g. Polybench)
where we can significantly increase the performance of the generated
code, this is in fact a good baseline - especially as we did not spend
too much time optimising this.

Yes, we should look into the compile-execution performance trade-off.
I have summarized some benchmarks (compile-time overhead is more than 200%) as follows:

SingleSource/Benchmarks/Shootout/nestedloop,
  compile_time(+6355.56%), execution_time(-99.21%)
SingleSource/Benchmarks/Shootout-C++/nestedloop,
  compile_time(+1155.56%), execution_time(-99.23%)

This is a huge overhead on a single kernel. We should analyse how we can do better.

SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm,
  compile_time(+278.95%), execution_time(0%)
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/3mm/3mm,
  compile_time(+270.73%), execution_time(0%)
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/syrk/syrk,
  compile_time(+208.57%), execution_time(0%)
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gemm/gemm,
  compile_time(+202.63%), execution_time(0%)
SingleSource/Benchmarks/Polybench/stencils/seidel-2d/seidel-2d,
  compile_time(+1275.00%), execution_time (0%)

The Polybench kernels as committed to the test suite are currently not
properly detected by Polly. The reason is that there are different pre processor flags that can be used to compile polybench and for a certain set it is more challenging to optimize the kernels. We should probably first set this to use both fixed sized arrays as well as non-parametric loop iterators.

MultiSource/Benchmarks/ASC_Sequoia/AMGmk/AMGmk,
  compile_time(+491.80%), execution_time (0%)
SingleSource/UnitTests/Vector/multiplies,
  compile_time(+350.00%), execution_time(-13.64%)
SingleSource/Benchmarks/Stanford/Puzzle,
  compile_time(+345.45%), execution_time(-40.91%)
SingleSource/Regression/C/test_indvars,
  compile_time(+200.00%), execution_time(0%)

Results show that some Polly leads to significant compile-time overhead without any execution performance improvement.
I have reported a bug for nestedloop (http://llvm.org/bugs/show_bug.cgi?id=16843), and I would reported other bugs for those benchmarks whose compile time is significantly increased but without execution performance improvement.

Great. We can than look at them one at a time.

Furthermore, you can view top 10 compiler passes when compiling with Polly as follows:
https://gist.github.com/tanstar/581bcea1e4e03498f935/raw/f6a4ec4e8565f7a7bbdb924cd59fcf145caac039/Polly-top10

Interesting. This will be very helpful for further performance analysis.

Even without optimization and code generation, Polly also increases the compile time for some benchmarks.
  As shown on:
       http://188.40.87.11:8000/db_default/v4/nts/29?compare_to=28&baseline=28
  there are 10 benchmarks that require more than 10% extra compile-time overhead compared with clang.

Having to pay at most 10% slowdown to decide if Polly should be run
(including all the canonicalization) is actually not bad. Especially as
the average on normal programs is probably a lot less.

Still, if we should have a look into why this is happening for some of
the biggest slowdowns.

Can you ping me when the server is up again. I would like to see which
kernels are slowed down most.

The server is up now.

For you information, you can view top 10 compiler passes when compiled with "polly without optimization and code generation" as follows:
     https://gist.github.com/tanstar/40ece0e4e2bf9d052ca0/raw/9e892fe50544acca9609004941d4b3d4921cb302/Polly-NoGenOpt-top10

I have checked some benchmarks. It seems that the extra compile-time overhead is mainly resutled by the following Polly passes:
     Polly - Create polyhedral description of Scops
     Combine redundant instructions
     Polly - Detect static control parts (SCoPs)

If there are still cases where the scop detection is slow, we may want to look into them.

     Induction Variable Simplification (Polly version)

Otherwise, I propose to look into this based on the benchmarks with the highest overhead.

Tobi