Hi all,
Hi,
thanks for the update and sorry for the delay in reviewing. I just had a look at your proposal.
I have updated my GSoS proposal: "FastPolly: Reducing LLVM-Polly Compiling overhead" (https://gist.github.com/tanstar/5441808). I think the pass ordering problem you discussed early can be also investigated in this project!
Yes, figuring out the optimal path ordering sequence is very good.
Is there any comment or advice about my proposal? I appreciate all your help and advice.
1. Summary:
LLVM-Polly is a promising polyhedral optimizer for data-locality and
parallelism, which takes advantage of multi-cores, cache hierarchies,
short vector instructions as well as dedicated accelerators. However,
Polly analysis and optimization can lead to significant compiling
overhead, which makes it much less attractive for LLVM users. I argue
that maintaining fast compiling time when Polly is enabled is very
important, especially if we want to think of enabling Polly in default.
Based on this assumption, I try to reduce Polly compiling overhead in
this project.
Sounds good.
2. Motivation:
LLVM is an incredible open-source project. It has been widely in C/C++
You miss a verb here ^^^
compilers, high-level synthesis compilers, virtual machines, optimizing
tools, etc. As a graduate student, I am going to work on compiler
analysis and optimization, especially on program vectorization and
parallelization. I find Polly is a very useful and powerful polyhedral
optimizer. I would like to use this tool and contribute to this project.
When I was using Polly tool, I found that Polly optimization can lead to
No need for 'tool' here ^^^
significant compiling overhead. On average, polly optimization will
increase the compiling time by 393% for PolyBench benchmarks and by 53%
for MediaBench benchmarks compared with clang. That means if you want to
gain from Polly, you have to pay 4 times extra compiling overhead. Even
if you do not want to gain much from Polly, you still have to pay 53%
compiling overhead. Such expensive compiling overhead would make the
Polly much less attractive to LLVM users.
Good point.
In this project, I try to reduce Polly compiling overhead by removing
I would call it 'compile-time overhead' instead of 'compiling overhead'.
unnecessary passes and improving critical passes. For this purpose, I
firstly try to find out where the compiling overhead comes from. When
Polly optimizes a program, it takes the following steps: 1) Polly
canonicalization: prepare some basic information and do some basic
transformation, such as loop-simplify and region-simplify. 2) LLVM-IR
to Polly description: detect polly scops and translates the detected
scops into a polyhedral representation. 3) Polly optimization: analyze
and optimize polyhedral scops. 4) Polly description to LLVM-IR:
translates the polyhedral description back into new LLVM-IR.
In attched table 1 and 2, pBasic shows the overhead of loading the
attached
LLVMPolly.so; pNoGen shows the overhead of step 1) and 2); pNoOpt shows
the overhead of step 1), 2) and 4). So the compiling overhead of Polly
can be divided into three parts:
PolyBench: canonicalization(13%-1%=12%), code generation(248%-13%=235%)
and optimization(393%-248%=145%) MediaBench:canonicalization( 9%-1%=
8%), code generation( 43%- 9%= 34%) and optimization( 53%- 43%= 10%)
Thanks for adding numbers for pNoGen. Having only 10% runtime increase if Polly is not used is a good sign, especially for the amount of canonicalization passes we run. This makes me confident we can get it to an even smaller number.
The other numbers are large, but there are likely ways to improve on this significantly. Also, it would be good to show at least for one benchmark which passes the different numbers actually contain. (You can use -debug-pass=Structure for this). E.g. the code generation time looks
rather large. I suppose most of the time is not actually spent in code generation, but also in the LLVM passes such as common subexpression elimination that have more LLVM-IR to work on or clean up after Polly was run.
Also, I believe the names of your columns, and the command line options
given above are a little out of sync. I could e.g. not find a description for pBasic
Based on these results, I plan to reduce Polly compiling overhead by the
following steps: First, I will try to remove unnecessary
canonicalization passes to reduce canonicalization time; Second, I will
try to remove or rewrite expensive analysis passes to reduce
optimization overhead; Third, I will try to improve the code generation
passes to reduce code generation overhead. Another interesting work is
to let the polly bail out early, which can be very helpful to save
compiling overhead if Polly cannot benefit the program.
OK, this sounds like a reasonable approach. Some more points may be worth adding:
1) It is important to pick criteria you can evaluate your work on
It is a good start that you identified two benchmarks. Especially looking into non-polybench code is very valuable. You should make sure that you evaluate your work throughout the project to see the benefit
of your changes. In fact, it may even be worthwhile to set up a Polly performance tester to track the compile time with Polly enabled and how
your changes influence it.
2) Add some specific bug reports you are planning to lock into
This bug report shows a large performance problem in Polly that is mainly due to creating a very difficult dependency analysis problem:
llvm.org/PR15643
There was a larger discussion on the Polly mailing list that discusses this bug.
3. Details about the project:
StageI -- Remove unnecessary canonicalization transformation. [Week 1-2]
Polly relies on some canonicalization passes to simplify the following
analysis and optimization. Canonicalization passes include
loop-simplify, region-simplify, Induction variable canonicalization and
block independent. For example, region-simplify pass is run to simplify
the region to single entry and single exit edge before -polly-detect.
However, such approach will introduce unnecessary modifications that
increase compile time even in the cases where Polly cannot optimize the
code.
A first step is to remove -region-simplify pass. For this purpose, I
have modified the scop detection pass and polly code generation pass to
allow scops with multiple entry edges and multiple exit edges. Details
can be referred to the following patch files: (Thanks for all the help
from Polly group)
r179673: Remove unneeded RegionSimplify pass r179586: Support SCoPs with
multiple entry edges r179159: Support SCoPs with multiple exit edges
r179158: Codegen: Replace region exit and entries recursively r179157:
RegionInfo: Add helpers to replace entry/exit recursively r178530:
ScopDetection: Use isTopLevelRegion
In this project, I plan to spend two weeks to reduce canonicalization
overhead.
It was a good idea to write down what you plan to do each week.
Week 1: Profile the compiling overhead of each canonicalization pass,
including PromoteMemoryToRegisterPass, CFGSimplificationPass,
ReassociatePass, LoopRotatePass, InstructionCombiningPass,
IndVarSimplifyPass, CodePreparationPass and LoopSimplifyPass. Week 2:
Remove or improve one or two most expensive canonicalization passes. I
will also try to revise the pass ordering to move some expensive
canonicalization passes later.
Instead of speeding up the canonicalization passes your focus should really be integrating Polly into the -O3 pass chain without the need to have any additional canonicalization passes. This part is not so much about the patch itself that implements it. It rather requires careful analysis how the number of detected scops changes when moving Polly.
At the moment we optimized for optimal scop coverage while neglecting compile time. Now we want both, optimal scop coverage and good compile time.
Another point that can be mentioned is removing the need for induction
variable canonicalization. We currently do this using the -polly-indvars pass. However, the new option -polly-codegen-scev enables us to remove this pass entirely. This could also be an interesting performance
problem as -polly-codegen-scev produces a lot cleaner LLVM-IR at code generation time, which may take more time to generate but it may also
require less time to be cleaned up. This could also be interesting to investigate.
StageII -- Remove or rewrite expensive analysis passes for compiling
performance. [Week 3-5]
There are many optimization libraries for Polly, such as ScopLib, Pluto,
ISL and Jason optimization. To balance the tradeoff between code
JSON
performance and compiling overhead, I will profile each optimization
library and try to improve some of these libraries to reduce compiling
overhead.
The only relevant one is currently isl. It may in some cases be useful to compare against Pluto so. No need to optimize scoplib or JSON.
Week 3: Profile the compiling overhead of each Polly optimization
library, including ScopLib, Pluto, ISL and Jason.
Instead of profiling per library, I would rather profile per Polly pass
using --time-passes
You could do this later for several programs, but it would be good to have this already today for a single program to get an idea where time is spent and what needs optimization.
Week 4: Profile the
compiling overhead of each optimization pass for one or two libraries
(such as ISL and ScopLib). For example, ISL optimization provides many
optimization passes, such as dependence simplify, schedule optimization,
and various loop transformation. Week 5: remove some expensive
optimization passes and rewrite some critical but expensive optimization
passes.
StageIII -- Improve code generation passes for compiling performance.
[Week 6-9]
Our evalutions show that polly code generation passes are very
expensive, especially for some benchmarks like ludcmp.c and adi.c. Polly
code generation passes can increase the compiling time by 500% or more
(See table 1). My plan is to improve various code generation passes.
Can you verify your numbers here. You report for ludcmp the following:
clang pBasic pNoOpt pNoGen pOPt
ludcmp.c 0.157 0.1602 0.2002 1.0761 1.3175
pBasic% pNoGen% pNoOpt% pOpt%
2.04% 27.52% 585.41% 739.17%
I have the feeling the headings of the pNoGen% and pNoOpt% columns have been switched accidentally. At least from the numbers above, I see an increase from 0.16 to 0.20 for code generation, which is far from being a 500% increase. On the other side, the optimization itself seems to add a larger amount of time as well as the code generation of the optimized code. O
Week 6: Profile the compiling overhead of each Polly code generation
pass, especially for ISL code generation. Week 7: Remove unnecessary
analysis for code generation. Currently, Polly code generation pass
dependents on a lot of analysis passes such as DominatorTree,
IslAstInfo, RegionInfo, ScalarEvolution, ScopDetection, ScopInfo. I will
try to remove some of expensive analysis passes.
Those passes add little overhead to the code generation. In fact the analysis is normally already available, such that these analysis requirements are for free. They have been added her mainly to allow the code generation to update them, such that we do not need to spend time rebuilding them later.
> Week 8-9: Rewrite some
expensive functions for Polly code generation based on profiling
information.
This is still very vague. I propose to
StageIV -- Let Polly bail out early. [Week 10]
Week 10: Add support in canicalization step or optimization step to
Typo -----> canonicalization
allow Polly boil out early if it cannot benefit programs.
StageV -- Improve other parts. [Week 11-12]
Week 11: Improve other parts of Polly. Especially, I will focus on some
expensive helper functions such as TempScop analysis. This helper
function is critical and expensive.
How do you know TempScop is expensive?
Week 12: Integrate all improvements
and evaluate the whole Polly with multiple benchmarks.
I think the only way to do this project is to continuously evaluate your changes on Polybench and mediabench and to directly integrate them
into the svn repository. This should be made clear at the beginning and
I believe it is very fine to spend more time on the individual steps,
such that we can make sure the changes are properly evaluated and integrated.
4. Profit for LLVM users and Polly users
This project can benefit both LLVM users and Polly users. For LLVM
users, our project will make the Polly more acceptable if it can
provides extra performance gains within little extra compiling overhead.
For Polly users, this project will make the Polly more powerful by
significantly reducing compiling overhead and improving code quality.
Nice.
You could make your goals more concrete saying that we want to show that
by enabling Polly we can significantly optimizing the polybench benchmarks, while at the same time no prohibitively large compile time increase can be seen for mediabench. Reaching this goal would be a great
step forward.
[Attachments]
Our evaluation is based on Intel Pentium Dual CPU T2390(1.86GHz) with
2GB DDR2 memory. Each benchmark is run multiple times and data are
collected using ministat (https://github.com/codahale/ministat). Results
are shown in table 1 and table 2. Five cases are tested: (alias
pollycc="clang -O3 -load LLVMPolly.so -mllvm -polly) *clang: clang -O3
*pLoad: clang -O3 -load LLVMPolly.so *pNoGen:pollycc -O3 -mllvm
-polly-optimizer=none -mllvm -polly-code-generatorr=none *pNoOpt:pollycc
-O3 -mllvm -polly-optimizer=none *polly: pollycc -O3
Table 1: Compile time for PolyBench (Seconds, each benchmark is run 10
times)
clang pBasic pNoOpt pNoGen pOPt pBasic% pNoGen%
pNoOpt% pOpt% 2mm.c 0.1521 0.1593 0.1711 0.3235 0.7247
4.73% 12.49% 112.69% 376.46% atax.c 0.1386 0.1349 0.1449
0.2066 0.313 0.00% 0.00% 49.06% 125.83% covariance.c 0.1498
0.1517 0.1526 0.3561 0.7706 1.27% 1.87% 137.72% 414.42% gemver.c
0.1562 0.1587 0.1724 0.2674 0.3936 1.60% 10.37% 71.19% 151.99%
instrument.c 0.1062 0.1075 0.1124 0.123 0.1216 0.00% 5.84%
15.82% 14.50% ludcmp.c 0.157 0.1602 0.2002 1.0761 1.3175 2.04%
27.52% 585.41% 739.17% 3mm.c 0.1529 0.1559 0.1826 0.4134
1.0436 1.96% 19.42% 170.37% 582.54% bicg.c 0.1244 0.1268
0.1353 0.1977 0.2828 1.93% 8.76% 58.92% 127.33% doitgen.c
0.1492 0.1505 0.1644 0.3325 0.8971 0.00% 10.19% 122.86% 501.27%
gesummv.c 0.1224 0.1279 0.134 0.1999 0.2937 4.49% 9.48%
63.32% 139.95% jacobi.c 0.1444 0.1506 0.1592 0.3912 0.8494
0.00% 10.25% 170.91% 488.23% seidel.c 0.1337 0.1353 0.1462
0.6299 0.9155 0.00% 9.35% 371.13% 584.74% adi.c 0.1593
0.1621 0.1835 1.4375 1.849 1.76% 15.19% 802.39% 1060.70%
correlation.c 0.1579 0.1596 0.1802 0.3393 0.6337 1.08% 14.12%
114.88% 301.33% gemm.c 0.1407 0.1432 0.1576 0.2421 0.4477
1.78% 12.01% 72.07% 218.20% gramschmidt.c 0.1331 0.1349 0.1509
0.3069 0.4138 0.00% 13.37% 130.58% 210.89% lu.c 0.1419
0.1443 0.1581 0.3156 0.3943 1.69% 11.42% 122.41% 177.87% average
1.26% 13.22% 248.47% 393.80%
To improve readability, it may be worth ensuring this fits into 80 columns. You may be able to reduce the number of digits used here.
You could probably increase the readability of your proposal further if you use markdown. See here for an example of how a markdown file looks at github: https://gist.github.com/micmcg/976172 and here the raw version
https://gist.github.com/micmcg/976172/raw/70f1e0db278340bd8167c98fb880979b4571e847/gistfile1.md
You basically need to use the file ending '.md' and you can then use markdown syntax to format your text. The very same syntax will also improve the readability of the proposal on the mailing list.
All the best,
Tobias