This is the proposal for parallelizing post-ipo stage. See the following
I also attach a toy-grade rudimentary implementation. This
implementation can be
used to illustrate some concepts here. This patch is not going to be
Unfortunately, this weekend I will be too busy to read emails. Please do
delayed response as being rude :-).
Thanks a lot in advance for your time insightful comments!
It is organized as following:
1) background info, if you heard "/usr/bin/ls", please skip it
2) the motivation of parallelize post-IPO stage
3) how to parallelize post-IPO
4) the linker problems.
5) the toy-grade rudimentary implementation
The Interprocedural-optimization compilation, aka IPO or IPA, typically
consists of three stages:
Each function goes through some analysis and not-very-aggressive
Some information is collected during this stage, this info will be to
This info is usually called summary info.
The result of this stage is "fake-objects" which is binary files using
some known object format to encapsulate IR as well as summary info
Compiler works with linker to resolve and merge symbols in the
Then Interprocedural analyses (IPA) are invoked to perform
analysis either based on the summary-info, or directly on the IR.
Interprocedural optimizations (IPO) are called based on the IPA result.
In some compilers, IPA and IPO are separated. One reason is that many
be directly conduct on the concise summary info, while many IPOs need
IRs and bulky annotation/metadata into memory.
Typically consist of Loop-Nest-Opt, Scalar Opt, Code-Gen etc etc.
are intra-procedural analyses/optimizers, they may directly benefit
the info collected in the IPO stages and pass down the road.
LLVM collectively call S2 and S3 as "LTO CodeGen", which is very
2. Why parallelize post-IPO stage
R1) To improve the scalarbility
It is quite obvious that we are not able to put everything about a
program in memory at once.
Even if you can afford a expensive computer, the address space of a
single compiler process cannot accommodate a monster program.
R2) to take advantage of ample HW resource to shorten compile time.
R3) make debugging lot easier.
One can triage problems in a much smaller partition rather than
the huge monster program.
This proposal is not able to shoot the goal R1 at this moment, because
the IPO stage, currently the compiler brings everything into memory at
3. How to parallelize post-IPO stage
From 5k' high, the concept is very simple, just to
step 1).divide the merged IR into small pieces,
step 2).and compile each of this pieces independendly.
step 3) the objects of each piece are fed back to linker to are linked
into an executable, or a dynamic lib.
Section 3.1 through 3.3 describe these three steps respectively.
Yes, this is one approach. I think others at Google have looked into this
sort of partitioning with GCC and found that the one thing which really
helps you pick the partitions, is to profile the program and make the
partitions based on the actual paths of functions seen by the profile. I
don't think they saw much improvement without that. See
I do want to mention some other things we can do to parallelize. You may
already know of these and have considered them and rejected them before
deciding on the design you emailed out here, but I want to list them since
there's already another thread with a different approach on the mailing
* Use what we have. LTO partitions at a time, directories for instance, on
the premise that LTO'ing them will produce something smaller than the sum
of its inputs. Then when you do the final whole-program step, it will be
receiving smaller inputs than it otherwise would. The change to LLVM here
is to fix the inliner (or other optimizations?) to reduce the chance that
LTO produces an output larger than its input.
* Parallelize the per-function code generator within a single LLVMContext.
CodeGen presently operates on a per-function basis, and is structured as an
analysis over llvm IR. There shouldn't be any global state, and there
shouldn't be any need for locking accesses to the IR since nothing will be
mutating it. This will even speed up clang -O0, and is a solid first step
that gets a thread-creation API into LLVM.
- What happened to "one LLVMContext per thread"? Okay, that rule is a
little white lie. Always was. LLVMContext allows two libraries to use llvm
under the hood without interfering with each other (even to the point of
separate maps of types to avoid one library from causing slow type lookups
in the other). LLVM also doesn't have locking around accesses to the IR,
and few guarantees how many things a single mutating operation will need to
look at or change, but with those caveats in mind it is possible to share a
context across threads. Because CodeGen is structured as an analysis over
the IR without mutating the IR, it should work. There's probably still
global state in the code generator somewhere, but it's not a structural
* Parallelize the function passes, and SCCs that are siblings in the call
tree (again within a single LLVMContext). The gnarly part of this is that
globals have shared use-lists which are updated as we modify each function
individually. Those globals either need to have locks on their use-lists,
replaced with a lockless list, or removed entirely. Probably both, as
GlobalVariable's have use-lists we actually use in the optimizers, but we
don't actually need the use-list for "i32 0".
* Parallelize ModulePasses by splitting them into an analysis phase and an
optimization phase. Make each per-TU build emit the .bc as usual plus an
analysis-file (for instance, call graph, or "which functions reads/modify
which globals"). Merge all the analysis-files and farm them back out to be
used as input to the programs optimizing each .bc individually -- but now
they have total knowledge of the whole-program call graph and other AA
- You can combine this with an RPC layer to give each worker the ability
to ask for the definition of a function from another worker. LLVM already
supports "unmaterialized" functions where the definition is loaded lazily.
The analysis part should arrange to give us enough information to determine
whether we want to do the inlining, then if we decide to materialize the
function we get its body from another worker.
* Parallelize by splitting into different LLVMContexts. This spares us the
difficulty of adding locks or otherwise changing the internals of LLVM,
gets us the ability to spread the load across multiple machines, and if
combined with the RPC idea above you can also get good inlining without
necessarily loading the whole program into memory on a single machine.
I'm not planning to do the work any time soon so count my vote with that in
mind, but if you ask me I think the first step should be to parallelize the
backend within a single LLVMContext first, then to parallelize the function
passes and CGSCC passes (across siblings only of course) second. Removing
the use-list from simple constants is a very interesting thing to do to
decrease lock contention, but we may want to do something smarter than just
remove it -- consider emitting a large constant that is only used by an
inline function. It is possible to emit a table of constants in the same
COMDAT group as the function, then if the inline function is discarded by
the linker the constants are discarded with it. I don't have a concrete
proposal for that.