[Proposal] Parallelize post-IPO stage.

Hi, There:

   This is the proposal for parallelizing post-ipo stage. See the following for details.

   I also attach a toy-grade rudimentary implementation. This implementation can be
used to illustrate some concepts here. This patch is not going to be committed.

   Unfortunately, this weekend I will be too busy to read emails. Please do not construe
delayed response as being rude :-).

Thanks a lot in advance for your time insightful comments!

Shuxin

The proposal

post-ipo.patch (28.2 KB)

While I am a self-proclaimed multi-process red-neck, in this case I would prefer to see a multi-threaded implementation because I want to verify that LLVMContext can be used as advertised. I’m sure some extra care will be needed to report failures/diagnostics, but we should start with the assumption that this approach is not significantly harder than multi-process because that’s how we advertise the design.

If any of the multi-threaded disadvantages you point out are real, I would like to find out about it.

  1. Race Conditions: We should be able to verify that the thread-parallel vs. sequential or multi-process compilation generate the same result. If they diverge, we would like to know about the bug so it can be fixed–independent of LTO.

  2. Small Address Space with LTO. We don’t need to design around this hypothetical case.

  3. Expensive thread-safe runtime lib. We should not speculate that platforms that we, as the LLVM community, care about have this problem. Let’s assume that our platforms are well implemented unless we have data to the contrary. (Personally, I would even love to use TLS in the compiler to vastly simplify API design in the backend, but I am not going to be popular for saying so).

We should be able to decompose each step of compilation for debugging. So the multi-process “implementation” should just be a degenerate form of threading with a bit of driver magic if you want to automate it.

-Andy

In don’t know if it’s feasible, but stable linker output, independent of the partioning, is highly desirable. One of the most irritating performance regressions to track down involves different versions of the host linker. If partitioning decisions are thrown into the mix, this could be annoying. Is it possible for the final link to do a better job cleaning up?

-Andy

While I haven't yet read the rest of the proposal I'm going to comment
on this in particular. In my view this is an absolute requirement as
the compiler should produce the same output given the same input every
time with no deviation.

-eric

The partitioning should be deterministic. It’s just that the linker output now depends on the partitioning heuristics. As long that decision is based on the input (not the host system), then it still meets Eric’s requirements. I just think it’s unfortunate that post-IPO partitioning (or more generally, parallel codegen) affects the output, but may be hard to avoid. It would be nice to be able to tune the partitioning for compile time without worrying about code quality.

Sorry for the tangential thought here… it seems that most of Shuxin’s proposal is actually independent of LTO, even though the prototype and primary goal is enabling LTO.

-Andy

Yes, of course, we should be able to save the IR before and after each major steps, and when trouble-shooting, we should be able to focus on one smaller partition. Once the problem of of the partition is fixed, we can manually link all the partition and other libs into final executable/dyn-lib. This is one important reasons for partitioning. Yes! That is why I’d like implement both. There is one difference though, with multi-proc implementation, we need to pass the /path/to/{llc/opt} to the libLTO.{so|dylib}, such that it can invoke these tools from right place. While in multi-thread implementation, we don’t need this info.

In theory, it is not possible. But I guess in practice, it is almost stable. If partition’s tweaking parameter remain unchanged, the partition should remain unchanged as well.

Thanks for the proposal. This is important work which is one step towards making LTO more applicable for large applications. Some comments inline.

3.1.1 Figure out Partition scheme
----------------------------------
we randomly pick up some function and put them in a partition.
It would be nice to perform some optimization at this moment. One opt
in my mind is to reorder functions in order to reduce working-set and
improve locality.

Unfortunately, this opt seems to be bit blind at this time, because
   - CallGraph is not annotated with estimated or profiled frequency.
   - some linkers don't respect the order. It seems they just
     remembers the function order of the pristine input obj/fake-obj,
     and enforce this order at final link (link-exec/shared-lib) stage.

Anyway, I try to ignore all these problems, and try to perform partition
via following steps. Maybe we have some luck on some platforms:

o. DFS the call-graph, ignoring the self-resursive edges, if freq is
    available, prioritizing the edges (i.e. corresponding to call-sites)
    such that frequent edges are visited first.

o. Cut the DFS spanning tree obtained from the previous step bottom-up,
    Each cut/partition contains reasonable # of functions, and the aggregate
    size of the functions of the partition should not exceeds predefined
    threshold.

I'd like to see more details about this step. How do you determine the "reasonable # of functions"? How do you define the threshold? It has to be the same for a given target / platform regardless of the configuration of the build machine, right?

o. repeat the previous step until the Call-graph's DFS spanning tree
    is empty.

3.1.2 Partition transformation
------------------------------

This is bit involved. There are bunch of problems we have to tackle.
1) When the use/def of a symbol are separated in different modules,
    its attribute, like linkage, visibility, need to be changed
    as well.

     [Example 1], if a symbol is flagged as "internal" to the module where
    the it is defined, the linkage need to be changed into "internal"
    to the executable/lib being compiled.

     [Example 2], For compile-time constants, their initialized value
    needs to to cloned to the partitions where it is referenced,
    The rationale is to make the post-ipo passes to take advantage
    of the initlized value to squeeeze some performance.

     In order to not bloat the code size, the cloned constant should
    mark "don't emit". [end of eg2]

      Being able to precisely update symbols' attribute is not only
    vital to correctness, it has significant impact to the the
    performance as well.

      I have not yet taken a thorough investigation of this issue. My
    rudimentary implementation is simply flag symbol "external" when its
    use/def are separated in different module. I believe this is one
    of the most difficult part of this work. I guess it is going to
    take long time to become stable.

2) In order to compile each partition in each separate thread (see
    Section 3.2), we have to put partitions in distinct LLVMContext.

    I could be wrong, but I don't find the code which is able to
    perform function cloning across LLVMContext.

    My workaround in the patch is to perform function cloning in
   one LLVMContext (but in different Module, of course), then
   save the module to disk file, and load it to memory using a
   new LLVMContext.

    It is bit circuitous and expensive.

Do you plan to fix this? What are the issues that prevented function cloning across multiple LLVMContexts?

Evan

+1

Thanks for the proposal. This is important work which is one step towards making LTO more applicable for large applications. Some comments inline.

3.1.1 Figure out Partition scheme
----------------------------------
  we randomly pick up some function and put them in a partition.
It would be nice to perform some optimization at this moment. One opt
in my mind is to reorder functions in order to reduce working-set and
improve locality.

  Unfortunately, this opt seems to be bit blind at this time, because
    - CallGraph is not annotated with estimated or profiled frequency.
    - some linkers don't respect the order. It seems they just
      remembers the function order of the pristine input obj/fake-obj,
      and enforce this order at final link (link-exec/shared-lib) stage.

  Anyway, I try to ignore all these problems, and try to perform partition
via following steps. Maybe we have some luck on some platforms:

  o. DFS the call-graph, ignoring the self-resursive edges, if freq is
     available, prioritizing the edges (i.e. corresponding to call-sites)
     such that frequent edges are visited first.

  o. Cut the DFS spanning tree obtained from the previous step bottom-up,
     Each cut/partition contains reasonable # of functions, and the aggregate
     size of the functions of the partition should not exceeds predefined
     threshold.

I'd like to see more details about this step. How do you determine the "reasonable # of functions"? How do you define the threshold? It has to be the same for a given target / platform regardless of the configuration of the build machine, right?

Say, each module should not contains :
     - no more than 100 functions, and
     - the total size of the functions in a partition should not exceed the pre-defined threshold,

These threshold can be override by command line.

o. repeat the previous step until the Call-graph's DFS spanning tree
     is empty.

3.1.2 Partition transformation
------------------------------

  This is bit involved. There are bunch of problems we have to tackle.
  1) When the use/def of a symbol are separated in different modules,
     its attribute, like linkage, visibility, need to be changed
     as well.

      [Example 1], if a symbol is flagged as "internal" to the module where
     the it is defined, the linkage need to be changed into "internal"
     to the executable/lib being compiled.

      [Example 2], For compile-time constants, their initialized value
     needs to to cloned to the partitions where it is referenced,
     The rationale is to make the post-ipo passes to take advantage
     of the initlized value to squeeeze some performance.

      In order to not bloat the code size, the cloned constant should
     mark "don't emit". [end of eg2]

       Being able to precisely update symbols' attribute is not only
     vital to correctness, it has significant impact to the the
     performance as well.

       I have not yet taken a thorough investigation of this issue. My
     rudimentary implementation is simply flag symbol "external" when its
     use/def are separated in different module. I believe this is one
     of the most difficult part of this work. I guess it is going to
     take long time to become stable.

  2) In order to compile each partition in each separate thread (see
     Section 3.2), we have to put partitions in distinct LLVMContext.

     I could be wrong, but I don't find the code which is able to
     perform function cloning across LLVMContext.

     My workaround in the patch is to perform function cloning in
    one LLVMContext (but in different Module, of course), then
    save the module to disk file, and load it to memory using a
    new LLVMContext.

     It is bit circuitous and expensive.

Do you plan to fix this? What are the issues that prevented function cloning across multiple LLVMContexts?

Evan

We may fix it, I don't know for sure if it is a big gain at this moment.

If issues is that, as far as I can tell, current code base dose not have functions support copying
IR across different LLVMContext.

For example, when it copy an instruction from src to dest,
it check the "src", take a look of of its Type, and derive LLVMContext from the Type, and use
the same context for the dest. So, we need to change the existing code.

Thanks for the proposal. This is important work which is one step towards making LTO more applicable for large applications. Some comments inline.

3.1.1 Figure out Partition scheme
----------------------------------
we randomly pick up some function and put them in a partition.
It would be nice to perform some optimization at this moment. One opt
in my mind is to reorder functions in order to reduce working-set and
improve locality.

Unfortunately, this opt seems to be bit blind at this time, because
   - CallGraph is not annotated with estimated or profiled frequency.
   - some linkers don't respect the order. It seems they just
     remembers the function order of the pristine input obj/fake-obj,
     and enforce this order at final link (link-exec/shared-lib) stage.

Anyway, I try to ignore all these problems, and try to perform partition
via following steps. Maybe we have some luck on some platforms:

o. DFS the call-graph, ignoring the self-resursive edges, if freq is
    available, prioritizing the edges (i.e. corresponding to call-sites)
    such that frequent edges are visited first.

o. Cut the DFS spanning tree obtained from the previous step bottom-up,
    Each cut/partition contains reasonable # of functions, and the aggregate
    size of the functions of the partition should not exceeds predefined
    threshold.

I'd like to see more details about this step. How do you determine the "reasonable # of functions"? How do you define the threshold? It has to be the same for a given target / platform regardless of the configuration of the build machine, right?

Say, each module should not contains :
   - no more than 100 functions, and
   - the total size of the functions in a partition should not exceed the pre-defined threshold,

These threshold can be override by command line.

But how do you come about the thresholds? And are they fixed thresholds or determined based on analysis of function size, etc.?

Evan

Yes, just count the total # of instructions.
I don't know which threshold is "comfortable" at this moment. We can keep changing it
until we feel comfortable with it.

I recall another reason for not adding the functionality about cloning functions across LLVMContexts.

Currently, LTO merges all stuff together into a "merged module" before IPO and post-IPO starts.

I believe this have to change in the future if we need to improve the scalarbility. As with other compilers,
we only care "merged symbol table" and "merged call-graph" in some passes, and load some of the
merged module on-demand.

I guess function-cloning need some changes if we switch to that mode. So for now, we are better using
whatever we have today, instead mess around implementing something that are about to change in the future.

A third approach is to decouple the backend compilation and
parallelism strategy from the partitioning. The partitioning can
spits out partition BC files and some action records in some standard
format. All of this can be fed into some driver tools that converts
the compilation action file into make/build file of the underlying
build system of your choice:

1) it can simply a compiler driver that does thread level parallelism;
2) or a tool that generates Makfiles which are fed into parallel make
to explore single node parallelism;
3) or a tool that generates BUILD files that feed into distributed
build system (such as Google's blaze:
http://google-engtools.blogspot.com/2011/08/build-in-cloud-how-build-system-works.html)

Another benefit is it will make compiler debugging easier.

thanks,

David

I have actually came up the 3 approaches to build the post-ipo object independently.

The "3rd approach" here is the 1st solution in my original proposal. Almost all coworkers call it sucks:-)
Now I accept it because the it has no way to be adaptive.

Consider the scenario we compile the llvm compiler. We use "make -j16" for
computer with 8 processor, each make-thread invoke a compiler which may blindly invoke 16 threads!
So, we end up to have 16*16 threads.

Being adaptive will render it possible to pick up right factor judiciously and adpatively.

In any case, I will support this approach (i.e. the 3rd approach you mentioned) at very least at beginning.

I have actually came up the 3 approaches to build the post-ipo object
independently.

The "3rd approach" here is the 1st solution in my original proposal. Almost
all coworkers call it sucks:-)
Now I accept it because the it has no way to be adaptive.

Consider the scenario we compile the llvm compiler. We use "make -j16" for
computer with 8 processor, each make-thread invoke a compiler which may
blindly invoke 16 threads!
So, we end up to have 16*16 threads.

Determining the right parallelism is not the job of the compiler
(builtin) nor that of a developer -- the underlying build system
should take care of the scheduling :slight_smile:

David

People told me we have some magic lib which is able to figure out how heavy the
system is. Compiler just need to link against them. I have not yet got chance
to try these libs, neither do I know their names at this moment.

Since how to compiler post-LTO is small part of this project, I'd like to pick up
whichever easier first. We may (very likely) end up all these possible three approaches.

Hi, There:

  This is the proposal for parallelizing post-ipo stage. See the following
for details.

  I also attach a toy-grade rudimentary implementation. This
implementation can be
used to illustrate some concepts here. This patch is not going to be
committed.

  Unfortunately, this weekend I will be too busy to read emails. Please do
not construe
delayed response as being rude :-).

Thanks a lot in advance for your time insightful comments!

Shuxin

The proposal
------------
  It is organized as following:
   1) background info, if you heard "/usr/bin/ls", please skip it
   2) the motivation of parallelize post-IPO stage
   3) how to parallelize post-IPO
   4) the linker problems.
   5) the toy-grade rudimentary implementation
   6) misc

1.Some background
------------------

  The Interprocedural-optimization compilation, aka IPO or IPA, typically
consists of three stages:

  S1) pre-IPO
    Each function goes through some analysis and not-very-aggressive
optimizations.
    Some information is collected during this stage, this info will be to
IPO stages.
    This info is usually called summary info.

    The result of this stage is "fake-objects" which is binary files using
    some known object format to encapsulate IR as well as summary info
along with
    other stuff.

  S2) IPO:
    Compiler works with linker to resolve and merge symbols in the
"fake-objects"

    Then Interprocedural analyses (IPA) are invoked to perform
interprocedural
    analysis either based on the summary-info, or directly on the IR.

    Interprocedural optimizations (IPO) are called based on the IPA result.

    In some compilers, IPA and IPO are separated. One reason is that many
IPAs can
    be directly conduct on the concise summary info, while many IPOs need
to load
    IRs and bulky annotation/metadata into memory.

  S3) post-IPO:
    Typically consist of Loop-Nest-Opt, Scalar Opt, Code-Gen etc etc.
While they
    are intra-procedural analyses/optimizers, they may directly benefit
from
    the info collected in the IPO stages and pass down the road.

  LLVM collectively call S2 and S3 as "LTO CodeGen", which is very
confusing.

2. Why parallelize post-IPO stage

  R1) To improve the scalarbility
    It is quite obvious that we are not able to put everything about a
monster
    program in memory at once.

    Even if you can afford a expensive computer, the address space of a
    single compiler process cannot accommodate a monster program.

  R2) to take advantage of ample HW resource to shorten compile time.
  R3) make debugging lot easier.
     One can triage problems in a much smaller partition rather than
   the huge monster program.

  This proposal is not able to shoot the goal R1 at this moment, because
during
the IPO stage, currently the compiler brings everything into memory at
once.

3. How to parallelize post-IPO stage
==============================**======

  From 5k' high, the concept is very simple, just to
   step 1).divide the merged IR into small pieces,
   step 2).and compile each of this pieces independendly.
   step 3) the objects of each piece are fed back to linker to are linked
           into an executable, or a dynamic lib.

  Section 3.1 through 3.3 describe these three steps respectively.

Yes, this is one approach. I think others at Google have looked into this
sort of partitioning with GCC and found that the one thing which really
helps you pick the partitions, is to profile the program and make the
partitions based on the actual paths of functions seen by the profile. I
don't think they saw much improvement without that. See
http://research.google.com/pubs/pub36355.html .

I do want to mention some other things we can do to parallelize. You may
already know of these and have considered them and rejected them before
deciding on the design you emailed out here, but I want to list them since
there's already another thread with a different approach on the mailing
list.

* Use what we have. LTO partitions at a time, directories for instance, on
the premise that LTO'ing them will produce something smaller than the sum
of its inputs. Then when you do the final whole-program step, it will be
receiving smaller inputs than it otherwise would. The change to LLVM here
is to fix the inliner (or other optimizations?) to reduce the chance that
LTO produces an output larger than its input.

* Parallelize the per-function code generator within a single LLVMContext.
CodeGen presently operates on a per-function basis, and is structured as an
analysis over llvm IR. There shouldn't be any global state, and there
shouldn't be any need for locking accesses to the IR since nothing will be
mutating it. This will even speed up clang -O0, and is a solid first step
that gets a thread-creation API into LLVM.
  - What happened to "one LLVMContext per thread"? Okay, that rule is a
little white lie. Always was. LLVMContext allows two libraries to use llvm
under the hood without interfering with each other (even to the point of
separate maps of types to avoid one library from causing slow type lookups
in the other). LLVM also doesn't have locking around accesses to the IR,
and few guarantees how many things a single mutating operation will need to
look at or change, but with those caveats in mind it is possible to share a
context across threads. Because CodeGen is structured as an analysis over
the IR without mutating the IR, it should work. There's probably still
global state in the code generator somewhere, but it's not a structural
problem.

* Parallelize the function passes, and SCCs that are siblings in the call
tree (again within a single LLVMContext). The gnarly part of this is that
globals have shared use-lists which are updated as we modify each function
individually. Those globals either need to have locks on their use-lists,
replaced with a lockless list, or removed entirely. Probably both, as
GlobalVariable's have use-lists we actually use in the optimizers, but we
don't actually need the use-list for "i32 0".

* Parallelize ModulePasses by splitting them into an analysis phase and an
optimization phase. Make each per-TU build emit the .bc as usual plus an
analysis-file (for instance, call graph, or "which functions reads/modify
which globals"). Merge all the analysis-files and farm them back out to be
used as input to the programs optimizing each .bc individually -- but now
they have total knowledge of the whole-program call graph and other AA
information, etc.
  - You can combine this with an RPC layer to give each worker the ability
to ask for the definition of a function from another worker. LLVM already
supports "unmaterialized" functions where the definition is loaded lazily.
The analysis part should arrange to give us enough information to determine
whether we want to do the inlining, then if we decide to materialize the
function we get its body from another worker.

* Parallelize by splitting into different LLVMContexts. This spares us the
difficulty of adding locks or otherwise changing the internals of LLVM,
gets us the ability to spread the load across multiple machines, and if
combined with the RPC idea above you can also get good inlining without
necessarily loading the whole program into memory on a single machine.

I'm not planning to do the work any time soon so count my vote with that in
mind, but if you ask me I think the first step should be to parallelize the
backend within a single LLVMContext first, then to parallelize the function
passes and CGSCC passes (across siblings only of course) second. Removing
the use-list from simple constants is a very interesting thing to do to
decrease lock contention, but we may want to do something smarter than just
remove it -- consider emitting a large constant that is only used by an
inline function. It is possible to emit a table of constants in the same
COMDAT group as the function, then if the inline function is discarded by
the linker the constants are discarded with it. I don't have a concrete
proposal for that.

Nick

3.1. Partitioning

Thank you for sharing your enlightening thoughts. I heard some ideas before, some are quite new to me. I will take this into account as the project move on. The motivation of this project is not merely to make compilation faster. It is also to: - significantly ease trouble shooting – I was asked to fix LTO bugs for several times, it almost drive me mad to pin-point the bug in a huge merged module. It is definitely an unglamorous and painstaking undertaking:-) - this is one step toward better scalability. For now, I don’t want to parallelize CodeGen only, as post-IPO scalar opt is compile-time hogging as well. On the other, HPC folks may invoke Loop-Nest-Opt/autopar/etc/ in the post-IPO stage as well, those intrinsically very slow breeds. So parallelize the entire post-IPO stage will make them happy. Finer-grained parallelism could be promising, however, it is too error-prone:-).

Shuxin presented the same idea to me. It offers the best opportunity to scale while sidestepping concurrency issues. It avoids problem of loading all functions during the IPO phase then cloning them into separate LLVMContexts later. I don’t see this as part of the first milestone though.

-Andy