3. How to parallelize post-IPO stage
From 5k' high, the concept is very simple, just to
step 1).divide the merged IR into small pieces,
step 2).and compile each of this pieces independendly.
step 3) the objects of each piece are fed back to linker to are linked
into an executable, or a dynamic lib.
You seem to be describing GCC's strategy for whole program
Thank you for the comment. Quite honestly, I'm not very familiar with
gcc's LTO implementation, and now I'm not allowed to read any GPL V3 code.
What I'm describing here is somewhat similar to Open64's IPA implementation.
I did port gcc before and of course read its document . Based on my
little knowledge of its internal, I think gcc's "whole-program"
mode is almost identical to Open64's IPA in spirit, I guess gcc community likely
borrowed some idea from Open64.
On the other hand, as far as I can understand of the oldish gcc's implementation
before I join Apple, I guess the "whole-program mode" in gcc is misnomer.
I don't want call this work as "whole-program mode", I'd like to call "big program mode"
or something like that, as I don't care the binary being built see the whole-program or
not while the analyses/opt
do care the difference between them.
What is a bit confusing in your description is that you seem to be
wanting to do more work after IPO is *done*.
IPO is difficult to be parallelized. What I try to do is to parallelize the post-lto compilation
stage, including, optionally LNO, and scalar opt and compile-time hogging CodeGen.
If the optimizations are
done, then there is no need to parallelize anything.
In GCC, we have two parallel stages: the generation of bytecode (which
is naturally parallelized by your build system) and the final
optimization phase, which is parallelized by partitioning the
callgraph into disjoint subsets. These subsets are then parallelized
by spawning the compiler on each of the partitions.
I call the 1st "parallel stage" pre-ipo, and the 2nd "parallel stage" post-IPO:-),
and the IPO (which you call WPA) is sandwiched in the middle.
The only sequential part is the actual analysis (what we call WPA or
whole program analysis).
While not include in the proposal, I was proposing another level (earlier) of partition in order
to build outrageously huge programs in order to parallelize the "WPA". The benefit
is to parallize the IPO/"WPA", however, the cost is not seeing the "whole program".
One of my coworker told me we care about the benefit reaped from seeing the "whole program"
than the improved compile-time at the cost of compiling bunch of "incomplete program"s independently.
That is a single threaded phase that makes
all the optimization decisions, annotates the summaries on each
callgraph node, partitions the callgraph and then spawns all the
optimizers to work on each section of the callgraph.
This is what I called "post-IPO"