[Proposal] Parallelize post-IPO stage.

You seem to be describing GCC's strategy for whole program
optimization (http://gcc.gnu.org/wiki/LinkTimeOptimization).

What is a bit confusing in your description is that you seem to be
wanting to do more work after IPO is *done*. If the optimizations are
done, then there is no need to parallelize anything.

In GCC, we have two parallel stages: the generation of bytecode (which
is naturally parallelized by your build system) and the final
optimization phase, which is parallelized by partitioning the
callgraph into disjoint subsets. These subsets are then parallelized
by spawning the compiler on each of the partitions.

The only sequential part is the actual analysis (what we call WPA or
whole program analysis). That is a single threaded phase that makes
all the optimization decisions, annotates the summaries on each
callgraph node, partitions the callgraph and then spawns all the
optimizers to work on each section of the callgraph.

Diego.

3. How to parallelize post-IPO stage

   From 5k' high, the concept is very simple, just to
    step 1).divide the merged IR into small pieces,
    step 2).and compile each of this pieces independendly.
    step 3) the objects of each piece are fed back to linker to are linked
            into an executable, or a dynamic lib.

You seem to be describing GCC's strategy for whole program
optimization (http://gcc.gnu.org/wiki/LinkTimeOptimization).

Hi, Diego:

   Thank you for the comment. Quite honestly, I'm not very familiar with
gcc's LTO implementation, and now I'm not allowed to read any GPL V3 code.

What I'm describing here is somewhat similar to Open64's IPA implementation.
I did port gcc before and of course read its document . Based on my
little knowledge of its internal, I think gcc's "whole-program"
mode is almost identical to Open64's IPA in spirit, I guess gcc community likely
borrowed some idea from Open64.

On the other hand, as far as I can understand of the oldish gcc's implementation
before I join Apple, I guess the "whole-program mode" in gcc is misnomer.
I don't want call this work as "whole-program mode", I'd like to call "big program mode"
or something like that, as I don't care the binary being built see the whole-program or
not while the analyses/opt
do care the difference between them.

What is a bit confusing in your description is that you seem to be
wanting to do more work after IPO is *done*.

IPO is difficult to be parallelized. What I try to do is to parallelize the post-lto compilation
stage, including, optionally LNO, and scalar opt and compile-time hogging CodeGen.

If the optimizations are
done, then there is no need to parallelize anything.

In GCC, we have two parallel stages: the generation of bytecode (which
is naturally parallelized by your build system) and the final
optimization phase, which is parallelized by partitioning the
callgraph into disjoint subsets. These subsets are then parallelized
by spawning the compiler on each of the partitions.

I call the 1st "parallel stage" pre-ipo, and the 2nd "parallel stage" post-IPO:-),
and the IPO (which you call WPA) is sandwiched in the middle.

The only sequential part is the actual analysis (what we call WPA or
whole program analysis).

While not include in the proposal, I was proposing another level (earlier) of partition in order
to build outrageously huge programs in order to parallelize the "WPA". The benefit
is to parallize the IPO/"WPA", however, the cost is not seeing the "whole program".

One of my coworker told me we care about the benefit reaped from seeing the "whole program"
than the improved compile-time at the cost of compiling bunch of "incomplete program"s independently.

That is a single threaded phase that makes
all the optimization decisions, annotates the summaries on each
callgraph node, partitions the callgraph and then spawns all the
optimizers to work on each section of the callgraph.

This is what I called "post-IPO" :slight_smile:

OK, yeah, this was horribly confusing. IPA would probably have been a
better choice of term (or WPA) with the 'A' for analysis and the 'O'
for optimization where we're doing the work.

It makes your entire proposal make some sense now :wink:

I'll need to go back and reread it with that in mind.

-eric

There are so many related terminologies that are usually confused or
misused. It should be normalized before discussion. Here is one
version:

o IPA --> interprocedural analysis and optimization
o IPO/CMO --> interprocedural optimization (cross module IPA). In
different context, IPO may mean slightly different things. For
instance, it may mean the whole compilation pipeline, or just the
serial part when all IR files are merged and analyzed. IPO/CMO can
operate both whole program mode and non-whole program mode.

Note that IPA is more general than IPO/CMO, as it covers single module
case tool.

o LTO --> one type of implementation of IPO/CMO, which involves
linker/linker plugin. It means the whole IPO pipeline.

The following are a list of GCC specific terms:

o WHOPR --> GCC LTO in whole program mode
o WPA --> the serial/merge part of the IPO pipelines
o LGEN --> pre-IPO compilation
o LTRANS -> post-IPO compilation.

Shuxin's proposal is about parallelizing 'LTRANS' in LLVM.

thanks,

David

The partitioning should be deterministic. It’s just that the linker output now depends on the partitioning heuristics. As long that decision is based on the input (not the host system), then it still meets Eric’s requirements. I just think it’s unfortunate that post-IPO partitioning (or more generally, parallel codegen) affects the output, but may be hard to avoid. It would be nice to be able to tune the partitioning for compile time without worrying about code quality.

I also want to chime in on the importance of stable binary outputs. And not just same compiler and same sources produces same binary, but that minor changes to either should cause minor changes to the output binary. For software updates, Apple updater tries to download only the delta to the binaries, so we want those to be as small as possible. In addition, it often happens late in an OS release cycle that some critical bug is found and the fix is in the compiler. To qualify it, we rebuild the whole OS with the new compiler, then compare all the binaries in the OS, making sure only things related to the bug are changed.

Sorry for the tangential thought here… it seems that most of Shuxin’s proposal is actually independent of LTO, even though the prototype and primary goal is enabling LTO.

This is very insightful, Andrew! Rather than think of this (post-IPO parallelization) as an LTO enhancement, it should be that the backend simply has some threshold (e.g. number of functions) which causes it to start parallelizing the last steps.

There are two camps: one camp advocate compiling partitions via multi-process,
the other one favor multi-thread.

There is also a variant of multi-threading that is popular at Apple. Our OSs have libdispatch which makes is easy to queue up chucks of work. The OS looks at the overall system balance and uses the ideal number of threads to process the work queue.

The compiler used to generate a single object file from the merged
IR, now it will generate multiple of them, one for each partition.

I have not studied the MC interface, but why does each partition need to generate a separate object file? Why can’t the first partition done create an object file, and as other partitions finish, they just append to that object file?

-Nick

We can view partitioning as a “transformation”. Unless the transformation is absolutely no-op, it will change something. If we care the consistency in binaries, we either consistently use partition or consistently not use partition. We could append the object files as an alternative. However, how do we know the /path/to/ld from the existing interface ABIs? How do we know the flags feed to the ld (more often than not, “-r” alone is enough, but some linkers may need more). In my rudimentary implement, I hack by hardcoding to /usr/bin/ld. I think adding object one by one back to the linker is better as the linker already have enough information.

But doesn’t "consistently not use partition” mean “don’t use the optimization you are working on”? Isn’t there someone to get the same output no matter how it is partitioned?

The compiler used to generate a single object file from the merged
IR, now it will generate multiple of them, one for each partition.

I have not studied the MC interface, but why does each partition need to generate a separate object file? Why can’t the first partition done create an object file, and as other partitions finish, they just append to that object file?

We could append the object files as an alternative.
However, how do we know the /path/to/ld from the existing interface ABIs?
How do we know the flags feed to the ld (more often than not, “-r” alone is enough,
but some linkers may need more).

In my rudimentary implement, I hack by hardcoding to /usr/bin/ld.

I think adding object one by one back to the linker is better as the linker already have
enough information.

I think you missed my point, or you are really thinking from the multi-process point of view. In LLVM there is an MCWriter used to produce object files. Your model is that if there are three partitions, then there will be three MCWriter objects, each producing an object file. What I am saying is to have only one MCWriter object and have all three partitions stream their content out through the one MCWriter, producing one object file.

-Nick

Yes No. Just like “cc --without-inline” and “cc --with-inline” yields diff binaries. No, I don’t want to sync all threads at this point. It make things all the more complex.

The partitioning should be deterministic. It's just that the linker output now depends on the partitioning heuristics. As long that decision is based on the input (not the host system), then it still meets Eric's requirements. I just think it's unfortunate that post-IPO partitioning (or more generally, parallel codegen) affects the output, but may be hard to avoid. It would be nice to be able to tune the partitioning for compile time without worrying about code quality.
I also want to chime in on the importance of stable binary outputs. And not just same compiler and same sources produces same binary, but that minor changes to either should cause minor changes to the output binary. For software updates, Apple updater tries to download only the delta to the binaries, so we want those to be as small as possible. In addition, it often happens late in an OS release cycle that some critical bug is found and the fix is in the compiler. To qualify it, we rebuild the whole OS with the new compiler, then compare all the binaries in the OS, making sure only things related to the bug are changed.

We can view partitioning as a "transformation". Unless the transformation is absolutely no-op,
it will change something. If we care the consistency in binaries, we either consistently use partition
or consistently not use partition.
But doesn't "consistently not use partition" mean "don't use the optimization you are working on"? Isn't there someone to get the same output no matter how it is partitioned?

The compiler used to generate a single object file from the merged
IR, now it will generate multiple of them, one for each partition.
I have not studied the MC interface, but why does each partition need to generate a separate object file? Why can't the first partition done create an object file, and as other partitions finish, they just append to that object file?

We could append the object files as an alternative.
However, how do we know the /path/to/ld from the existing interface ABIs?
How do we know the flags feed to the ld (more often than not, "-r" alone is enough,
but some linkers may need more).

In my rudimentary implement, I hack by hardcoding to /usr/bin/ld.

I think adding object one by one back to the linker is better as the linker already have
enough information.
I think you missed my point, or you are really thinking from the multi-process point of view. In LLVM there is an MCWriter used to produce object files. Your model is that if there are three partitions, then there will be three MCWriter objects, each producing an object file. What I am saying is to have only one MCWriter object and have all three partitions stream their content out through the one MCWriter, producing one object file.

[Xiaofei] if you only target parallelizing post LTO passes, there is no need to partition, parallelizing function passes is enough and could achieve better performance (no need to link partitions together since it share AsmPrinter & MCWriter)

-Nick