Updated RFC: ThinLTO Implementation Plan

Quick question: Is the word required to support ThinLTO using llvm’s native tools orthogonal to that required to supporting non-llvm tools? If not, would it make sense to start with a deployment of entirely LLVM based tools - since there seems to be general interest in that - and then come back to the non-llvm based tools separately?

Personally, I see both sides here. I can understand why you want to minimize build configuration changes - they tend to be painful - but I also am reluctant to design a major enhancement to LLVM under the assumption that LLVM’s own tools aren’t adequate for the purpose. That seems like it would be majorly problematic from the perspective of the project as a whole.

(I realize that LLVM’s tools could simply extract the bitcode out of the wrapper file, but that seems unnecessarily complex for an initial LLVM only solution.)

Really late on this thread, but I wanted to second Philip’s position: wrapped bitcode can be interesting and we should consider it, but a bitcode only solution should probably come first.

The things thin lto seems to need are generally desirable bitcode features: easy to read symbol table, ability to read a single function, etc.

Cheers,
Rafael

>
> Quick question: Is the word required to support ThinLTO using llvm's
native tools orthogonal to that required to supporting non-llvm tools? If
not, would it make sense to start with a deployment of entirely LLVM based
tools - since there seems to be general interest in that - and then come
back to the non-llvm based tools separately?
>
> Personally, I see both sides here. I can understand why you want to
minimize build configuration changes - they tend to be painful - but I also
am reluctant to design a major enhancement to LLVM under the assumption
that LLVM's own tools aren't adequate for the purpose. That seems like it
would be majorly problematic from the perspective of the project as a whole.
>
> (I realize that LLVM's tools could simply extract the bitcode out of the
wrapper file, but that seems unnecessarily complex for an *initial* LLVM
only solution.)
>

Really late on this thread, but I wanted to second Philip's position:
wrapped bitcode can be interesting and we should consider it, but a bitcode
only solution should probably come first.

The things thin lto seems to need are generally desirable bitcode
features: easy to read symbol table, ability to read a single function, etc.

Sorry for the late reply, this came in the midst of some summer vacations
last month. I've come up with some data structures and interfaces that are
independent of the underlying file format, which should make the
implementation in each format easier and the implementation order less
critical hopefully. I will send that out later when it is fully spec'd.

I've been focusing more on the symbol linkage and renaming specification,
which I emailed out to the list a couple days ago, since that is
fundamental to the ThinLTO design.

Thanks,
Teresa

Cheers,

Hi Teresa,

Thanks for layout down a detailed proposal on top of the slides, it is very instructive and very pleasant to read.

I have a few questions, none of which touches the ELF aspect! :slight_smile:
I apologize if you already addressed them and I missed it (if you can refer me to any past discussion that would be nice).

1) I haven’t seen mention of how you plan to split the pipeline (what will be done at compile-time vs what will be done at link time), is the plan do it the same way as -flto does today (AFAIK almost complete -O3 during compilation, and the specific LTO pipeline during link time)?

2) It seems to me that there is a major difference in the optimization pipeline with the regular pipeline and LTO vs your proposal. When the inliner runs, we usually have all the callees (and their callees, transitively) already “optimized” (not completely, but fairly intensively). In your flow, a function imported from another module will be considered by the inliner *before* it has inlined cross-module its callees.

3) The incremental aspect is not addressed, is it something that you thought about or plan to improve in a future version? I have some ideas on this topic that I’d like to explore, and ThinLTO seems an interesting design to play with that.

4) Is your prototype implementation available online?

Thanks,

Hi Teresa,

Thanks for layout down a detailed proposal on top of the slides, it is
very instructive and very pleasant to read.

Hi Mehdi,

Thanks!

I have a few questions, none of which touches the ELF aspect! :slight_smile:
I apologize if you already addressed them and I missed it (if you can
refer me to any past discussion that would be nice).

1) I haven’t seen mention of how you plan to split the pipeline (what will
be done at compile-time vs what will be done at link time), is the plan do
it the same way as -flto does today (AFAIK almost complete -O3 during
compilation, and the specific LTO pipeline during link time)?

Yes, the first stage is very similar to LTO, so just like a "-c -flto"
compile (which could do -O2 or -O3 optimization before emitting the
bitcode), the main difference being the function summary and index emitted
along with the module bitcode. Unlike LTO the initial link of the bitcode
files just produces the combined function summary and index, but doesn't do
any optimization. After that, the LTO pipeline will be executed during the
parallel backend stage, where each module is compiled separately and does
importing.

2) It seems to me that there is a major difference in the optimization
pipeline with the regular pipeline and LTO vs your proposal. When the
inliner runs, we usually have all the callees (and their callees,
transitively) already “optimized” (not completely, but fairly intensively).

I assume you are referring to the optimization/inlining done in the initial
(-c -flto -O2) compile. That part will be the same with ThinLTO.

In your flow, a function imported from another module will be considered
by the inliner *before* it has inlined cross-module its callees.

The importing pass will be prior to inlining in the backend LTO pipeline.
So let's say we have a case where module A.cc contains a() which calls b()
from module B.cc, which in turn calls c() from module C.cc. When compiling
A.o in the backend phase, the importer will see the call to b(), and if it
thinks inlining is likely it will import it, exposing the call to c(),
which will then similarly be considered for importing. So when the inliner
pass is run it will be similar to LTO for the a()->b()->c() call chain,
assuming importing was deemed profitable for all of the cross-module
callees.

3) The incremental aspect is not addressed, is it something that you
thought about or plan to improve in a future version? I have some ideas on
this topic that I’d like to explore, and ThinLTO seems an interesting
design to play with that.

For the bitcode files produced by the first -c compile phase the
incremental compiles work as normal. But I assume you are talking about
incremental compiles for the backend LTO part. Some incremental compilation
can be achieved by computing and saving the dependence set of modules each
module is allowed to import from, using profile information or heuristics
in the linker/plugin stage.

I'd be interested in any thoughts you have on enabling incremental
compilation for ThinLTO or LTO in general.

4) Is your prototype implementation available online?

I haven't made it available as it needed a cleanup and had some prototype
aspects like writing the function index to a side text file instead of in
the module with the bitcode. I've been working instead on a cleaner
implementation that I've started to send for review. I saw you added
yourself to a few of the patches and left some review comments - thanks. I
will be working on responding to the comments and updating the patches next.

Teresa

Hi Teresa,

Thanks for layout down a detailed proposal on top of the slides, it is very instructive and very pleasant to read.

Hi Mehdi,

Thanks!

I have a few questions, none of which touches the ELF aspect! :slight_smile:
I apologize if you already addressed them and I missed it (if you can refer me to any past discussion that would be nice).

  1. I haven’t seen mention of how you plan to split the pipeline (what will be done at compile-time vs what will be done at link time), is the plan do it the same way as -flto does today (AFAIK almost complete -O3 during compilation, and the specific LTO pipeline during link time)?

Yes, the first stage is very similar to LTO, so just like a “-c -flto” compile (which could do -O2 or -O3 optimization before emitting the bitcode), the main difference being the function summary and index emitted along with the module bitcode. Unlike LTO the initial link of the bitcode files just produces the combined function summary and index, but doesn’t do any optimization. After that, the LTO pipeline will be executed during the parallel backend stage, where each module is compiled separately and does importing.

  1. It seems to me that there is a major difference in the optimization pipeline with the regular pipeline and LTO vs your proposal. When the inliner runs, we usually have all the callees (and their callees, transitively) already “optimized” (not completely, but fairly intensively).

I assume you are referring to the optimization/inlining done in the initial (-c -flto -O2) compile. That part will be the same with ThinLTO.

I was really thinking about the LTO phase, when you do cross-module inlining. This is done in parallel backends if I understand correctly how ThinLTO works. Which means you can’t intrinsically follow the SCC order that is used by LTO.
But basically this is what I describe in more details just below.

In your flow, a function imported from another module will be considered by the inliner before it has inlined cross-module its callees.

The importing pass will be prior to inlining in the backend LTO pipeline. So let’s say we have a case where module A.cc contains a() which calls b() from module B.cc, which in turn calls c() from module C.cc. When compiling A.o in the backend phase, the importer will see the call to b(), and if it thinks inlining is likely it will import it, exposing the call to c(), which will then similarly be considered for importing. So when the inliner pass is run it will be similar to LTO for the a()->b()->c() call chain, assuming importing was deemed profitable for all of the cross-module callees.

So to be clear with the sequence of what happens, if you’re doing what the regular pipeline does (assuming O3):

  1. Function c() will be considered by the inliner (assuming no calllees available)

  2. It will be optimized by running the following passes (+ implicit analyses):

SROA, Early CSE, Jump Threading, Value Propagation, Simplify the CFG, Combine redundant instructions, Tail Call Elimination, Simplify the CFG, Reassociate expressions, Canonicalize natural loops, Loop-Closed SSA Form Pass, Rotate Loops, Loop Invariant Code Motion, Unswitch loops, Combine redundant instructions, Scalar Evolution Analysis, Canonicalize natural loops, Loop-Closed SSA Form Pass, Induction Variable Simplification, Recognize loop idioms, Delete dead loops, Unroll loops, MergedLoadStoreMotion, Memory Dependence Analysis, Global Value Numbering, MemCpy Optimization, Sparse Conditional Constant Propagation, Combine redundant instructions, Jump Threading, Value Propagation, Dead Store Elimination, Aggressive Dead Code Elimination, Simplify the CFG, Combine redundant instructions.

  1. Function b will be considered by the inliner, and may or may not inline c()

  2. Function b will be optimized using the same set of passes as function c() during step 2).

  3. Only now Function a() will be considered by the inliner and may or may not inline b().

  4. Function a will be optimized using the same set of passes as function c() during step 2).

Note that with ThinLTO, you have in parallel at the same time the module B.cc processing and steps 1, 2, 3, and 4 are also performed on the same IR. They may not end-up with the same result as c() may have some callees available that were not available during A.cc compilation. It means that c() might be inlined in b() when processing B.cc but not when processing A.cc, and as a consequence maybe the version of b() in B.cc could have been inlined in a() but not the version of b() in A.cc.
(I hope it’s not too confused, I may have to provide an example).

And of course step 1 and 2 are also performed at the same time for module C.cc, and may give again a different result.

My observation is that the increased parallelism you have with ThinLTO comes with some (non-negligible?) redundant work, and you need this redundant work to have keep some quality for the inliner.

  1. The incremental aspect is not addressed, is it something that you thought about or plan to improve in a future version? I have some ideas on this topic that I’d like to explore, and ThinLTO seems an interesting design to play with that.

For the bitcode files produced by the first -c compile phase the incremental compiles work as normal. But I assume you are talking about incremental compiles for the backend LTO part. Some incremental compilation can be achieved by computing and saving the dependence set of modules each module is allowed to import from, using profile information or heuristics in the linker/plugin stage.

I’d be interested in any thoughts you have on enabling incremental compilation for ThinLTO or LTO in general.

I’ll let you know if I manage to get a reasonable sketch on this topic!

  1. Is your prototype implementation available online?

I haven’t made it available as it needed a cleanup and had some prototype aspects like writing the function index to a side text file instead of in the module with the bitcode. I’ve been working instead on a cleaner implementation that I’ve started to send for review. I saw you added yourself to a few of the patches and left some review comments - thanks. I will be working on responding to the comments and updating the patches next.

I saw the patches on Phabricator after asking the question here :slight_smile:
I was more interesting to hack around a working prototype to experiment the potential for incremental compilation. But not a big deal, I’ll follow the landing of patches!

Thanks,

Mehdi, you have good observation. In this respect, ThinLTO is similar to LIPO. See also inlined response below.

Hi Mehdi,

Saw David’s response but wanted to add a bit more below.

Hi Teresa,

Hi Mehdi,

Saw David’s response but wanted to add a bit more below.

Good point, I didn’t see before that the LTO pipeline is different. It hasn’t been touched in years and I wonder if it is not just a bug. I don’t see any reason for it to be different than the regular pipeline? It seems I have some experiments to run…

You understood correctly :slight_smile:

Trying to figure out the ThinLTO limitations compared to (Fat)LTO, I was trying to figure out what result would ThinLTO give on the basic clang example: http://llvm.org/docs/LinkTimeOptimization.html

The regular LTO performs this (copy/pasted from the page):

• In this example, the linker recognizes that foo2() is an externally visible symbol defined in LLVM bitcode file. The linker completes its usual symbol resolution pass and finds that foo2() is not used anywhere. This information is used by the LLVM optimizer and it removes foo2().

• As soon as foo2() is removed, the optimizer recognizes that condition i < 0 is always false, which means foo3() is never used. Hence, the optimizer also removes foo3().

• And this in turn, enables linker to remove foo4().

In ThinLTO, it seems to me that to be able to import foo1 in main.c, the static in a.c has to be promoted to a global which would kill the optimization described, is it correct?

Thanks,

Mehdi, ThinLTO is capable of doing any summary based whole program analysis (no transformation) as mentioned in the original document. Depending on the analysis overhead, it can either be turned on by default or with additional option.

The example you mentioned is a trivial one – global const var analysis. Using linker feedback, the plugin should be able to detect ‘i’ is never modified, and such result can be saved in the combined summary file.

In later phase3 stage, the simplifier can use the summary data to simplify references to those read only globals.

David