Thanks for layout down a detailed proposal on top of the slides, it is very instructive and very pleasant to read.
I have a few questions, none of which touches the ELF aspect!
I apologize if you already addressed them and I missed it (if you can refer me to any past discussion that would be nice).
- I haven’t seen mention of how you plan to split the pipeline (what will be done at compile-time vs what will be done at link time), is the plan do it the same way as -flto does today (AFAIK almost complete -O3 during compilation, and the specific LTO pipeline during link time)?
Yes, the first stage is very similar to LTO, so just like a “-c -flto” compile (which could do -O2 or -O3 optimization before emitting the bitcode), the main difference being the function summary and index emitted along with the module bitcode. Unlike LTO the initial link of the bitcode files just produces the combined function summary and index, but doesn’t do any optimization. After that, the LTO pipeline will be executed during the parallel backend stage, where each module is compiled separately and does importing.
- It seems to me that there is a major difference in the optimization pipeline with the regular pipeline and LTO vs your proposal. When the inliner runs, we usually have all the callees (and their callees, transitively) already “optimized” (not completely, but fairly intensively).
I assume you are referring to the optimization/inlining done in the initial (-c -flto -O2) compile. That part will be the same with ThinLTO.
I was really thinking about the LTO phase, when you do cross-module inlining. This is done in parallel backends if I understand correctly how ThinLTO works. Which means you can’t intrinsically follow the SCC order that is used by LTO.
But basically this is what I describe in more details just below.
In your flow, a function imported from another module will be considered by the inliner before it has inlined cross-module its callees.
The importing pass will be prior to inlining in the backend LTO pipeline. So let’s say we have a case where module A.cc contains a() which calls b() from module B.cc, which in turn calls c() from module C.cc. When compiling A.o in the backend phase, the importer will see the call to b(), and if it thinks inlining is likely it will import it, exposing the call to c(), which will then similarly be considered for importing. So when the inliner pass is run it will be similar to LTO for the a()->b()->c() call chain, assuming importing was deemed profitable for all of the cross-module callees.
So to be clear with the sequence of what happens, if you’re doing what the regular pipeline does (assuming O3):
Function c() will be considered by the inliner (assuming no calllees available)
It will be optimized by running the following passes (+ implicit analyses):
SROA, Early CSE, Jump Threading, Value Propagation, Simplify the CFG, Combine redundant instructions, Tail Call Elimination, Simplify the CFG, Reassociate expressions, Canonicalize natural loops, Loop-Closed SSA Form Pass, Rotate Loops, Loop Invariant Code Motion, Unswitch loops, Combine redundant instructions, Scalar Evolution Analysis, Canonicalize natural loops, Loop-Closed SSA Form Pass, Induction Variable Simplification, Recognize loop idioms, Delete dead loops, Unroll loops, MergedLoadStoreMotion, Memory Dependence Analysis, Global Value Numbering, MemCpy Optimization, Sparse Conditional Constant Propagation, Combine redundant instructions, Jump Threading, Value Propagation, Dead Store Elimination, Aggressive Dead Code Elimination, Simplify the CFG, Combine redundant instructions.
Function b will be considered by the inliner, and may or may not inline c()
Function b will be optimized using the same set of passes as function c() during step 2).
Only now Function a() will be considered by the inliner and may or may not inline b().
Function a will be optimized using the same set of passes as function c() during step 2).
Note that with ThinLTO, you have in parallel at the same time the module B.cc processing and steps 1, 2, 3, and 4 are also performed on the same IR. They may not end-up with the same result as c() may have some callees available that were not available during A.cc compilation. It means that c() might be inlined in b() when processing B.cc but not when processing A.cc, and as a consequence maybe the version of b() in B.cc could have been inlined in a() but not the version of b() in A.cc.
(I hope it’s not too confused, I may have to provide an example).
And of course step 1 and 2 are also performed at the same time for module C.cc, and may give again a different result.
My observation is that the increased parallelism you have with ThinLTO comes with some (non-negligible?) redundant work, and you need this redundant work to have keep some quality for the inliner.
- The incremental aspect is not addressed, is it something that you thought about or plan to improve in a future version? I have some ideas on this topic that I’d like to explore, and ThinLTO seems an interesting design to play with that.
For the bitcode files produced by the first -c compile phase the incremental compiles work as normal. But I assume you are talking about incremental compiles for the backend LTO part. Some incremental compilation can be achieved by computing and saving the dependence set of modules each module is allowed to import from, using profile information or heuristics in the linker/plugin stage.
I’d be interested in any thoughts you have on enabling incremental compilation for ThinLTO or LTO in general.
I’ll let you know if I manage to get a reasonable sketch on this topic!
- Is your prototype implementation available online?
I haven’t made it available as it needed a cleanup and had some prototype aspects like writing the function index to a side text file instead of in the module with the bitcode. I’ve been working instead on a cleaner implementation that I’ve started to send for review. I saw you added yourself to a few of the patches and left some review comments - thanks. I will be working on responding to the comments and updating the patches next.
I saw the patches on Phabricator after asking the question here
I was more interesting to hack around a working prototype to experiment the potential for incremental compilation. But not a big deal, I’ll follow the landing of patches!