[RFC] A Unified LTO Bitcode Frontend

Then that really doesn’t look like noise to me, especially since it’s skewed onto one specific side.

I think the testing we’ve done on real applications is more instructive. The CTMark tests are very small, and that magnifies the impact of lengthening the pipeline. We’ve seen very little difference in compile/run time performance in our internal usage, and you can see the same results in the Webkit build we posted here.

Would it be possible to test a larger application, such as building the Clang itself?

Yes, we can do that. We’ve run that test internally several times. I’ll run it again and post the results here.

Here’s the compile time results for clang when built with current ThinLTO and unified ThinLTO.

Run Current Unified
1 2348.32 2384.87
2 2349.90 2385.76
3 2377.53 2385.98
AVG 2358.58 2385.53

%diff = 1.14%

This fits well with our internal tests. There is a slight penalty incurred by running the Full LTO pipeline pre-link, but it’s ~1%.

1 Like

@petrhosek @LebedevRI Any further thoughts on pre-link compile time performance? I should probably rebase this at some point…

I’ve rebased all patches. Other than removing the changes related to the legacy pipeline, the most signficant changes are related to calculating the Module ID. The new approach only takes weak symbols into account if the current approach hasn’t produced a hash, indicating that no externally visible symbols are available. The weak symbol names are combined with the module identifier to avoid any potential hash conflicts.

We’ve run a few different tests at this point and unified LTO seems to perform reasonably well. We’re hoping to commit this as an experimental option. Is there anything else that needs to be looked at?

Can you provide the invocation you’re using (and the set of patches to apply) to reproduce the compile-time results mentioned for clang in [RFC] A Unified LTO Bitcode Frontend - #45 by ormris ? (that is ThinLTO vs “Unified ThinLTO”).

Sure. The patches I used are attached. They’re based on commit 27a8735a444fb311838f06f8d0d5b10ca9b541f6. To configure the patched and unpatched compiler for comparison:

cmake -G Ninja -DCMAKE_C_COMPILER=clang-12 -DCMAKE_CXX_COMPILER=clang++-12 -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_PROJECTS="llvm;clang;lld" -DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_ENABLE_ASSERTIONS=OFF <repo-dir>/llvm

To configure the test build with Unified LTO:

cmake -G Ninja -DCMAKE_C_COMPILER=<install-dir>/clang -DCMAKE_CXX_COMPILER=<install-dir>/clang++ -DLLVM_USE_LINKER=<install-dir>/ld.lld -DLLVM_ENABLE_ASSERTIONS=OFF -DLLVM_INCLUDE_EXAMPLES=OFF -DLLVM_VERSION_SUFFIX= -DLLVM_BUILD_RUNTIME=ON -DCMAKE_POSITION_INDEPENDENT_CODE=ON -DCLANG_ENABLE_OPAQUE_POINTERS=OFF -DCMAKE_CXX_FLAGS="-flto=thin -funified-lto -Wl,--lto=thin -fuse-ld=<install-dir>/ld.lld" -DCMAKE_C_FLAGS="-flto=thin -funified-lto -Wl,--lto=thin -fuse-ld=<install-dir>/ld.lld" -DCMAKE_BUILD_TYPE=Release -DLLVM_BUILD_TESTS=ON -DLLVM_ENABLE_PROJECTS="clang;llvm;lld" -DLLVM_TARGETS_TO_BUILD=X86 -DCMAKE_AR=<install-dir>/llvm-ar -DCMAKE_RANLIB=<install-dir>/llvm-ranlib <repo-dir>/llvm

Remove “-funified-lto -Wl,–lto=thin” to configure a test build without Unified LTO.

clanglld.patch (20.1 KB)
llvm.patch (48.1 KB)
moduleid.patch (15.6 KB)
ps4.patch (1.6 KB)

1 Like

Thanks I was able to reproduce and look deeper, it made me figure out something that wasn’t clear when you posted the RFC:

When I was trying to understand your compile time results before, I thought that you only made “use the Full LTO pipeline for pre-link optimisation”, but you also seem to be using the FullLTO pipeline for ThinLTO post-link optimization, which changes the equation quite significantly!

So now I’m wondering why going this direction? Wouldn’t a better choice be the opposite actually: that is build the UnifiedLTO pipeline to be aligned on the ThinLTO pipeline instead (which itself is more aligned with a regular O2/O3). That wouldn’t change anything for ThinLTO, but that would make a FullLTO link slower: but that may be a better tradeoff overall, at least it won’t degrade ThinLTO performance. Have you considered this?

No, that’s not quite right. We use the ThinLTO and FullLTO default post-link pipelines for their respective post-link optimization steps. Unfortunately, there was a testing-related part of the patch (now corrected) that did seem to indicate that we use FullLTO for ThinLTO post-link. Sorry for the confusion.

There was some concern around changing the FullLTO pipeline. Many of our users prefer it and we didn’t want to harm its optimization effectiveness. When we figured out that we could use the FullLTO pipeline pre-link without losing ThinLTO optimization performance (and with minimal compile-time impact), we decided that would be a good way to go. Moving to the ThinLTO pre-link pipeline for everything would remove a significant number of optimizations from the Unified FullLTO pipeline as compared to the default FullLTO pipeline. I wouldn’t want someone running with Unified LTO enabled to see worse runtime performance in either mode.

To second this, I’m still using FullLTO, and planning to ship DXC using FullLTO in the near future because it still produces better performing output than ThinLTO.

Earlier this year I ran a comparison on CTMark with FullLTO vs ThinLTO on my M2 MacBook Pro and found that the FullLTO binaries performed consistently around 4% faster across the board. When I’m building for a distribution I’ll always burn the extra link time once to save 4% of compile time for every invocation of that binary.

Regressing ThinLTO link-time to preserve FullLTO output quality seems like the right tradeoff to me. Making FullLTO produce worse binaries just seems like the wrong call.

I think there was a misunderstanding, as I believe @ormris clarified in another comment. The LTO link will continue to use the selected type of LTO’s optimization pipeline.

(Also note that using the wrong pipeline for ThinLTO would degrade its performance, not link time, but in any case this isn’t what is happening).

The main pipeline change that is being contemplated is the pre-LTO compile. I believe what is being considered is using the full LTO pre-LTO compilation pipeline for the thin LTO pre-LTO compile. We need to ensure that doesn’t degrade ThinLTO performance either.

I ran my tests with the set of 4 patches you posted here, just 2 comments above. And this is the diff I see in there:

diff --git a/llvm/lib/Passes/PassBuilder.cpp b/llvm/lib/Passes/PassBuilder.cpp
index 8de919d24291..e84692886a1e 100644
--- a/llvm/lib/Passes/PassBuilder.cpp
+++ b/llvm/lib/Passes/PassBuilder.cpp
@@ -1140,7 +1140,10 @@ Error PassBuilder::parseModulePass(ModulePassManager &MPM,
     } else if (Matches[1] == "thinlto-pre-link") {
       MPM.addPass(buildThinLTOPreLinkDefaultPipeline(L));
     } else if (Matches[1] == "thinlto") {
-      MPM.addPass(buildThinLTODefaultPipeline(L, nullptr));
+      if (!PTO.UnifiedLTO)
+        MPM.addPass(buildThinLTODefaultPipeline(L, nullptr));
+      else
+        MPM.addPass(buildLTOPreLinkDefaultPipeline(L));
     } else if (Matches[1] == "lto-pre-link") {
       MPM.addPass(buildLTOPreLinkDefaultPipeline(L));
     } else {

Should the right diff be this instead the following?

diff --git a/llvm/lib/Passes/PassBuilder.cpp b/llvm/lib/Passes/PassBuilder.cpp
index 8de919d24291..f8f85b3d0dad 100644
--- a/llvm/lib/Passes/PassBuilder.cpp
+++ b/llvm/lib/Passes/PassBuilder.cpp
@@ -1138,7 +1138,10 @@ Error PassBuilder::parseModulePass(ModulePassManager &MPM,
     if (Matches[1] == "default") {
       MPM.addPass(buildPerModuleDefaultPipeline(L));
     } else if (Matches[1] == "thinlto-pre-link") {
-      MPM.addPass(buildThinLTOPreLinkDefaultPipeline(L));
+      if (PTO.UnifiedLTO)
+        MPM.addPass(buildLTOPreLinkDefaultPipeline(L));
+      else
+        MPM.addPass(buildThinLTOPreLinkDefaultPipeline(L));
     } else if (Matches[1] == "thinlto") {
       MPM.addPass(buildThinLTODefaultPipeline(L, nullptr));
     } else if (Matches[1] == "lto-pre-link") {

Is this is the only change? I’d like to be sure I run the right set of tests.

What kind of data do you have on this? It’s not clear to me actually: I spent a significant amount of time trying to just use the ThinLTO pipeline for FullLTO when we designed it. The only reason we didn’t pursue it at the time (as far as I remember) was compile time regressions. On the other hand performance regression shouldn’t be expected: I’d be curious to understand it if it is the case.
(I’m sure there are edges case: any kind of change to the pipeline, optimizations passes, or heuristics in the compiler will be “unlucky” in some cases of course).

Sure: to be clear, we should absolutely not regress FullLTO, but this is not what is at hand here I believe.

The benefit you get from FullLTO aren’t because of a very nice pipeline (otherwise we’d adopt it in ThinLTO). On the opposite, in general the FullLTO pipeline can’t be optimal because of link-time concerns. So you get benefits of FullLTO because you have the entire IR, and you should have these benefits even with using the ThinLTO pipeline :slight_smile:

If you’re running your tests with opt, you’ll want this change as well.

diff --git a/llvm/tools/opt/NewPMDriver.cpp b/llvm/tools/opt/NewPMDriver.cpp
index 57d3d2e86aa3..8d90bc64c407 100644
--- a/llvm/tools/opt/NewPMDriver.cpp
+++ b/llvm/tools/opt/NewPMDriver.cpp
@@ -416,6 +417,7 @@ bool llvm::runPassPipeline(StringRef Arg0, Module &M, TargetMachine *TM,
   // to false above so we shouldn't necessarily need to check whether or not the
   // option has been enabled.
   PTO.LoopUnrolling = !DisableLoopUnrolling;
+  PTO.UnifiedLTO = UnifiedLTO;
   PassBuilder PB(TM, PTO, P, &PIC);
   registerEPCallbacks(PB);

I’ll need to think more about your second question…

To summarize and make sure we’re on the same page, I’d like to cross-check my understand of what we have:

Regular O2/O3 is structured around:

  1. ModuleSimplificationPipeline: there are a few module-level simplifications before going to the inliner which operates on SCCs in the callgraph, applying a function-level pass-pipeline and moving to callees to inline.
  2. ModuleOptimizationPipeline: this is where we get aggressive with vectorization, and loop transformations (but also outlining, function merging, etc.). We intentionally avoid these before finishing inlining to avoid messing with the inliner heuristic.
  3. Each Backend do whatever fits in CodeGen preparation.

ThinLTO basically intercept between 1 and 2 above:

  • pre-link: we run approximately the ModuleSimplificationPipeline above, and stop there.
  • link-time: run the cross-module specific passes, and then run ModuleOptimizationPipeline and let the backend take over.
    → very aligned with O2/O3.

FullLTO:

  • pre-link: runs approximately ModuleSimplificationPipeline and ModuleOptimizationPipeline.
  • link-time: runs the cross-module specific passes, then run a custom pipeline that is a bit weird.
    For example it runs the inliner (which now runs after the ModuleOptimizationPipeline from pre-link…) but right now I don’t see anything done as simplification done interleaved during inlining? (seems really unexpected to me!)
    Then after inlining it tries to re-do some optimization but it can’t be as agressive as ModuleOptimizationPipeline for compile/link time reasons either, and also it means redoing the optimization already done during the pre-link.

The proposal here is to use the FullLTO pipeline during pre-link, that is perform ModuleOptimizationPipeline that ThinLTO does not do right now. The link-time wouldn’t change: each mode would keep their own pipeline.

I’d like to mention this patch from @nikic which actually stops running ModuleOptimizationPipeline during pre-link for FullLTO, bring the pre-link of FullLTO closer to ThinLTO (they mentioned in the review that some work is needed in the FullLTO link-time pipeline first).

@nikic patch would go in the direction where we wouldn’t have to think too much about UnifiedLTO: we’d have a natural convergence of the pre-link pipelines for all modes.

Let me know if I missed something!

2 Likes

@mehdi_amini Your description sounds basically right to me. I’ll just add a few minor notes:

It’s not just a matter of the inliner heuristic – we also want to have maximal information for the vectorization/unrolling heuristics. E.g. trip counts may become known post-link, and we’ll be able to make more profitable vectorization/unrolling choices with that information.

The “cross-module” part of ThinLTO runs outside the pipeline itself. The post-link ThinLTO pipeline will run both ModuleSimplification (which e.g. inlines imported functions) and ModuleOptimization. (The pre-link and post-link simplification pipelines have some differences in configuration, but they’re the same at a high level.)

Yes, the full LTO pipeline is indeed quite odd. It basically runs a very reduced version of the function simplification pipeline without the CGSCC interleaving with the inliner. I assume this is done for compile-time reasons, as it ensures each function only gets simplified once.

I think it’s also worth mentioning that if we talk about ThinLTO vs FullLTO, there are two components: The compilation model and the optimization pipelines. In terms of optimization power, FullLTO has the better compilation model (all code visible), but ThinLTO has the better optimization pipeline.

That’s my understanding of the proposal, and the part I am mainly concerned about, because it seems like we’re enforcing the use of the “worse” pre-link pipeline. Of course, until these pipelines can be unified, both combining the ThinLTO pre-link with FullLTO post-link and FullLTO pre-link with ThinLTO post-link will cause regressions in some cases.

2 Likes

Right, I was trying to frame this in terms of "why isn’t regular O2/O3 running this during the “simplification” phase (interleaved with the inliner): if it wasn’t messing with the inliner flow we’d just run it there right?

Yes, inlining is what it comes down to in one way or the other. We want most transforms to run before (/interleaved with) inlining, but some transforms only after inlining, which is how we get the simplification/optimization split. For LTO there is additional inlining post-link, so optimization should also only happen post-link (as is the case in the ThinLTO pipeline).

1 Like

I wanted to run some preliminary tests using ThinLTO as the pre-link pipeline in FullLTO mode. The test compared lld build times using a version of clang and lld built with two different unified FullLTO pipelines, one using the FullLTO pre-link pipeline and the other using the ThinLTO pre-link pipeline. The ThinLTO pre-link pipeline doesn’t cause a huge regression here and actually does slightly better. It also avoids the compile time regressions we noted earlier. After some internal discussion, we’re happy to move forward with the ThinLTO pre-link pipeline for this RFC. I’m sure that there will be more pipeline tuning down the road, but this seems like a reasonable place to start.

Pre-link pipeline

# Full Thin
1 1067.46 1062.12
2 1068.25 1064.11
3 1068.71 1064.46
4 1068.80 1064.69
5 1069.35 1065.45
AVG 1068.51 1064.17
1 Like