Supporting Regular and Thin LTO with a Single LTO Bitcode Format

We’re using regular LTO for our Clang toolchain build because we don’t mind spending more resources to squeeze out as much performance as possible. However, when looking into our build performance, I’ve noticed that we only spent about 1/3 of the total build time in building distribution components, the rest is spent on building unit tests and tools that are only used by lit tests. For the latter, we don’t care about the performance, so it’d be nice to avoid doing regular LTO to speed up the build.

The idea I had would be to use a single LTO bitcode format for all translation units, and then decide only at link time whether to use regular LTO for distribution components or ThinLTO for everything else.

After doing some research, I found the “Supporting Regular and Thin LTO with a Single LTO Bitcode Format” talk presented by Matthew Voss at LLVM Developers’ Meeting 2019 which does exactly what I described, but it seems like this was only implemented downstream.

Has there been any progress on upstreaming the implementation? Is there any way to do what I described using the in-tree LTO implementation?

+Matthew and Teresa for any context they might have

High level sounds like a reasonable thing to me, for what it's worth.

This is a really good thread to read: https://lists.llvm.org/pipermail/llvm-dev/2018-April/122469.html

There is no fundamental technical reasons why this cannot happen but it requires lots of work to fine tuning the pipeline (yes, fullLTO and thinLTO uses different pipeline) so that it reaches a good balance of performance/build overhead for general users.

Steven

Hi Petr,

This does sound like a good use case for our pipeline. We’ve seen good runtime performance overall, as we stated in the talk. I’ve been working on upstreaming our patches off and on for a couple months now. Our pipeline needs to be ported to the NPM, but once that work is done, the patch is ready for review. I should be able to finish that work within the next month or two and would love to get some feedback on our approach.

Thanks,

Matthew

Hi,

I actually just watched the presentation, but it is a bit too high level for me to really understand what is this new “format” in practice.
As far as I remember ThinLTO bitcode should be a super-set of information compared to FullLTO. It isn’t clear to me what prevents a linker implementation to take the ThinLTO bitcode and perform FullLTO on these, this was designed to allow this originally.

The optimization pipelines are set up differently, but that does not really affect the format I believe. A good FullLTO link implementation based on the ThinLTO bitcode would likely involve running a heavier pipeline during LTO, this is something I tried to do at the time to align the pipeline and ensure there is only one, however it was making the LTO link slower and we considered it wasn’t a good tradeoff.
We could also run part of the pipeline on individual modules (in parallel) before linking them together during FullLTO, which would address the performance question. It wouldn’t however help when the same modules are linked into different binaries, like it is the case for LLVM test binaries which are linking over and over the same set of libraries. For this setup (which may not be common though) it is better to do as much work as possible in the first compilation phase and as little as possible during the link phase.

the rest is spent on building unit tests and tools that are only used by lit tests. For the latter, we don’t care about the performance, so it’d be nice to avoid doing regular LTO to speed up the build.

For such a case, the “ideal” setup would be to link these in a ThinLTO mode where we disable cross-module importing and just optimize/codegen in parallel. We could guarantee this way that each file is codegened once and cached and every test binary would have a cache hit on the files. This makes it “almost zero” cost over non-LTO I think.

Best,

Since we do get a module summary even for regular LTO (with metadata added to indicate whether the LTO link should do regular or ThinLTO), I think the issue is just the divergent pass managers, which is also discussed in the thread Steven pointed to. Matthew - can you remind me of the different summary issue?

Petr, in your case where you don’t care about the performance (so much) when running in a ThinLTO mode since it is just tests, presumably the different pipeline issue isn’t a big deal.

Teresa

Yes, I don’t think that using a different pipeline for test binaries would be a problem.

Hi,

I actually just watched the presentation, but it is a bit too high level for me to really understand what is this new “format” in practice.
As far as I remember ThinLTO bitcode should be a super-set of information compared to FullLTO. It isn’t clear to me what prevents a linker implementation to take the ThinLTO bitcode and perform FullLTO on these, this was designed to allow this originally.

The optimization pipelines are set up differently, but that does not really affect the format I believe. A good FullLTO link implementation based on the ThinLTO bitcode would likely involve running a heavier pipeline during LTO, this is something I tried to do at the time to align the pipeline and ensure there is only one, however it was making the LTO link slower and we considered it wasn’t a good tradeoff.

This is something I’m interested in separately from the original topic. A while back during one of our discussions, Chandler suggested that it may be worth looking into using ThinLTO pipeline for FullLTO since FullLTO hasn’t really been getting much attention in the last few years and could benefit from the work done on the ThinLTO pipeline. Aside from slower links, have you seen any improvements in terms of runtime performance?

Hi Teresa,

Matthew - can you remind me of the different summary issue?

In broad strokes, the two formats are compatible, but we had to make several small changes related to symbol resolution and pass ordering to get everything working.

Thanks,

Matthew

Yes at the time (5 years ago though) there were improvements and regressions as usual. Here is a revision I had opened later (4 years ago) after we shipped ThinLTO: https://reviews.llvm.org/D29376 ; some folks timed it.

See also this previous thread on llvm-dev@ on this same topic: https://lists.llvm.org/pipermail/llvm-dev/2018-April/122469.html

We also have this problem, and we are considering using fembed-bitcode: https://crbug.com/1196260

This way, we wouldn’t do any kind of LTO for test binaries, we’d just link native objects and discard the bitcode sections. ThinLTO has the advantage that it may do less code generation of non-prevailing inline functions, but it is also more complex overall. It is too early for me to say which of these approaches is best, but we’ll keep in touch.

We also have this problem, and we are considering using fembed-bitcode: https://crbug.com/1196260

Interesting. I have a comment/question on the v2 suggestion listed there, will add a comment to the bug.

Is Chromium using distributed ThinLTO? I can’t recall if that has been enabled yet. If so, there are other ways to reduce the time for test targets when building with ThinLTO (along the lines of what we do for certain statically linked test targets in google’s internal builds).