[RFC] A Unified LTO Bitcode Frontend

Hi Everyone,

I’ve been working on a usability improvement for LTO workflows. We (Sony) have
been using this in production for some time and we’d like to offer it here as an
non-default LTO configuration option for clang.

How it works

Currently, LTO mode is chosen during the pre-link phase and can’t
be changed at a later stage. Thin and Full LTO bitcode may share a binary
format (LLVM bitcode), but are explicitly made incompatible. When summaries
were added to Full LTO, they were given a different name, to ensure that they
were never confused with ThinLTO summaries.

Our LTO pipeline creates a single LTO bitcode structure that can be used by
Thin or Full LTO. This means that the LTO mode can be chosen at link time. In
addition, this means that all LTO bitcode is compatible, from an optimization
perpsective. Currently, if a compilation has both Thin and Full LTO bitcode
files, they will be optimized separately with no information shared between the
Full and Thin backends. Since the internal structure of a bitcode file isn’t
visible to most users, this may not be the expected behavior. This
compatibility also means that deploying bitcode libraries can be a simpler
process. A normalized set of features over all bitcode files helps to ensure
that users get the optimizations they expect when these libraries are included.

We implemented this feature by making every LTO module identical in structure
to a split ThinLTO bitcode module. We then use the Full LTO pipeline for
pre-link optimisation. This allows for maximum optimisation and compatibility.
It also, as you’ll see below, leads to increased file sizes. This is due to the
fact that type information is always available when using this scheme, which
means that more split modules are created.

Performance

This feature does come with a build time cost, however. Here are Webkit build
times when using ThinLTO.

Run # Build Time (s)
Unified:
0 2866.34
1 2868.64
2 2872.39
Distinct:
3 2831.95
4 2826.11
5 2830.02
Unified AVG (s): 2869.12
Distinct AVG (s): 2829.36
% Diff: 1.40%

This also comes with a slight increase in file size when compared to a standard
ThinLTO build. This is because loops are unrolled and some vectorization also
occurs.

Unified: 376878.7998
Distinct: 373361.0738
% diff: 0.938%

When compared to a normal Full LTO build, the difference is negligible.

Diff: -0.084303938
Unified: 376878.7998
Distinct: 376878.8841
% diff 0.000%

Conclusion

We’ve been using this feature in production for some time and it’s been stable.
We’ve had the opportunity to work out the kinks (mostly related to symbol
resolution). At this point, we’d like get some feedback on contributing this
system. We think it could be a useful workflow for others in the project.
Comments and feedback welcome. I’ve posted the patch in three parts, with
one diff for each project changed (clang, llvm, lld).

Patches:
https://reviews.llvm.org/D123803
https://reviews.llvm.org/D123804
https://reviews.llvm.org/D123805

Thanks,
Matt

Hi Matt,

Thanks for sending this! I remember discussing this with you a few years back after you presented at LLVM Dev Meeting. Since it has been awhile and my memory on the details is fuzzy, I wanted to get clarification on one piece you mention:

making every LTO module identical in structure to a split ThinLTO bitcode module

Can you clarify this and the implications? Typically for CFI e.g. I know that the module is split into a regular LTO module containing the vtable defs and a few other things, and a ThinLTO module containing the rest. Then the LTO link does a combination of regular and thin LTO. What is the implication on the resulting LTO link performed in your model when all the modules are split? Why is that desirable?

Thanks,
Teresa

Hi Teresa,

In order for bitcode files to provide uniform features and be compatible with each other, every module must have split LTO units enabled. This doesn’t mean that all bitcode files contain two modules, but rather that the module flag is always set.

Thanks,
Matthew

Hi Matt

I remember the discussion we had during dev meeting about the schema. I remember we all agree this format is very doable and the main concern was about the build time and runtime impact of merging two currently different pipeline.

I see you collect the build time and size impact for one project (WebKit) but it would be good to see more data. Also the runtime performance data will be very important matrix too.

A quick peak of the patch, it seems you are creating a new mode of LTO in clang so I am less concern about the compatibility with fullLTO or thinLTO but it would be still be good to have more number so we can guide the users for the pros and cons for different LTO modes.

Steven

Hi Steven,

I see you collect the build time and size impact for one project (WebKit) but it would be good to see more data. Also the runtime performance data will be very important matrix too.

Running this on a couple of other opensource projects wouldn’t be a bad idea. Clang would be the first project that comes to mind, but open to other suggestions.

Thanks,
Matt

I’m surprised that there isn’t more impact on the compile-time (phase1) of ThinLTO when you add all the optimizations passes, it used to be significant (like double-digit percentage).
Also: this means that all of the “late optimizations” would run before ThinLTO can consider inlining: it would change potentially heuristics used there (basically it is somehow like running the late passes that run post-inlining inside the CGSCC phase).
Am I missing something in what you’re doing?

Hi Mehdi,

I’m surprised that there isn’t more impact on the compile-time (phase1) of ThinLTO when you add all the optimizations passes, it used to be significant (like double-digit percentage).

Same here, honestly. This was something we looked at early on. All we’re doing is using the regular LTO frontend passes. We were also concerned about phase ordering issues, but the performance of the executables we compiled was the same or better.

Thanks,
Matt

In order for bitcode files to provide uniform features and be compatible with each other, every module must have split LTO units enabled. This doesn’t mean that all bitcode files contain two modules, but rather that the module flag is always set.

Perhaps I just need to look more closely at the patches, but I’m still a little confused about this. How are split modules which currently have a regular and a thin LTO module handled in the “unified” mode?

How are split modules which currently have a regular and a thin LTO module handled in the “unified” mode?

When we’re using the ThinLTO backend, we can handle all modules as usual. No customization required. When we’re using the regular LTO backend, both modules are interpreted as regular LTO modules. With some small changes to internalization and metadata , they link together without issue. The regular LTO compile then proceeds as normal.

ormris
April 15

How are split modules which currently have a regular and a thin LTO module handled in the “unified” mode?

When we’re using the ThinLTO backend, we can handle all modules as usual.

Does this mean that both “split” parts are built through the backend with pure ThinLTO, or that you get “hybrid” LTO as happens today with split LTO modules?

We do not use split modules internally because of scalability issues with the regular LTO split portion.

Does this mean that both “split” parts are built through the backend with pure ThinLTO, or that you get “hybrid” LTO as happens today with split LTO modules?

You get “hybrid” LTO. We wanted some of the features that mode offers.

ormris
April 15

Does this mean that both “split” parts are built through the backend with pure ThinLTO, or that you get “hybrid” LTO as happens today with split LTO modules?

You get “hybrid” LTO. We wanted some of the features that mode offers.

I see. Unfortunately, that’s not something we were able to use successfully. Which is why for example, I added the non-split LTO unit mode and an index-only version of WPD.

I see. Unfortunately, that’s not something we were able to use successfully. Which is why for example, I added the non-split LTO unit mode and an index-only version of WPD.

Makes sense. CFI was a big reason we choose hybrid mode. Given the size of our projects, the performance is acceptable.

I spent some time looking through the clang and llvm patches and have a few comments there that I will send shortly. However, I wanted to discuss here a higher level question.

Right now the patch’s new UnifiedLTO model implies 2 things:

  1. A unified optimization pipeline.
  2. Split LTO units.

AFAICT, these are orthogonal. I.e. I believe the unified optimization pipeline mode could just as easily be used with non-split ThinLTO. Is that correct? Also, if you are using CFI, that already implies/requires split LTO units, so it seems like the change to force enable that with unified LTO should’t be needed in your case (and could always be enabled under its separate -fsplit-lto-unit option in any case). Additionally, the llvm change made to include more symbols in the module hash to get better splitting is currently gated on the UnifiedLTO mode, but that also seems like it should be orthogonal - why not always do that? It seems like that would help the non-unified default pipeline case with split LTO (e.g. for CFI) as well.

Can the split LTO unit changes from the clang driver be removed from the unified LTO handling, and the module hash changes be sent as a different patch that affects whenever a split ThinLTO unit is being generated?

Thanks,
Teresa

Hi Teresa,

Yes, the unified optimization pipeline and the split LTO units are orthogonal. But we provide pre-built bitcode libraries to our customers, and so for us to support WPD and CFI, we need to enable them both. With that in mind, we could redo the patch series as follows:

  1. Unified Pipeline in LLVM
  2. Unified Pipeline in Clang+LLD
  3. ModuleID changes
  4. SIE-specific settings (split LTO units, etc)

This would preserve the compatibility of bitcode files on our platform, while allowing for non-split units.

Thanks,
Matthew

But we provide pre-built bitcode libraries to our customers, and so for us to support WPD and CFI, we need to enable them both.

AFAIK WPD and CFI options both imply -fsplit-lto-units by default already, so I’m still not sure why the unified LTO option would need to additionally enable that. Sorry if I am missing something.

SIE-specific settings (split LTO units, etc)

With the change are you still needing to have the unified LTO option enable split LTO units?

AFAIK WPD and CFI options both imply -fsplit-lto-units by default already, so I’m still not sure why the unified LTO option would need to additionally enable that

My understanding is that the “EnableSplitLTOUnits” module flag needs to be set consistently. In addition, compiling a partially split set of bitcode files can result in an error from the backend. Since we know that some (but not all) users want CFI/WPD, we need bitcode libraries that work whether or not these flags are enabled. This means that we need to enable split LTO units in our libraries, as this is the only setting that works in both situations. Since this flag needs to be consistent, we then need to enable it by default so that all bitcode modules are compatible with our bitcode libraries.

ormris
April 18

AFAIK WPD and CFI options both imply -fsplit-lto-units by default already, so I’m still not sure why the unified LTO option would need to additionally enable that

My understanding is that the “EnableSplitLTOUnits” module flag needs to be set consistently.

Yes, that is true.

In addition, compiling a partially split set of bitcode files can result in an error from the backend. Since we know that some (but not all) users want CFI/WPD, we need bitcode libraries that work whether or not these flags are enabled. This means that we need to enable split LTO units in our libraries, as this is the only setting that works in both situations. Since this flag needs to be consistent, we then need to enable it by default so that all bitcode modules are compatible with our bitcode libraries.

Since presumably you would need to force enable the new -funified-lto added in this patch to get split units enabled consistently, can you just force enable both -funified-lto and -fsplit-lto-unit, and leave the behaviors distinct?

can you just force enable both -funified-lto and -fsplit-lto-unit, and leave the behaviors distinct?

Yes, that was my plan.

ormris
April 18

can you just force enable both -funified-lto and -fsplit-lto-unit, and leave the behaviors distinct?

Yes, that was my plan.

Ah - got it! That sounds great, thanks.