[RFC] -ffat-lto-objects support

LLVM users currently have to choose at build-time whether to generate object code or bitcode. Only having these options can be limiting since translation units can often be shared between many targets. These different targets may have different requirements in terms of compile time or performance. For instance, the benefits for LTO might be worth the overhead for main targets whereas this might not be the case for the tests.

GCC allows GIMPLE bytecode to be saved alongside object code if the -ffat-lto-objects option is passed. This “fat” object format allows one to build one set of fat objects which could be used for targets with different requirements since the decision to use LTO can be made at link time.

Our goal is to implement support for emitting bitcode alongside object code in LLVM and use that support to implement the -ffat-lto-objects option in Clang. Doing so would bring Clang to parity with GCC. We will initially focus on ELF, but this support could be extended to other formats later.

The difference between the existing -fembed-bitcode option and -ffat-lto-objects is the pass manager configuration, the rest should be the same. We will need a new LLD option to choose whether to perform traditional linking or LTO.

An open question is whether to reuse the existing .llvmbc section or whether to introduce a new section? The purpose of .llvmbc section has already been discussed in ⚙ D116995 [gold] Ignore bitcode from sections inside object files and it’s not clear whether this section is intended for LTO or not, and whether its use for -ffat-lto-objects would be appropriate or not.

How does this relate to this work: [RFC] A Unified LTO Bitcode Frontend ?

Can you describe better the rationale for why this is necessary? Building two separate output files, .lto.o with bitcode, and a separate .o with native objects, and depending on the correct one from the correct target doesn’t seem super-tricky, and seems generally preferable.

IMO, the Unified LTO format should be considered a hard prerequisite for this, and always enabled for this mode.

I think the correct path forward now is to deprecate and remove all of the existing command-line options and related infrastructure which currently deals with non-LTO embedded bitcode.

I say that because:

  • The current options and sections were originally intended for – and only really worked properly for – embedded bitcode for non-LTO re-compilation of the original object files (e.g. Apple’s AppStore, where Apple can rebuilds existing IR with a modified backend configuration).
  • Apple is, to my knowledge, the only entity that has actually used this.
  • Apple announced that embedded bitcode is no longer supported by their appstore infrastructure, and nobody should be uploading embedded bitcode anymore.

Once we’ve dropped all the existing stuff, we can consider and introduce new section names and new options for doing LTO-targeted fat bitcode, without any of the current baggage.

  • Apple is, to my knowledge, the only entity that has actually used this.

No, we use the 2 sections in MLGO to collect the pre-optimization “shape” of the IR, and, respectivelly, the command line options that we can use to re-invoke clang with. We use the 2 sections in both non-LTO and in ThinLTO cases (see LTOBackend.cpp, lto-embed-bitcode and thinlto-assume-merged, and related tests.)

We are happy to move to an alternative as long as it supports the 2 scenarios:

  • non LTO: have somehow available the pre-optimization IR and the command line options used
  • thinlto: same, but post-merge, pre-opt

All lld ports supporting LTO decide whether an input file is a native object file or an LLVM bitcode file upfront. I know the LLVMgold.so somehow postpones the decision, but this really inconveniences some linker performance optimizations (not optimizations for the output). Supporting something like -ffat-lto-objects or embedded bitcode will be very intrusive changes. I want to see very solid arguments for it to be considered.

I am uncertain -ffat-lto-objects is a desired interface. Such a change can be at the build system layer.

I agree with @jyknight that all the existing command line options and related infrastructure which currently deals with non-LTO embedded bitcode should probably be deprecated and removed. If MLGO uses embedded bitcode somehow, there needs to be an alternative.

I think the hard split means ld --relocatable with a mix of LTO and non-LTO inputs has to drop the LTO information.

ld.lld -r supports bitcode files. The LTO compilation is conservative and retains all non-local symbols (VisibleToRegularObj).

The MLGO requirements are relatively simple: we want 1) the command line used to invoke clang, and 2) the IR pre any optimization. In the case of thinLTO, we want the pre-optimization but post-merging IR. This allows us to collect a corpus and then replay it as follows:

  • build a target, adding to compilation the additional flags to embed bitcode
  • run a tool that goes over every .o file and scrapes out the .llvmbc and .llvmcmd sections, creating files with the same relative name as that of that of the .o file.
  • now we know that if we re-run clang on a IR file obtained above, with a command line based on the cmd file associated with it, we can replicate the optimizations that happened originally; so to that command line we can append the training flags we need that exercise a particular model and produce a log; extract metrics (like regalloc scores) and contrast them with those obtained under “default” compilation; etc.

Key thing is, all training can happen detached from any build system; and, in the case of thinlto, because it’s post-merge, without needing to understand “merging”: all that training cares about is that it can run clang on some files in a canonical way.

The current feature has extras we don’t need, for instance, the command line passed to the driver is filtered. That doesn’t hurt us much - we so far typically dealt with build systems invoking clang -cc1 - but it doesn’t help us either, and I bet it’s a source of the complexity @jyknight and @MaskRay mentioned. So, for instance, if we ended up being the only inheritors of Apple’s original feature, we would be happy to lower its complexity by removing that part of the code.

Another annoyance is that the .llvmbc and .llvmcmd can’t be associated with each other post-link: so if they survive linking, we’d have no way to tell which .llvmcmd corresponded to which .llvmbc.

Finally, the .llvmbc is uncompressed => linking large binaries just got harder. We can skip linking, but… seems kind of hacky.

We do like we can satisfy our scenario via sections (rather than extra output files), mainly because this greatly simplifies our scenario on hermetic build systems (bazel-based).

(needless to say, happy to collaborate on evolving / changing these sections and/or their internals)

I would rather have a design where the LLVM layer can be setup totally independently of clang: that is LLVM shouldn’t need to know whether we’re compiling an IR generated by clang or Rust to setup an optimizer/codegen.

The primary use cases we have in mind for -ffat-lto-objects is improving build performance when sharing translation units between distribution and test binaries. For example, in our build of Clang distribution which uses LTO, having to perform LTO for binaries only used by tests doubles the total test time.

I agree that this could be in theory addressed through build system, but it’s not something I’m particularly keen on implementing. It would likely require an extra layer of indirection to choose either the native or LTO target for every dependency, and this selection would need to done transitively.

I know of a way to implement this idea in a build system like GN, since we already use a similar approach elsewhere in Fuchsia, and it’s one of the most complicated parts of our build that few people understand and is a source of significant maintenance overhead. I don’t know yet of a way to implement this in CMake; there might be one, but it would likely be non-trivial.

The biggest advantage of supporting this feature directly in the compiler and the linker is simplicity and convenience. Doing this inside the build system is going to require a lot of complexity that’s going to be duplicated across every project that wishes to use it, and it would be difficult to maintain.

Supporting -ffat-lto-objects would improve compatibility with GCC and ICC which already implement this option. I’ve done a quick search and found a number of projects that already use this feature (including Linux). It appears to be especially common in the embedded domain.

Using -ffat-lto-objects is also potentially more efficient since you don’t need to parse the code twice (although the potential benefit depends on the input and its complexity).

I agree that [RFC] A Unified LTO Bitcode Frontend should be a prerequisite for this feature.

The last I checked, it compiled the bitcode (instead of deferring). GCC was able to defer the compilation to the target code in my experiments.

I am currently working on enabling embedding bitcode for the purpose of LTO on AIX, and was planning on posting a similar RFC to spur the discussion of upstreaming and eventually enabling for ELF as well. I’ve posted a draft patch ⚙ D130777 Enable embedded lto for XCOFF. of the AIX enablement. I am glad to see I am not the only one interested in embedded-lto.

My motivation for adding this feature is compatibility with the system compiler on AIX. The xlc compiler emits fat objects by default for LTO, and some users of xlc have build environments designed around the fact they can produce both formats in a single file and use each as appropriate. In those cases trying to restructure the build to migrate them to llvm is prohibitively expensive, and even if the restructuring the build is feasible a typical observation is that it works with both xlc and gcc, so clangs feature set is deficient, not their build environment.

This is exactly what I am experiencing with trying to migrate users that use LTO to llvm. The customer I am working with now builds hundreds of archives, and then those are linked against some basic functioanlity testing that will stop the build if it fails, while selectively using LTO on some targets when building shared objects and executables out of the same archives. Enabling LTO for everything is way to time consuming, and restructuring their complex build environment to build both and selectively link against depending on context is too difficult to implement now.

On AIX we have to pass the linker an option to indicate we are doing an LTO link and bitcodes files are unrecognized otherwise. The nice thing is that we don’t pay any cost to having to look for embed bitcode for native links, and LTO links are usually long enough that the extra overhead of looking for the special sections in the XCOFF files used in an LTO link is almost unnoticible. With LLD would it be acceptable to not bother looking for embedded bitcode by default, and only do so with an extra option. In that case the overhead becomes opt-in. I’m not sure how the other ELF linkers behave with respect to fat objects so I"m not sure if its a divergence from their behaviour though.

I used -fembed-lto to mimic the existing clang option for embedding bitcode, and expected to alias -ffat-lto-objects to it. If we are OK with deprecating and removing -fembed-bitcode though I’ll change up my patch to use -ffat-lto-object instead.

I’ll have to get caught up on this.

Adding an option to recognize a section which contains BitcodeFile is fine. Driver.cpp needs to construct a BitcodeFile (with offset/size information locating the bitcode content) instead of ObjFile<...>.

We met with @mtrofin to discuss the requirements for MLGO. We think that trying to combine -fembed-bitcode and -ffat-lto-objects isn’t the right strategy. Even though both of these flags have similar goals, that is embedding IR inside the object file, there are also important differences, specifically at which point in the pipeline do we take the IR. Trying to satisfy all of the different requirements with a single flag is likely going to lead to unnecessary complexity.

Our proposal is to keep -fembed-bitcode but evolve it better meet the needs of MLGO. The first step would be to drop -fembed-bitcode from the driver and keep it only as a cc1 flag. That’s going to eliminate a lot of the complexity related to flag filtering. After that, we can start evolving -fembed-bitcode, implementing changes such as combining .llvmbc and .llvmcmd into a single section, support for compression, etc.

Separately, we are also prototyping -ffat-lto-objects implementation in ⚙ D131618 [WIP][Do NOT review] LLD related changes for -ffat-lto-objects support, although that change is not yet ready for review (it also depends on Unified LTO which hasn’t landed yet).

If we can drop -fembed-bitcode as a driver flag ASAP, that’ll be great. It’d remove potential user-confusion between the features, and give us the ability to refactor/simplify the internals for MLGO’s use-case over the next months without impacting a publicly-exposed feature.

(BTW, I think it’d be most useful not to frame this proposal as “keep -fembed-bitcode”. The proposal here is to remove this feature, only temporarily preserving some of the underlying implementation as a basis for the implementation MLGO actually wants.)

I’ll look at removing from the driver next week. I think we’d want a staged approach, like first warn if it’s used from the driver - in the off chance there were users that weren’t represented here.

I agree that [RFC] A Unified LTO Bitcode Frontend should be a prerequisite for this feature.

I’m not sure I understand why the Unified LTO frontend needs to be a prerequisite? They seem like orthogonal aspects to me. I.e. whether to have a fat object vs what type of LTO was requested. It will be clear from the summary type what type of LTO pipeline was used to produce the IR in the fat object (full LTO, thin LTO, unified LTO), and the LTO backend can proceed accordingly. What am I missing?