RFC: Addressing Deficiencies in LLVM’s LTO Implementation

Link-Time Optimization (LTO) has been supported in LLVM for a long time, and for most user level applications, it functions well. However, LTO has many deficiencies with respect to linker semantics, which show up more frequently in kernel or embedded code bases. For now, we will only focus on ELF linking, and how LTO breaks many of the subtle but load bearing cases for ELF. All of the issues we discuss here affect Full LTO, though several also impact Thin LTO.

In this post we will attempt to highlight several cases where things break down, many of which are well known to the community. When possible, we’ve attributed related issues in the GitHub issue tracker, and tried to provide some concrete examples here for discussion. Our goal is to bring this issue to the broader community, and hopefully agree on the next steps in the process for addressing these problems. In a few cases, there has already been significant discussion or RFCs on how to address the issue. We’d like to avoid rehashing all of those here, but also don’t want to ignore the issues, since they are at least somewhat related and in some cases may have the same root cause. We’ve specifically called out issues where we think a dedicated discussion or new RFC is more appropriate, and hope that we can delay a detailed discussion of those particular issues until then.

For a few cases, we’ve outlined some steps we think may be able to address the problem, and are providing details about our rationale. Based on feedback here, we plan to make dedicated RFCs for these as appropriate, but given the degree of overlap between these problems, it seems prudent to have a broader discussion about some of our design and implementation decisions regarding LTO.

Problems

Incomplete IR Symbol Table

Problem 1. Because the compiler can introduce new symbol references (e.g. function calls) through libcalls or other optimizations (e.g. replacing a call to memcmp with a call to bcmp), it is possible for the compiler to delete a function through DCE only to later reintroduce a call to that function. This happens more frequently when things like libc or compiler-rt are provided as bitcode, since they frequently supply symbols used for libcalls or which the compiler assumes knowledge. Similar problems may arise if something in the ABI is changed, for example during partial inlining, because something was misclassified as safe to internalize, for example due to something lazily brought into the link. A similar problem can also happen when the main binary is providing a definition for something brought in lazily from an archive, but that definition was DCEd.

An example of this can be found in Lazy files extracted post LTO compilation might reference other lazy bitcode files, leading to incorrect absolute defined symbols · Issue #127284 · llvm/llvm-project · GitHub and some related tests in [ELF][LTO] Add baseline test for invalid relocations against runtime calls by arichardson · Pull Request #127286 · llvm/llvm-project · GitHub.

As another example, consider the case where your libc is participating in LTO. This means that symbols, like memcmp and bcmp are being provided as bitcode. If during the link all the calls to bcmp are inlined, and bcmp is marked as internal, then Global DCE will remove bcmp’s definition. This is the right thing to do, except that later passes, like Instcombine can transform a call to memcmp into a call to bcmp, which has no way of being satisfied any more, since its definition has been deleted. Here, neither memcmp nor bcmp are libcalls, but are instead classified as Lib_Func, which means they are not treated the same way as other functions we could emit calls to later in the optimization pipeline.

Problem 2. Any libcall in an bitcode archive is eagerly extracted and code-generated thereby breaking traditional archive semantics, because symbols that should never participate in the link are now brought in. This can lead to undefined symbols being brought into shared libraries for example.

A concrete example of this is when building a shared library from a single TU with no unresolved symbol references (e.g. the TU provides all definitions required itself, which could be a single function). However, libunwind is one of the linker inputs, provided as a bitcode archive via –as-needed -lunwind –no-as-needed, which means it should only participate in the link if required. In the course of normal ELF linking, the archive is never consulted, since there are no unresolved symbols needed by the shared library.. However, we found that when libunwind is bitcode, libcalls like _UnwindResume are eagerly extracted from the archive, which in turn brings in several other symbols, none of which were needed by the original library, and which should not have participated in the link based on normal ELF archive semantics.

A related problem was brought up in [RFC][LTO] Handling math libcalls with LTO, where Joseph Huber outlines issues they run into using LTO on the GPU, since all math functions from libm are considered libcalls. This causes code generation for a significant number of functions that are never used/referenced, and they often survive into the final binary. Since the libcall is introduced late in codegen, they also do not get a chance to be inlined.

Problem 3. Inline assembly can generate symbols that won’t be in the LTO symbol table. Obviously, this is an issue since the compiler/linker could mistakenly assume that the symbol reference doesn’t exist. There are at least a few instances of this issue: LTO scan of module-level inline assembly does not respect CPU · Issue #67698 · llvm/llvm-project · GitHub and Incorrect branch targets in RISC-V executables built with LTO · Issue #65090 · llvm/llvm-project · GitHub

Target Features and ABI

Problem 4. Module level inline assembly is parsed to search for symbol references, but the correct target features may not be used.

Under LTO, there isn’t a very good mechanism of ensuring target features are used correctly at link time. This has been problematic for RISC-V, which has made more use of these features. Patches like https://github.com/llvm/llvm-project/pull/73721, [RISCV] Use 'riscv-isa' module flag to set ELF flags and attributes. by topperc · Pull Request #85155 · llvm/llvm-project · GitHub, https://github.com/llvm/llvm-project/pull/100833, and https://github.com/llvm/llvm-project/pull/73721 are good examples ways we’ve tried to address this issue.

While a lot of these problems show up in RISC-V, they are potentially an issue for all target architectures.

Module Boundaries

Problem 5. Full LTO breaks file globbing rules in linker scripts, since the merged module is only operated on via a monolithic LTO.o file, while the globbing rules are written to reference the original object files. This can be worked around via section attributes, but it is not always feasible to migrate a codebase to use these attributes. Previous RFCs on the topic: [RFC] (Thin)LTO with Linker Scripts and LTO with Linker Scripts.

Potential Solutions

Fixing the Symbol Table

Problems 1-3 can be considered to have the same root cause: the symbol table is not fixed when performing LTO, but LTO code generation needs to act as though it is fixed. To address the underlying problem here, we could modify how both the compiler and linker behave w.r.t. emitting new references to symbols and when it is safe to remove them. To begin, LLVM already has a method for marking functions to be preserved via libcalls. This should be leveraged and used as the only source of truth for emitting new function calls to code outside of the module, and should probably be extended to include things that could be emitted as an optimization, such as replacing a call to memcpy with a call to bcmp. This is an important detail if we want to avoid incorrectly removing function definitions.

We can use this conservative list of APIs to preserve these symbols, thereby ensuring that they won’t be removed if a later libcall references them. This would also require linker support, since bringing in a TU due to a libcall may transitively cause more symbols to be referenced. The linker would need to supply a list of all symbols that might become referenced as a consequence of a libcall being emitted. There is a potential here to scan each function and annotate it with a conservative list of any callees that could be introduced.

Linking with LTO can continue without modification until we reach the linking phase that could bring in new bitcode via new libcalls, or via lazy evaluation from archive members. At this point any new bitcode object being brought into the link would be compiled as a relocatable link ( ld -r) which will generate a new .o for the entire TU being brought in from the archive. After the new objects are generated, linking can continue normally, as we’ve completed all potential code generation. This approach may lose out on some performance, through missed inlining, but should be safe and fairly straightforward to implement in LLD.

While we’ve thought a great deal about how this works for ELF, we are not sure if this should/would apply to other object file formats, or needs to. Our intuition is that these are still problems regardless of the target, but we are unsure if that is an accurate assessment, given that there are many subtle details to ELF linking that would not hold true for COFF or Mach-O.

Module level assembly

Problem 4 requires a different approach that allows the module level flags and features to be used with the inline assembly. In RISC-V, we’ve run into similar problems with LTO builds not getting the correct target features. Today we have an ad-hoc solution that plumbs the arch string to the Target machine, so that it can be instantiated with the same flags as the surrounding module. We can either follow suit for other architectures, or we can design a new mechanism. @jrtc27 has mentioned that they have ideas in this area, and I don’t have a good sense for what a good long term solution would look like, so feedback here is most welcome.

We should note that https://github.com/llvm/llvm-project/pull/100833 may help, but it’s unclear if module level ASM will be/can be addressed this way and others have suggested it could be solved by moving the asm parsing logic into frontends https://discourse.llvm.org/t/rfc-target-cpu-and-features-for-module-level-inline-assembly/74713/2.

Linker scripts

Addressing the lack of LTO support for file name globbing in linker scripts requires more extensive changes to the compiler and/or linker, and we’re considering addressing those details here to be out of scope for this discussion. This is an issue our team has considered for quite some time to better support our embedded users and is sufficiently complex as to require its own RFC and dedicated discussion, especially given that there are at least two previous RFCs trying to address this specific problem ([RFC] (Thin)LTO with Linker Scripts and LTO with Linker Scripts).

Related GitHub issues/erata

Incorrect branch targets in RISC-V executables built with LTO · Issue #65090 · llvm/llvm-project · GitHub

LTO plugin uses wrong ABI for LTO objects on riscv · Issue #50591 · llvm/llvm-project · GitHub

4 Likes

cc: @petrhosek @teresajohnson @MaskRay @nikic @davidxl @aeubanks

cc: @mysterymath

As far as I’m aware, the current status-quo is to extract everything and ‘preserve’ it so long as its libcall name is not set to null. Ideally we’d just be able to prevent these kinds of optimizations from happening in the first place. I had an attempted stab at this where linking something with nobuiltin on it would propagate to the entire TU, but that caused some weird runtime bugs I didn’t have time to reduce so I just hacked around it.

For my use-case, it’s imperative that we don’t keep symbols around longer than they need to. Right now LTO just prevents internalization on these symbols, mostly because of the assumption that the backend could emit direct calls to them at any time.

LTO of libc breaks fundamental assumptions we make in the compiler: for example, we assume we can’t see the implementation of malloc. If you’re doing this, you need to be very careful to ensure the IR is marked up correctly for all phases of compilation. It would be very easy to end up with something that appears to work for small examples, but is actually subtly broken.

I’m not sure how multiple LTO phases interacts with internalization; if you don’t know which objects are going into the link, it seems hard to prove the correct symbol visibility?

What’s the proper way to work around this? This is liable to happen depending on how you link the LLVM libc. Currently we hack around this by passing -fno-builtin-<func>. I’m guessing that’s sufficient?

Currently we prevent internalization for anything the backend states it might use.

What’s the proper way to work around this? This is liable to happen depending on how you link the LLVM libc. Currently we hack around this by passing -fno-builtin-. I’m guessing that’s sufficient?

-fno-builtin is sufficient… not sure if you can get away with suppressing specific functions. I guess it depends what assumptions you make about the way libc itself is structured/optimized.

-fno-builtin is, basically, the currently accepted solution. It might be possible to do something more fine-grained, but I’m not exactly sure what that would look like. Maybe we can define some subset of libc functions that don’t interact badly with interprocedural optimizations.


Currently we prevent internalization for anything the backend states it might use.

There might be some weird interactions here: if a libc object exposes both a backend symbol, and some other symbol, we need to make sure we don’t internalize the other symbol early.

-fno-builtin isn’t really desirable for the vast majority of cases because it prevents all inlining + internalization.

The specific interaction, from what I understand, is that if you have two functions with different “-fno-builtin” markings, we can’t inline? I think we discussed making LTO do something to force the markings to be consistent.


But anyway, the reason I’m bringing it up in the context of this RFC is to show this isn’t just an issue with the symbols themselves… we also need to consider the associated semantics.

So, you’re suggesting disabling the libcall mechanism entirely? I’m not sure if that is feasible, given that they’re often required.

Do we document these assumptions anywhere? For malloc specifically, is the main issue its interactions w/ MemoryBuiltins? or do you have other concerns?

Agreed. This is certainly less straightforward than I thought it would be when we started this work, and trying to understanding the semantics (of LTO, of ELF linking, etc.) has been a big part of that.

I think most of the malloc-related stuff goes through MemoryBuiltins? I think we also do similar optimizations on calls where the return value is “noalias”.

Note that having any malloc optimizations enabled infects the rest of libc because you don’t know which libc-internal functions touch the allocator.

We don’t really have any central documentation that would determine which functions are, and are not, special. Everything should go though TargetLibraryInfo, I think, unless we’re making some very subtle implicit assumption. Skimming the list, I would be suspicious of anything related to allocation, anything related to unwinding, and maybe floating-point status/exception registers on soft-float targets. But the one with the widest impact is definitely allocation optimizations.

Maybe there’s some way to mark up a libc implementation with builtins to make it LTO-friendly, but I haven’t really thought about it.

Yeah, MemoryBuiltins was the main thing I could think of or find, plus any knock-ons for aliasing or things that check AllocType.

IIRC glibc has some kind of annotations for a bunch of libc things so that GCC can handle them better. Perhaps there’s a good set of attributes we could use (or invent) to communicate that to compilers.