I've included below an RFC for implementing ThinLTO in LLVM, looking
forward to feedback and questions.
Thanks!
Teresa
RFC to discuss plans for implementing ThinLTO upstream. Background can
be found in slides from EuroLLVM 2015:
https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
As described in the talk, we have a prototype implementation, and
would like to start staging patches upstream. This RFC describes a
breakdown of the major pieces. We would like to commit upstream
gradually in several stages, with all functionality off by default.
The core ThinLTO importing support and tuning will require frequent
change and iteration during testing and tuning, and for that part we
would like to commit rapidly (off by default). See the proposed staged
implementation described in the Implementation Plan section.
ThinLTO Overview
See the talk slides linked above for more details. The following is a
high-level overview of the motivation.
Cross Module Optimization (CMO) is an effective means for improving
runtime performance, by extending the scope of optimizations across
source module boundaries. Without CMO, the compiler is limited to
optimizing within the scope of single source modules. Two solutions
for enabling CMO are Link-Time Optimization (LTO), which is currently
supported in LLVM and GCC, and Lightweight-Interprocedural
Optimization (LIPO). However, each of these solutions has limitations
that prevent it from being enabled by default. ThinLTO is a new
approach that attempts to address these limitations, with a goal of
being enabled more broadly. ThinLTO is designed with many of the same
principals as LIPO, and therefore its advantages, without any of its
inherent weakness. Unlike in LIPO where the module group decision is
made at profile training runtime, ThinLTO makes the decision at
compile time, but in a lazy mode that facilitates large scale
parallelism. The serial linker plugin phase is designed to be razor
thin and blazingly fast. By default this step only does minimal
preparation work to enable the parallel lazy importing performed
later. ThinLTO aims to be scalable like a regular O2 build, enabling
CMO on machines without large memory configurations, while also
integrating well with distributed build systems. Results from early
prototyping on SPEC cpu2006 C++ benchmarks are in line with
expectations that ThinLTO can scale like O2 while enabling much of the
CMO performed during a full LTO build.
A ThinLTO build is divided into 3 phases, which are referred to in the
following implementation plan:
phase-1: IR and Function Summary Generation (-c compile)
phase-2: Thin Linker Plugin Layer (thin archive linker step)
phase-3: Parallel Backend with Demand-Driven Importing
Implementation Plan
This section gives a high-level breakdown of the ThinLTO support that
will be added, in roughly the order that the patches would be staged.
The patches are divided into three stages. The first stage contains a
minimal amount of preparation work that is not ThinLTO-specific. The
second stage contains most of the infrastructure for ThinLTO, which
will be off by default. The third stage includes
enhancements/improvements/tunings that can be performed after the main
ThinLTO infrastructure is in.
The second and third implementation stages will initially be very
volatile, requiring a lot of iterations and tuning with large apps to
get stabilized. Therefore it will be important to do fast commits for
these implementation stages.
1. Stage 1: Preparation
-------------------------------
The first planned sets of patches are enablers for ThinLTO work:
a. LTO directory structure:
Restructure the LTO directory to remove circular dependence when
ThinLTO pass added. Because ThinLTO is being implemented as a SCC pass
within Transforms/IPO, and leverages the LTOModule class for linking
in functions from modules, IPO then requires the LTO library. This
creates a circular dependence between LTO and IPO. To break that, we
need to split the lib/LTO directory/library into lib/LTO/CodeGen and
lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
respectively. Only LTOCodeGenerator has a dependence on IPO, removing
the circular dependence.
I wonder whether LTOModule is a good fit (it might be; I'm not sure).
We still use it in libLTO, but gold-plugin.cpp no longer uses it,
instead using lib/Object and lib/Linker directly.
b. ELF wrapper generation support:
(From elsewhere in the thread, it looks like you're just using ELF
as a short-hand for "native".)
Implement ELF wrapped bitcode writer. In order to more easily interact
with tools such as $AR, $NM, and â$LD -râ we plan to emit the phase-1
bitcode wrapped in ELF via the .llvmbc section, along with a symbol
table. The goal is both to interact with these tools without requiring
a plugin, and also to avoid doing partial LTO/ThinLTO across files
linked with â$LD -râ (i.e. the resulting object file should still
contain ELF-wrapped bitcode to enable ThinLTO at the full link step).
Shouldn't `ld -r` change symbol visibility and such? How do you plan
to handle that when you concatenate sections?
For reference, ld64 (through libLTO) merges all the bitcode together
with lib/Linker, gives all "hidden" symbols local linkage (by running
-internalize with OnlyHidden=1), and writes out a new bitcode file.
I will send a separate design document for these changes, but the
following is a high-level overview.
Support was added to LLVM for reading ELF-wrapped bitcode
(rG10039c02ea1d), but there does not yet exist
support in LLVM/Clang for emitting bitcode wrapped in ELF. I plan to
add support for optionally generating bitcode in an ELF file
containing a single .llvmbc section holding the bitcode. Specifically,
the patch would add new options âemit-llvm-bc-elfâ (object file) and
corresponding âemit-llvm-elfâ (textual assembly code equivalent).
If we decide to go this way -- wrapping the bitcode in the native
object format -- wouldn't emit-llvm-native or emit-llvm-object be
better? The native object format is implied by the triple.
Eventually these would be automatically triggered under â-fthinlto -câ
and â-fthinlto -Sâ, respectively.
Additionally, a symbol table will be generated in the ELF file,
holding the function symbols within the bitcode. This facilitates
handling archives of the ELF-wrapped bitcode created with $AR, since
the archive will have a symbol table as well. The archive symbol table
enables gold to extract and pass to the plugin the constituent
ELF-wrapped bitcode files. To support the concatenated llvmbc section
generated by â$LD -râ, some handling needs to be added to gold and to
the backend driver to process each original moduleâs bitcode.
The function index/summary will later be added as a special ELF
section alongside the .llvmbc sections.
2. Stage 2: ThinLTO Infrastructure
----------------------------------------------
The next set of patches adds the base implementation of the ThinLTO
infrastructure, specifically those required to make ThinLTO functional
and generate correct but not necessarily high-performing binaries. It
also does not include support to make debug support under -g efficient
with ThinLTO.
I think we should at least have a vague plan...
a. Clang/LLVM/gold linker options:
An early set of clang/llvm patches is needed to provide options to
enable ThinLTO (off by default), so that the rest of the
implementation can be disabled by default as it is added.
Specifically, clang options -fthinlto (used instead of -flto) will
cause clang to invoke the phase-1 emission of LLVM bitcode and
function summary/index on a compile step, and pass the appropriate
option to the gold plugin on a link step. The -thinlto option will be
added to the gold plugin and llvm-lto tool to launch the phase-2 thin
archive step. The -thinlto option will also be added to the âoptâ tool
to invoke it as a phase-3 parallel backend instance.
I'm not sure I follow the `opt` part of this. That's a developer
tool, not something we ship. It also doesn't have a backend (doesn't
do CodeGen). What am I missing?
b. Thin-archive linking support in Gold plugin and llvm-lto:
Under the new plugin option (see above), the plugin needs to perform
the phase-2 (thin archive) link which simply emits a combined function
map from the linked modules, without actually performing the normal
link. Corresponding support should be added to the standalone llvm-lto
tool to enable testing/debugging without involving the linker and
plugin.
c. ThinLTO backend support:
Support for invoking a phase-3 backend invocation (including
importing) on a module should be added to the âoptâ tool under the new
option. The main change under the option is to instantiate a Linker
object used to manage the process of linking imported functions into
the module, efficient read of the combined function map, and enable
the ThinLTO import pass.
d. Function index/summary support:
This includes infrastructure for writing and reading the function
index/summary section. As noted earlier this will be encoded in a
special ELF section within the module, alongside the .llvmbc section
containing the bitcode. The thin archive generated by phase-2 of
ThinLTO simply contains all of the function index/summary sections
across the linked modules, organized for efficient function lookup.
Each function available for importing from the module contains an
entry in the moduleâs function index/summary section and in the
resulting combined function map. Each function entry contains that
functionâs offset within the bitcode file, used to efficiently locate
and quickly import just that function.
I don't think you'll actually buy anything here over the lazy-loading
feature in the BitcodeReader (although perhaps you can help improve
it if you have some ideas). In practice, to correctly load a
Function you need to load constants (include declarations for other
GlobalValues) and metadata that it references.
The entry also contains summary
information (e.g. basic information determined during parsing such as
the number of instructions in the function), that will be used to help
guide later import decisions. Because the contents of this section
will change frequently during ThinLTO tuning, it should also be marked
with a version id for backwards compatibility or version checking.
e. ThinLTO importing support:
Support for the mechanics of importing functions from other modules,
which can go in gradually as a set of patches since it will be off by
default. Separate patches can include:
- BitcodeReader changes to use function index to import/deserialize
single function of interest (small changes, leverages existing lazy
streamer support).
Ah, here it is. Should have read ahead.
How do you plan to handle references to other GlobalValues (global
variables, functions, and aliases)? If you're going to keep loading
the symbol table (which I think you need to?), then the lazy loader
already creates a function index. Or do you have some other plan?
If an imported function references functions with internal linkage,
will you pull in copies of those functions as well?
If an imported function references global variables with internal
linkage... actually, that doesn't seem legal. Will you disallow
importing such functions? How will you mark them?
- Minor LTOModule changes to pass the ThinLTO function to import and
its index into bitcode reader.
- Marking of imported functions (for use in ThinLTO-specific symbol
linking and global DCE, for example).
Marking how? Do you mean giving them internal linkage, or something
else?
What's your plan for ThinLTO-specific symbol linking?
This can be in-memory initially,
but IR support may be required in order to support streaming bitcode
out and back in again after importing.
- ModuleLinker changes to do ThinLTO-specific symbol linking and
static promotion when necessary. The linkage type of imported
functions changes to AvailableExternallyLinkage, for example. Statics
must be promoted in certain cases, and renamed in consistent ways.
Ah, could have read ahead again; this answers my questions about
referencing global variables with local linkage.
It also sounds pretty hairy. Details welcome.
- GlobalDCE changes to support removing imported functions that were
not inlined (very small changes to existing pass logic).
If you give them "available_externally" linkage, won't this already
happen?
f. ThinLTO Import Driver SCC pass:
Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO via
an SCC pass, enabled only under -fthinlto options. The pass includes
utilizing the thin archive (global function index/summary), import
decision heuristics, invocation of LTOModule/ModuleLinker routines
that perform the import, and any necessary callgraph updates and
verification.
g. Backend Driver:
For a single node build, the gold plugin can simply write a makefile
and fork the parallel backend instances directly via parallel make.
This doesn't seem like the way we'd want to test this, and it
seems strange for the toolchain to require a build system...
3. Stage 3: ThinLTO Tuning and Enhancements
----------------------------------------------------------------
This refers to the patches that are not required for ThinLTO to work,
but rather to improve compile time, memory, run-time performance and
usability.
a. Lazy Debug Metadata Linking:
The prototype implementation included lazy importing of module-level
metadata during the ThinLTO pass finalization (i.e. after all function
importing is complete). This actually applies to all module-level
metadata, not just debug, although it is the largest. This can be
added as a separate set of patches. Changes to BitcodeReader,
ValueMapper, ModuleLinker
It sounds like this would work well with the "full" LTO implemented
by tools/gold-plugin right now. What exactly did you do to improve
this?