RFC: ThinLTO Impementation Plan

echristo · May 14, 2015, 9:12pm

The binutils part

I took it as the more general: “we want to simply work with native
toolchains”, not as something specific to binutils.

Been since clarified by Teresa

Oh, I understood. I just don’t know that I agree.

Fair enough. I just wanted to make sure there wasn’t a misunderstanding here

To do anything with the
tools will require some knowledge of bitcode anyhow or need the plugin.

This is certainly true, but that’s part of the point - the ability to
pass through native tools without them breaking, or worrying about
the bitcode there.

Except I’m saying that those tools are mostly going to be useless

Anyhow, see other reply to Teresa I think for how I’m laying this out.

(I actually have no real dog in this fight, just trying to make sure
everyone is on the same page ;P)

FWIW I honestly don’t either, just trying to figure out what’s the best set of implementation choices for the project.

-eric

Xinliang_David_Li · May 14, 2015, 9:28pm

>
>
>>
>>>
>>>
>>>
>>>>
>>>> > I'm not sure this is a particularly great assumption to make.
>>>>
>>>> Which part?
>>>
>>>
>>> The binutils part
>>>
>>>>
>>>>
>>>> > We have to
>>>> > support a lot of different build systems and tools and
concentrating
>>>> > on
>>>> > something that just binutils uses isn't particularly friendly here.
>>>> I think you may have misunderstood
>>>> His point was exactly that they want to be transparent to *all of*
these
>>>> tools.
>>>> You are saying "we should be friendly to everyone". He is saying the
>>>> same thing.
>>>> We should be friendly to everyone. The friendly way to do this is to
>>>> not require all of these tools build plugins to handle bitcode.
>>>>
>>>> Hence, elf-wrapped bitcode.
>>>
>>>
>>> Oh, I understood. I just don't know that I agree. To do anything with
the
>>> tools will require some knowledge of bitcode anyhow or need the
plugin. I'm
>>> saying that as a baseline start we should look at how to do this
using the
>>> tools we've got rather than wrapping things for no real gain.
>>
>>
>> That doesn't seem strictly true - the ar situation (which I'm lead to
>> believe is in use in our build system & others, one would assume).
With the
>> symbol table included as proposed, ar can be used without any
knowledge of
>> the bitcode or need for a plugin.
>>
>
> For some bits, sure. Optimizing for ar seems a bit silly, why not 'ld
-r'?

But as mentioned, ld -r can work on native object wrapped bitcode
without a plugin as well.

How? It's not like any partial linking is going to go on inside the
bitcode if the linker doesn't understand bitcode.

What do we want plugin to do anything here? We just need the linker to
concatenate the bitcode sections and produce a combined bitcode file.

> Agreed. The ar situation is interesting because one thing we discussed
after
> you wandered off was just adding a ToC section to bitcode as it is and
then
> having the tools handle that. Would seem to accomplish at least the
goals as
> I've seen them up to this point without worrying too much.

The ToC section is a way we can encode the function index/summary into
bitcode, but won't help integrate with existing tools. The main issue
we are trying to solve is integrating transparently with existing
binutils tools in use in our build system and probably elsewhere.

Right. I'm not entirely sure what use we're going to see in the existing
tools that we want to encompass here. There's some of it for convenience
(i.e. nm etc for developers), but they can use a tool that understands
bitcode and we can make the existing llvm tools suffice for these needs.

I think the way of looking at this is that we can:

a) go with wrapping things in native object formats, this means
- some tools continue to work at the cost of additional I/O and space at
compile/link time

Are you sure about the additional I/O? With native symtab, existing tools
just need to read those, while plugin based approach needs to read bit code
section to feedback symbols to the tool.

- we still have to update some tools to work at all

If any, it will be minimal.

b) we extend those tools/our own tools and have them be drop in
replacements to the existing tools. They'll understand the bitcode format
natively, they'll be smaller, and we'll be able to push the state of the
art in tooling/analysis a bit more in the future without having to rework
thin lto.

It's basically a set of trade-offs and for llvm we've historically gone
the b direction.

I am fine making llvm tools work with it, but we should not require/force
user using them. I think this is an orthogonal feature.

David

teresajohnson · May 14, 2015, 9:31pm

>
>
>>
>>>
>>>
>>>
>>>>
>>>> > I'm not sure this is a particularly great assumption to make.
>>>>
>>>> Which part?
>>>
>>>
>>> The binutils part
>>>
>>>>
>>>>
>>>> > We have to
>>>> > support a lot of different build systems and tools and
>>>> > concentrating
>>>> > on
>>>> > something that just binutils uses isn't particularly friendly here.
>>>> I think you may have misunderstood
>>>> His point was exactly that they want to be transparent to *all of*
>>>> these
>>>> tools.
>>>> You are saying "we should be friendly to everyone". He is saying the
>>>> same thing.
>>>> We should be friendly to everyone. The friendly way to do this is to
>>>> not require all of these tools build plugins to handle bitcode.
>>>>
>>>> Hence, elf-wrapped bitcode.
>>>
>>>
>>> Oh, I understood. I just don't know that I agree. To do anything with
>>> the
>>> tools will require some knowledge of bitcode anyhow or need the
>>> plugin. I'm
>>> saying that as a baseline start we should look at how to do this using
>>> the
>>> tools we've got rather than wrapping things for no real gain.
>>
>>
>> That doesn't seem strictly true - the ar situation (which I'm lead to
>> believe is in use in our build system & others, one would assume). With
>> the
>> symbol table included as proposed, ar can be used without any knowledge
>> of
>> the bitcode or need for a plugin.
>>
>
> For some bits, sure. Optimizing for ar seems a bit silly, why not 'ld
> -r'?

But as mentioned, ld -r can work on native object wrapped bitcode
without a plugin as well.

How? It's not like any partial linking is going to go on inside the bitcode
if the linker doesn't understand bitcode.

It allows us to delay the actual linking until the full link step,
thereby enabling ThinLTO on those modules.

As we discussed offline, the current ld -r behavior with the plugin is
to compile all the way down to machine code. The alternative if we use
straight bitcode is to tell the plugin to stop early after combining
the bitcode and emit bitcode back out, with the thinlto function info
also combined.

> Agreed. The ar situation is interesting because one thing we discussed
> after
> you wandered off was just adding a ToC section to bitcode as it is and
> then
> having the tools handle that. Would seem to accomplish at least the
> goals as
> I've seen them up to this point without worrying too much.

The ToC section is a way we can encode the function index/summary into
bitcode, but won't help integrate with existing tools. The main issue
we are trying to solve is integrating transparently with existing
binutils tools in use in our build system and probably elsewhere.

Right. I'm not entirely sure what use we're going to see in the existing
tools that we want to encompass here. There's some of it for convenience
(i.e. nm etc for developers), but they can use a tool that understands
bitcode and we can make the existing llvm tools suffice for these needs.

My understanding from our discussion is that the llvm versions of
those tools do not accept native object files, so that is not
something that will work in the short term.

The best alternative to native wrapped bitcode seems to be relying on
the plugin (and changing its behavior for ld -r). Which means that the
ability to use some of the tools from the native toolchain out of the
box, and build systems such as ours have to be taught to use it.

echristo · May 14, 2015, 9:46pm

I’m not sure this is a particularly great assumption to make.

Which part?

The binutils part

We have to
support a lot of different build systems and tools and
concentrating
on
something that just binutils uses isn’t particularly friendly here.
I think you may have misunderstood
His point was exactly that they want to be transparent to all of
these
tools.
You are saying “we should be friendly to everyone”. He is saying the
same thing.
We should be friendly to everyone. The friendly way to do this is to
not require all of these tools build plugins to handle bitcode.

Hence, elf-wrapped bitcode.

Oh, I understood. I just don’t know that I agree. To do anything with
the
tools will require some knowledge of bitcode anyhow or need the
plugin. I’m
saying that as a baseline start we should look at how to do this using
the
tools we’ve got rather than wrapping things for no real gain.

That doesn’t seem strictly true - the ar situation (which I’m lead to
believe is in use in our build system & others, one would assume). With
the
symbol table included as proposed, ar can be used without any knowledge
of
the bitcode or need for a plugin.

For some bits, sure. Optimizing for ar seems a bit silly, why not ‘ld
-r’?

But as mentioned, ld -r can work on native object wrapped bitcode
without a plugin as well.

How? It’s not like any partial linking is going to go on inside the bitcode
if the linker doesn’t understand bitcode.

It allows us to delay the actual linking until the full link step,
thereby enabling ThinLTO on those modules.

As we discussed offline, the current ld -r behavior with the plugin is
to compile all the way down to machine code. The alternative if we use
straight bitcode is to tell the plugin to stop early after combining
the bitcode and emit bitcode back out, with the thinlto function info
also combined.

I think this is what should happen anyhow. ld -r that doesn’t do a partial link is misleading.

Right. I’m not entirely sure what use we’re going to see in the existing
tools that we want to encompass here. There’s some of it for convenience
(i.e. nm etc for developers), but they can use a tool that understands
bitcode and we can make the existing llvm tools suffice for these needs.

My understanding from our discussion is that the llvm versions of
those tools do not accept native object files, so that is not
something that will work in the short term.

We have tools that do understand native object files and it’s pretty easy to use the libraries that they’re built upon.

The best alternative to native wrapped bitcode seems to be relying on
the plugin (and changing its behavior for ld -r). Which means that the
ability to use some of the tools from the native toolchain out of the
box, and build systems such as ours have to be taught to use it.

I don’t think that natively wrapped bitcode gets you as much as you think it does anyhow, unless you’re duplicating a lot of information (ar, as discussed earlier, aside). I’m not too worried about the build system as far as a wrapping mechanism and I think more traditional LTO schemes with LLVM have just used bitcode/IR output as an input to the LTO link step. I think what we’re talking about here is the best way to encode the data that thin lto needs/wants in order to handle summary information etc right?

-eric

dexonsmith · May 14, 2015, 10:19pm

FWIW, in "full" LTO on Darwin, ld64 emits bitcode from `ld -r` as long
as all inputs are bitcode.

Xinliang_David_Li · May 14, 2015, 10:23pm

>>
>> >>
>> >> But as mentioned, ld -r can work on native object wrapped bitcode
>> >> without a plugin as well.
>> >>
>> >
>> > How? It's not like any partial linking is going to go on inside the
bitcode
>> > if the linker doesn't understand bitcode.
>>
>> It allows us to delay the actual linking until the full link step,
>> thereby enabling ThinLTO on those modules.
>>
>> As we discussed offline, the current ld -r behavior with the plugin is
>> to compile all the way down to machine code. The alternative if we use
>> straight bitcode is to tell the plugin to stop early after combining
>> the bitcode and emit bitcode back out, with the thinlto function info
>> also combined.
>
>
> I think this is what should happen anyhow. ld -r that doesn't do a
partial link is misleading.

FWIW, in "full" LTO on Darwin, ld64 emits bitcode from `ld -r` as long
as all inputs are bitcode.

That is what need in thinLTO too.

David

dexonsmith · May 14, 2015, 11:29pm

I've included below an RFC for implementing ThinLTO in LLVM, looking
forward to feedback and questions.
Thanks!
Teresa

RFC to discuss plans for implementing ThinLTO upstream. Background can
be found in slides from EuroLLVM 2015:
https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
As described in the talk, we have a prototype implementation, and
would like to start staging patches upstream. This RFC describes a
breakdown of the major pieces. We would like to commit upstream
gradually in several stages, with all functionality off by default.
The core ThinLTO importing support and tuning will require frequent
change and iteration during testing and tuning, and for that part we
would like to commit rapidly (off by default). See the proposed staged
implementation described in the Implementation Plan section.

ThinLTO Overview

See the talk slides linked above for more details. The following is a
high-level overview of the motivation.

Cross Module Optimization (CMO) is an effective means for improving
runtime performance, by extending the scope of optimizations across
source module boundaries. Without CMO, the compiler is limited to
optimizing within the scope of single source modules. Two solutions
for enabling CMO are Link-Time Optimization (LTO), which is currently
supported in LLVM and GCC, and Lightweight-Interprocedural
Optimization (LIPO). However, each of these solutions has limitations
that prevent it from being enabled by default. ThinLTO is a new
approach that attempts to address these limitations, with a goal of
being enabled more broadly. ThinLTO is designed with many of the same
principals as LIPO, and therefore its advantages, without any of its
inherent weakness. Unlike in LIPO where the module group decision is
made at profile training runtime, ThinLTO makes the decision at
compile time, but in a lazy mode that facilitates large scale
parallelism. The serial linker plugin phase is designed to be razor
thin and blazingly fast. By default this step only does minimal
preparation work to enable the parallel lazy importing performed
later. ThinLTO aims to be scalable like a regular O2 build, enabling
CMO on machines without large memory configurations, while also
integrating well with distributed build systems. Results from early
prototyping on SPEC cpu2006 C++ benchmarks are in line with
expectations that ThinLTO can scale like O2 while enabling much of the
CMO performed during a full LTO build.

A ThinLTO build is divided into 3 phases, which are referred to in the
following implementation plan:

phase-1: IR and Function Summary Generation (-c compile)
phase-2: Thin Linker Plugin Layer (thin archive linker step)
phase-3: Parallel Backend with Demand-Driven Importing

Implementation Plan

This section gives a high-level breakdown of the ThinLTO support that
will be added, in roughly the order that the patches would be staged.
The patches are divided into three stages. The first stage contains a
minimal amount of preparation work that is not ThinLTO-specific. The
second stage contains most of the infrastructure for ThinLTO, which
will be off by default. The third stage includes
enhancements/improvements/tunings that can be performed after the main
ThinLTO infrastructure is in.

The second and third implementation stages will initially be very
volatile, requiring a lot of iterations and tuning with large apps to
get stabilized. Therefore it will be important to do fast commits for
these implementation stages.

1. Stage 1: Preparation
-------------------------------

The first planned sets of patches are enablers for ThinLTO work:

a. LTO directory structure:

Restructure the LTO directory to remove circular dependence when
ThinLTO pass added. Because ThinLTO is being implemented as a SCC pass
within Transforms/IPO, and leverages the LTOModule class for linking
in functions from modules, IPO then requires the LTO library. This
creates a circular dependence between LTO and IPO. To break that, we
need to split the lib/LTO directory/library into lib/LTO/CodeGen and
lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
respectively. Only LTOCodeGenerator has a dependence on IPO, removing
the circular dependence.

I wonder whether LTOModule is a good fit (it might be; I'm not sure).
We still use it in libLTO, but gold-plugin.cpp no longer uses it,
instead using lib/Object and lib/Linker directly.

b. ELF wrapper generation support:

(From elsewhere in the thread, it looks like you're just using ELF
as a short-hand for "native".)

Implement ELF wrapped bitcode writer. In order to more easily interact
with tools such as $AR, $NM, and “$LD -r” we plan to emit the phase-1
bitcode wrapped in ELF via the .llvmbc section, along with a symbol
table. The goal is both to interact with these tools without requiring
a plugin, and also to avoid doing partial LTO/ThinLTO across files
linked with “$LD -r” (i.e. the resulting object file should still
contain ELF-wrapped bitcode to enable ThinLTO at the full link step).

Shouldn't `ld -r` change symbol visibility and such? How do you plan
to handle that when you concatenate sections?

For reference, ld64 (through libLTO) merges all the bitcode together
with lib/Linker, gives all "hidden" symbols local linkage (by running
-internalize with OnlyHidden=1), and writes out a new bitcode file.

I will send a separate design document for these changes, but the
following is a high-level overview.

Support was added to LLVM for reading ELF-wrapped bitcode
(rG10039c02ea1d), but there does not yet exist
support in LLVM/Clang for emitting bitcode wrapped in ELF. I plan to
add support for optionally generating bitcode in an ELF file
containing a single .llvmbc section holding the bitcode. Specifically,
the patch would add new options “emit-llvm-bc-elf” (object file) and
corresponding “emit-llvm-elf” (textual assembly code equivalent).

If we decide to go this way -- wrapping the bitcode in the native
object format -- wouldn't emit-llvm-native or emit-llvm-object be
better? The native object format is implied by the triple.

Eventually these would be automatically triggered under “-fthinlto -c”
and “-fthinlto -S”, respectively.

Additionally, a symbol table will be generated in the ELF file,
holding the function symbols within the bitcode. This facilitates
handling archives of the ELF-wrapped bitcode created with $AR, since
the archive will have a symbol table as well. The archive symbol table
enables gold to extract and pass to the plugin the constituent
ELF-wrapped bitcode files. To support the concatenated llvmbc section
generated by “$LD -r”, some handling needs to be added to gold and to
the backend driver to process each original module’s bitcode.

The function index/summary will later be added as a special ELF
section alongside the .llvmbc sections.

2. Stage 2: ThinLTO Infrastructure
----------------------------------------------

The next set of patches adds the base implementation of the ThinLTO
infrastructure, specifically those required to make ThinLTO functional
and generate correct but not necessarily high-performing binaries. It
also does not include support to make debug support under -g efficient
with ThinLTO.

I think we should at least have a vague plan...

a. Clang/LLVM/gold linker options:

An early set of clang/llvm patches is needed to provide options to
enable ThinLTO (off by default), so that the rest of the
implementation can be disabled by default as it is added.
Specifically, clang options -fthinlto (used instead of -flto) will
cause clang to invoke the phase-1 emission of LLVM bitcode and
function summary/index on a compile step, and pass the appropriate
option to the gold plugin on a link step. The -thinlto option will be
added to the gold plugin and llvm-lto tool to launch the phase-2 thin
archive step. The -thinlto option will also be added to the ‘opt’ tool
to invoke it as a phase-3 parallel backend instance.

I'm not sure I follow the `opt` part of this. That's a developer
tool, not something we ship. It also doesn't have a backend (doesn't
do CodeGen). What am I missing?

b. Thin-archive linking support in Gold plugin and llvm-lto:

Under the new plugin option (see above), the plugin needs to perform
the phase-2 (thin archive) link which simply emits a combined function
map from the linked modules, without actually performing the normal
link. Corresponding support should be added to the standalone llvm-lto
tool to enable testing/debugging without involving the linker and
plugin.

c. ThinLTO backend support:

Support for invoking a phase-3 backend invocation (including
importing) on a module should be added to the ‘opt’ tool under the new
option. The main change under the option is to instantiate a Linker
object used to manage the process of linking imported functions into
the module, efficient read of the combined function map, and enable
the ThinLTO import pass.

d. Function index/summary support:

This includes infrastructure for writing and reading the function
index/summary section. As noted earlier this will be encoded in a
special ELF section within the module, alongside the .llvmbc section
containing the bitcode. The thin archive generated by phase-2 of
ThinLTO simply contains all of the function index/summary sections
across the linked modules, organized for efficient function lookup.

Each function available for importing from the module contains an
entry in the module’s function index/summary section and in the
resulting combined function map. Each function entry contains that
function’s offset within the bitcode file, used to efficiently locate
and quickly import just that function.

I don't think you'll actually buy anything here over the lazy-loading
feature in the BitcodeReader (although perhaps you can help improve
it if you have some ideas). In practice, to correctly load a
Function you need to load constants (include declarations for other
GlobalValues) and metadata that it references.

The entry also contains summary
information (e.g. basic information determined during parsing such as
the number of instructions in the function), that will be used to help
guide later import decisions. Because the contents of this section
will change frequently during ThinLTO tuning, it should also be marked
with a version id for backwards compatibility or version checking.

e. ThinLTO importing support:

Support for the mechanics of importing functions from other modules,
which can go in gradually as a set of patches since it will be off by
default. Separate patches can include:

- BitcodeReader changes to use function index to import/deserialize
single function of interest (small changes, leverages existing lazy
streamer support).

Ah, here it is. Should have read ahead.

How do you plan to handle references to other GlobalValues (global
variables, functions, and aliases)? If you're going to keep loading
the symbol table (which I think you need to?), then the lazy loader
already creates a function index. Or do you have some other plan?

If an imported function references functions with internal linkage,
will you pull in copies of those functions as well?

If an imported function references global variables with internal
linkage... actually, that doesn't seem legal. Will you disallow
importing such functions? How will you mark them?

- Minor LTOModule changes to pass the ThinLTO function to import and
its index into bitcode reader.

- Marking of imported functions (for use in ThinLTO-specific symbol
linking and global DCE, for example).

Marking how? Do you mean giving them internal linkage, or something
else?

What's your plan for ThinLTO-specific symbol linking?

This can be in-memory initially,
but IR support may be required in order to support streaming bitcode
out and back in again after importing.

- ModuleLinker changes to do ThinLTO-specific symbol linking and
static promotion when necessary. The linkage type of imported
functions changes to AvailableExternallyLinkage, for example. Statics
must be promoted in certain cases, and renamed in consistent ways.

Ah, could have read ahead again; this answers my questions about
referencing global variables with local linkage.

It also sounds pretty hairy. Details welcome.

- GlobalDCE changes to support removing imported functions that were
not inlined (very small changes to existing pass logic).

If you give them "available_externally" linkage, won't this already
happen?

f. ThinLTO Import Driver SCC pass:

Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO via
an SCC pass, enabled only under -fthinlto options. The pass includes
utilizing the thin archive (global function index/summary), import
decision heuristics, invocation of LTOModule/ModuleLinker routines
that perform the import, and any necessary callgraph updates and
verification.

g. Backend Driver:

For a single node build, the gold plugin can simply write a makefile
and fork the parallel backend instances directly via parallel make.

This doesn't seem like the way we'd want to test this, and it
seems strange for the toolchain to require a build system...

3. Stage 3: ThinLTO Tuning and Enhancements
----------------------------------------------------------------

This refers to the patches that are not required for ThinLTO to work,
but rather to improve compile time, memory, run-time performance and
usability.

a. Lazy Debug Metadata Linking:

The prototype implementation included lazy importing of module-level
metadata during the ThinLTO pass finalization (i.e. after all function
importing is complete). This actually applies to all module-level
metadata, not just debug, although it is the largest. This can be
added as a separate set of patches. Changes to BitcodeReader,
ValueMapper, ModuleLinker

It sounds like this would work well with the "full" LTO implemented
by tools/gold-plugin right now. What exactly did you do to improve
this?

Xinliang_David_Li · May 15, 2015, 5:05am

>
>
>>
>> >
>> >
>> >>
>> >>>
>> >>>
>> >>>
>> >>>>
>> >>>> > I'm not sure this is a particularly great assumption to make.
>> >>>>
>> >>>> Which part?
>> >>>
>> >>>
>> >>> The binutils part
>> >>>
>> >>>>
>> >>>>
>> >>>> > We have to
>> >>>> > support a lot of different build systems and tools and
>> >>>> > concentrating
>> >>>> > on
>> >>>> > something that just binutils uses isn't particularly friendly
here.
>> >>>> I think you may have misunderstood
>> >>>> His point was exactly that they want to be transparent to *all of*
>> >>>> these
>> >>>> tools.
>> >>>> You are saying "we should be friendly to everyone". He is saying
the
>> >>>> same thing.
>> >>>> We should be friendly to everyone. The friendly way to do this is
to
>> >>>> not require all of these tools build plugins to handle bitcode.
>> >>>>
>> >>>> Hence, elf-wrapped bitcode.
>> >>>
>> >>>
>> >>> Oh, I understood. I just don't know that I agree. To do anything
with
>> >>> the
>> >>> tools will require some knowledge of bitcode anyhow or need the
>> >>> plugin. I'm
>> >>> saying that as a baseline start we should look at how to do this
using
>> >>> the
>> >>> tools we've got rather than wrapping things for no real gain.
>> >>
>> >>
>> >> That doesn't seem strictly true - the ar situation (which I'm lead
to
>> >> believe is in use in our build system & others, one would assume).
With
>> >> the
>> >> symbol table included as proposed, ar can be used without any
knowledge
>> >> of
>> >> the bitcode or need for a plugin.
>> >>
>> >
>> > For some bits, sure. Optimizing for ar seems a bit silly, why not 'ld
>> > -r'?
>>
>> But as mentioned, ld -r can work on native object wrapped bitcode
>> without a plugin as well.
>>
>
> How? It's not like any partial linking is going to go on inside the
bitcode
> if the linker doesn't understand bitcode.

It allows us to delay the actual linking until the full link step,
thereby enabling ThinLTO on those modules.

As we discussed offline, the current ld -r behavior with the plugin is
to compile all the way down to machine code. The alternative if we use
straight bitcode is to tell the plugin to stop early after combining
the bitcode and emit bitcode back out, with the thinlto function info
also combined.

I think this is what should happen anyhow. ld -r that doesn't do a partial
link is misleading.

> Right. I'm not entirely sure what use we're going to see in the existing

> tools that we want to encompass here. There's some of it for convenience
> (i.e. nm etc for developers), but they can use a tool that understands
> bitcode and we can make the existing llvm tools suffice for these needs.

My understanding from our discussion is that the llvm versions of
those tools do not accept native object files, so that is not
something that will work in the short term.

We have tools that do understand native object files and it's pretty easy
to use the libraries that they're built upon.

The best alternative to native wrapped bitcode seems to be relying on
the plugin (and changing its behavior for ld -r). Which means that the
ability to use some of the tools from the native toolchain out of the
box, and build systems such as ours have to be taught to use it.

I don't think that natively wrapped bitcode gets you as much as you think
it does anyhow, unless you're duplicating a lot of information (ar, as
discussed earlier, aside). I'm not too worried about the build system as
far as a wrapping mechanism

Do not under estimate the importance of build system integration. Tools
used in the build can include ar, nm, ranlib, objcopy, strip, etc. The
latest binutils support plugin for ar, nm and ranlib, but not others.
objcopy can actually change visibility of symbols. It is conceivably easy
to support this with native object wrapper (i.e. propagate the visibility
change at IR reading time), but it is unclear wether existing plugin
interfaces are enough for it.

and I think more traditional LTO schemes with LLVM have just used
bitcode/IR output as an input to the LTO link step. I think what we're
talking about here is the best way to encode the data that thin lto
needs/wants in order to handle summary information etc right?

Not entirely. We want both functionality and usability. Easy integration
with build is considered as the usability. Asking users to use wrapper
tools in order to pass plugin path is not something I consider as being
highly usable -- but I can be convinced the other way

thanks,

David

bigboze · May 15, 2015, 12:11pm

Are you sure about the additional I/O? With native symtab, existing tools just need to read those, while plugin based approach needs to read bit code section to feedback symbols to the tool.

The additional I/O will be quite big if you are going to emit the full symbol table. Looking at some of our real world links the symbol table and string tables of all the inputs seen by the linker add up to about 50 - 100mb.

teresajohnson · May 15, 2015, 2:30pm

Thanks for all the feedback and questions, answers below.
Teresa

I've included below an RFC for implementing ThinLTO in LLVM, looking
forward to feedback and questions.
Thanks!
Teresa

RFC to discuss plans for implementing ThinLTO upstream. Background can
be found in slides from EuroLLVM 2015:
https://drive.google.com/open?id=0B036uwnWM6RWWER1ZEl5SUNENjQ&authuser=0)
As described in the talk, we have a prototype implementation, and
would like to start staging patches upstream. This RFC describes a
breakdown of the major pieces. We would like to commit upstream
gradually in several stages, with all functionality off by default.
The core ThinLTO importing support and tuning will require frequent
change and iteration during testing and tuning, and for that part we
would like to commit rapidly (off by default). See the proposed staged
implementation described in the Implementation Plan section.

ThinLTO Overview

See the talk slides linked above for more details. The following is a
high-level overview of the motivation.

Cross Module Optimization (CMO) is an effective means for improving
runtime performance, by extending the scope of optimizations across
source module boundaries. Without CMO, the compiler is limited to
optimizing within the scope of single source modules. Two solutions
for enabling CMO are Link-Time Optimization (LTO), which is currently
supported in LLVM and GCC, and Lightweight-Interprocedural
Optimization (LIPO). However, each of these solutions has limitations
that prevent it from being enabled by default. ThinLTO is a new
approach that attempts to address these limitations, with a goal of
being enabled more broadly. ThinLTO is designed with many of the same
principals as LIPO, and therefore its advantages, without any of its
inherent weakness. Unlike in LIPO where the module group decision is
made at profile training runtime, ThinLTO makes the decision at
compile time, but in a lazy mode that facilitates large scale
parallelism. The serial linker plugin phase is designed to be razor
thin and blazingly fast. By default this step only does minimal
preparation work to enable the parallel lazy importing performed
later. ThinLTO aims to be scalable like a regular O2 build, enabling
CMO on machines without large memory configurations, while also
integrating well with distributed build systems. Results from early
prototyping on SPEC cpu2006 C++ benchmarks are in line with
expectations that ThinLTO can scale like O2 while enabling much of the
CMO performed during a full LTO build.

A ThinLTO build is divided into 3 phases, which are referred to in the
following implementation plan:

phase-1: IR and Function Summary Generation (-c compile)
phase-2: Thin Linker Plugin Layer (thin archive linker step)
phase-3: Parallel Backend with Demand-Driven Importing

Implementation Plan

This section gives a high-level breakdown of the ThinLTO support that
will be added, in roughly the order that the patches would be staged.
The patches are divided into three stages. The first stage contains a
minimal amount of preparation work that is not ThinLTO-specific. The
second stage contains most of the infrastructure for ThinLTO, which
will be off by default. The third stage includes
enhancements/improvements/tunings that can be performed after the main
ThinLTO infrastructure is in.

The second and third implementation stages will initially be very
volatile, requiring a lot of iterations and tuning with large apps to
get stabilized. Therefore it will be important to do fast commits for
these implementation stages.

1. Stage 1: Preparation
-------------------------------

The first planned sets of patches are enablers for ThinLTO work:

a. LTO directory structure:

Restructure the LTO directory to remove circular dependence when
ThinLTO pass added. Because ThinLTO is being implemented as a SCC pass
within Transforms/IPO, and leverages the LTOModule class for linking
in functions from modules, IPO then requires the LTO library. This
creates a circular dependence between LTO and IPO. To break that, we
need to split the lib/LTO directory/library into lib/LTO/CodeGen and
lib/LTO/Module, containing LTOCodeGenerator and LTOModule,
respectively. Only LTOCodeGenerator has a dependence on IPO, removing
the circular dependence.

I wonder whether LTOModule is a good fit (it might be; I'm not sure).
We still use it in libLTO, but gold-plugin.cpp no longer uses it,
instead using lib/Object and lib/Linker directly.

b. ELF wrapper generation support:

(From elsewhere in the thread, it looks like you're just using ELF
as a short-hand for "native".)

Right, I should have written this as native object wrapper. I had
focused on ELF since that was what I have been looking at most
closely, but the support can be more general.

Implement ELF wrapped bitcode writer. In order to more easily interact
with tools such as $AR, $NM, and “$LD -r” we plan to emit the phase-1
bitcode wrapped in ELF via the .llvmbc section, along with a symbol
table. The goal is both to interact with these tools without requiring
a plugin, and also to avoid doing partial LTO/ThinLTO across files
linked with “$LD -r” (i.e. the resulting object file should still
contain ELF-wrapped bitcode to enable ThinLTO at the full link step).

Shouldn't `ld -r` change symbol visibility and such? How do you plan
to handle that when you concatenate sections?

If we use native object wrapped bitcode, ld -r would not do any
changing of symbols or merging. It would be more like an archive in
that it packages the bitcode and delays merging until the backend.
That way it's constituents are still bitcode available for importing
into other modules.

For the non-wrapped bitcode option, using the gold plugin, we would
want to change the behavior for ld -r to be similar to what you are
describing for ld64, i.e. emit bitcode.

For reference, ld64 (through libLTO) merges all the bitcode together
with lib/Linker, gives all "hidden" symbols local linkage (by running
-internalize with OnlyHidden=1), and writes out a new bitcode file.

I will send a separate design document for these changes, but the
following is a high-level overview.

Support was added to LLVM for reading ELF-wrapped bitcode
(rG10039c02ea1d), but there does not yet exist
support in LLVM/Clang for emitting bitcode wrapped in ELF. I plan to
add support for optionally generating bitcode in an ELF file
containing a single .llvmbc section holding the bitcode. Specifically,
the patch would add new options “emit-llvm-bc-elf” (object file) and
corresponding “emit-llvm-elf” (textual assembly code equivalent).

If we decide to go this way -- wrapping the bitcode in the native
object format -- wouldn't emit-llvm-native or emit-llvm-object be
better? The native object format is implied by the triple.

Yes, that is better.

Eventually these would be automatically triggered under “-fthinlto -c”
and “-fthinlto -S”, respectively.

Additionally, a symbol table will be generated in the ELF file,
holding the function symbols within the bitcode. This facilitates
handling archives of the ELF-wrapped bitcode created with $AR, since
the archive will have a symbol table as well. The archive symbol table
enables gold to extract and pass to the plugin the constituent
ELF-wrapped bitcode files. To support the concatenated llvmbc section
generated by “$LD -r”, some handling needs to be added to gold and to
the backend driver to process each original module’s bitcode.

The function index/summary will later be added as a special ELF
section alongside the .llvmbc sections.

2. Stage 2: ThinLTO Infrastructure
----------------------------------------------

The next set of patches adds the base implementation of the ThinLTO
infrastructure, specifically those required to make ThinLTO functional
and generate correct but not necessarily high-performing binaries. It
also does not include support to make debug support under -g efficient
with ThinLTO.

I think we should at least have a vague plan...

Sorry, I should have been clearer here. I do have a plan for this and
know how to do it (it is implemented in my prototype). It's discussed
below under Stage 3. I was debating whether to put the metadata
handling under Stage 2, but it isn't strictly necessary to get the
ThinLTO pipeline working. You just end up with a lot of duplicate
metadata/debug as you have to import it multiple times. But really the
metadata (incl debug) handling should be the next thing after the
basic ThinLTO pipeline is done.

a. Clang/LLVM/gold linker options:

An early set of clang/llvm patches is needed to provide options to
enable ThinLTO (off by default), so that the rest of the
implementation can be disabled by default as it is added.
Specifically, clang options -fthinlto (used instead of -flto) will
cause clang to invoke the phase-1 emission of LLVM bitcode and
function summary/index on a compile step, and pass the appropriate
option to the gold plugin on a link step. The -thinlto option will be
added to the gold plugin and llvm-lto tool to launch the phase-2 thin
archive step. The -thinlto option will also be added to the ‘opt’ tool
to invoke it as a phase-3 parallel backend instance.

I'm not sure I follow the `opt` part of this. That's a developer
tool, not something we ship. It also doesn't have a backend (doesn't
do CodeGen). What am I missing?

For the prototype I was using llvm-lto as my backend driver. I
realized that this was probably not the best option as we don't need
all of the LTO handling built into that driver, and it isn't listed as
a tool on LLVM Command Guide — LLVM 18.0.0git documentation, so my feeling was that
'opt' was better supported and a better alternative. Unfortunately
when I was writing this up I forgot that 'opt' generates bitcode not
an object file.

Another option would be to use clang and allow it to accept bitcode
and bypass parsing under an appropriate ThinLTO option. AFAICT there
isn't currently an option for clang to accept bitcode. Do you think
this is the right approach?

b. Thin-archive linking support in Gold plugin and llvm-lto:

Under the new plugin option (see above), the plugin needs to perform
the phase-2 (thin archive) link which simply emits a combined function
map from the linked modules, without actually performing the normal
link. Corresponding support should be added to the standalone llvm-lto
tool to enable testing/debugging without involving the linker and
plugin.

c. ThinLTO backend support:

Support for invoking a phase-3 backend invocation (including
importing) on a module should be added to the ‘opt’ tool under the new
option. The main change under the option is to instantiate a Linker
object used to manage the process of linking imported functions into
the module, efficient read of the combined function map, and enable
the ThinLTO import pass.

d. Function index/summary support:

This includes infrastructure for writing and reading the function
index/summary section. As noted earlier this will be encoded in a
special ELF section within the module, alongside the .llvmbc section
containing the bitcode. The thin archive generated by phase-2 of
ThinLTO simply contains all of the function index/summary sections
across the linked modules, organized for efficient function lookup.

Each function available for importing from the module contains an
entry in the module’s function index/summary section and in the
resulting combined function map. Each function entry contains that
function’s offset within the bitcode file, used to efficiently locate
and quickly import just that function.

I don't think you'll actually buy anything here over the lazy-loading
feature in the BitcodeReader (although perhaps you can help improve
it if you have some ideas). In practice, to correctly load a
Function you need to load constants (include declarations for other
GlobalValues) and metadata that it references.

As you saw below, it is leveraging the lazy loading support. The
metadata handling is discussed later on in 3a.

The entry also contains summary
information (e.g. basic information determined during parsing such as
the number of instructions in the function), that will be used to help
guide later import decisions. Because the contents of this section
will change frequently during ThinLTO tuning, it should also be marked
with a version id for backwards compatibility or version checking.

e. ThinLTO importing support:

Support for the mechanics of importing functions from other modules,
which can go in gradually as a set of patches since it will be off by
default. Separate patches can include:

- BitcodeReader changes to use function index to import/deserialize
single function of interest (small changes, leverages existing lazy
streamer support).

Ah, here it is. Should have read ahead.

How do you plan to handle references to other GlobalValues (global
variables, functions, and aliases)? If you're going to keep loading
the symbol table (which I think you need to?), then the lazy loader
already creates a function index. Or do you have some other plan?

We do have to reload the declarations and other symbol table info.
Where it differs from the lazy loader is that we don't need to keep
parsing the module to build up the function index
(DeferredFunctionInfo), with repeated calls to
FindFunctionInStream/ParseModule. Once we hit the first function body
we stop, then when materializing we simply set up the
DeferredFunctionInfo entry from the bitcode index that was saved in
the ThinLTO function index.

If an imported function references functions with internal linkage,
will you pull in copies of those functions as well?

There are two possibilities in this case: promotion (along with
renaming to avoid name clashing with other modules), or force import.
As you note later on, I talk about promotion just below here. To limit
the required static promotions I have implemented a strategy where we
attempt to force import referenced functions that have internal
linkage. But we still must do static promotion if the local function
(or global) is potentially imported to another module (in the combined
function map) and is address exposed.

If an imported function references global variables with internal
linkage... actually, that doesn't seem legal. Will you disallow
importing such functions? How will you mark them?

Static promotion handles this.

- Minor LTOModule changes to pass the ThinLTO function to import and
its index into bitcode reader.

- Marking of imported functions (for use in ThinLTO-specific symbol
linking and global DCE, for example).

Marking how? Do you mean giving them internal linkage, or something
else?

Mentioned just after this: either an in-memory flag on the Function
class, or potentially in the IR. For the prototype I just had a flag
on the Function class.

What's your plan for ThinLTO-specific symbol linking?

Mentioned just below here as you note.

This can be in-memory initially,
but IR support may be required in order to support streaming bitcode
out and back in again after importing.

- ModuleLinker changes to do ThinLTO-specific symbol linking and
static promotion when necessary. The linkage type of imported
functions changes to AvailableExternallyLinkage, for example. Statics
must be promoted in certain cases, and renamed in consistent ways.

Ah, could have read ahead again; this answers my questions about
referencing global variables with local linkage.

It also sounds pretty hairy. Details welcome.

It has to be well thought out for sure. We had to do this for LIPO as
well so already knew what needed to be done here. I will put together
more details in a follow-on email.

- GlobalDCE changes to support removing imported functions that were
not inlined (very small changes to existing pass logic).

If you give them "available_externally" linkage, won't this already
happen?

There were only a couple minor tweaks required here (under the flag I
added to the Function indicating that this was imported). Only
promoted statics are remarked available_externally. For a
non-discardable symbol that was imported, we can discard here since we
are done with inlining (it is non-discardable in its home module).

f. ThinLTO Import Driver SCC pass:

Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO via
an SCC pass, enabled only under -fthinlto options. The pass includes
utilizing the thin archive (global function index/summary), import
decision heuristics, invocation of LTOModule/ModuleLinker routines
that perform the import, and any necessary callgraph updates and
verification.

g. Backend Driver:

For a single node build, the gold plugin can simply write a makefile
and fork the parallel backend instances directly via parallel make.

This doesn't seem like the way we'd want to test this, and it
seems strange for the toolchain to require a build system...

The idea is to make this all transparent to the user. So you can just
do something like:
% clang -fthinlto -O2 *.cc -c
% clang -fthinlto -O2 *.o

the second command will do everything transparently (phase-2 thin
plugin later, launch parallel backend processes, hand back resulting
native object code to linker, produce a.out). So somehow the plugin
needs to launch the parallel backend processes.

3. Stage 3: ThinLTO Tuning and Enhancements
----------------------------------------------------------------

This refers to the patches that are not required for ThinLTO to work,
but rather to improve compile time, memory, run-time performance and
usability.

a. Lazy Debug Metadata Linking:

The prototype implementation included lazy importing of module-level
metadata during the ThinLTO pass finalization (i.e. after all function
importing is complete). This actually applies to all module-level
metadata, not just debug, although it is the largest. This can be
added as a separate set of patches. Changes to BitcodeReader,
ValueMapper, ModuleLinker

It sounds like this would work well with the "full" LTO implemented
by tools/gold-plugin right now. What exactly did you do to improve
this?

I don't think it will help with full LTO. The parsing of the metadata
is only delayed until the ThinLTO pass finalization, and the delayed
metadata import is necessary to avoid reading and linking in the
metadata multiple times (for each function imported from that module).
Coming out of the ThinLTO pass you still have all the metadata
necessary for each function that was imported. For a full LTO that
would end up being all of the metadata in the module.

The high level summary is that during the initial import it leaves the
temporary metadata on the instructions that were imported, but saves
the index used by the bitcode reader used to correlate with the
metadata when it is ready (i.e. the MDValuePtrs index), and skips the
metadata parsing. During finalization we parse just the metadata, and
suture it up during metadata value mapping using the saved index.

Xinliang_David_Li · May 15, 2015, 3:26pm

There is no need for emitting the full symtab. I checked the overhead with a huge internal C++ source. The overhead of symtab + str table compared with byte code with debug is about 3%.

More importantly, it is also possible to use the symtab also for index/summary purpose, which makes the space usage completely ‘unwasted’. That gets into the details which will follow when patches are in.

David

dblaikie · May 15, 2015, 4:18pm

>> - Marking of imported functions (for use in ThinLTO-specific symbol
>> linking and global DCE, for example).
>
> Marking how? Do you mean giving them internal linkage, or something
> else?

Mentioned just after this: either an in-memory flag on the Function
class, or potentially in the IR. For the prototype I just had a flag
on the Function class.

Would this be anything other than "available externally" linkage? (either
this module is responsible for emitting the definition of this function for
other modules to call, or it's not - if it's not, either this module
depends on another module to provide a definition (this would be "available
externally") or it doesn't (this would be "internal" linkage")) - I
think... though I'm sort of jumping into the middle of some of this &
might've missed some context, sorry.

- David

dblaikie · May 15, 2015, 4:20pm

>> - Marking of imported functions (for use in ThinLTO-specific symbol
>> linking and global DCE, for example).
>
> Marking how? Do you mean giving them internal linkage, or something
> else?

Mentioned just after this: either an in-memory flag on the Function
class, or potentially in the IR. For the prototype I just had a flag
on the Function class.

Would this be anything other than "available externally" linkage? (either
this module is responsible for emitting the definition of this function for
other modules to call, or it's not - if it's not, either this module
depends on another module to provide a definition (this would be "available
externally") or it doesn't (this would be "internal" linkage")) - I
think... though I'm sort of jumping into the middle of some of this &
might've missed some context, sorry.

& I see your follow up further down in that thread:

"There were only a couple minor tweaks required here (under the flag I
added to the Function indicating that this was imported). Only
promoted statics are remarked available_externally. For a
non-discardable symbol that was imported, we can discard here since we
are done with inlining (it is non-discardable in its home module)."

&, like Duncan, I'll wait for more details on that front. (may or may not
be useful to split some of these subthreads into separate email threads to
keep discussion clear - but I'm not sure)

Xinliang_David_Li · May 15, 2015, 4:47pm

(resent as the previous message got bounced)

There is no need for emitting the full symtab. I checked the overhead with
a huge internal C++ source. The overhead of symtab + str table compared
with byte code with debug is about 3%.

More importantly, there is plan to use the symtab also for thinLTO indexing
purpose, which makes the space usage completely 'unwasted'. That gets into
the details which will follow when the patches are in (with design docs).

thanks,

David

teresajohnson · May 15, 2015, 4:53pm

>> - Marking of imported functions (for use in ThinLTO-specific symbol
>> linking and global DCE, for example).
>
> Marking how? Do you mean giving them internal linkage, or something
> else?

Mentioned just after this: either an in-memory flag on the Function
class, or potentially in the IR. For the prototype I just had a flag
on the Function class.

Would this be anything other than "available externally" linkage? (either
this module is responsible for emitting the definition of this function for
other modules to call, or it's not - if it's not, either this module depends
on another module to provide a definition (this would be "available
externally") or it doesn't (this would be "internal" linkage")) - I think...
though I'm sort of jumping into the middle of some of this & might've missed
some context, sorry.

& I see your follow up further down in that thread:

"There were only a couple minor tweaks required here (under the flag I
added to the Function indicating that this was imported). Only
promoted statics are remarked available_externally. For a
non-discardable symbol that was imported, we can discard here since we
are done with inlining (it is non-discardable in its home module)."

&, like Duncan, I'll wait for more details on that front. (may or may not be
useful to split some of these subthreads into separate email threads to keep
discussion clear - but I'm not sure)

I just went back and looked at my prototype and I had remembered this
wrong. An imported function is always marked
AvailableExternallyLinkage, unless it has link once linkage.

As far as using that to indicate that it is an aux function, I was
concerned about overloading the meaning of that linkage type. See the
next para for an example where I am unsure about doing this...

Looking back through my GlobalDCE changes, it looks like one of the
places I had changed (where we mark defined globals in runOnModule)
already has a guard for !hasAvailableExternallyLinkage and
!isDiscardableIfUnused, so my additional guard against marking
imported functions is unnecessary. But the other place I had to change
was in GlobalIsNeeded where it walks through the function and
recursively marks any referenced global as needed. Here there was no
guard against marking a global that is available externally as needed
if it is referenced. I had added a check here to not mark imported
functions as needed on reference unless they were discardable (i.e.
link once). Is this a bug - should this have a guard against marking
available externally function refs as needed?

There was one other change to GlobalDCE that I had to make, where we
erase the list of DeadFunctions from the module. In my case we may
have a function body that was eliminated, but still have references to
it (i.e. in the case I am talking about above). In that case it is now
just a declaration and we don't erase it from the function list on the
module. If the GlobalsIsNeeded code is changed to avoid marking
available externally function refs as needed then this change would be
needed for that case as well.

Thanks,
Teresa

dblaikie · May 15, 2015, 5:01pm

That direction ends up more heavily leaning on this model, though. Keeping
all the LLVM stuff (including summary info) in the IR means that on
platforms with bitcode-aware tools (like, by the sounds of it, OSX with ld
being bitcode aware, etc) we can support a nice bitcode-only solution.
Wrapping that in native object files for backwards compatibility for a few
tools seems OK, but the more features we build on top of that foundation
the harder it is to get out of that business when/where the backwards
compatibility isn't needed.

Also, leaving the wrapping as a separate backwards compatibility thing
would, I imagine, ease testing by making more parts testable without the
added complexity of the wrapping.

It'd be useful to see the sorts of build system scenarios that use these
native object tools so we can look at what we can/can't reasonably support.
(I assume tools aren't generally expecting a symtab where the symbols
aren't actually in the .text section - I don't know what/if any/how some of
these tools might do the wrong thing when presented with such info - but
this is all outside of my depth/area, so don't worry about explaining it to
me, but it seems other people care about what we're supporting here, at
least)

dblaikie · May 15, 2015, 5:04pm

>
>
>>>
>>>
>>> >> - Marking of imported functions (for use in ThinLTO-specific symbol
>>> >> linking and global DCE, for example).
>>> >
>>> > Marking how? Do you mean giving them internal linkage, or something
>>> > else?
>>>
>>> Mentioned just after this: either an in-memory flag on the Function
>>> class, or potentially in the IR. For the prototype I just had a flag
>>> on the Function class.
>>
>>
>> Would this be anything other than "available externally" linkage?
(either
>> this module is responsible for emitting the definition of this function
for
>> other modules to call, or it's not - if it's not, either this module
depends
>> on another module to provide a definition (this would be "available
>> externally") or it doesn't (this would be "internal" linkage")) - I
think...
>> though I'm sort of jumping into the middle of some of this & might've
missed
>> some context, sorry.
>
>
> & I see your follow up further down in that thread:
>
> "There were only a couple minor tweaks required here (under the flag I
> added to the Function indicating that this was imported). Only
> promoted statics are remarked available_externally. For a
> non-discardable symbol that was imported, we can discard here since we
> are done with inlining (it is non-discardable in its home module)."
>
> &, like Duncan, I'll wait for more details on that front. (may or may
not be
> useful to split some of these subthreads into separate email threads to
keep
> discussion clear - but I'm not sure)

I just went back and looked at my prototype and I had remembered this
wrong. An imported function is always marked
AvailableExternallyLinkage, unless it has link once linkage.

As far as using that to indicate that it is an aux function, I was
concerned about overloading the meaning of that linkage type. See the
next para for an example where I am unsure about doing this...

Looking back through my GlobalDCE changes, it looks like one of the
places I had changed (where we mark defined globals in runOnModule)
already has a guard for !hasAvailableExternallyLinkage and
!isDiscardableIfUnused, so my additional guard against marking
imported functions is unnecessary. But the other place I had to change
was in GlobalIsNeeded where it walks through the function and
recursively marks any referenced global as needed. Here there was no
guard against marking a global that is available externally as needed
if it is referenced. I had added a check here to not mark imported
functions as needed on reference unless they were discardable (i.e.
link once). Is this a bug - should this have a guard against marking
available externally function refs as needed?

Duncan's probably got a better idea of what the right answer is here. I
suspect "yes".

The trick with available_externally is to ensure we keep these around for
long enough that their definitions are useful (for inlining, constant prop,
all that good stuff) but remove them before we actually do too much work on
them, or at least before we emit them into the final object file.

I imagine if GlobalDCE isn't removing available_externally functions it's
because they're still useful in the optimization pipeline and something
further down the pipe removes them (because they do, ultimately, get
removed). Any idea what problematic behavior you were seeing that was
addressed by changing DCE to remove these functions was? I imagine whatever
the bad behavior was, it could be demonstrated without ThinLTO and with a
normal module with an available_externally function & would be worth
discussing what the right behavior is in that context, regardless of
ThinLTO.

bigboze · May 15, 2015, 5:07pm

There is no need for emitting the full symtab. I checked the overhead with a huge internal C++ source. The overhead of symtab + str table compared with byte code with debug is about 3%.

It’s still sizable and could be noticeable if thinLTO can deliver compile times that closer to what resembles builds without LTO as your results suggest.

More importantly, it is also possible to use the symtab also for index/summary purpose, which makes the space usage completely ‘unwasted’. That gets into the details which will follow when patches are in.

There is symbol information in both the native object symbol table and the bitcode file? isn’t that waste? I understand the reasons for using the native object wrapper (compatibility with other tools) and happy with that. But I’d also like to see the option for function index/summary data to be produced without the wrapper, so that bitcode aware tools do not need to use this wrapped format. If you mix the native object wrapper symbol information with the function/index summary data then that would end up being impossible.

Also won’t having the native object data with the function index/summary have a cost on testing for all of the supported native object formats?

Xinliang_David_Li · May 15, 2015, 5:32pm

(resend a bounced message).

I don't think that natively wrapped bitcode gets you as much as you think
it does anyhow, unless you're duplicating a lot of information (ar, as
discussed earlier, aside). I'm not too worried about the build system as
far as a wrapping mechanism

Do not under estimate the importance of build system integration. Tools
used in the build can include ar, nm, ranlib, objcopy, strip, etc. The
latest binutils support plugin for ar, nm and ranlib, but not others.
objcopy can actually change visibility of symbols. It is conceivably easy
to support this with native object wrapper (i.e. propagate the visibility
change at IR reading time), but it is unclear wether existing plugin
interfaces are enough for it.

and I think more traditional LTO schemes with LLVM have just used
bitcode/IR output as an input to the LTO link step. I think what we're
talking about here is the best way to encode the data that thin lto
needs/wants in order to handle summary information etc right?

Not entirely. We want both functionality and usability. Easy integration
with build is considered as the usability. Asking users to use wrapper
tools in order to pass plugin path is not something I consider as being
highly usable -- but I can be convinced the other way

David

Xinliang_David_Li · May 15, 2015, 5:40pm

> There is no need for emitting the full symtab. I checked the overhead
with a huge internal C++ source. The overhead of symtab + str table
compared with byte code with debug is about 3%.

It's still sizable and could be noticeable if thinLTO can deliver compile
times that closer to what resembles builds without LTO as your results
suggest.

If the cost is part of the index/summary, then it is avoidable.

> More importantly, it is also possible to use the symtab also for
index/summary purpose, which makes the space usage completely 'unwasted'.
That gets into the details which will follow when patches are in.

There is symbol information in both the native object symbol table and the
bitcode file? isn't that waste? I understand the reasons for using the
native object wrapper (compatibility with other tools) and happy with that.
But I'd also like to see the option for function index/summary data to be
produced without the wrapper, so that bitcode aware tools do not need to
use this wrapped format.

I agree.

If you mix the native object wrapper symbol information with the
function/index summary data then that would end up being impossible.

It is possible. The summary data is still in its own proper (its own
section). Under the bitcode only option, the symtab will be replaced with
bitcode form of the index, while the summary remains the same.

Also won't having the native object data with the function index/summary
have a cost on testing for all of the supported native object formats?

yes.

thanks,

David

Topic		Replies	Views
Updated RFC: ThinLTO Implementation Plan LLVM Dev List Archives	28	171	August 21, 2015
RFC: ThinLTO File API and Data Structures LLVM Dev List Archives	1	91	August 12, 2015
RFC: ThinLTO File Format LLVM Dev List Archives	13	75	September 2, 2015
getting nowhere with thinLTO LLVM Dev List Archives	1	68	November 9, 2017
[RFC] Integrated Distributed ThinLTO LLD	39	2567	September 6, 2023

RFC: ThinLTO Impementation Plan

ThinLTO Overview

Implementation Plan

ThinLTO Overview

Implementation Plan

Related Topics