RFC [ThinLTO]: Promoting more aggressively in order to reduce incremental link time and allow sharing between linkage units

When are you imagining that promotion would happen? If it happens just
before codegen (or bitcode emission), it wouldn't inhibit these
optimizations, right?

For ThinLTO it has to happen before the link-time optimizations, because of cross-module importing.

Are you referring to the fact that these optimizations would be inhibited
versus regular LTO, since we cannot internalize? Yes, that does seem like
an issue.

Yes, this is an issue I’m fighting with currently with ThinLTO.(And I haven’t reach the tuning stage yet because I can’t nail the infrastructure these days…)

Hi all,

I'd like to propose changes to how we do promotion of global values
in ThinLTO. The goal here is to make it possible to pre-compile parts of
the translation unit to native code at compile time. For example, if we
know that:

1) A function is a leaf function, so it will never import any other
functions, and

It still may be imported somewhere else right?

2) The function's instruction count falls above a threshold
specified at compile time, so it will never be imported.

It won’t be imported, but unless it is a “leaf” it may import and
inline itself.

or
3) The compile-time threshold is zero, so there is no possibility of
functions being imported (What's the utility of this? Consider a program
transformation that requires whole-program information, such as CFI. During
development, the import threshold may be set to zero in order to minimize
the incremental link time while still providing the same CFI enforcement
that would be used in production builds of the application.)

then the function's body will not be affected by link-time
decisions, and we might as well produce its object code at compile time.

Reading this last sentence, it seems exactly the “non-LTO” case?

Yes, basically the point of this proposal is to be able to split the
linkage unit into LTO and non-LTO parts.

This will also allow the object code to be shared between linkage
units (this should hopefully help solve a major scalability problem for
Chromium, as that project contains a large number of test binaries based on
common libraries).

This can be done with a change to the intermediate object file
format. We can represent object files as native code containing statically
compiled functions and global data in the .text,. data, .rodata (etc.)
sections, with an .llvmbc section (or, I suppose, "__LLVM, __bitcode" when
targeting Mach-O) containing bitcode for functions to be compiled at link
time.

In order to make this work, we need to make sure that references
from link-time compiled functions to statically compiled functions work
correctly in the case where the statically compiled function has internal
linkage. We can do this by promoting every global value with internal
linkage, using a hash of the external names (as I mentioned in [1]).

Mehdi - I know you were keen to reduce the amount of promotion. Is
that still an issue for you assuming linker GC (dead stripping)?

Yes: we do better optimization on internal function in general.

Inliner is one of the affected optimization -- however this sounds like
a matter of tuning to teach inliner about promoted static functions.

The inliner compute a tradeoff between pseudo runtime cost and binary
size, the existing bonus for static functions is when there is a single
call site because it makes the binary increase inexistant (dropping the
static after inline). We promote function because we think we are likely to
introduce a reference to it somewhere else, so “lying” to the inliner is
not necessarily a good idea.

It is not lying to the inliner. If a static (before promotion) function
is a candidate to be inlined in the original defining module, it is
probably more likely to inlined in other importing modules where more
context is available. In other words, the inliner can apply the same bonus
to 'promoted' static functions as if references in other modules will also
disappear. Of course, we can not assume it has single callsite.

Comdat functions can be handled similarly.

That said we (actually Bruno did) prototyped it already with somehow
good results :slight_smile:
I’m not convinced yet that it should be independent of promoted or not
promoted though.

Generally true (see the comdat case).

Assuming we solve the inliner issue, then remain the “optimizations
other than inliner”. We can probably solve most but I suspect it won’t be
“trivial” either.

Any such optimizations in mind?

I don’t have the details, but in short:

For promoted functions: IPSCCP, dead arg elimination
For promoted global variables: anything that is impacted somehow by
aliasing

When are you imagining that promotion would happen? If it happens just
before codegen (or bitcode emission), it wouldn't inhibit these
optimizations, right?

For ThinLTO it has to happen before the link-time optimizations, because
of cross-module importing.

Are you referring to the fact that these optimizations would be inhibited
versus regular LTO, since we cannot internalize? Yes, that does seem like
an issue.

Yes, this is an issue I'm fighting with currently with ThinLTO.
(And I haven't reach the tuning stage yet because I can't nail the
infrastructure these days...)

Sorry to chime in late here, away from my email most of the day.

I think the early promotion being proposed by Peter introduces less
optimization issues than the missing internalization on globals in ThinLTO.
For example, I would anticipate that the inline bonus for static functions
with a single callsite would likely provide any intended benefit during the
inlining performed in the -O2 -c compile step, and before bitcode/text
emission which is presumably when the early promotion would occur. I am not
sure about the other places mentioned by Mehdi, I am less familiar with
those, but presumably some could/should be done on static functions during
a -O2 compile step (e.g. dead argument elimination?).

For internalization, when I implemented the ThinLTO prototype I had played
with applying the single called static function bonus to functions noted as
having a single call in the summary (along with linker GC). It sounds like
Mehdi/Bruno are also looking at that. I've also not yet had a chance to do
optimization tuning on the upstream implementation, hopefully starting that
very soon though.

Teresa

I was wondering why can’t the “precompiled” function be embedded in the IR instead of the bitcode embedded in the object file?
The codegen would still emit a single object file out of this IR file that contains the code for the IR and the precompiled function.

It seems to me that this way the scheme would work with any existing existing LTO implementation.

Hi all,

I'd like to propose changes to how we do promotion of global values in
ThinLTO. The goal here is to make it possible to pre-compile parts of the
translation unit to native code at compile time. For example, if we know
that:

1) A function is a leaf function, so it will never import any other
functions, and
2) The function's instruction count falls above a threshold specified at
compile time, so it will never be imported.
or
3) The compile-time threshold is zero, so there is no possibility of
functions being imported (What's the utility of this? Consider a program
transformation that requires whole-program information, such as CFI. During
development, the import threshold may be set to zero in order to minimize
the incremental link time while still providing the same CFI enforcement
that would be used in production builds of the application.)

then the function's body will not be affected by link-time decisions, and
we might as well produce its object code at compile time. This will also
allow the object code to be shared between linkage units (this should
hopefully help solve a major scalability problem for Chromium, as that
project contains a large number of test binaries based on common libraries).

This can be done with a change to the intermediate object file format. We
can represent object files as native code containing statically compiled
functions and global data in the .text,. data, .rodata (etc.) sections,
with an .llvmbc section (or, I suppose, "__LLVM, __bitcode" when targeting
Mach-O) containing bitcode for functions to be compiled at link time.

I was wondering why can't the "precompiled" function be embedded in the IR
instead of the bitcode embedded in the object file?
The codegen would still emit a single object file out of this IR file that
contains the code for the IR and the precompiled function.

It seems to me that this way the scheme would work with any existing
existing LTO implementation.

You'd still have the same problem. No matter whether you put the native
object inside the IR file or vice versa, you still have a file containing a
native object and some IR. That's the scenario that I found that the gold
plugin interface wouldn't support.

Supporting IR embedded in a native object section inside a linker should be
pretty trivial, if you control the linker. My prototype implementation in
lld is about 10 lines of code.

Peter

It is not clear to me why it is a problem for gold: it does not need to know that the IR file contains some native precompiled code: it only need to know that this is an “LLVM file”, that will be passed to LLVM for LTO and it will get a single object file in return.
Can you elaborate why the linker need to know beforehand and differentiate?

Hi all,

I'd like to propose changes to how we do promotion of global values in
ThinLTO. The goal here is to make it possible to pre-compile parts of the
translation unit to native code at compile time. For example, if we know
that:

1) A function is a leaf function, so it will never import any other
functions, and
2) The function's instruction count falls above a threshold specified at
compile time, so it will never be imported.
or
3) The compile-time threshold is zero, so there is no possibility of
functions being imported (What's the utility of this? Consider a program
transformation that requires whole-program information, such as CFI. During
development, the import threshold may be set to zero in order to minimize
the incremental link time while still providing the same CFI enforcement
that would be used in production builds of the application.)

then the function's body will not be affected by link-time decisions, and
we might as well produce its object code at compile time. This will also
allow the object code to be shared between linkage units (this should
hopefully help solve a major scalability problem for Chromium, as that
project contains a large number of test binaries based on common libraries).

This can be done with a change to the intermediate object file format. We
can represent object files as native code containing statically compiled
functions and global data in the .text,. data, .rodata (etc.) sections,
with an .llvmbc section (or, I suppose, "__LLVM, __bitcode" when targeting
Mach-O) containing bitcode for functions to be compiled at link time.

I was wondering why can't the "precompiled" function be embedded in the
IR instead of the bitcode embedded in the object file?
The codegen would still emit a single object file out of this IR file
that contains the code for the IR and the precompiled function.

It seems to me that this way the scheme would work with any existing
existing LTO implementation.

You'd still have the same problem. No matter whether you put the native
object inside the IR file or vice versa, you still have a file containing a
native object and some IR. That's the scenario that I found that the gold
plugin interface wouldn't support.

It is not clear to me why it is a problem for gold: it does not need to
know that the IR file contains some native precompiled code: it only need
to know that this is an "LLVM file", that will be passed to LLVM for LTO
and it will get a single object file in return.
Can you elaborate why the linker need to know beforehand and differentiate?

(There wouldn't just be one object file, there would be N native objects
and 1 (or N if ThinLTO) combined LTO objects.)

In principle, it doesn't need to know. In practice, I found that in my
prototype I couldn't persuade gold to accept what I was doing without
giving undefined symbol errors.

I suppose I could have debugged it further, but I couldn't justify spending
more time on it, since the projects I care about are interested in
switching to lld for other reasons.

Peter

No really, that’s not what I described. I described a mode where even llc would be able to process this input IR file with an embedded precompiled function and spit out a single object file.
Conceptually, it should be similar to a naked IR function with inline assembly for instance, or module level inline assembly, i.e. totally transparent to the linker.

Hi all,

I'd like to propose changes to how we do promotion of global values in
ThinLTO. The goal here is to make it possible to pre-compile parts of the
translation unit to native code at compile time. For example, if we know
that:

1) A function is a leaf function, so it will never import any other
functions, and
2) The function's instruction count falls above a threshold specified at
compile time, so it will never be imported.
or
3) The compile-time threshold is zero, so there is no possibility of
functions being imported (What's the utility of this? Consider a program
transformation that requires whole-program information, such as CFI. During
development, the import threshold may be set to zero in order to minimize
the incremental link time while still providing the same CFI enforcement
that would be used in production builds of the application.)

then the function's body will not be affected by link-time decisions,
and we might as well produce its object code at compile time. This will
also allow the object code to be shared between linkage units (this should
hopefully help solve a major scalability problem for Chromium, as that
project contains a large number of test binaries based on common libraries).

This can be done with a change to the intermediate object file format.
We can represent object files as native code containing statically compiled
functions and global data in the .text,. data, .rodata (etc.) sections,
with an .llvmbc section (or, I suppose, "__LLVM, __bitcode" when targeting
Mach-O) containing bitcode for functions to be compiled at link time.

I was wondering why can't the "precompiled" function be embedded in the
IR instead of the bitcode embedded in the object file?
The codegen would still emit a single object file out of this IR file
that contains the code for the IR and the precompiled function.

It seems to me that this way the scheme would work with any existing
existing LTO implementation.

You'd still have the same problem. No matter whether you put the native
object inside the IR file or vice versa, you still have a file containing a
native object and some IR. That's the scenario that I found that the gold
plugin interface wouldn't support.

It is not clear to me why it is a problem for gold: it does not need to
know that the IR file contains some native precompiled code: it only need
to know that this is an "LLVM file", that will be passed to LLVM for LTO
and it will get a single object file in return.
Can you elaborate why the linker need to know beforehand and
differentiate?

(There wouldn't just be one object file, there would be N native objects
and 1 (or N if ThinLTO) combined LTO objects.)

No really, that's not what I described. I described a mode where even llc
would be able to process this input IR file with an embedded precompiled
function and spit out a single object file.
Conceptually, it should be similar to a naked IR function with inline
assembly for instance, or module level inline assembly, i.e. totally
transparent to the linker.

That doesn't seem like a good idea to me. Each embedded precompiled
function would need relocations, debug info, unwind info etc. in some form
or another. Given that the whole point of embedding precompiled code is to
reduce the work required by the linker, they would probably need to be a
form close to what will appear in the native object file. At that point
you're basically just inventing another native object format, and we
already have enough of them.

Peter

Hi all,

I'd like to propose changes to how we do promotion of global values in
ThinLTO. The goal here is to make it possible to pre-compile parts of the
translation unit to native code at compile time. For example, if we know
that:

1) A function is a leaf function, so it will never import any other
functions, and
2) The function's instruction count falls above a threshold specified at
compile time, so it will never be imported.
or
3) The compile-time threshold is zero, so there is no possibility of
functions being imported (What's the utility of this? Consider a program
transformation that requires whole-program information, such as CFI. During
development, the import threshold may be set to zero in order to minimize
the incremental link time while still providing the same CFI enforcement
that would be used in production builds of the application.)

then the function's body will not be affected by link-time decisions,
and we might as well produce its object code at compile time. This will
also allow the object code to be shared between linkage units (this should
hopefully help solve a major scalability problem for Chromium, as that
project contains a large number of test binaries based on common libraries).

This can be done with a change to the intermediate object file format.
We can represent object files as native code containing statically compiled
functions and global data in the .text,. data, .rodata (etc.) sections,
with an .llvmbc section (or, I suppose, "__LLVM, __bitcode" when targeting
Mach-O) containing bitcode for functions to be compiled at link time.

I was wondering why can't the "precompiled" function be embedded in the
IR instead of the bitcode embedded in the object file?
The codegen would still emit a single object file out of this IR file
that contains the code for the IR and the precompiled function.

It seems to me that this way the scheme would work with any existing
existing LTO implementation.

You'd still have the same problem. No matter whether you put the native
object inside the IR file or vice versa, you still have a file containing a
native object and some IR. That's the scenario that I found that the gold
plugin interface wouldn't support.

It is not clear to me why it is a problem for gold: it does not need to
know that the IR file contains some native precompiled code: it only need
to know that this is an "LLVM file", that will be passed to LLVM for LTO
and it will get a single object file in return.
Can you elaborate why the linker need to know beforehand and
differentiate?

(There wouldn't just be one object file, there would be N native objects
and 1 (or N if ThinLTO) combined LTO objects.)

If the system can already handle the N case, for ThinLTO - that seems like
it would solve the problem here, right? (LTO, when asked by the linker to
produce the N object files would just build the N/2 object files from IR,
and build another N/2 object files that were already object files (by
spitting out the embedded object code from the IR into a new file without
touching any of the bits)). But perhaps I'm not understanding something.

I think that's what Mehdi means by not having to modify existing linkers -
it seems anything that can cope with ThinLTO could cope with a few more
files being created, no? (I don't know too much about this stuff, though)

Hi all,

I'd like to propose changes to how we do promotion of global values in
ThinLTO. The goal here is to make it possible to pre-compile parts of the
translation unit to native code at compile time. For example, if we know
that:

1) A function is a leaf function, so it will never import any other
functions, and
2) The function's instruction count falls above a threshold specified at
compile time, so it will never be imported.
or
3) The compile-time threshold is zero, so there is no possibility of
functions being imported (What's the utility of this? Consider a program
transformation that requires whole-program information, such as CFI. During
development, the import threshold may be set to zero in order to minimize
the incremental link time while still providing the same CFI enforcement
that would be used in production builds of the application.)

then the function's body will not be affected by link-time decisions, and
we might as well produce its object code at compile time. This will also
allow the object code to be shared between linkage units (this should
hopefully help solve a major scalability problem for Chromium, as that
project contains a large number of test binaries based on common libraries).

This can be done with a change to the intermediate object file format. We
can represent object files as native code containing statically compiled
functions and global data in the .text,. data, .rodata (etc.) sections,
with an .llvmbc section (or, I suppose, "__LLVM, __bitcode" when targeting
Mach-O) containing bitcode for functions to be compiled at link time.

In order to make this work, we need to make sure that references from
link-time compiled functions to statically compiled functions work
correctly in the case where the statically compiled function has internal
linkage. We can do this by promoting every global value with internal
linkage, using a hash of the external names (as I mentioned in [1]).

What about translation units that have no external names? I hit this
problem with DWARF Fission hashing recently, where two files had code
equivalent to this:

  struct foo { foo(); }
  static foo f;

Thus no external symbols, and indeed exactly the same set of symbols for
two instances of this file (& I have seen examples of this in Google's
codebase - though I haven't searched extensively, and it may be that the
linker never actually picks two of these together, but the DWP tool doesn't
have the same kind of "skip this library if no symbols are needed from it"
behavior as the linker).

Also, (I haven't read the whole thread, but I assume) you're considering
doing this with debug info too? All type information could pretty easily be
emitted up-front and just reduced to declarations (again, on non-LLDB
platforms... :/) for the rest of the debug info. The extra declarations
might make object files a bit bigger, though. (eg: if there were types that
weren't used in any of the ahead-of-time compiled code, but were used in
the ThinLTO'd code - the naive approach would still produce the type info
up front and a declaration in ThinLTO which would make for bigger output
than just putting the type in the ThinLTO'd code - but it would potentially
improve parallelism by reducing the amount of type goo needing to be
imported/exported/emitted during ThinLTO)

Hi all,

I'd like to propose changes to how we do promotion of global values in
ThinLTO. The goal here is to make it possible to pre-compile parts of the
translation unit to native code at compile time. For example, if we know
that:

1) A function is a leaf function, so it will never import any other
functions, and
2) The function's instruction count falls above a threshold specified at
compile time, so it will never be imported.
or
3) The compile-time threshold is zero, so there is no possibility of
functions being imported (What's the utility of this? Consider a program
transformation that requires whole-program information, such as CFI. During
development, the import threshold may be set to zero in order to minimize
the incremental link time while still providing the same CFI enforcement
that would be used in production builds of the application.)

then the function's body will not be affected by link-time decisions, and
we might as well produce its object code at compile time. This will also
allow the object code to be shared between linkage units (this should
hopefully help solve a major scalability problem for Chromium, as that
project contains a large number of test binaries based on common libraries).

This can be done with a change to the intermediate object file format. We
can represent object files as native code containing statically compiled
functions and global data in the .text,. data, .rodata (etc.) sections,
with an .llvmbc section (or, I suppose, "__LLVM, __bitcode" when targeting
Mach-O) containing bitcode for functions to be compiled at link time.

In order to make this work, we need to make sure that references from
link-time compiled functions to statically compiled functions work
correctly in the case where the statically compiled function has internal
linkage. We can do this by promoting every global value with internal
linkage, using a hash of the external names (as I mentioned in [1]).

What about translation units that have no external names? I hit this
problem with DWARF Fission hashing recently, where two files had code
equivalent to this:

  struct foo { foo(); }
  static foo f;

Thus no external symbols, and indeed exactly the same set of symbols for
two instances of this file (& I have seen examples of this in Google's
codebase - though I haven't searched extensively, and it may be that the
linker never actually picks two of these together, but the DWP tool doesn't
have the same kind of "skip this library if no symbols are needed from it"
behavior as the linker).

Yes, I came across this case in my prototype. This can happen if two such
TUs appear directly as linker inputs (rather than as library members). This
is a rare case, and the code in such a TU is most likely initialization
code that does not require extensive optimization, so the solution I
decided on was to inhibit ThinLTO for such modules. In my prototype, I
caused such modules to be compiled with regular LTO, but there are other
possible solutions, such as compiling to a native object.

Also, (I haven't read the whole thread, but I assume) you're considering

doing this with debug info too? All type information could pretty easily be
emitted up-front and just reduced to declarations (again, on non-LLDB
platforms... :/) for the rest of the debug info. The extra declarations
might make object files a bit bigger, though. (eg: if there were types that
weren't used in any of the ahead-of-time compiled code, but were used in
the ThinLTO'd code - the naive approach would still produce the type info
up front and a declaration in ThinLTO which would make for bigger output
than just putting the type in the ThinLTO'd code - but it would potentially
improve parallelism by reducing the amount of type goo needing to be
imported/exported/emitted during ThinLTO)

That's an interesting idea. I hadn't thought about just emitting type
declarations in the ThinLTO'd code but yes, that's something we could
consider doing. It would be interesting to see what the tradeoff would be
in terms of the edit/compile/debug cycle time, as we'd be exchanging linker
work for debugger work.

Peter

Hi all,

I'd like to propose changes to how we do promotion of global values in
ThinLTO. The goal here is to make it possible to pre-compile parts of the
translation unit to native code at compile time. For example, if we know
that:

1) A function is a leaf function, so it will never import any other
functions, and
2) The function's instruction count falls above a threshold specified at
compile time, so it will never be imported.
or
3) The compile-time threshold is zero, so there is no possibility of
functions being imported (What's the utility of this? Consider a program
transformation that requires whole-program information, such as CFI. During
development, the import threshold may be set to zero in order to minimize
the incremental link time while still providing the same CFI enforcement
that would be used in production builds of the application.)

then the function's body will not be affected by link-time decisions,
and we might as well produce its object code at compile time. This will
also allow the object code to be shared between linkage units (this should
hopefully help solve a major scalability problem for Chromium, as that
project contains a large number of test binaries based on common libraries).

This can be done with a change to the intermediate object file format.
We can represent object files as native code containing statically compiled
functions and global data in the .text,. data, .rodata (etc.) sections,
with an .llvmbc section (or, I suppose, "__LLVM, __bitcode" when targeting
Mach-O) containing bitcode for functions to be compiled at link time.

In order to make this work, we need to make sure that references from
link-time compiled functions to statically compiled functions work
correctly in the case where the statically compiled function has internal
linkage. We can do this by promoting every global value with internal
linkage, using a hash of the external names (as I mentioned in [1]).

What about translation units that have no external names? I hit this
problem with DWARF Fission hashing recently, where two files had code
equivalent to this:

  struct foo { foo(); }
  static foo f;

Thus no external symbols, and indeed exactly the same set of symbols for
two instances of this file (& I have seen examples of this in Google's
codebase - though I haven't searched extensively, and it may be that the
linker never actually picks two of these together, but the DWP tool doesn't
have the same kind of "skip this library if no symbols are needed from it"
behavior as the linker).

Yes, I came across this case in my prototype. This can happen if two such
TUs appear directly as linker inputs (rather than as library members). This
is a rare case, and the code in such a TU is most likely initialization
code that does not require extensive optimization, so the solution I
decided on was to inhibit ThinLTO for such modules. In my prototype, I
caused such modules to be compiled with regular LTO, but there are other
possible solutions, such as compiling to a native object.

Fair enough

Also, (I haven't read the whole thread, but I assume) you're considering

doing this with debug info too? All type information could pretty easily be
emitted up-front and just reduced to declarations (again, on non-LLDB
platforms... :/) for the rest of the debug info. The extra declarations
might make object files a bit bigger, though. (eg: if there were types that
weren't used in any of the ahead-of-time compiled code, but were used in
the ThinLTO'd code - the naive approach would still produce the type info
up front and a declaration in ThinLTO which would make for bigger output
than just putting the type in the ThinLTO'd code - but it would potentially
improve parallelism by reducing the amount of type goo needing to be
imported/exported/emitted during ThinLTO)

That's an interesting idea. I hadn't thought about just emitting type
declarations in the ThinLTO'd code but yes, that's something we could
consider doing. It would be interesting to see what the tradeoff would be
in terms of the edit/compile/debug cycle time, as we'd be exchanging linker
work for debugger work.

Potentially, yes - it's certainly a tradeoff we already make to reduce
debug info size (the default (on non-LLDB platforms) of
-fno-standalone-debug causes type definitions to be emitted only along with
the vtable for types with vtables, only where a type is required to be
complete (so if you just use pointers to a type, etc you get a
declaration), and only along with a template explicit instantiation
definition if the type has an explicit instantiation declaration) - so I
think this would be consistent with that, but it would be more aggressive,
to be sure. I'm not sure what sort of GDB performance infrastructure we
have available to get better numbers on this.

- David