RFC [ThinLTO]: Promoting more aggressively in order to reduce incremental link time and allow sharing between linkage units

Hi all,

I’d like to propose changes to how we do promotion of global values in ThinLTO. The goal here is to make it possible to pre-compile parts of the translation unit to native code at compile time. For example, if we know that:

  1. A function is a leaf function, so it will never import any other functions, and
  2. The function’s instruction count falls above a threshold specified at compile time, so it will never be imported.
    or
  3. The compile-time threshold is zero, so there is no possibility of functions being imported (What’s the utility of this? Consider a program transformation that requires whole-program information, such as CFI. During development, the import threshold may be set to zero in order to minimize the incremental link time while still providing the same CFI enforcement that would be used in production builds of the application.)

then the function’s body will not be affected by link-time decisions, and we might as well produce its object code at compile time. This will also allow the object code to be shared between linkage units (this should hopefully help solve a major scalability problem for Chromium, as that project contains a large number of test binaries based on common libraries).

This can be done with a change to the intermediate object file format. We can represent object files as native code containing statically compiled functions and global data in the .text,. data, .rodata (etc.) sections, with an .llvmbc section (or, I suppose, “__LLVM, __bitcode” when targeting Mach-O) containing bitcode for functions to be compiled at link time.

In order to make this work, we need to make sure that references from link-time compiled functions to statically compiled functions work correctly in the case where the statically compiled function has internal linkage. We can do this by promoting every global value with internal linkage, using a hash of the external names (as I mentioned in [1]).

I imagine that for some linkers, it may not be possible to deal with this scheme. For example, I did some investigation last year and discovered that I could not use the gold plugin interface to load a native object file if we had already claimed it as an IR file. I wouldn’t be surprised to learn that ld64 has similar problems.

In cases where we completely control the linker (e.g. lld), we can easily support this scheme, as the linker can directly do whatever it wants. But for linkers that cannot support this, I suggest that we promote consistently under ThinLTO rather than having different promotion schemes for different linkers, in order to reduce overall complexity.

Thanks for your feedback!

Thanks,

Hi all,

I’d like to propose changes to how we do promotion of global values in ThinLTO. The goal here is to make it possible to pre-compile parts of the translation unit to native code at compile time. For example, if we know that:

  1. A function is a leaf function, so it will never import any other functions, and

It still may be imported somewhere else right?

  1. The function’s instruction count falls above a threshold specified at compile time, so it will never be imported.

It won’t be imported, but unless it is a “leaf” it may import and inline itself.

or
3) The compile-time threshold is zero, so there is no possibility of functions being imported (What’s the utility of this? Consider a program transformation that requires whole-program information, such as CFI. During development, the import threshold may be set to zero in order to minimize the incremental link time while still providing the same CFI enforcement that would be used in production builds of the application.)

then the function’s body will not be affected by link-time decisions, and we might as well produce its object code at compile time.

Reading this last sentence, it seems exactly the “non-LTO” case?

This will also allow the object code to be shared between linkage units (this should hopefully help solve a major scalability problem for Chromium, as that project contains a large number of test binaries based on common libraries).

This can be done with a change to the intermediate object file format. We can represent object files as native code containing statically compiled functions and global data in the .text,. data, .rodata (etc.) sections, with an .llvmbc section (or, I suppose, “__LLVM, __bitcode” when targeting Mach-O) containing bitcode for functions to be compiled at link time.

In order to make this work, we need to make sure that references from link-time compiled functions to statically compiled functions work correctly in the case where the statically compiled function has internal linkage. We can do this by promoting every global value with internal linkage, using a hash of the external names (as I mentioned in [1]).

I imagine that for some linkers, it may not be possible to deal with this scheme. For example, I did some investigation last year and discovered that I could not use the gold plugin interface to load a native object file if we had already claimed it as an IR file. I wouldn’t be surprised to learn that ld64 has similar problems.

I suspect ld64 would need to be update to handle this scheme. Somehow it need to be able to extract the object from the section.

Otherwise this should work, but I suspect the applicability (leaving CFI aside) may not concern that many functions, so I’m not sure about the impact.

Oh there was an “and”, so forget about my comments above, since this is acknowledged.

Hi all,

I'd like to propose changes to how we do promotion of global values in
ThinLTO. The goal here is to make it possible to pre-compile parts of the
translation unit to native code at compile time. For example, if we know
that:

1) A function is a leaf function, so it will never import any other
functions, and

It still may be imported somewhere else right?

2) The function's instruction count falls above a threshold specified at
compile time, so it will never be imported.

It won’t be imported, but unless it is a “leaf” it may import and inline
itself.

or
3) The compile-time threshold is zero, so there is no possibility of
functions being imported (What's the utility of this? Consider a program
transformation that requires whole-program information, such as CFI. During
development, the import threshold may be set to zero in order to minimize
the incremental link time while still providing the same CFI enforcement
that would be used in production builds of the application.)

then the function's body will not be affected by link-time decisions, and
we might as well produce its object code at compile time.

Reading this last sentence, it seems exactly the “non-LTO” case?

Yes, basically the point of this proposal is to be able to split the
linkage unit into LTO and non-LTO parts.

This will also allow the object code to be shared between linkage units
(this should hopefully help solve a major scalability problem for Chromium,
as that project contains a large number of test binaries based on common
libraries).

This can be done with a change to the intermediate object file format. We
can represent object files as native code containing statically compiled
functions and global data in the .text,. data, .rodata (etc.) sections,
with an .llvmbc section (or, I suppose, "__LLVM, __bitcode" when targeting
Mach-O) containing bitcode for functions to be compiled at link time.

In order to make this work, we need to make sure that references from
link-time compiled functions to statically compiled functions work
correctly in the case where the statically compiled function has internal
linkage. We can do this by promoting every global value with internal
linkage, using a hash of the external names (as I mentioned in [1]).

I imagine that for some linkers, it may not be possible to deal with this
scheme. For example, I did some investigation last year and discovered that
I could not use the gold plugin interface to load a native object file if
we had already claimed it as an IR file. I wouldn't be surprised to learn
that ld64 has similar problems.

I suspect ld64 would need to be update to handle this scheme. Somehow it
need to be able to extract the object from the section.

Do you mean the bitcode object? There's already support in LLVM for this (
http://llvm-cs.pcc.me.uk/lib/Object/IRObjectFile.cpp#269). I suspect the
tricky part (which I was unsuccessful at doing with gold) will be
convincing the linker that the bitcode object and the regular object are
two separate things.

Otherwise this should work, but I suspect the applicability (leaving CFI

aside) may not concern that many functions, so I’m not sure about the
impact.

I'm also curious about the applicability in the regular ThinLTO case. One
thing that may help with applicability is that I suspect that not only leaf
functions but also functions that only call (directly or indirectly)
functions in the same TU, as well as functions that only make calls via
function pointers or vtables may fall under the criteria for non-importing.

Peter

Hi all,

I'd like to propose changes to how we do promotion of global values in
ThinLTO. The goal here is to make it possible to pre-compile parts of the
translation unit to native code at compile time. For example, if we know
that:

1) A function is a leaf function, so it will never import any other
functions, and
2) The function's instruction count falls above a threshold specified at
compile time, so it will never be imported.
or
3) The compile-time threshold is zero, so there is no possibility of
functions being imported (What's the utility of this? Consider a program
transformation that requires whole-program information, such as CFI. During
development, the import threshold may be set to zero in order to minimize
the incremental link time while still providing the same CFI enforcement
that would be used in production builds of the application.)

Do you know of any use case that is not as an aid for developers? I.e.
would this be a user-visible feature?

-- Sean Silva

It would indeed be a user-visible feature. The developers in this case are
the developers of the program that uses the CFI feature, not LLVM
developers.

Peter

Hi all,

I'd like to propose changes to how we do promotion of global values in
ThinLTO. The goal here is to make it possible to pre-compile parts of the
translation unit to native code at compile time. For example, if we know
that:

1) A function is a leaf function, so it will never import any other
functions, and
2) The function's instruction count falls above a threshold specified at
compile time, so it will never be imported.
or
3) The compile-time threshold is zero, so there is no possibility of
functions being imported (What's the utility of this? Consider a program
transformation that requires whole-program information, such as CFI. During
development, the import threshold may be set to zero in order to minimize
the incremental link time while still providing the same CFI enforcement
that would be used in production builds of the application.)

Do you know of any use case that is not as an aid for developers? I.e.
would this be a user-visible feature?

It would indeed be a user-visible feature. The developers in this case are
the developers of the program that uses the CFI feature, not LLVM
developers.

Ah, sorry. I misinterpreted what you wrote.

-- Sean Silva

Hi all,

I'd like to propose changes to how we do promotion of global values in
ThinLTO. The goal here is to make it possible to pre-compile parts of the
translation unit to native code at compile time. For example, if we know
that:

1) A function is a leaf function, so it will never import any other
functions, and

It still may be imported somewhere else right?

2) The function's instruction count falls above a threshold specified at
compile time, so it will never be imported.

It won’t be imported, but unless it is a “leaf” it may import and inline
itself.

or
3) The compile-time threshold is zero, so there is no possibility of
functions being imported (What's the utility of this? Consider a program
transformation that requires whole-program information, such as CFI. During
development, the import threshold may be set to zero in order to minimize
the incremental link time while still providing the same CFI enforcement
that would be used in production builds of the application.)

then the function's body will not be affected by link-time decisions, and
we might as well produce its object code at compile time.

Reading this last sentence, it seems exactly the “non-LTO” case?

Yes, basically the point of this proposal is to be able to split the
linkage unit into LTO and non-LTO parts.

This will also allow the object code to be shared between linkage units
(this should hopefully help solve a major scalability problem for Chromium,
as that project contains a large number of test binaries based on common
libraries).

This can be done with a change to the intermediate object file format. We
can represent object files as native code containing statically compiled
functions and global data in the .text,. data, .rodata (etc.) sections,
with an .llvmbc section (or, I suppose, "__LLVM, __bitcode" when targeting
Mach-O) containing bitcode for functions to be compiled at link time.

In order to make this work, we need to make sure that references from
link-time compiled functions to statically compiled functions work
correctly in the case where the statically compiled function has internal
linkage. We can do this by promoting every global value with internal
linkage, using a hash of the external names (as I mentioned in [1]).

Mehdi - I know you were keen to reduce the amount of promotion. Is that
still an issue for you assuming linker GC (dead stripping)? With this
proposal we will need to stick with the current promote everything scheme.

I imagine that for some linkers, it may not be possible to deal with this
scheme. For example, I did some investigation last year and discovered that
I could not use the gold plugin interface to load a native object file if
we had already claimed it as an IR file. I wouldn't be surprised to learn
that ld64 has similar problems.

I suspect ld64 would need to be update to handle this scheme. Somehow it
need to be able to extract the object from the section.

Do you mean the bitcode object? There's already support in LLVM for this (
http://llvm-cs.pcc.me.uk/lib/Object/IRObjectFile.cpp#269). I suspect the
tricky part (which I was unsuccessful at doing with gold) will be
convincing the linker that the bitcode object and the regular object are
two separate things.

Otherwise this should work, but I suspect the applicability (leaving CFI

aside) may not concern that many functions, so I’m not sure about the
impact.

I'm also curious about the applicability in the regular ThinLTO case. One
thing that may help with applicability is that I suspect that not only leaf
functions but also functions that only call (directly or indirectly)
functions in the same TU, as well as functions that only make calls via
function pointers or vtables may fall under the criteria for non-importing.

With indirect call profiling the function pointer case should not be a
blocker for importing.

Teresa

Mehdi - I know you were keen to reduce the amount of promotion. Is that still an issue for you assuming linker GC (dead stripping)?

Yes: we do better optimization on internal function in general. Our benchmarks showed that it can really make some difference, and many cases were ThinLTO didn’t perform as well as FullLTO were because of this promotion.
(binary size has never been my concern here)

With this proposal we will need to stick with the current promote everything scheme.

I don’t think so: you would need do it only for “internal functions that a leaf and aren’t likely to be imported/inlined”.
That said any function that we emit the binary at compile time instead of link time will contribute to inhibit optimizations for LTO/ThinLTO. The gain in compile time has to be really important to make it worth it.
(Of course the CFI use-case is a totally different tradeoff).

Peter: have you thought about debug info by the way?

Hi all,

I'd like to propose changes to how we do promotion of global values in
ThinLTO. The goal here is to make it possible to pre-compile parts of the
translation unit to native code at compile time. For example, if we know
that:

1) A function is a leaf function, so it will never import any other
functions, and

It still may be imported somewhere else right?

2) The function's instruction count falls above a threshold specified at
compile time, so it will never be imported.

It won’t be imported, but unless it is a “leaf” it may import and inline
itself.

or
3) The compile-time threshold is zero, so there is no possibility of
functions being imported (What's the utility of this? Consider a program
transformation that requires whole-program information, such as CFI. During
development, the import threshold may be set to zero in order to minimize
the incremental link time while still providing the same CFI enforcement
that would be used in production builds of the application.)

then the function's body will not be affected by link-time decisions,
and we might as well produce its object code at compile time.

Reading this last sentence, it seems exactly the “non-LTO” case?

Yes, basically the point of this proposal is to be able to split the
linkage unit into LTO and non-LTO parts.

This will also allow the object code to be shared between linkage units
(this should hopefully help solve a major scalability problem for Chromium,
as that project contains a large number of test binaries based on common
libraries).

This can be done with a change to the intermediate object file format.
We can represent object files as native code containing statically compiled
functions and global data in the .text,. data, .rodata (etc.) sections,
with an .llvmbc section (or, I suppose, "__LLVM, __bitcode" when targeting
Mach-O) containing bitcode for functions to be compiled at link time.

In order to make this work, we need to make sure that references from
link-time compiled functions to statically compiled functions work
correctly in the case where the statically compiled function has internal
linkage. We can do this by promoting every global value with internal
linkage, using a hash of the external names (as I mentioned in [1]).

Mehdi - I know you were keen to reduce the amount of promotion. Is that
still an issue for you assuming linker GC (dead stripping)?

Yes: we do better optimization on internal function in general. Our
benchmarks showed that it can really make some difference, and many cases
were ThinLTO didn’t perform as well as FullLTO were because of this
promotion.
(binary size has never been my concern here)

With this proposal we will need to stick with the current promote
everything scheme.

I don’t think so: you would need do it only for “internal functions that a
leaf and aren’t likely to be imported/inlined”.

I suppose you could do promotion in two different places - during the
compile step for these functions will will be emitted to text, and in the
back ends for the rest if they have references imported elsewhere.

That said any function that we emit the binary at compile time instead of

Hi all,

I'd like to propose changes to how we do promotion of global values in
ThinLTO. The goal here is to make it possible to pre-compile parts of the
translation unit to native code at compile time. For example, if we know
that:

1) A function is a leaf function, so it will never import any other
functions, and
2) The function's instruction count falls above a threshold specified at
compile time, so it will never be imported.
or
3) The compile-time threshold is zero, so there is no possibility of
functions being imported (What's the utility of this? Consider a program
transformation that requires whole-program information, such as CFI. During
development, the import threshold may be set to zero in order to minimize
the incremental link time while still providing the same CFI enforcement
that would be used in production builds of the application.)

then the function's body will not be affected by link-time decisions, and
we might as well produce its object code at compile time. This will also
allow the object code to be shared between linkage units (this should
hopefully help solve a major scalability problem for Chromium, as that
project contains a large number of test binaries based on common libraries).

3) sounds like a good use case (though very unique to CFI. Generally, a
function body will be affected by link time decisions including whole
program analyses). In this mode, do we really need to emit the bit code
section at all? There does not any need for static promotions if we assume
no cross module transformations will happen in this mode.

David

Hi all,

I'd like to propose changes to how we do promotion of global values in
ThinLTO. The goal here is to make it possible to pre-compile parts of the
translation unit to native code at compile time. For example, if we know
that:

1) A function is a leaf function, so it will never import any other
functions, and

It still may be imported somewhere else right?

2) The function's instruction count falls above a threshold specified at
compile time, so it will never be imported.

It won’t be imported, but unless it is a “leaf” it may import and inline
itself.

or
3) The compile-time threshold is zero, so there is no possibility of
functions being imported (What's the utility of this? Consider a program
transformation that requires whole-program information, such as CFI. During
development, the import threshold may be set to zero in order to minimize
the incremental link time while still providing the same CFI enforcement
that would be used in production builds of the application.)

then the function's body will not be affected by link-time decisions,
and we might as well produce its object code at compile time.

Reading this last sentence, it seems exactly the “non-LTO” case?

Yes, basically the point of this proposal is to be able to split the
linkage unit into LTO and non-LTO parts.

This will also allow the object code to be shared between linkage units
(this should hopefully help solve a major scalability problem for Chromium,
as that project contains a large number of test binaries based on common
libraries).

This can be done with a change to the intermediate object file format.
We can represent object files as native code containing statically compiled
functions and global data in the .text,. data, .rodata (etc.) sections,
with an .llvmbc section (or, I suppose, "__LLVM, __bitcode" when targeting
Mach-O) containing bitcode for functions to be compiled at link time.

In order to make this work, we need to make sure that references from
link-time compiled functions to statically compiled functions work
correctly in the case where the statically compiled function has internal
linkage. We can do this by promoting every global value with internal
linkage, using a hash of the external names (as I mentioned in [1]).

Mehdi - I know you were keen to reduce the amount of promotion. Is that
still an issue for you assuming linker GC (dead stripping)?

Yes: we do better optimization on internal function in general.

Inliner is one of the affected optimization -- however this sounds like a
matter of tuning to teach inliner about promoted static functions.

David

The inliner compute a tradeoff between pseudo runtime cost and binary size, the existing bonus for static functions is when there is a single call site because it makes the binary increase inexistant (dropping the static after inline). We promote function because we think we are likely to introduce a reference to it somewhere else, so “lying” to the inliner is not necessarily a good idea.

That said we (actually Bruno did) prototyped it already with somehow good results :slight_smile:
I’m not convinced yet that it should be independent of promoted or not promoted though.

Assuming we solve the inliner issue, then remain the “optimizations other than inliner”. We can probably solve most but I suspect it won’t be “trivial” either.

I imagine that for some linkers, it may not be possible to deal with this
scheme. For example, I did some investigation last year and discovered that
I could not use the gold plugin interface to load a native object file if we
had already claimed it as an IR file. I wouldn't be surprised to learn that
ld64 has similar problems.

Can't you just call add_input_file with the original file in addition
to any .o files you created during linking?

Cheers,
Rafael

We still need some way to generate the correct code for the CFI intrinsics,
which rely on cross module information. That information would come
directly from the combined summary, rather than being imported from other
modules.

Peter

I think that almost worked, but I kept running into issues relating to the
fact that the plugin needed to inform the linker about symbols in the
native object file via add_symbols. What I ended up with was kind of
unsound and still didn't work correctly for (at least) some parts of
Chromium.

I suppose I can try to pull up that work again and go into more detail on
what exactly the issues were, if you're interested.

Can't you just call add_input_file with the original file in addition
to any .o files you created during linking?

I think that almost worked, but I kept running into issues relating to the
fact that the plugin needed to inform the linker about symbols in the native
object file via add_symbols. What I ended up with was kind of unsound and
still didn't work correctly for (at least) some parts of Chromium.

I suppose I can try to pull up that work again and go into more detail on
what exactly the issues were, if you're interested.

Not directly interested myself, but I think it is something that is
supposed to work, so patches to fix it (llvm or gold) would probably
be welcome.

If I remember correctly plugins are tested in gold with a test plugin
that reads the output of readelf in claim_file_hook and passes the
original files to gold in all_symbols_read.

Cheers,
Rafael

Hi all,

I'd like to propose changes to how we do promotion of global values in
ThinLTO. The goal here is to make it possible to pre-compile parts of the
translation unit to native code at compile time. For example, if we know
that:

1) A function is a leaf function, so it will never import any other
functions, and

It still may be imported somewhere else right?

2) The function's instruction count falls above a threshold specified
at compile time, so it will never be imported.

It won’t be imported, but unless it is a “leaf” it may import and
inline itself.

or
3) The compile-time threshold is zero, so there is no possibility of
functions being imported (What's the utility of this? Consider a program
transformation that requires whole-program information, such as CFI. During
development, the import threshold may be set to zero in order to minimize
the incremental link time while still providing the same CFI enforcement
that would be used in production builds of the application.)

then the function's body will not be affected by link-time decisions,
and we might as well produce its object code at compile time.

Reading this last sentence, it seems exactly the “non-LTO” case?

Yes, basically the point of this proposal is to be able to split the
linkage unit into LTO and non-LTO parts.

This will also allow the object code to be shared between linkage units
(this should hopefully help solve a major scalability problem for Chromium,
as that project contains a large number of test binaries based on common
libraries).

This can be done with a change to the intermediate object file format.
We can represent object files as native code containing statically compiled
functions and global data in the .text,. data, .rodata (etc.) sections,
with an .llvmbc section (or, I suppose, "__LLVM, __bitcode" when targeting
Mach-O) containing bitcode for functions to be compiled at link time.

In order to make this work, we need to make sure that references from
link-time compiled functions to statically compiled functions work
correctly in the case where the statically compiled function has internal
linkage. We can do this by promoting every global value with internal
linkage, using a hash of the external names (as I mentioned in [1]).

Mehdi - I know you were keen to reduce the amount of promotion. Is that
still an issue for you assuming linker GC (dead stripping)?

Yes: we do better optimization on internal function in general.

Inliner is one of the affected optimization -- however this sounds like a
matter of tuning to teach inliner about promoted static functions.

The inliner compute a tradeoff between pseudo runtime cost and binary
size, the existing bonus for static functions is when there is a single
call site because it makes the binary increase inexistant (dropping the
static after inline). We promote function because we think we are likely to
introduce a reference to it somewhere else, so “lying” to the inliner is
not necessarily a good idea.

It is not lying to the inliner. If a static (before promotion) function is
a candidate to be inlined in the original defining module, it is probably
more likely to inlined in other importing modules where more context is
available. In other words, the inliner can apply the same bonus to
'promoted' static functions as if references in other modules will also
disappear. Of course, we can not assume it has single callsite.

Comdat functions can be handled similarly.

That said we (actually Bruno did) prototyped it already with somehow good
results :slight_smile:
I’m not convinced yet that it should be independent of promoted or not
promoted though.

Generally true (see the comdat case).

Assuming we solve the inliner issue, then remain the “optimizations other
than inliner”. We can probably solve most but I suspect it won’t be
“trivial” either.

Any such optimizations in mind?

David

I don’t have the details, but in short:

For promoted functions: IPSCCP, dead arg elimination
For promoted global variables: anything that is impacted somehow by aliasing

Yes, I suspect we'll have to duplicate the debug info between the non-LTO
and the LTO part like I was doing with parallel LTO codegen. In practice,
that means we'll end up with two DWARF compile units per TU. That's
probably better than it being a factor of the number of threads, and since
only one of the two compile units will be codegen'd at any one time, we
hopefully shouldn't see the sort of memory consumption we were seeing with
parallel LTO codegen.

Peter