[RFC] (Thin)LTO with Linker Scripts

RFC: (Thin)LTO with Linker Scripts

At the last US LLVM Developers' Meeting, we presented [1] a proposal for linker
script support in (Thin)LTO. In this RFC, I would like to describe the
proposal in more detail and invite the community's feedback, so we can build
consensus on the upstream implementation.

The end goal of this effort is to extend the benefits of (Thin)LTO, including
significant code size and performance improvements, to the many embedded and
system-level software projects that rely on linker scripts to control (ELF)
image layout.

In particular, this proposal seeks to:

1. Ensure that ELF sections emitted by LTO match the same path-based linker
script rules that they would have matched if the project was compiled
without LTO.

2. Make module optimization passes aware of the final output sections of
symbols in order to limit inter-section (e.g. inlining) or enable
intra-section (e.g. constant merging) optimizations where needed.

3. Implement these features without changing the behavior of the compiler when
linker script information is *not* available, particularly on source files
that contain symbols carrying explicit section attributes.

This proposal only addresses changes to Clang/LLVM. The linker also needs to be
enhanced to support LTO with linker scripts; so far, this has only been done
for qcld, the linker shipped with the Hexagon SDK.

The proposed implementation involves small changes throughout the compilation
flow, so the rest of this document follows the progression from source file to
linking. Individual changes, which could map to patches, are marked using
"(X.Y)" to help with referencing.

Step 1: Compilation of individual files

Hello Tobias,

Thanks very much for the RFC, I think that this will be useful in
persuading embedded developers to use LTO in their projects. I think
the overall approach for communication between the linker and code
generator sounds reasonable. I've got some questions/comments based on
some experience with Arm's proprietary linker, which supports LTO but
has a different linker script mechanism than GNU ld compatible
linkers. I'm not hugely familiar with the details of LTO at the moment
so apologies in advance for any misunderstandings on my part.

My understanding from the RFC is:
- All global objects in the bitcode file will be assigned a section name.
- A linker will communicate the output section of all global objects.
- Certain transformations won't be performed if the output section is different.

The common use cases that I can see that might not fit perfectly into
that model:
- Code that is in different OutputSections but it will be logically
correct and in many cases desirable to perform transformations on as
if they were in the same output section.
- Output section placement rules that are not based on names, for
example Arm's linker can assign sections to an output section until
the output section size limit is reached, then a different output
section is used. I admit that this may be more of a problem for
linkers that have a different linker script model.

I think both cases are illustrative of a use case where the precise
output section does not matter, but there is a vaguer goal of placing
a subset of the input sections in a subset of the output sections.

From what I can tell there isn't a way for the code generator to tell

the difference between code that is placed in different output
sections and it is not correct or beneficial to optimize and code that
is placed in different output sections and it is correct and
beneficial to optimize together.

I think that this kind of use case could be supported by doing something like:
- Linker informs code generator the output sections that must not use
any information from another module and may not contribute any
information to another module. For example an output section that is
representing an overlay.
- Linker can omit the output section information for sections that the
user doesn't care where they go, and let the linker decide based on
some size constraint later.
I think the latter case could be made to work by assigning a "don't
care" module id, assuming -ffunction-sections. The former case is a
bit more difficult as we would still want some way to distinguish the
module id for placement.

I think that these are mostly details rather than fundamental problems though.

Peter

Hi Peter,

My understanding from the RFC is:

  • All global objects in the bitcode file will be assigned a section name.

… which is equal to the section name that they would have been emitted to if this was a regular compilation. In addition to allowing the linker to read section names from the bitcode, this also helps support mixing -ffunction-sections and -fno-function-sections and similar options (forgot to mention that in the RFC).

  • A linker will communicate the output section of all global objects.

Correct. (Global objects in the LLVM sense, so that includes objects with local linkage).

  • Certain transformations won’t be performed if the output section is different.

Correct. Plus, others can be enabled if they’re safe to apply when we know things are going to the same output section.

The common use cases that I can see that might not fit perfectly into
that model:

  • Code that is in different OutputSections but it will be logically
    correct and in many cases desirable to perform transformations on as
    if they were in the same output section.

Right. The output section that the linker communicates for a symbol doesn’t need to correspond to a “physical” output section. So let’s say if the linker knows (or the user somehow tells it) that two output sections should be considered equivalent, the linker can communicate the same output section identifier for symbols in either of the two physical output sections. This is perfectly safe since the output section info is only ever used to enable/inhibit optimizations, not for actual symbol emission by LTO.

  • Output section placement rules that are not based on names, for
    example Arm’s linker can assign sections to an output section until
    the output section size limit is reached, then a different output
    section is used. I admit that this may be more of a problem for
    linkers that have a different linker script model.

That should actually just work in the existing model. Before LTO runs, we don’t know the size of symbols anyway, so the linker will just communicate the original output section for all of them and we apply optimizations across them as if they all fitted in the same section. After LTO, some may end up in the ‘overflow’ section but LTO doesn’t need to know about that since it wouldn’t have been correct for the user to make any assumptions about what ends up in the original section vs overflow in the first place.

I think both cases are illustrative of a use case where the precise
output section does not matter, but there is a vaguer goal of placing
a subset of the input sections in a subset of the output sections.

From what I can tell there isn’t a way for the code generator to tell
the difference between code that is placed in different output
sections and it is not correct or beneficial to optimize and code that
is placed in different output sections and it is correct and
beneficial to optimize together.

Perhaps we should rename the “output section” that is communicated to LTO to something less specific to make it clear that it can be used for exactly this purpose. Optimization domain? Partition?

I think that this kind of use case could be supported by doing something like:

  • Linker informs code generator the output sections that must not use
    any information from another module and may not contribute any
    information to another module. For example an output section that is
    representing an overlay.

It’s not so much about other modules (files) - you could have multiple files contributing input sections to the same overlay, for instance, and you would want to optimize across them. But you wouldn’t want to de-duplicate a constant from another overlay. I think the OutputSectionID-as-optimization-domain idea captures this use case, no?

  • Linker can omit the output section information for sections that the
    user doesn’t care where they go, and let the linker decide based on
    some size constraint later.

That’s an interesting idea to allow a ‘don’t care’ output section ID; we would have to be pretty careful in defining what that means on a per-optimization basis. That is, am I allowed to inline a function with a defined output section into a function without one (probably no)? Vice versa (probably yes)?

I think that these are mostly details rather than fundamental problems though.

Thank you very much for your comments!

Tobias

Hello Tobias,

Thanks very much for the response.

I think both cases are illustrative of a use case where the precise
output section does not matter, but there is a vaguer goal of placing
a subset of the input sections in a subset of the output sections.
From what I can tell there isn't a way for the code generator to tell
the difference between code that is placed in different output
sections and it is not correct or beneficial to optimize and code that
is placed in different output sections and it is correct and
beneficial to optimize together.

Perhaps we should rename the "output section" that is communicated to LTO to
something less specific to make it clear that it can be used for exactly
this purpose. Optimization domain? Partition?

Yes I think it would help; either of those names make sense to me.

I think that this kind of use case could be supported by doing something
like:
- Linker informs code generator the output sections that must not use
any information from another module and may not contribute any
information to another module. For example an output section that is
representing an overlay.

It's not so much about other modules (files) - you could have multiple files
contributing input sections to the same overlay, for instance, and you would
want to optimize across them. But you wouldn't want to de-duplicate a
constant from another overlay. I think the
OutputSectionID-as-optimization-domain idea captures this use case, no?

Yes I think each overlay could be captured by its own domain. What I
think would be useful is to allow some kind of suffix to the
optimization domain that a linker could use to recover preferences
between two Output Sections for example:
OS1 : { my_objects*(*.text) }
OS2 : { other_objects*(*.text) }

I want OS1 and OS2 to be in the same optimization domain, but I do
want to recover the preference for OS1 and OS2 at link time. If only
the optimization domain is added to the section name (the ^^) that
information will be lost. If there is some kind of way of recovering
the original OS then that would be great. For example if I passed to
the code generator "OS.1" and "OS.2" where OS is the optimization
domain and anything after the . is preserved in the section name so
that the linker can map it back to the original output section.

I think that this is more important to linker's that don't give as
much control over section ordering within an OutputSection so it is
fairly common to use multiple consecutive OutputSections to group
InputSections together. Although I could be missing something about
how I could do this with the original proposal?

- Linker can omit the output section information for sections that the
user doesn't care where they go, and let the linker decide based on
some size constraint later.

That's an interesting idea to allow a 'don't care' output section ID; we
would have to be pretty careful in defining what that means on a
per-optimization basis. That is, am I allowed to inline a function with a
defined output section into a function without one (probably no)? Vice versa
(probably yes)?

Yes, in Arm's linker an a section in an overlay was allowed to rely on
code/data in "normal" output sections to be present so could share
things like rang-extension thunks, but obviously not vice versa. I
couldn't guarantee that the advantages would outweigh the extra
implementation complexity though.

Peter