Richer symbol dependency information for LTO

Richer symbol dependency information for LTO

Description
C and C++ programs are often composed of various object files produced from separately-compiled source files that are then linked together. When compiling one source file, knowledge that can be derived from the logic contained within the other source files would normally not be available. Link-time optimization, also known as LTO, is a way for optimization to be done using information from more than one source file.

In LLVM, LTO is achieved by using LLVM bitcode objects as the output from the “compile” step and feeding those objects into the link step. LLVM’s LTO operates in conjunction with the linker. The linker is invoked by the user and the linker in turn drives LLVM’s LTO when it encounters LLVM bitcode files, getting information from LTO about what symbols a bitcode object defines or references. Information about what symbols are defined in or referenced from an object is necessary for the linker to perform symbol resolution, and a linker is normally able to extract such information from regular (non-bitcode) object files.

The implied consequences of LLVM’s LTO implementation with respect to linker GC (linker garbage collection) can be improved, especially for aggressive forms of linker GC with lazy inclusion of objects and sections. In particular, the symbols referenced but undefined by an LTO module are, to the linker, monolithic at the module level. At the same time, the symbols referenced but undefined by regular (non-LTO) objects are monolithic to LTO. Together, this means that the inclusion of an LTO module into the overall process potentially leads, in the linker’s initial symbol resolution, to all the undefined symbols in that module being considered as referenced; in turn, additional artifacts (e.g., archive members) may be added into the resolution, which further leads to references that may resolve to symbols defined in LTO modules and a premature conclusion that the definition of these symbols are needed. This at least means potentially unnecessary codegen is being done for functions that will be garbage-collected in the end (waste of electricity and time).

We acknowledge that an ideal implementation probably involves a “coroutine” like interaction between the linker and LTO codegen where information flows back and forth; however, such an endeavour is invasive to both linkers and to LLVM.

We believe that by

  • having the linker register, via an API to LTO, symbol reference “nodes” modelling the relationship between a symbol and the symbols that are referenced in turn from (the object file section containing) its linker-selected definition, and
  • using that information in LTO processing,

the LTO processing will be able to effectively identify a more accurate set of LTO symbols that are visible outside of the LTO unit. The linker merely needs to identify only exported symbols and entry points (such as the entry point for an executable and functions involved in the initialization and finalization).
Having the LLVM opt/codegen understand the dependency implications from the “outside world” is strictly better than the other direction: the symbols referred to by relocations in non-LTO code are pretty much fixed as compiled (whereas some references in LTO code may disappear with optimization).

Preparation resources
TBD

Expected results

  1. Modification of the C++ LTO interface used by LLD to implement an interface to record the symbol reference dependency data (incorporating awareness of sections and comdats). This may additionally include a method to add LTO objects provisionally, simulating behaviours where linkers only add objects as needed.
  2. Modification of LTO to use new symbol reference information for definitions in regular objects when visiting definitions in the IR prior to the internalization pass to discover (transitive) symbol references and record the so-referenced symbols as being visible to regular objects. This may additionally include the “late” incorporation of LTO objects added provisionally into the merged LTO module.
  3. Modification of LLD (for ELF) to modify initial resolution to use the new interface as a replacement for setting VisibleToRegularObj except for entry point functions (including C++ dynamic initialization and finalization).

Confirmed mentors
Sean Fertile, Hubert Tong, Wael Yehia

Desirable skills
Intermediate C++; basic linker concepts (e.g., symbols, sections, and relocations)

Project size
350 hours

Difficulty
Medium/Hard

@teresajohnson @Peter_Collingbourne2

cc: @Wael_Yehia1 @Sean_Fertile

Hi Teresa and Peter,

In our usage of “full” LTO, we have noticed what appear to be limitations in terms of the granularity of information accessible and communicable by the linker for symbol resolution purposes. We have been using the C API; however, it does not appear that LTO, in general, captures information about where references are made from (for example, relocations associated with a section). We have a thought about a potential Google Summer of Code project and your feedback would be appreciated. If you have other thoughts about approaching the problem, those would also be helpful.

Please see below for our draft.

Thanks,

Hubert Tong

@MaskRay for comments on the linker side.

Thanks for sending this!

Can you give a small example of the current issue? I’m not sure I’m completely following the problem and proposed solution. E.g. the first part of the description seems to be talking about references to and from modules included in the LTO process, but the latter part is talking about objects visible to outside the LTO unit.

Also, does this affect ThinLTO and would any proposed solution address the issue there too?

I don’t think we’d want the linker to synthesize bitcode, but perhaps extending the current symbol resolution info provided by the linker to LTO could help?

One last note that any fix done via LLD and the resolution based LTO interface will not affect the C API as that uses the legacy non-linker resolution based LTO implementation (not used by lld).

Hi Teresa,

Thank you for your quick response.

I’ve used ThinLTO in my example below, so I think it affects ThinLTO as well. I believe the general strategy also covers ThinLTO.

The following demonstrates what I see to be the issue.

When running the following, zip is not going to be in the final linker output; however, it is clear that zip was compiled.

Script:

#!/bin/bash

set -o pipefail
set -e

rm -f shr.o1.lto.o shr.o2.lto.o

clang -fvisibility=hidden -ffunction-sections -c -o bar.o -xc -<<'EOF'
void zip(void);
void bar(void) { zip(); }
EOF

clang -fvisibility=hidden -flto=thin -ffunction-sections \
    -c -o main_foo.o -xc -<<'EOF'
__attribute__((__visibility__("default") ))
int main(void) { }

extern void bar(void);
void foo(void) { bar(); }
EOF

clang -fvisibility=hidden -flto=thin -ffunction-sections \
    -c -o zip.o -xc -<<'EOF'
extern volatile int dangling;
void zip(void) { ++dangling; }
EOF

clang -flto=thin -ffunction-sections -fuse-ld=lld \
    -Wl,--gc-sections -Wl,-plugin-opt=save-temps \
    -nostdlib --shared -o shr.o main_foo.o bar.o zip.o

llvm-objdump -t shr.o1.lto.o shr.o2.lto.o shr.o

The output:


shr.o1.lto.o:   file format elf64-powerpcle

SYMBOL TABLE:
0000000000000000 l    df *ABS*  0000000000000000 -
0000000000000000 l    d  .text.main     0000000000000000 .text.main
0000000000000000 g     F .text.main     0000000000000014 main

shr.o2.lto.o:   file format elf64-powerpcle

SYMBOL TABLE:
0000000000000000 l    df *ABS*  0000000000000000 -
0000000000000000 l    d  .text.zip      0000000000000000 .text.zip
0000000000000000 l    d  .toc   0000000000000000 .toc
0000000000000000 g     F .text.zip      0000000000000034 0x62 zip
0000000000000000         *UND*  0000000000000000 .TOC.
0000000000000000         *UND*  0000000000000000 dangling

shr.o:  file format elf64-powerpcle

SYMBOL TABLE:
0000000000000000 l    df *ABS*  0000000000000000 -
0000000000000000 l    df *ABS*  0000000000000000 -
0000000000000000 l    df *ABS*  0000000000000000 -
0000000000028358 l       .got   0000000000000000 .hidden .TOC.
00000000000202f8 l       .dynamic       0000000000000000 .hidden _DYNAMIC
00000000000102e0 g     F .text  0000000000000014 main

Thanks for that. I will try to update the description to focus on using calls to LTO instead of synthesizing bitcode. The proposal is, in a way, extending the resolution info provided by the linker. Instead of having the linker resolve symbols based on limited information from the LTO modules—which results in the LTO process (as a system) prematurely concluding that some symbols referred to by regular objects are actually needed—it has LTO do a form of symbol resolution.

There are some complications in that the linker should either accurately provide the resulting-reference information about the definition it would choose if LTO does generate a reference to a native symbol or provide a conservatively correct merging of the definitions. If the LTO definition is known to be prevailing, then the linker should omit the information associated with the native definitions of the same symbol. The linker should also be able to inform LTO that the

Again, thanks for pointing that out. I notice that your description switched between “interface” and “implementation”. Is the underlying implementation sufficiently disjoint that implementing a solution for the resolution-based LTO interface will not necessarily help towards being able to implement a solution using the C API? If the implementation is sufficiently shared, then surfacing the functionality in the C API and developing unit tests for that would probably make sense to include with the project.

With appreciation,


	Hubert Tong

@MaskRay question for you below on lld
@cachemeifyoucan question below for you on legacy LTO and C API

@reinterpretcast
Ok I see, thanks for the example. So in this case, bar.c is not compiled with LTO, and therefore the references to zip() in zip.c are opaque to LTO and the linker just tells LTO that it is visible to a regular object, which forces LTO to keep it (Thin or regular).

More specifically, within ThinLTO’s thin link we do not have a summary for anything defined in bar.c, so in the call/reference graph we have an edge from foo()'s summary to the GUID for bar(), but no summary for bar(), since it is defined in an ELF object. Then the linker has to conservatively tell LTO that zip() is VisibleToRegularObj, which forces it to be kept. In regular LTO it is similar but with the definition of foo() referencing an external decl for bar(), for which we have no more info.

So I guess the proposal here is to provide more fine grained info from the linker to LTO for those symbols visible to a regular obj than just a big hammer flag VisibleToRegularObj. E.g. in ThinLTO if the linker provided the transitive set of regular object symbols referencing a bitcode definition, it could presumably synthesize a summary entry for those regular object symbols with summary call edges to the bitcode symbol. Then ThinLTO’s global dead code elimination would work properly. For example, if the linker told LTO that zip() was referenced by a regular object symbol foo(), and ThinLTO then created a summary for foo that had a call edge to zip, when we discovered that bar was dead code we would also discover that foo and in turn zip were also unreachable from any preserved/exported symbols, allowing zip’s removal. For regular LTO some other mechanism would need to be created to capture this info.

However, 2 main questions:

  1. How often does this situation arise and how important is it to fix? In our use of ThinLTO we either have all files built with ThinLTO (in which case we can remove all symbols from both bar.o and zip.o), or the leaf objects/libraries are ELF. In the latter case, e.g. if bar.c was built with ThinLTO and instead zip.c was built without, we would be able to remove all of the symbols in bar.o during (Thin)LTO. Is it common to have ELF objects referencing bitcode object symbols?

  2. Does lld even have the necessary info? I.e. if I nm bar.o I just see:


0000000000000000 T bar
                 U zip

Does the linker have more fine grained info to know that the reference to zip comes from symbol bar? @MaskRay can probably help answer this if needed. My expertise is more on the LTO side than the linker side.

Again, thanks for pointing that out. I notice that your description switched between “interface” and “implementation”. Is the underlying implementation sufficiently disjoint that implementing a solution for the resolution-based LTO interface will not necessarily help towards being able to implement a solution using the C API?

Much of the underlying low level LTO optimization support is shared. However, the C API currently uses a legacy LTO interface and layer that does not consume linker resolutions. There has been periodic discussion of reimplementing the legacy interfaces more closely on top of the resolution LTO API, but it hasn’t happened yet. @cachemeifyoucan is the main legacy LTO developer and can probably answer this question better than I can.

2 Likes

@teresajohnson

Thank you for all the pointers.

I think we’re on about the same page insofar as how the solution can apply to ThinLTO.

I think you meant “referenced by regular object symbol foo() bar()”. Yes, I think that would be the case.

For the general case of a generic object file format, it would not know that the reference comes from the symbol bar (but it would know that the reference comes from the section that bar resides in). Since --gc-sections operates at the section level and static offsets may be used by code within the section or assumed by external code for some reason, that’s as good as it can be. If the user built with -ffunction-sections, it is pretty much equivalent to knowing that it came from bar.

This is the output from llvm-objdump -r bar.o:

bar.o:  file format elf64-powerpcle

RELOCATION RECORDS FOR [.text.bar]:
OFFSET           TYPE                     VALUE
0000000000000000 R_PPC64_REL16_HA         .TOC.
0000000000000004 R_PPC64_REL16_LO         .TOC.+0x4
000000000000001c R_PPC64_REL24            zip

RELOCATION RECORDS FOR [.eh_frame]:
OFFSET           TYPE                     VALUE
000000000000001c R_PPC64_REL32            .text.bar

I can’t speak to how often the situation arises except to anecdotally describe the situation. What I do know is that we now have an LLVM-based compiler solution that is being used in place of a proprietary compiler. With the proprietary compiler, the LTO processing has direct access to the native object files (so it does not have the issue being described).

My understanding (and I am not quite directly involved) is that the sort of issue described here occurs because of builds that incorporate static libraries built by different groups (with different compilers or build options) where the relationships between libraries are… complicated (co-dependent or partially redundant).

There is an additional aspect around archive members that may be more specific to AIX than not: On AIX, the linker garbage collection is fairly aggressive, and it can deduce (for native object files) that the reference to bar() from foo() is dead (and thus omit bar() and, in turn zip()). If the file that zip() is defined in is an archive member, then the file would be omitted and, if that file contained initialization functions, those functions (given certain linker options) would not be run. Because the linker does not know that the reference to bar() comes from foo() (which is unneeded), it currently needs to include bar() and to tell LTO to include the file containing zip(). Once that happens, the linker has no real way to recover from the possibility of initialization functions in the file containing zip() being present unnecessarily. The ability for the linker to let LTO know about an LTO module without committing to indicating that said module is needed would, in combination with providing (as proposed) LTO with the regular object section/symbol dependency graph, allow LTO to incorporate the module only as needed.

Let me rephrase the proposal to check my understanding.

Current linker-LTO protocol is symbol-based. In all ports of lld, LTO happens at the end of symbol processing.

  • The linker knows references within native (COFF, ELF, Mach-O, …) relocatable object files. But the information is not passed to LLVMLTO.
  • With a (ThinLTO) module summary, LLVMLTO knows references within LLVM bitcode files. But the information is not known by the linker.

LLVMLTO does dead stripping. Currently references from native relocatable object files are approximated by the coarse-grained VisibleToRegularObj.
bar.o is an example that the inaccuracy leads to LTO work which is destined to be wasted (and may lead to more waste if zip.o has, say, initialization functions).
This proposal is to extend VisibleToRegularObj with references within native relocatable object files.

  1. Does lld even have the necessary info? I.e. if I nm bar.o I just see:

Yes. The references are represented as relocations (section->symbol). However, the information is not immediately usable. We need to transform section->symbol references to symbol->symbol references. There will be complexity for aliases and non-local symbols not defined at section offset 0. In addition, examining relocations in the symbol processing breaks ideal phase ordering and is therefore less elegant.

I have looked into this in the past and have some notes on https://maskray.me/blog/2021-02-28-linker-garbage-collection#link-time-optimization.
My gut feeling (may be similar to @tejohnson’s): for many groups, they can ensure that most files are LLVM bitcode files, so VisibleToRegularObj is a not-too-bad compromise considering the complexity that the linker passes relocation information to LLVMLTO.

--gc-sections --print-gc-sections roughly gives the upper bound of how many sections a more accurate approach will give.
For a -DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_ENABLE_LTO=Thin build of clang, a more accurate approach will not eliminate more than 2206/127949=1.7% sections.

% ld.lld @response.txt -Map a.map --gc-sections --print-gc-sections > gc.txt
% grep -c '\.text' gc.txt
2206
% grep -c '\.text' a.map   
127949

The ratio will be higher if the link includes more relocatable object files, perhaps in some AIX deployment models.

For clang, in a non-LTO build, I measured:

# of non-local symbols: 137266
# of non-local symbol occurrences in all input files (one symbol may occur in multiple files): 397750
# of relocations: 4230452

For default build of Chrome:

# of non-local symbols: 1922002
# of non-local symbol occurrences in all input files (one symbol may occur in multiple files): 4209388
# of relocations: 9171068

We need to take into account the cost of passing the relocation information to LLVMLTO.

There is an additional aspect around archive members that may be more specific to AIX than not: On AIX, the linker garbage collection is fairly aggressive, and it can deduce (for native object files) that the reference to bar() from foo() is dead (and thus omit bar() and, in turn zip()). If the file that zip() is defined in is an archive member, then the file would be omitted and, if that file contained initialization functions, those functions (given certain linker options) would not be run. Because the linker does not know that the reference to bar() comes from foo() (which is unneeded), it currently needs to include bar() and to tell LTO to include the file containing zip().

In GNU ld, gold, and ld.lld, archive member selection (part of symbol resolution) happens before section based garbage collection.
The linker knows the reference from foo to bar is dead. It is just after all archive member selection was done.

Does AIX ld interleave archive member selection with section based garbage collection?

Considering that it leads to additional initialization in the resulting program and not just a more expensive linking process, I would say the wastage multiplies with the usage of that program.

This sounds a lot like trying to get symbol->symbol references in a way that I am not sure is entirely valid. Within a section, opaque usage of static offsets may be in play.

Not sure if I can get a more definitive answer from someone who has more detailed documents, access to the source code, or design knowledge over what is a long weekend here. Experimentally, yes, it does interleave archive member selection with section-based garbage collection.

I think you meant “referenced by regular object symbol foo() bar()”. Yes, I think that would be the case.

Woops, yep, looks like I got foo() and bar() switched up in this sentence. It is bar() defined in a regular object for which we don’t have a summary in ThinLTO, and therefore don’t know about its call edge to zip() without the enhancement you are suggesting.

Even if it is not the norm for the various linkers listed in the above, using option control is a possibility. I think I should be posting the project on the webpage and, if there are refinements to make, then we can make it after the posting.

I will do need to some catch up for the entire thread later but to answer this question, there isn’t much plan to improve legacy LTO in terms of API. We try to move legacy LTO reuse the same infrastructure as much as possible under various bug fixes/new PM move but it is still fundamentally limited by its API which doesn’t capture the symbol resolution per object file, which is the problem we are trying to solve here.

To fully modernize the C API, we need to redesign a complete new set of stable API, which will not be legacy API anymore. If anyone has good proposal for a new stable LTO API, I am happy to help/review on the RFC.

This part would be my main concern:

  1. Does lld even have the necessary info? I.e. if I nm bar.o I just see:

Yes. The references are represented as relocations (section->symbol). However, the information is not immediately usable. We need to transform section->symbol references to symbol->symbol references. There will be complexity for aliases and non-local symbols not defined at section offset 0. In addition, examining relocations in the symbol processing breaks ideal phase ordering and is therefore less elegant.

Claiming the possibly little LTO optimization saving will need to do some less elegant thing in the linker.

(I have also been thinking about how to speed up lld recently (in response to mold’s challenge). A bunch of ad-hoc symbol processing things (e.g. --fortran-common) make our parallelism much more complex. )

@MaskRay, do you have some suggested material for “preparation resources”?

Does AIX ld interleave archive member selection with section based garbage collection?

It loads archive members on demand when doing a liveness DFS of csects. Apart from few exceptions, sections that are not marked visited by the end of the search are dead and can be GCed.