Proposal: MCLinker - an LLVM integrated linker

Hi all,

We are developing a linker, MCLinker.

MCLinker is a linker for LLVM. It leverages the LLVM machine code (MC) layer to link object files and bitcodes, and generate shared objects and executable files.

Motivation

Cool! A linker has been the largest gap for FreeBSD to move to a
completely BSD toolchain by default now that libc++ is ported.

At our developers summit at EuroBSDCon we brainstormed a list of
features we'd like to see in an eventual linker. In case it's of
interest here they are:

- linker scripts (or equivalent)
- LTO framework
- Link time optimization against IR or machine code
- Incremental linking
- Support for IR in ELF
- GNU ld compatibility
- IR processing by plugin
- Limited non-ELF support (for boot blocks, etc)
- Alternative hash table support
- Crunching support
- Be fast
- Native cross-architecture support
- Multipass lookup
- Unit tests
- Coded to LLVM standards (to allow inclusion in LLVM)
- linker is a library
- C and C++ support
- Architecture support: i386, x86_64, ARM, PPC(64),
- MIPS(64), PiNaCl
- Possible architecture support: sparc64

You might notice that some are really obvious, but we were trying to
capture as many requirements as possible.

-- Brooks

Hi, Brooks,

Since this project is helped by many BSD guys in Taiwan, one of
MCLinker's main objectives is make direct contribution to the BSD
realm. Please feel free to give us suggestions to make sure we can
achieve this goal. Any comments are appreciated.

We realized open discussion on the mailing list is necessary, and we
hope this thread can be a beginning to openly discuss the project
scope, features, the why and the how of MCLinker.

I've read the list, and here are some idea from our group.

- LTO framework
- Link time optimization against IR or machine code
- Support for IR in ELF

LLVM has supported LTO on bitcode, and IMHO, it may be good enough.

In GCC, LTO causes 'fat' object files, because GCC needs to serialize
IR into 'intermediate language' (IL) and compress IL in object files.
In our experience, the 'fat' object files are x10 bigger than the
original one, and slow down the linking process significantly. The
generated code can get about only 7%~13% improvement.

IMHO, LLVM provides a better solution than GCC. With LLVM, users can
compile source files and generate many small bitcodes. LTO can be
performed well when link these small bitcodes into a 'big bitcode'.
MCLinker reads the 'big bitcode' and generate EXE/DSOs.
Since the 'big bitcode' is only a little bit bigger than the generated
file, we can avoid generating the 'fat' objects and also get enough
performance improvement.

Apart from the LTO, we also have some good idea on link time
optimization. I will open another thread to discuss this later.

- linker scripts (or equivalent)

Linker scripts is a thorny problem. The grammar of link script
language in GNU ld is context sensitive, and hard to be implemented.
Maybe we can list the necessary requirements first, and try to define
a simpler grammar.

- Incremental linking
- GNU ld compatibility
- IR processing by plugin
- Limited non-ELF support (for boot blocks, etc)
- Alternative hash table support
- Crunching support
- Be fast
- Native cross-architecture support
- Multipass lookup
- Unit tests
- Coded to LLVM standards (to allow inclusion in LLVM)
- linker is a library
- C and C++ support
- Architecture support: i386, x86_64, ARM, PPC(64),
- MIPS(64), PiNaCl
- Possible architecture support: sparc64

We still have some idea about above features. In order to keep the
discussion easy to follow, I will discuss them in other threads.

BTW, sorry for the appearance of "Email Confidentially Notice". I
asked our IT remove it from all our emails immediately. And also sorry
for some scrambled characters in the name. I had asked all my members
should use English name.

Best regards,
Luba

It is not necessary to preserve compatiblity with GNU linker scripts.
There are many good reasons not to, but the functionality has to exist.
Some of the issues to be addressed:

(1) Mapping sections to fixed offsets.
(2) Ordering of sections and aggregation into PT_LOAD segments.
(3) Setting non-default attributes on segments, e.g. making debug
information loadable for specific applications.
(4) Adding start-of-section/end-of-section markers.

Joerg

Apart from the LTO, we also have some good idea on link time
optimization. I will open another thread to discuss this later.

Sorry, I made a stupid mistake. I mean "some good idea on the optimizations that can be done by linkers, such as instruction relaxation, and how to efficiently use IP register"

Hello, I'm glad to see other people are interested in building an LLVM linker.

I have been working in the dirrection of an LLVM linker for some time
now. The first key component of this is the libObject library in tree.
See my talk from the November 2010 Dev Meeting called "Object Files in
LLVM" <http://www.llvm.org/devmtg/2010-11/&gt;\.

We are currently seeking to remove the inherent duplication of object
file handling in all of the llvm projects.

In your design a lot of the structure seems to be dedicated to the
idea of integrated linking. While we want to support this kind of use,
I believe it can be accomplished without as much structure. Most
projects are built as libraries, and the final executable is generally
one source file which links to a bunch
of rather large libraries. The small speedup you get from not writing
this one object file to disk is negligible, considering it's going to
be in cache when the link starts, and it will be the first object
processed.

This structure might make it hardter to use the facilities the linker
provides to do dynamic loading, and linking JITed code. Both of these
are cases where having the linker integrated makes a huge performance
and complexity difference.

A major feature of the linker design I have been working on is the
object file IR it is based on. It takes the concept of an atom (a
named range of data) and formalizes it into a graph based data
structure where edges represent the relations between atoms such as
relocations, groupings, and layout constraints. This provides a format
for the core linker algorithms to work on, and also for plugins and
file-format specific processing to happen. This can also enable more
efficent intermidate object and debug file formats.

We would all like to avoid duplication and have the best linker
possible for LLVM. So I think it's important to discuss these issues
and decide how to best combine our efforts.

- Michael Spencer

In GCC, LTO causes 'fat' object files, because GCC needs to serialize
IR into 'intermediate language' (IL) and compress IL in object files.
In our experience, the 'fat' object files are x10 bigger than the
original one, and slow down the linking process significantly. The
generated code can get about only 7%~13% improvement.

Right. Though GCC 4.7 will offer an option to emit just bytecode in
object files. Additionally, the biggest gains we generally observe
with LTO is when it's coupled with FDO. And almost always, the
biggest wins are in the inliner
(LightweightIpo - GCC Wiki).

Apart from the LTO, we also have some good idea on link time
optimization. I will open another thread to discuss this later.

You may want to look at Diablo (http://diablo.elis.ugent.be/). An
optimizing linker that has been around for a while. I'm not sure
whether it is still being developed, but they had several interesting
ideas in it.

Diego.

You may want to look at Diablo (http://diablo.elis.ugent.be/). An
optimizing linker that has been around for a while. I'm not sure
whether it is still being developed, but they had several interesting
ideas in it.

  Diablo is still being maintained. I checked its status few days ago
on Diablo mailing list.

Regards,
chenwj

This is, in fact, how Apple's linker (Source Browser) works.

We realized a few years back that the traditional way of linking (section merging) made advanced linker features very difficult. The kinds of features Apple wanted were (and still are): dead code stripping, weak symbol coalescing, and order files (functions and data are re-ordering to minimize runtime RAM footprint). A better model is based on Atoms (nodes). An Atom is an indivisible subrange of a section. Atoms have References/Fixups (edges) to other Atoms. As Michael said, once you have your object file data in this representation, all the linker algorithms are much simpler.

The hard part of linking is parsing object files into Atoms. ELF should be a little easier because the symbol table has the st_size field. But there are lots of cases (like dwarf unwind info) where symbol table entries are not created.

Nick Kledzik
Linker Engineer
Apple Inc.

Thanks for the useful information. We notice that the idea of LIPO also can help LLVM LTO if LLVM has FDO/PGO. And regarding Diablo, we’ll learn from it and I think we’ll get some good ideas from it.

In MCLinker, the detail of the instructions and data in bitcode are still kept during linking, so some opportunities to optimize the instruction in bitcode become intuitive. Instruction relaxation is one of the cases. (Since ARM is one of the target we focus on, I’m going to use ARM to illustrate the problem.)

When linking bitcode and other object files, stubs are necessary if the branch range is too far or ARM/THUMB mode switching. Google gold linker uses two kinds of stubs basically. One is consecutive branch instructions, and the other is one branch instruction with one following instruction (e.g., ldr) which changes PC directly.

Example of the later cases,

1: bl <stub_address>

2: ldr pc, [pc, #-4] ; stub
3: dcd R_ARM_ABS32(X)

In MCLinker, we can optimize it as following:

X: ldr ip, [pc, #-4]
Y: dcd R_ARM_ABS32(X)
Z: bx ip

Before optimization, some processors suffer from flushing ROB/Q because their pipelines are fulfilled with the invalid instructions that immediately appear after ldr. However, all of these instructions should not be executed, and processors must flush them when ldr is committed.

Since all details of instruction and data are reserved, MCLinker can directly rewrite the program without insertion of stub. It can replace the 1:bl instruction with a longer branch Z: bx, and the performance of the program is therefore improved by efficient use of branch target buffer (BTB).
This is just one case, and there are other optimizations we can do…

Thanks,Chinyen

A helpful link-time optimization would be to place subroutines that
are used close together in time also close together in the executable
file. That also goes for data that is in the executable file, whether
initialized (.data segment) or zero-initialized (.bss).

If the unit of linkage of code is the function rather than the
compilation module, and the unit of linkage of data is the individual
data item rather than all the .data and .bss items together that are
in a compilation unit, you could rearrange them at will.

For architectures such as ARM that cannot make jumps to faraway
addresses, you could make the destinations of subroutine calls close
to the caller so you would not need so many trampolines.

The locality improves the speed because the program would use the code
and data caches more efficiently, and would page in data and code from
disk less often.

Having fewer physically resident pages also saves on precious kernel
memory. I read in O'Reilly's "Understanding the Linux Kernel" that on
the i386 architecture, the kernel's page tables consume most of the
physical memory in the computer, leaving very little physical memory
for user processes!

A first cut would be to start with the runtime program startup code,
which for C program then calls main(). The subroutines that main
calls would be placed next in the file. Suppose main calls Foo() and
then Bar(). One would then place each of the subroutines that Foo()
calls all together, then each of the subroutines that Bar() calls.

It would be best if some static analysis were performed to determine
in what order subroutines are called, and in what order .data and .bss
memory is accessed.

Getting that analysis right for the general case would not be easy, as
the time-order in which subroutines are called may of course depend on
the input data. To improve the locality, one could produce an
instrumented executable which saved a stack trace at the entry of each
subroutine. Examination of all the stack traces would enable a
post-processing tool to generate a linker script that would be used
for a second pass of the linker. This is a form of profiler-guided
optimization.

For extra credit one could prepare multiple input files (or for
interactive programs, several distinctly different GUI robot scripts).
Then the tool that prepared the linker script would try to optimize
for the average case for most code.

Regards,

Don Quixote

A helpful link-time optimization would be to place subroutines that
are used close together in time also close together in the executable
file. That also goes for data that is in the executable file, whether
initialized (.data segment) or zero-initialized (.bss).

If the unit of linkage of code is the function rather than the
compilation module, and the unit of linkage of data is the individual
data item rather than all the .data and .bss items together that are
in a compilation unit, you could rearrange them at will.

This is exactly what the atom model provides. And some of the use
cases you describe were actually discussed at the social tonight.

- Michael Spencer

Hi, Nick,

Thanks for your sharing. We also think ATOM is a good idea, and thanks
for you work, the code of APPLE linker is pretty and easy to
understand.

In ELF world, section merging is relatively mature and popular
technique. As my best knowledge, no one defined ATOM for ELF before.
(Now I know Spencer is doing this, and I will reply his mail soon :p)
Most of us are not familiar with the concept of ATOM at first, and
lots of papers and textbooks are about section merging. So, please
forgive me for the conservative at the beginning, in order to make
sure we can archive a workable linker, I and my team members decided
to adopt section merging.

I think that maybe we can put the ATOM in the roadmap, and adventure
it after we have a workable linker.

Besides the ATOM, I think the features of APPLE linker you mentioned
are also important for us. We lay our first emphasis on section
reordering. In our previous experiments, the order of the sections
highly affects the performance of multimedia codec. So, an easier
approach to change the order of sections are also one of our goal. I
think separate the layout information of a section into the offset
within the file and the size, like AsmLayout and MCSectionData does,
can help section reordering easier.

Garbage collection of sections are another problem that needs a good
approach. Transitive closure of sections is the approach used in
Google gold. We're looking forward a better solution.

Best regards,
Luba

Hi

Thank you for reply, Object Files is really an impressive work. We’ve studied it as a important related work. Sorry for our relative work without the mention of Object Files, I sent proposal in a rush.

In fact, I ever implemented readers based on Object Files in MCLinker, and these readers are used to parse libraries into MCLinker IR. That is, we design structures to be used as IR for core linking algorithms, not just for “Integrated”.

We intend to design unified abstractions which support the common linking algorithm. For instances, you can find the definition of LDSymbol in http://code.google.com/p/mclinker/source/browse/include/mcld/LD/LDSymbol.h. The readers in MCLinker will convert ELF Elf_Sym, Mach-O nlist, and COFF symbol to LDSymbol, and running the same symbol resolution algorithm. Currently, we are implementing symbol tables and symbol resolution in MCLinker. We will release our code asap.

As LDSymbol.h shows, LDSymbol has a pointer to MCSectionData. Once we replace MCSectionData here with smaller data units, I think LDSymbol can be used with a similar concept as atom. As Nick mentioned, ELF will be easier for parsing with the fields such as st_size if we apply this change, and it may need efforts to parse Mach-O and COFF to produce such smaller units.

Jush

I am curious is it better to do these link-time optimizations on section level?

For example, invoking Clang with "-ffunction-sections" and "-fdata-sections"
options can place each functions and data into its own section in the
output file for elf format backend. So this is exactly what the two options
provide on section level.

It is not a new stuff, AFAIK there are already some link-time optimizations
working with function-sections. What are the advantages of creating atoms
when linking rather than function-sections and data-sections prior to linking?

Thanks,
ChengWei Chen

Hi,

I trace dyld and some components again. I think MCLinker can provide its facility to dyld and MC-JIT as a library can do, even it is an integrated linker.

MCLinker will provides a number of clearly separate APIs, which include readFile, symbol resolution, layout, section merges, reordering, instruction relaxation, applying relocations, writeFile and so on. The only thing to use these public APIs is passing MCLinker IR to them, and this means they will be easy to be reused by other components such as dyld and MC-JIT with IR

For dyld, I think it might be similar to the function “SectLinker::doFinalization(Module &pM)” in SectLinker.cpp (The file is in MCLinker source code, please see http://code.google.com/p/mclinker/source/browse/lib/CodeGen/SectLinker.cpp) . SectLinker is a function pass used as a driver, it invokes readers to parse files into IR and calls a number of reusable APIs to perform core linking algorithms, and these linking algorithms are implemented in a linker library.

Various readers in MCLinker are all implemented as a library. As I mentioned in last mail, there is an Object File version readers too. These readers generate MCLinker IR for performing core linking algorithms such as symbol resolution and relocation, which we are working on. Finally, MCLinker will call a DSOwriter or an EXEWriter to generate .so or executable. As readers, these writers will be implemented as a library.

Jush

A helpful link-time optimization would be to place subroutines that
are used close together in time also close together in the executable
file. That also goes for data that is in the executable file, whether
initialized (.data segment) or zero-initialized (.bss).

If the unit of linkage of code is the function rather than the
compilation module, and the unit of linkage of data is the individual
data item rather than all the .data and .bss items together that are
in a compilation unit, you could rearrange them at will.

For architectures such as ARM that cannot make jumps to faraway
addresses, you could make the destinations of subroutine calls close
to the caller so you would not need so many trampolines.

The locality improves the speed because the program would use the code
and data caches more efficiently, and would page in data and code from
disk less often.

Having fewer physically resident pages also saves on precious kernel
memory. I read in O'Reilly's "Understanding the Linux Kernel" that on
the i386 architecture, the kernel's page tables consume most of the
physical memory in the computer, leaving very little physical memory
for user processes!

A first cut would be to start with the runtime program startup code,
which for C program then calls main(). The subroutines that main
calls would be placed next in the file. Suppose main calls Foo() and
then Bar(). One would then place each of the subroutines that Foo()
calls all together, then each of the subroutines that Bar() calls.

This static analysis does not capture virtual calls (either C++ or
Objective-C). It may also causing error handling code to be moved into
the "hot" area.

It would be best if some static analysis were performed to determine
in what order subroutines are called, and in what order .data and .bss
memory is accessed.

Getting that analysis right for the general case would not be easy, as
the time-order in which subroutines are called may of course depend on
the input data. To improve the locality, one could produce an
instrumented executable which saved a stack trace at the entry of each
subroutine. Examination of all the stack traces would enable a
post-processing tool to generate a linker script that would be used
for a second pass of the linker. This is a form of profiler-guided
optimization.

At Apple we generate "order files" by running a program under dtrace.
We use a feature of dtrace that sets a "one shot" break point on the
start of every function. You then run the program (under dtrace) and
you get a list of functions in the order they were first called with
no need to build a special version of the program.

Given that the optimal ordering is dependent on what the user does
to exercise different parts of the program, we've concluded the minimal
ordering is to just get initialization functions ordered. This also helps
programs launch faster (less paging).