Make LLD output COFF relocatable object file (like ELF's -r does). How much work is required to implement this?

Hi,

How far are we from having '-r' in the LLD COFF linker?
I'd try to implement this if not too much effort is required.
Any suggestions and/or pointers?

Cheers,
Kyra

As far as I know, no one has ever tried to add the -r option to the lld COFF linker. It shouldn’t be super hard to add it to the COFF linker, but from our experience of implementing it to lld ELF linker, I can say that it was tricky and somewhat fragile. We had to add a number of small pieces of code here and there.

We wanted to support it in the ELF linker because that’s an existing feature and people are actually using it. Otherwise, we wouldn’t have added it. So, what is the motivation of adding the feature to the COFF linker? I don’t think MSVC linker supports it.

(For those who are not familiar with -r, the option is to make the linker emit a .o file instead of an executable or a shared library. With the option, you can combine multiple object files into one object file.)

TL;DR:
I'm trying to evaluate if LLD can be used with GHC (Glasgow Haskell Compiler) on Windows.

Haskell binary code is usually deployed in "packages". A package typically provides static library(ies) and optionally – shared library(ies) and/or prelinked ('ld -r') object file. The latter is the best way to satisfy GHC runtime linker, since it requires no separate compile/link pass (as shared library requires), and is much faster to consume by GHC runtime linker than a static library.

Long story:

To prevent linking unused code GHC have always been supported splitting intermediate assembly which is horribly slow when compiling. Now GHC supports a direct analogue of '-ffunction-sections' ('-split-sections' in GHC parlance), which dramatically improves compile times, but now BFD linker is horribly slow on the files with a *lot* of sections. In the *nix world they have gold linker, in the windows world we have nothing other than GNU BFD ld ATM.

GHC on Windows uses Mingw tools and LLD doesn't fit into Mingw ecosystem yet (I know that some support have creeped into LLD recently, but it is still far from being complete), moreover, when assemling GHC native codegen output, GNU assembler produces peculiar non-standard COFF files (with 0x11 relocations), and finally binutils doesn't (and probably never would) support bigobj extension in the 32-bit case.

Windows GHC relies heavily on GCC, especially it's runtime system's code is full of gnu-isms, but Clang has a unique ability to combine gnu-ish frontend with ms-ish backend, I've experimented a bit and have concluded that replacing GCC as a C compiler/system assembler with Clang in GHC on Windows is very much doable.

GHC uses object file combining ('ld -r') when C stubs/wrappers generation is triggered, these stubs/wrappers are compiled with gcc and are linked back into the 'main' object file. In the MS world this use case can easily be satisfied by packing the object files into a library since MS linker looks into libraries both when linking final exe/dll *and/or* creating another library (i.e. when creating another library it unpacks all object files from all libraries it is fed with, and repacks them into the output library, llvm-lib doesn't support this ATM, and AFAIR LLVM developers are aware of this).

But my question is motivated by another important use-case: when packaging compiled Haskell code it is very desirable to provide not only a static library, but also to partially link this library's object modules into the one big object file, which can further be consumed by GHC runtime linker. GHC runtime linker can link binary code in any form, but linking static library is much slower than linking the single object file.

Thank you for your detailed explanation!

TL;DR:
I'm trying to evaluate if LLD can be used with GHC (Glasgow Haskell
Compiler) on Windows.

Haskell binary code is usually deployed in "packages". A package typically
provides static library(ies) and optionally – shared library(ies) and/or
prelinked ('ld -r') object file. The latter is the best way to satisfy GHC
runtime linker, since it requires no separate compile/link pass (as shared
library requires), and is much faster to consume by GHC runtime linker than
a static library.

I'm not sure if I understand correctly. If my understanding is correct, you
are saying that GHC can link either .o or .so at runtime, which sounds a
bit odd because .o is not designed for dynamic linking. Am I missing
something?

I also do not understand why only static libraries need "compile/link pass"
-- they at least don't need a compile pass, as they contain compiled .o
files, and they indeed need a link pass, but that's also true for a single
big .o file generated by -r, no? After all, in order to link against a .a
file, I think you need to pull out a .o file from a .a and do whatever you
need to do to link a single big .o file.

Long story:

To prevent linking unused code GHC have always been supported splitting
intermediate assembly which is horribly slow when compiling. Now GHC
supports a direct analogue of '-ffunction-sections' ('-split-sections' in
GHC parlance), which dramatically improves compile times, but now BFD
linker is horribly slow on the files with a *lot* of sections. In the *nix
world they have gold linker, in the windows world we have nothing other
than GNU BFD ld ATM.

GHC on Windows uses Mingw tools and LLD doesn't fit into Mingw ecosystem
yet (I know that some support have creeped into LLD recently, but it is
still far from being complete), moreover, when assemling GHC native codegen
output, GNU assembler produces peculiar non-standard COFF files (with 0x11
relocations), and finally binutils doesn't (and probably never would)
support bigobj extension in the 32-bit case.

Windows GHC relies heavily on GCC, especially it's runtime system's code
is full of gnu-isms, but Clang has a unique ability to combine gnu-ish
frontend with ms-ish backend, I've experimented a bit and have concluded
that replacing GCC as a C compiler/system assembler with Clang in GHC on
Windows is very much doable.

GHC uses object file combining ('ld -r') when C stubs/wrappers generation
is triggered, these stubs/wrappers are compiled with gcc and are linked
back into the 'main' object file. In the MS world this use case can easily
be satisfied by packing the object files into a library since MS linker
looks into libraries both when linking final exe/dll *and/or* creating
another library (i.e. when creating another library it unpacks all object
files from all libraries it is fed with, and repacks them into the output
library, llvm-lib doesn't support this ATM, and AFAIR LLVM developers are
aware of this).

I have an in-progress patch to add the feature to llvm-lib. I didn't have
time to finish it, but it is on the table, and needs to be done for
compatibility with MSVC lib.exe.

But my question is motivated by another important use-case: when packaging
compiled Haskell code it is very desirable to provide not only a static
library, but also to partially link this library's object modules into the
one big object file, which can further be consumed by GHC runtime linker.
GHC runtime linker can link binary code in any form, but linking static
library is much slower than linking the single object file.

IIUC, GHC is faster when handling .a files compared to a prelinked big .o
file, even if they contain the same binary code/data. But it sounds like an
artifact of the current implementation of GHC, because, in theory, there's
no reason the former is much inefficient than the latter. If that's the
case, doesn't it make more sense to improve GHC?

I'm not sure if I understand correctly. If my understanding is correct, you are saying that GHC can link either .o or .so at runtime, which sounds a bit odd because .o is not designed for dynamic linking. Am I missing something?

Yes, GHC runtime linker *does* link .o files not only doing all necessary relocations but also creating trampolines for "far" code to fulfill "small" memory model.

I also do not understand why only static libraries need "compile/link pass" -- they at least don't need a compile pass, as they contain compiled .o files, and they indeed need a link pass, but that's also true for a single big .o file generated by -r, no? After all, in order to link against a .a file, I think you need to pull out a .o file from a .a and do whatever you need to do to link a single big .o file.

Don't quite understand this.
The idea is that when creating a package you should *at the very least* provide a static library a client can statically link against. You optionally may create a shared library for a client to link against, but to do so you should *recompile* the whole package because things differ now (this is how GHC works), – you can't simply link all your existing object code (what you've produced the static library from) into this shared library. But if you want to provide the single prelinked *.o file (for GHC runtime linker consumption) you need no to perform any extra compile step, you simply link all your object files (exactly those which went to the package's static library) into this *.o file with 'ld -r'.

IIUC, GHC is faster when handling .a files compared to a prelinked big .o file, even if they contain the same binary code/data. But it sounds like an artifact of the current implementation of GHC, because, in theory, there's no reason the former is much inefficient than the latter. If that's the case, doesn't it make more sense to improve GHC?

No. GHC **runtime** linker is much slower when handling *.a files (and this is exactly the culprit of this whole story) since it goes through the whole archive and links each object module separately doing all resolutions and relocations and trampolines, than when linking already prelinked big *.o file.

There are, perhaps, some confusions related to what GHC *runtime* linker is. GHC runtime linker goes out into the scene when either GHC is used interactively, or GHC encounters the code which it has to execute at compile time (Template Haskell/quasiquotations). Thus GHC compiler must link some external code during it's own run time.

HTH.

Cheers,
Kyra

I'm not sure if I understand correctly. If my understanding is correct,
you are saying that GHC can link either .o or .so at runtime, which sounds
a bit odd because .o is not designed for dynamic linking. Am I missing
something?

Yes, GHC runtime linker *does* link .o files not only doing all necessary
relocations but also creating trampolines for "far" code to fulfill "small"
memory model.

I also do not understand why only static libraries need "compile/link

pass" -- they at least don't need a compile pass, as they contain compiled
.o files, and they indeed need a link pass, but that's also true for a
single big .o file generated by -r, no? After all, in order to link against
a .a file, I think you need to pull out a .o file from a .a and do whatever
you need to do to link a single big .o file.

Don't quite understand this.
The idea is that when creating a package you should *at the very least*
provide a static library a client can statically link against. You
optionally may create a shared library for a client to link against, but to
do so you should *recompile* the whole package because things differ now
(this is how GHC works), – you can't simply link all your existing object
code (what you've produced the static library from) into this shared
library. But if you want to provide the single prelinked *.o file (for GHC
runtime linker consumption) you need no to perform any extra compile step,
you simply link all your object files (exactly those which went to the
package's static library) into this *.o file with 'ld -r'.

IIUC, GHC is faster when handling .a files compared to a prelinked big .o

file, even if they contain the same binary code/data. But it sounds like an
artifact of the current implementation of GHC, because, in theory, there's
no reason the former is much inefficient than the latter. If that's the
case, doesn't it make more sense to improve GHC?

No. GHC **runtime** linker is much slower when handling *.a files (and
this is exactly the culprit of this whole story) since it goes through the
whole archive and links each object module separately doing all resolutions
and relocations and trampolines, than when linking already prelinked big
*.o file.

Looks like I still do not understand why a .a can be much slower than a
prelinked .o. As far as I understand, "ld -r" doesn't reduce amount of data
that much. It doesn't reduce the number of relocations, as relocations in
input object files are basically passed through to the output. It doesn't
reduce the number of symbols that much, as the combined object file
contains a union of all symbols appeared in the input files. So, I think
the amount of data in a .a is essentially the same as a prelinked .o. I
wonder what can make a difference in speed.

There are, perhaps, some confusions related to what GHC *runtime* linker

I can't speak for Haskell, but ld -r can be useful for speeding up C++
links, because it acts as a pre-merging step for duplicate comdats.
Consider a library that uses many instantiations of the same template with
the same type. An archive will contain many copies of the template, but a
relocated object file will only contain one.

Ah, good point.

Only now have I realized that my perception of link times was formed when no '-split-sections' option existed. The corresponding option was '-split-obs' and typical package 's static library contained thousands object modules.

For example:
The latest official GHC 8.2.1 release "base" package's static library built with '-split-objs' contains 25631 object modules. The static library size is 28MB, prelinked object file size is 15MB.
My own custom built GHC ghc-8.3.20170619 release "base" package's static library built with '-split-sections' (instead of '-split-objs') contains 228 object modules only. The static library size is 22MB, prelinked object file size is 15MB.

Thus, when working with "-split-sections" libraries we won't, perhaps, see that big differences in link times (remember we mean GHC runtime linker here) between these libraries and their prelinked object counterparts.

Thus, perhaps, having '-r' option in COFF LLD is becoming much less important than I though before.

Cheers,
Kyra