[RFC]Extending lib/Linker to support bitcode "shared objects"

Hi llvm team!

I’m currently working on the extended version of llvm-ld, which has an ability to check if all the symbols present (and fail if some symbols are not resolved), treat archives in the right way (link all the object files in the archive if it’s specified as the regular input, not as -l) and the most important to my project feature: to link against bitcode “shared objects”. The semantics is pretty simple. The module treated as shared object is only used as the source of symbol names which be excluded from the list unresolved symbols. For example:

foo.c:

int main(void) {
return bar()
}

bar.c:

int bar() {
return 123;
}

In the native case, you have three options to link these files.

  1. Classic static linking:

$ clang foo.c bar.c

  1. Static linking with the archive.

$ clang -c bar.c
$ ar q libbar.a bar.o
$ clang foo.c -lbar -L.

  1. Dynamic linking:

$ clang -c foo.c
$ clang -shared bar.c -o libbar.so
$ clang foo.c -lbar -L.
$ nm a.out

U bar

Hi llvm team!

I'm currently working on the extended version of llvm-ld, which has an
ability to check if all the symbols present (and fail if some symbols
are not resolved), treat archives in the right way (link all the object
files in the archive if it's specified as the regular input, not as -l)

Is that the "right way"? I think all native linker require a
--whole-archive or similar to do that, no?

and the most important to my project feature: to link against bitcode
"shared objects". The semantics is pretty simple. The module treated as
shared object is only used as the source of symbol names which be
excluded from the list unresolved symbols.

In summary, linking foo.bc and bar.bcso (do you use a different
extension to differentiate from a "static" bc?) would be equivalent (but
faster, easier) to

$ llc bar.bc -filetype=obj -o bar.o
$ clang -shared -o bar.so bar.o
$ clang -use-gold-plugin foo.o bar.so -o t

Is that correct? In particular, "lld t" should show a dependency on bar.
Any particular reason for not adding this to the plugin api?

I feel like the similar functionality might be useful for other projects
as well. For example, Emscripten (as of 3 months ago, when I have
checked it) does not check for undefined symbols at link time, because
it does not have a way to link against glibc (the symbols from which
would be added at runtime as usual javascript functions). This adds
additional overhead to the developer who ports the software to
javascript using Emscripten.

Is that the same case? It seems that it would be easier to just include
a stub ELF libc.so with the symbols from libc that Emscripten supports.

Any objections? Comments?

Ivan Krasin

Cheers,
Rafael

Hi llvm team!

I’m currently working on the extended version of llvm-ld, which has an
ability to check if all the symbols present (and fail if some symbols
are not resolved), treat archives in the right way (link all the object
files in the archive if it’s specified as the regular input, not as -l)

Is that the “right way”? I think all native linker require a
–whole-archive or similar to do that, no?

My mistake. Please, read the phrase as “support -whole-archive”.

and the most important to my project feature: to link against bitcode
“shared objects”. The semantics is pretty simple. The module treated as
shared object is only used as the source of symbol names which be
excluded from the list unresolved symbols.

In summary, linking foo.bc and bar.bcso (do you use a different
extension to differentiate from a “static” bc?) would be equivalent (but
faster, easier) to

$ llc bar.bc -filetype=obj -o bar.o
$ clang -shared -o bar.so bar.o
$ clang -use-gold-plugin foo.o bar.so -o t

Is that correct? In particular, “lld t” should show a dependency on bar.
Any particular reason for not adding this to the plugin api?

The result is a native .so here. My goal is to have a bitcode result.

gold plugin with mods supports that as well, but it would be nice to avoid dependency on gold and gold plugin to simplify things.

I feel like the similar functionality might be useful for other projects
as well. For example, Emscripten (as of 3 months ago, when I have
checked it) does not check for undefined symbols at link time, because
it does not have a way to link against glibc (the symbols from which
would be added at runtime as usual javascript functions). This adds
additional overhead to the developer who ports the software to
javascript using Emscripten.

Is that the same case? It seems that it would be easier to just include
a stub ELF libc.so with the symbols from libc that Emscripten supports.

I believe that Emscripten does not use gold and gold plugin at the moment. They use llvm-link: https://github.com/kripken/emscripten/wiki/Building-Projects

I fully agree that all the features I want from the bitcode linker may be achieved with gold + gold plugin (with mods to both parts), but I would like to stay away from this dependency. Partly, because maintaining modified gold and gold plugin is no fun, partly because gold + gold plugin is hardly broken under cygwin. See http://code.google.com/p/nativeclient/issues/detail?id=2286 for more details. In short, it’s because cygwin does not fully support dlopen. https://www.google.com/search?sourceid=chrome&ie=UTF-8&q=cygwin+dlopen

It may be fixed by linking gold plugin to gold statically, like here: http://codereview.chromium.org/8713008/ but it only adds the complexity.

    $ llc bar.bc -filetype=obj -o bar.o
    $ clang -shared -o bar.so bar.o
    $ clang -use-gold-plugin foo.o bar.so -o t

    Is that correct? In particular, "lld t" should show a dependency on bar.
    Any particular reason for not adding this to the plugin api?

The result is a native .so here. My goal is to have a bitcode result.

gold plugin with mods supports that as well, but it would be nice to
avoid dependency on gold and gold plugin to simplify things.

Sorry, I meant equivalent only in the produced executable (t). In the
above example the bar.bc is your "shared IL library" which I assume
already exists. You want to produce a executable (t) which has a
dependency on a shared library with the symbols defined in foo.bc.

Is that the use case?

I believe that Emscripten does not use gold and gold plugin at the
moment. They use
llvm-link: Building Projects · emscripten-core/emscripten Wiki · GitHub

I think that is the current implementation, yes.

I fully agree that all the features I want from the bitcode linker may
be achieved with gold + gold plugin (with mods to both parts), but I
would like to stay away from this dependency. Partly, because
maintaining modified gold and gold plugin is no fun, partly because gold
+ gold plugin is hardly broken under cygwin.

I understand. I was just pointing that emacscriten case looks different
from what you have. I don't see where an LLVM IL library (libc in your
example) would fit. All that is needed is a stub with the functions that
will be defined in JS in the end.

Cheers,
Rafael

$ llc bar.bc -filetype=obj -o bar.o
$ clang -shared -o bar.so bar.o
$ clang -use-gold-plugin foo.o bar.so -o t

Is that correct? In particular, “lld t” should show a dependency on bar.
Any particular reason for not adding this to the plugin api?

The result is a native .so here. My goal is to have a bitcode result.

gold plugin with mods supports that as well, but it would be nice to
avoid dependency on gold and gold plugin to simplify things.

Sorry, I meant equivalent only in the produced executable (t). In the
above example the bar.bc is your “shared IL library” which I assume
already exists. You want to produce a executable (t) which has a
dependency on a shared library with the symbols defined in foo.bc.

Is that the use case?

Almost. The difference is that I want to have a bitcode module as the result of linking, not executable file.

I believe that Emscripten does not use gold and gold plugin at the
moment. They use
llvm-link: https://github.com/kripken/emscripten/wiki/Building-Projects

I think that is the current implementation, yes.

I fully agree that all the features I want from the bitcode linker may
be achieved with gold + gold plugin (with mods to both parts), but I
would like to stay away from this dependency. Partly, because
maintaining modified gold and gold plugin is no fun, partly because gold

  • gold plugin is hardly broken under cygwin.

I understand. I was just pointing that emacscriten case looks different
from what you have. I don’t see where an LLVM IL library (libc in your
example) would fit. All that is needed is a stub with the functions that
will be defined in JS in the end.

bitcode “so” is supposed to have the form of:

define void @printf() {
ret void
}

In this case, Empscripten might want to generate such stubs based on the list of javascript functions they provide and in my case, the generation of this LL stub is a simple script based on nm, sed and llvm-as that takes native .so as the input and produces a bitcode file with stub definitions of the public symbols.

This is about the same use case as Emscripten with the only difference in the source of the defined symbols (native .so vs js file)

Is it more clear now? If not, I would like to give it another try and write much more details and examples.

krasin

Is it more clear now? If not, I would like to give it another try and
write much more details and examples.

I am still not completely sure I understand the use case. In particular
since you say you want a bitcode module in the end, I don't understand
how this is different from using llvm-link, it doesn't complain about
undefined symbols...

Can you provide an example? What are your inputs. Are they IL or ELF?
What are the outputs that you want? Are they IL or ELF?

krasin

Cheers,
Rafael

Is it more clear now? If not, I would like to give it another try and
write much more details and examples.

I am still not completely sure I understand the use case. In particular
since you say you want a bitcode module in the end, I don’t understand
how this is different from using llvm-link, it doesn’t complain about
undefined symbols…

Complaining about undefined symbols is the next step. It’s already implemented in my branch: https://github.com/krasin/bitlink/commit/be222a2863a989666d4925e5344d0c84cac8e06b

Can you provide an example? What are your inputs. Are they IL or ELF?
What are the outputs that you want? Are they IL or ELF?

All the inputs and all the outputs are IL. Some IL inputs are just stubs which have only functions with empty bodies.
The output IL file is supposed to have all the symbols resolved, statically or dynamically (i.e. there’s a stub that has that symbols defined).
In case of Emscripten, that’s enough.
In our case (PNaCl), we also add metadata to the output IL file, so that it would be possible to track which shared libraries are actually needed.

As for “real life” example, we may want to consider a developer on Windows machine who wants to write a portable program (that will run inside NaCl) using libpng and glibc using PNaCl toolchain that would provide portable bitcode stubs for glibc and libpng port from naclports repositories (with link-time portable stubs as well). In this case, the developer wants to create a bitcode file that will be treated by PNaCl as executable (and would be translated to the target architecture before the actual run), linked dynamically with glibc and libpng. He may achieve this by using modified gold + modified gold plugin from the PNaCl toolchain (but Windows would prevent him from this path, see cygwin notes in my previous messages), or he can use the bitcode linker, I’m currently working on. In this case, the developer will know if all the symbols are resolved and will not have a dependency on gold linker.