[RFC] LLVM Busybox Proposal

Hello all,

When building LLVM tools, including Clang and lld, it’s currently possible to use either static or shared linking for LLVM libraries. The latter can significantly reduce the size of the toolchain since we aren’t duplicating the same code in every binary, but the dynamic relocations can affect performance. The former doesn’t affect performance but significantly increases the size of our toolchain.

We would like to implement a support for a third approach which we call, for a lack of better term, “busybox” feature, where everything is compiled into a single binary which then dispatches into an appropriate tool depending on the first command. This approach can significantly reduce the size by deduplicating all of the shared code without affecting the performance.

In terms of implementation, the build would produce a single binary called llvm and the first command would identify the tool. For example, instead of invoking llvm-nm you’d invoke llvm nm. Ideally we would also support creation of llvm-nm symlink which redirects to llvm for backwards compatibility.
This functionality would ideally be implemented as an option in the CMake build that toolchain vendors can opt into.

The implementation would have to replace main function of each tool with an entrypoint regular function which is registered into a tool registry. This could be wrapped in a macro for convenience. When the “busybox” feature is disabled, the macro would expand to a main function as before and redirect to the entrypoint function. When the “busybox” feature is enabled, it would register the entrypoint function into the registry, which would be responsible for the dispatching based on the tool name. Ideally, toolchain maintainers would also be able to control which tools they could add to the “busybox” binary via CMake build options, so toolchains will only include the tools they use.

One implementation detail we think will be an issue is merging arguments in individual tools that use cl::opt. cl::opt works by maintaining a global state of flags, but we aren’t confident of what the resulting behavior will be when merging them together in the dispatching main. What we would like to avoid is having flags used by one specific tool available on other tools. To address this issue, we would like to migrate all tools to use OptTable which doesn’t have this issue and has been the general direction most tools have been already moving into.

A second issue would be resolving symlinks. For example, llvm-objcopy will check argv[0] and behave as llvm-strip (ie. use the right flags + configuration) if it is called via a symlink that “looks like” a strip tool, but for all other cases it will run under the default objcopy mode. The “looks like” function is usually an Is function copied in multiple tools that is essentially a substring check: so symlinks like llvm-strip, strip.exe, and gnu-llvm-strip-10 all result in using the strip “mode” while all other names use the objcopy mode. To replicate the same behavior, we will need to take great care in making sure symlinks to the busybox tool dispatch correctly to the appropriate llvm tool, which might mean exposing and merging these Is functions.

Some open questions:

  • People’s initial thoughts/opinions?

  • Are there existing tools in LLVM that already do this?

  • Other implementation details/global states that we would also need to account for?

  • Leonard

Hello all,

When building LLVM tools, including Clang and lld, it's currently possible to use either static or shared linking for LLVM libraries. The latter can significantly reduce the size of the toolchain since we aren't duplicating the same code in every binary, but the dynamic relocations can affect performance. The former doesn't affect performance but significantly increases the size of our toolchain.

We would like to implement a support for a third approach which we call, for a lack of better term, "busybox" feature, where everything is compiled into a single binary which then dispatches into an appropriate tool depending on the first command. This approach can significantly reduce the size by deduplicating all of the shared code without affecting the performance.

In terms of implementation, the build would produce a single binary called `llvm` and the first command would identify the tool. For example, instead of invoking `llvm-nm` you'd invoke `llvm nm`. Ideally we would also support creation of `llvm-nm` symlink which redirects to `llvm` for backwards compatibility.
This functionality would ideally be implemented as an option in the CMake build that toolchain vendors can opt into.

The implementation would have to replace `main` function of each tool with an entrypoint regular function which is registered into a tool registry. This could be wrapped in a macro for convenience. When the "busybox" feature is disabled, the macro would expand to a `main` function as before and redirect to the entrypoint function. When the "busybox" feature is enabled, it would register the entrypoint function into the registry, which would be responsible for the dispatching based on the tool name. Ideally, toolchain maintainers would also be able to control which tools they could add to the "busybox" binary via CMake build options, so toolchains will only include the tools they use.

One implementation detail we think will be an issue is merging arguments in individual tools that use `cl::opt`. `cl::opt` works by maintaining a global state of flags, but we aren’t confident of what the resulting behavior will be when merging them together in the dispatching `main`. What we would like to avoid is having flags used by one specific tool available on other tools. To address this issue, we would like to migrate all tools to use `OptTable` which doesn't have this issue and has been the general direction most tools have been already moving into.

A second issue would be resolving symlinks. For example, llvm-objcopy will check argv[0] and behave as llvm-strip (ie. use the right flags + configuration) if it is called via a symlink that “looks like” a strip tool, but for all other cases it will run under the default objcopy mode. The “looks like” function is usually an `Is` function copied in multiple tools that is essentially a substring check: so symlinks like `llvm-strip`, strip.exe, and `gnu-llvm-strip-10` all result in using the strip “mode” while all other names use the objcopy mode. To replicate the same behavior, we will need to take great care in making sure symlinks to the busybox tool dispatch correctly to the appropriate llvm tool, which might mean exposing and merging these `Is` functions.

Some open questions:
- People's initial thoughts/opinions?

I think it's an interesting idea. My main concern is that adding a new CMake
option for this going to complicate the build system and make future CMake
improvements more difficult.

Do you have any idea of how much performance /
toolchain size gains you will get from this approach?

-Tom

I think it’s an interesting idea. My main concern is that adding a new CMake
option for this going to complicate the build system and make future CMake
improvements more difficult.

That’s fair. I’m working on a WIP version now and attempting to mitigate the amount of CMake changes. Ideally, this would be controlled behind a single CMake option that doesn’t end-user behavior, and we would have an upstream buildbot that could just enable this flag and ensure tools dispatched through busybox work as-is.

Do you have any idea of how much performance /
toolchain size gains you will get from this approach?

Locally we’ve found resolving dynamic relocations takes about 20% of the runtime for various dynamically linked LLVM tools. We’d have to double check if this is still the case because recently there have been some changes around semantic interposition that may help with this. I’m working on a WIP version that we can compare against for size (that is, size of separate tools + LLVM shared libs vs combined busybox size).

Hello all,

When building LLVM tools, including Clang and lld, it's currently possible
to use either static or shared linking for LLVM libraries. The latter can
significantly reduce the size of the toolchain since we aren't duplicating
the same code in every binary, but the dynamic relocations can affect
performance. The former doesn't affect performance but significantly
increases the size of our toolchain.

The dynamic relocation claim is not true.

A thin executable using just -Bsymbolic libLLVM-13git.so is almost
identical to a mostly statically linked PIE.

I added -Bsymbolic-functions to libLLVM.so and libclang-cpp.so which
has claimed most of the -Bsymbolic benefits.

The shared object approach *can be* inferior to static linking plus
-Wl,--gc-sections because with libLLVM.so and libclang-cpp.so we are
making many many API dynamic and that inhibits the --gc-sections
benefits. However, if clang and lld are shipped together with
llvm-objdump/llvm-readobj/llvm-objcopy/.... , I expect the non-GCable
code due to shared objects will be significantly smaller.

I am conservative on adding yet another mechanism.

We would like to implement a support for a third approach which we call,
for a lack of better term, "busybox" feature, where everything is compiled
into a single binary which then dispatches into an appropriate tool
depending on the first command. This approach can significantly reduce the
size by deduplicating all of the shared code without affecting the
performance.

In terms of implementation, the build would produce a single binary called
`llvm` and the first command would identify the tool. For example, instead
of invoking `llvm-nm` you'd invoke `llvm nm`. Ideally we would also support
creation of `llvm-nm` symlink which redirects to `llvm` for backwards
compatibility.
This functionality would ideally be implemented as an option in the CMake
build that toolchain vendors can opt into.

The implementation would have to replace `main` function of each tool with
an entrypoint regular function which is registered into a tool registry.
This could be wrapped in a macro for convenience. When the "busybox"
feature is disabled, the macro would expand to a `main` function as before
and redirect to the entrypoint function. When the "busybox" feature is
enabled, it would register the entrypoint function into the registry, which
would be responsible for the dispatching based on the tool name. Ideally,
toolchain maintainers would also be able to control which tools they could
add to the "busybox" binary via CMake build options, so toolchains will
only include the tools they use.

One implementation detail we think will be an issue is merging arguments in
individual tools that use `cl::opt`. `cl::opt` works by maintaining a
global state of flags, but we aren’t confident of what the resulting
behavior will be when merging them together in the dispatching `main`. What
we would like to avoid is having flags used by one specific tool available
on other tools. To address this issue, we would like to migrate all tools
to use `OptTable` which doesn't have this issue and has been the general
direction most tools have been already moving into.

A second issue would be resolving symlinks. For example, llvm-objcopy will
check argv[0] and behave as llvm-strip (ie. use the right flags +
configuration) if it is called via a symlink that “looks like” a strip
tool, but for all other cases it will run under the default objcopy mode.
The “looks like” function is usually an `Is` function copied in multiple
tools that is essentially a substring check: so symlinks like `llvm-strip`,
strip.exe, and `gnu-llvm-strip-10` all result in using the strip “mode”
while all other names use the objcopy mode. To replicate the same behavior,
we will need to take great care in making sure symlinks to the busybox tool
dispatch correctly to the appropriate llvm tool, which might mean exposing
and merging these `Is` functions.

Some open questions:
- People's initial thoughts/opinions?
- Are there existing tools in LLVM that already do this?
- Other implementation details/global states that we would also need to
account for?

crunchgen. As you said, argv[0] checking code needs to be taken care of.
We should make these executables' main file not have colliding symbols.
I have cleaned up a lot of files.

Dear Leonard et al.,

Will Dietz built a multiplexing tool using LLVM that does just this: it takes several programs and merges them together into one “busy box-esque” program that determines which main() function to call based on the argv[0] string.

The relevant paper is here: https://dl.acm.org/doi/abs/10.1145/3276524.

Will included the multiplexer code in the ALLVM code base. You can look at it here: https://publish.illinois.edu/allvm-project/software/. I believe the Github link is https://github.com/allvm/allvm-tools. I’ve been told that the code was built with LLVM 4.0, so it’d need to be updated to mainline.

I haven’t used it myself, but the idea of having LLVM multiplex itself seems cool, and it might make sense to give LLVM the ability to multiplex programs instead of expending effort doing it manually for LLVM and only getting the benefit in LLVM.

Regards,

John Criswell

A few points.

In an ideal ELF world only external function calls need PLT entries.
Currently shared objects have PLT entries for in-dso function calls
because default visibility non-local symbols are preemptible by default
and the linker will produce PLT entries. -Bsymbolic-functions suppresses
PLT entries for in-dso symbols.

Do you have a plan for Windows? Sym links on Windows are mostly limited to administrators and developer mode.

For pure compatibility purposes, in place of symlinks we could have facade executables on Windows. But that isn’t favorable in terms of performance, the cost of launching additional executables is quite high on Windows. I wonder if the LLVM installer could have a way to switch between both schemes: if admin mode is available, create symlinks; otherwise fall back to facades.

Hello Leonard,

That is a very interesting idea! This will particularly favor Windows where the LLVM bin/ folder is huge (3.5 GiB) since we don’t have working symlinks out-of-box. This is also going towards the direction that we are pursuing, having Clang and LLD together into an embedded application as suggested by llvm-buildozer [1], however we’re also considering the multi-threading aspect. We took a different route for now, which is loading the existing executables as shared libraries inside our application, but our concern was less the binary size on disk, and more about runtime performance (building time).

Regarding migrating every option to OptTable, are you suggesting removing cl::opt and CommandLineParser altogether? I can count 3,597 instances of cl::opt in the whole monorepo. This can be a tedious task even with automation, since it would need some level of classification into the appropriate .td file. What would be the approach for the migration? To alleviate the issue of having cl::opts cross the tool domain, we could temporarily auto-generate a dictionary of cl::opts available for each tool? That could be a quick intermediary step, while waiting for a complete migration.

Once other issue I can see is symbols clashing at link time. Having everything in the same executable requires internal ABI compatibly throughout, ie. compiling with the same #defines and linking with the same (system) libraries. I’m wondering if there was a analysis done in that regards? But maybe that is not an issue.

Best,

Alex.

[1] https://reviews.llvm.org/D86351

Small update: I have a WIP prototype of the tool at https://reviews.llvm.org/D104686. The prototype only includes llvm-objcopy and llvm-objdump packed together, but we’re seeing size benefits from busyboxing those two compared against having two separate tools. (More details in the prototype’s description.) I don’t plan on landing this as-is anytime soon and there’s still some things I’d like to improve/change and get feedback on.

To answer some replies:

  • Ideally, we could start off with an incremental approach and not package large tools like clang/lld off the bat. The llvm-* tools seem like a good place to start since they’re generally a bunch of relatively small binaries that all share a subset of functions in libLLVM, but don’t necessarily use all of libLLVM, so statically linking them together (with --gc-sections) can help dedup a lot of shared components vs having separate statically compiled tools. In my measurements, the busybox tool containing llvm-objcopy+objdump is negligibly larger than llvm-objdump on its own (a couple KB difference) indicating a lot of shared code between objdump and objcopy.

  • Will Dietz’s multiplexing tool looks like a good place to start from. The only concern I can see though is mostly the amount of work needed to update it to LLVM 13.

  • We don’t have plans for windows support now, but it’s not off the table. (Been mostly focusing on *nix for now). Depending on overall traction for this idea, we could approach incrementally and add support for different platforms over time.

  • I’m starting to think the cl::opt to OptTable issue might be orthogonal to the busybox implementation. The tool essentially dispatches to different “main” functions in different tools, but as long as we don’t do anything within busybox after exiting that tool’s main, then the global state issues we weren’t sure of with cl::opt might not be of any concern now. It may be an issue down the line if, let’s say, the tool flags moved from being “owned” by the tools themselves to instead being “owned” by busybox, and then we’d have to merge similarly-named flags together. In that case, migrating these tools to use OptTable may be necessary since (I think) OptTable should handle this. This may be a tedious task, but this is just to say that busybox won’t need to be immediately blocked on it.

  • I haven’t seen any issues with colliding symbols when linking (although I’ve only merged two tools for now). I suspect that with small-ish llvm-* tools, the bulk of their code is shared from libLLVM, and they have their own distinct logic built on top of it, which could mean a low chance of conflicting internal ABIs.

Small update: I have a WIP prototype of the tool at
⚙ D104686 [WIP][llvm] LLVM Busybox Prototype. The prototype only includes llvm-objcopy
and llvm-objdump packed together, but we're seeing size benefits from
busyboxing those two compared against having two separate tools. (More
details in the prototype's description.) I don't plan on landing this as-is
anytime soon and there's still some things I'd like to improve/change and
get feedback on.

To answer some replies:

- Ideally, we could start off with an incremental approach and not package
large tools like clang/lld off the bat. The llvm-* tools seem like a good
place to start since they're generally a bunch of relatively small binaries
that all share a subset of functions in libLLVM, but don't necessarily use
all of libLLVM, so statically linking them together (with --gc-sections)
can help dedup a lot of shared components vs having separate statically
compiled tools. In my measurements, the busybox tool containing
llvm-objcopy+objdump is negligibly larger than llvm-objdump on its own (a
couple KB difference) indicating a lot of shared code between objdump and
objcopy.

- Will Dietz's multiplexing tool looks like a good place to start from. The
only concern I can see though is mostly the amount of work needed to update
it to LLVM 13.

- We don't have plans for windows support now, but it's not off the table.
(Been mostly focusing on *nix for now). Depending on overall traction for
this idea, we could approach incrementally and add support for different
platforms over time.

-DLLVM_LINK_LLVM_DYLIB=on -DCLANG_LINK_CLANG_DYLIB=on -DLLVM_TARGETS_TO_BUILD=X86 (custom1)
vs
-DLLVM_TARGETS_TO_BUILD=X86 (custom2)

# This is the lower bound for any multiplexing approach. clang is the largest executable.
% stat -c %s /tmp/out/custom2/bin/clang-13
102900408

I have built clang, lld and a bunch of ELF binary utilities.

% stat -c %s /tmp/out/custom1/lib/libLLVM-13git.so /tmp/out/custom1/lib/libclang-cpp.so.13git /tmp/out/custom1/bin/{clang-13,lld,llvm-{ar,cov,cxxfilt,nm,objcopy,objdump,readobj,size,strings,symbolizer}} | awk '{s+=$1}END{print s}'
138896544

% stat -c %s /tmp/out/custom2/bin/{clang-13,lld,llvm-{ar,cov,cxxfilt,nm,objcopy,objdump,readobj,size,strings,symbolizer}} | awk '{s+=$1}END{print s}'
209054440

The -DLLVM_LINK_LLVM_DYLIB=on -DCLANG_LINK_CLANG_DYLIB=on build is doing a really good job.

A multiplexing approach can squeeze some bytes from 138896544 toward 102900408,
but how much can it do?

- I'm starting to think the `cl::opt` to `OptTable` issue might be
orthogonal to the busybox implementation. The tool essentially dispatches
to different "main" functions in different tools, but as long as we don't
do anything within busybox after exiting that tool's main, then the global
state issues we weren't sure of with `cl::opt` might not be of any concern
now. It may be an issue down the line if, let's say, the tool flags moved
from being "owned" by the tools themselves to instead being "owned" by
busybox, and then we'd have to merge similarly-named flags together. In
that case, migrating these tools to use `OptTable` may be necessary since
(I think) `OptTable` should handle this. This may be a tedious task, but
this is just to say that busybox won't need to be immediately blocked on it.

Such improvement is useful even if we don't do multiplexing.
I switched llvm-symbolizer. thakis switched llvm-objdump.
I can look at some binary utilities.

From our perspective as a toolchain vendor, even if using shared libraries could get us closer to static linking in terms of performance, we’d still prefer static linking for the ease of distribution. Dealing with a single statically linked executable is much easier than dealing with multiple shared libraries. This is especially important in distributed compilation environments like Goma.

When comparing performance between static and dynamic linking, I’d also recommend doing a comparison between binaries built with PGO+LTO. Plain -O3 leaves a lot of performance on the table and as far as I’m aware, most toolchain vendors use PGO+LTO.

From our perspective as a toolchain vendor, even if using shared libraries could get us closer to static linking in terms of performance, we’d still prefer static linking for the ease of distribution. Dealing with a single statically linked executable is much easier than dealing with multiple shared libraries. This is especially important in distributed compilation environments like Goma.

What makes it especially complicated for distributed compilation environments? (I’d expect a toolchain contains so many files that whether it’s one binary, or a binary and a handful of shared libraries wouldn’t change the general implementation complexity of a distributed build system?)

I guess this depends on a particular implementation of the distributed build system. In the case of Goma, we only supply the compiler binary which was invoked as the command (that binary links glibc as a shared library but we assume that one is supplied by the host system), all other files like headers are passed together with the compiler invocation as inputs. If we used dynamic linking, Goma would need to figure out what other shared libraries need to be sent to the server. It’s certainly doable but it’s an extra complexity we would like to avoid.

I guess this depends on a particular implementation of the distributed build system. In the case of Goma, we only supply the compiler binary which was invoked as the command (that binary links glibc as a shared library but we assume that one is supplied by the host system), all other files like headers are passed together with the compiler invocation as inputs. If we used dynamic linking, Goma would need to figure out what other shared libraries need to be sent to the server. It’s certainly doable but it’s an extra complexity we would like to avoid.

Curious/fair enough - good to know!

I guess this depends on a particular implementation of the distributed build system. In the case of Goma, we only supply the compiler binary which was invoked as the command (that binary links glibc as a shared library but we assume that one is supplied by the host system), all other files like headers are passed together with the compiler invocation as inputs. If we used dynamic linking, Goma would need to figure out what other shared libraries need to be sent to the server. It's certainly doable but it's an extra complexity we would like to avoid.

For non-clang executables, -DLLVM_LINK_LLVM_DYLIB=on just adds one
more DT_NEEDED.
The DT_NEEDED entry can use a $ORIGIN based DT_RUNPATH. Can Goma
detect the libraries shipped with the tools?
I asked because I feel this could be an artificial limitation which
could be straightforwardly addressed in Goma.
A toolchain executable using a accompanying shared object is not rare
(thinking of plugins).

Multiplexing LLVM tools is one alternative but I am a bit concerned
with the extra complexity and the new configuration the build system
needs to support.

https://lists.llvm.org/pipermail/llvm-dev/2021-June/151338.html
mentioned another approach which doesn't require intrusive
modification to the tools.

As for PGO+LTO, you can apply them to libLLVM-13git.so as well.

Is that a problem? Installers generally run with administrator rights (choco, for example, requires running from an Administrator PowerShell and that's how most folks I know install LLVM on Windows).

Developers generally need to enable developer mode if they want to run things that they've built (and doing so is a single toggle switch in Settings, so it's not a massive obstacle). It should be fairly easy to try running mklink during CMake if this option is enabled and, if it fails, error out and tell the person running the build to either enable developer mode or switch to separate-program builds.

David

I agree that the official installation case probably isn't an issue.

There are unofficial installation cases that are more annoying. I wouldn't be able to just zip up my llvm dir and hand it to someone else to unzip like I can today.

The just-built case is a bigger deal. I do most of my development on Windows from a standard account (non-admin, non-developer). That's largely by choice, but some IT departments are much more picky. If I need to install something, then I open a distinct admin command prompt.

Requiring development mode to be turned on for LLVM dev is similar to requiring Linux devs to build as root (or at least making a few new programs setuid root).

I agree that the official installation case probably isn’t an issue.

There are unofficial installation cases that are more annoying. I wouldn’t be able to just zip up my llvm dir and hand it to someone else to unzip like I can today.

The just-built case is a bigger deal. I do most of my development on Windows from a standard account (non-admin, non-developer). That’s largely by choice, but some IT departments are much more picky. If I need to install something, then I open a distinct admin command prompt.

Requiring development mode to be turned on for LLVM dev is similar to requiring Linux devs to build as root (or at least making a few new programs setuid root).

None of this would be required - it looks like the discussion is only about an optional build mode that would be opt-in and beneficial to some folks.

Some thoughts if we’re getting into PGO+LTO territory, I feel that both methods presented here will be at a disadvantage compared to building clang and lld into their own binaries.
For example I remember that on Mac an important optimization for clang builds was to order the functions in the binary roughly in the order in which they are first encountered during execution, assuming the same behavior for lld you can see the conflicting optimization goal… You can also think about how libSupport may be differently “hot” on a clang PGO profile compared to lld and would result in different optimization.

LTO also benefits from “internalizing”, basically building a static binary where only main is exported and everything else becomes an internal linkage is the best case: pointer escaping, global analysis, etc all become more powerful. Optimizing a shared library kind of makes every symbol public, and I suspect the busybox approach may be better on this aspect (you get back to a single public main, but it can reach much more code though).