Improving the reproducibility of linker benchmarking

A side project I’ve been working on recently is to make it easier to benchmark lld (as well as other linkers like mold and wild) by creating a common set of reproducer benchmarks. The idea is to create a new version of the lld-speed-test benchmarks which were used in the early years of lld development and are still mentioned in the README. Those benchmarks were basically just a tree of reproducers collected from binaries from various large projects built with lld’s --reproduce flag.

The goals for the new version of the benchmark suite are:

  1. More architectures. The previous benchmarks were for x86 only but other architectures may have different performance characteristics due to features like range extension thunks (AArch64, RISC-V) and size relaxation (RISC-V).
  2. Easy to maintain. It should be possible to build all of the benchmarks with a single command that can work on any Linux machine without needing to manually install dependencies, and it should be easy to update the benchmarks to use new projects or new versions of projects as needed.
  3. Reproducible. Any machine should be able to build a benchmark suite that is bit-identical to a benchmark suite built on any other machine, so there is concensus about the contents of the benchmarks. Ideally, the only variables when it comes to reproducing benchmark results should be the linker code itself and the hardware.

I think these goals are achievable with the approach that I decided on. The idea is to use the Nix project to build the projects with a custom linker wrapper that collects the reproducers. Nix builds are (at least theoretically) reproducible and all projects are built in a fairly uniform way which makes it fairly easy to inject the linker wrapper into any project without knowing project-specific details. Nix also supports cross compilation so the same machine can be used to build reproducers for all architectures (but there are some bugs which mean that at the moment not all projects can be cross compiled and not all projects end up using the linker wrapper). And we can upgrade project versions by using a new version of Nixpkgs (Nix’s “ports tree”).

I made all of this work in this LLVM fork and you can try it by following the instructions in the collect.nix file. (N.B. I’ve only tried building the projects on an aarch64 machine so far.) I’d like to upstream the reproducer builder code once it works with the upstream Nixpkgs (I still have some outstanding PRs to Nixpkgs).

That sounds amazing, and very exciting.

At the risk of derailing, I would love to collaborate on an extension to the ELF relocatable object file. The idea is really simple: we should content hash strtab, and the static linker should use that instead of hashing .strtab online, since it’s such a large fraction of the input. A more disruptive, potentially optional, second step is to compress .strtab. Output images would remain the same, this is merely a transparent optimization handshake between the compiler and linker in the spirit of CREL. However, real performance engineering requires profiling and benchmarks rather than guesswork, so what you’ve proposed here is the right first step.

Great idea. The lld-speed-test benchmarks were very useful as far as they went, but having the flexibility of a source build would be a large improvement.

In particular being able to add flags such as debug information to large examples would be useful.

That sounds interesting. I guess the idea is that we would use the hashes for hash table insertion and only read/uncompress strtab to diagnose errors? Since we would need the input strtab to produce the output strtab it’s unclear that compressing the strtab would be beneficial. So far, lld, mold and wild have in turn been able to show that major performance improvements are possible without changing the file format, so I would probably continue to look at algorithmic improvements to begin with.

By the way, one profiling technique that I recently became aware of is Dmitry Vyukov’s latency profiling; since lld only uses parallelism in certain places it could be useful to use it to identify latency hot spots that could benefit most from optimization.

This is certainly possible with Nix. For debug information specifically, I think we should have the reproducer package be built with debug info for all projects by default, and users may pass -S at link time to benchmark without debug info. Actually, I thought that this was what Nix was already doing by default, but I just noticed that only Hello, Firefox and Ladybird have debug info in their object files, so it looks like it’s actually up to the project and we may want to override the other projects.

1 Like

What about extending -fcrash-diagnostics= to have a -fcrash-diagnostics=linker mode and then using -fcrash-diagnostics=linker -gen-reproducer=always -fcrash-diagnostics-dir=/some/place/known? Plumbing those few switches seems preferable to using a wrapper script for the linker.

If we wrote the reproducers to a directory, I think that would require custom integration for every project (for reproducibility, Nix runs all builds inside a chroot with limited filesystem access, so something would need to copy them out and add them to the store). All we can really rely on is that the final binaries are copied to the store, so the simplest way to smuggle the reproducer out of the project’s build system that should work for ~all projects is to add the reproducer to a section of the binary.

Collecting reproducers in this way is a niche use case so it seems unclear that it’s worth extending the tools themselves to support it. A wrapper script gives us the flexibility to decide and change the contract between reproducer collection and extraction without needing to fix those details in the compiler.

My theory is that, to produce the output string table, you never need to decompress all input string tables at once in memory. You can determine which strings you need up front, dealing only in content hashes, and then run streaming decompression over the input string tables and select only the symbols you need to carry over into the output symbol table.

This should result in more compact inputs, less FS cache memory usage, and a lower working set size for the overall link job. At Google, linker input size is a major constraint, so compressing the symbol table has a lot of value. For the LLVM community at large, they might prefer to leave strings decompressed and just spend disk / FS cache memory to save compute.

Still, the idea of offloading the string hashing from the linker to the compiler, which is inherently parallel, is nice, and it’s easy to compute the hashes for the inputs that don’t have hashes up front in parallel before proceeding with string table merging.