`llvm-dsymutil` performance

Hey everyone,

Recently on our project, where we release a dozen or so binaries targeting macOS + Linux, we noticed that llvm-dsymutil is a huge bottleneck when producing our macOS release builds.

The general order of magnitude we’re seeing can be seen when building parts of LLVM as well. For example linking clang in RelWithDebInfo takes a second or 2 on my mac. Running dsymutil afterwards takes nearly >1.5 minutes:

$ time bin/dsymutil bin/clang --flat -o /tmp/default
dsymutil bin/clang --flat -o   133.40s user 7.59s system 143% cpu 1:38.37 total

I did a quick test with the --linker=parallel mode, which still seems to take more time than I would expect:

$ time bin/dsymutil bin/clang --flat -o /tmp/parallel --linker=parallel
bin/dsymutil bin/clang --flat -  216.44s user 24.04s system 531% cpu 45.275 total

It also looks like the parallel mode has a few bugs today that likely means it should be avoided for now.

I’m curious to hear from folks if anyone is actively working on performance improvements here, if there are any quick wins folks have found to improve things in the meantime, or if this performance is inherent and something we have to live with.

Thanks!

I’ve been the maintainer of dsymutil since I joined Apple, so I think I’m probably best positioned to provide some context.

The story for debugging on Apple platform was purposely designed to make the compile-link-debug cycle as fast as possible. Not having to deal with DWARF during linking (which used to be the least parallelizable step) played a key role in that.

For those that are not familiar: the idea is that when you’re debugging locally, we don’t link DWARF at all, and the debugger finds the unlinked DWARF in the object files. However, that doesn’t work if you don’t have the object files around, so for releases, you need a way to archive the debug info. That’s where dsymutil comes in. But it’s doing more than just linking the DWARF together; it’s an optimizing linker which uses the One-Definition Rule to unique the linked debug info. That’s where most of the time is spent. The result is fast development builds, and slower archive builds with smaller DWARF.

Another design goal of the original dsymutil implementation was to stream the DWARF, both its input and output. The input DWARF can easily exceed your available memory. I don’t think the output is realistically a concern anymore, and despite its name, MCStreamer isn’t truly streaming anyway. The DWARF linking algorithm used to be single-threaded, and now is done by two threads in lockstep. We also process different architectures in parallel.

The biggest challenge with dsymutil is its qualification. When Fred upstreamed our original internal implementation, we did so by generating bug-for-bug identical DWARF. I did the same thing when implementing the lockstep algorithm mentioned above. We have no good way to assess the quality of the generated DWARF. It’s relatively easy to spot-check small things, like we do in our tests, but the real tricky issues only start showing up at debug-time when the debugger starts misbehaving.

I see two options to speed up dsymutil:

  1. Stick to binary compatibility of the generated DWARF. Over the years, I’ve spent quite some time thinking about this, and I can’t come up with a plan that would move the needle significantly enough to be worth the investment. I brainstormed with the folks that worked on the linker, and we had some ideas but nothing that would really, unless you change the generated output or are willing to keep potentially very large amounts of DWARF in memory.
  2. Find a way to qualify the generated DWARF by semantically comparing the debug info. Alexey’s parallel linker doesn’t rely on binary compatibility. However, I don’t have a plan for this mythical debug info semantic diffing tool.

As Alexey mentions in the issue you linked, ODR uniquing in parallel dwarflinker is still in an experimental state. The generated DWARF is non-deterministic, which is a show-stopper for us and anyone that cares about reproducible builds. I vaguely remember that we had a plan to solve it, but Alexey’s contract ended before he was able to get around to it.

Personally, I would recommend pursuing option (2). The parallel linker is passing all the existing tests, and sticking to binary compatibility won’t scale forever. I also believe that such a tool to semantically compare DWARF would benefit the whole debug info community, not just dsymutil.

I don’t have the bandwidth to invest in this in the short term. I would be more motivated to explore the qualification strategy if ODR uniquing was working, or work on the ODR uniquing if I had a qualification strategy. But without a concrete plan for either it’s hard to prioritize this over my other responsibilities.

3 Likes

Thanks for all the context. Can you share some more thoughts on the specifics of how the validation tool would work?

In the past when I was debugging this type of thing I diffed literal dwarfdump output (which obviously isn’t resilient to many types of innocuous changes). Would this tool work similarly in spirit, after trying to filter out things that aren’t meaningful changes? On the surface this sounds ok but I am curious about what subtle differences do actually cause meaningful differences (for example, is the same info in a different order considered identical?).

Do you imagine this as a new tool or part of one of the existing tools? Would you expect this tool to act like dwarfdump but output normalized data that the user could diff, or take multiple binaries for comparison and do the diffing internally?

Sorry for the delay. I’m still catching up after the holidays.

Conceptually, I think such a comparison tool could be part of llvm-dwarfdump, but it could also stand on its own (llvm-dwarfdiff?). I imagine the audiences for the two tools would overlap significantly, but the implementations likely not so much. While it’s certainly possible to generate an intermediate representation, dump it, and then diff the output, I would expect it to be more efficient and flexible for the tool itself to perform the comparison. There are arguments for each approach.

Although DWARF looks like a tree, it is really a directed acyclic graph, due to the way attributes can reference one another. I would expect the diffing tool to rely on some form of graph-matching algorithm. A naive (and horribly inefficient) approach would be to take every node (DW_TAG_) and compare all nodes reachable via either parent–child relationships or attribute references (basically how lookForDIEsToKeep walks the DWARF). There are certainly more efficient algorithms that avoid revisiting the same nodes repeatedly.

I’m also glossing over some of the implementation details, such as how to determine that nodes are the same. There are also some special cases to handle (like forward declarations), though likely those things are already special-cased/handled in dsymutil.

Yeah, I’d love a good semantic DWARF diffing tool for this and other reasons.

You could rely on linkage names or some other unique identifiers (qualified names (have to rebuild those from the namespace scopes, maybe with template parameters added if simplified template names is used, etc) of entities that have external linkage) to pin certain nodes together between the two graphs - that might make the diffing easier.

(the thing I really want in a graph diffing tool (for DWARF, but also for the library dependency graph in a build, the linker symbol resolution graph, and maybe even the graph of reachable code with a profile - you could overlay these to see which parts of the library graph aren’t needed by the linker (so you could split files that have two unrelated functions with different dependencies, so callers of one don’t need to take a build dependency on the implementation of the other function) or using the profile reachability graph compared to the link graph to see what parts of the code aren’t actually being run/maybe are dynamically unreachable and could be changed to be statically unreachable) - would be the ability to focus on the “weight” of subgraphs, and especially any pinch-points/bottlenecks into such heavy subgraphs (especially such heavy subgraphs that aren’t present in one input or the other). Though even visualizing a single graph, without the comparison, could be useful for this (oh, I have this whole subgraph of DWARF that’s only reachable through this one/few references - perhaps I can figure out how to avoid that reference (either as a compiler user, or a compiler developer looking to reduce the cost of DWARF generally))

2 Likes

I used Claude to prototype a diffing tool: [llvm-dwarfdump] Add --compare for semantic DWARF diffing by JDevlieghere · Pull Request #194089 · llvm/llvm-project · GitHub

While I’ve reviewed the implementation and manually fixed some issues, I don’t want to claim by any means that it’s anywhere near production ready. Most of my testing was done on small binaries with a limited number of DIEs that I could easily reason about.

I ran it on clang and compared the output from dsymutil using the classic and parallel algorithm. Out of roughly 5 million DIEs, the tool identified about 50k as different. I haven’t verified those results, but assuming they are not off by an order of magnitude, it was sufficient encouragement to take a stab at the non-determinism in the parallel linker:

My plan is to investigate the differences and use that to decide whether the parallel linker is ready to become the new default for dsymutil. I haven’t decided yet whether I think the diffing support for dwarfdump warrants being upstream, but if folks think it woudl be useful for them please let me know.

PS: For comaprison, running dsymutil on clang takes roughly 3 minutes with the classic linker compared to 40 seconds with the (deterministic) parallel linker.

4 Likes