[RFC][BOLT] A New Parallel DWARF Processing Approach in BOLT

[RFC][BOLT] A New Parallel DWARF Processing Approach in BOLT

  • Status: Draft

  • Company: Bytedance

Summary

Currently, the debug information updates slowly for large binaries, as it only performs concurrent processing for DWO. This RFC proposes to upstream a set of changes that significantly reduce the time cost of --update-debug-sections by parallelizing DWARF update work including mainbinaryCU and splitCU, while preserving deterministic output.

The proposal is based on an implementation currently carried in a PR 195058 and is intended to be discussed and iterated in the LLVM community before landing.

The approximate performance results are shown in the table below:

Split dwarf:

clang:

thread original(batchsize) ours speedup
8 29.07 7.96 3.6x
16 23.69 4.97 4.7x
32 19.85 4.21 4.7x
64 17.91 4.18 4.3x

Bytedance Program one(debug information size almost 6GB):

thread original(batchsize) ours speedup
8 882.30 228.71 3.9x
16 789.64 153.99 5.1x
32 691.84 152.67 4.5x
48 658.89 163.93 4.0x

Bytedance Program two:

thread original(batchsize) ours speedup
8 76.87 24.46 3.1x
16 82.42 23.58 3.5x
32 76.87 25.01 3.1x
64 71.45 27.20 2.6x

Main binary:

Bydance Program one:

thread original(batchsize) ours speedup
8 119.85 36.28 3.3x
16 112.81 36.98 3.0x
32 115.02 39.80 2.9x
64 108.42 41.82 2.6x

Motivation

  • Lack of Parallel processing for main Binary CUs. The current implementation has some parallelism in the split DWARF path (generating DWO output), but the main binary CU update path is still mostly sequential. On binaries with few split units or where the main CU rewrite dominates, the speedup is limited.

  • Inefficient bucketing algorithm. The current level of concurrency remains low, constrained by both the parallelism strategy and the bucketing algorithm.

  • Fix existing bugs. The current implementation contains several bugs that need to be resolved, including correctness issues in CU rewriting logic and potential data races in shared state, which can lead to incorrect debug information output order in some testcase or crashes on certain inputs.

Proposed Design

The core idea is to restructure the DWARF update pipeline so that CU partitions (buckets) are processed in parallel, each with its own independent DIEBuilder and local range/loc writers, followed by a in-order merge that guarantees deterministic output. Additionally, CUs that are referenced (excluding DW_FORM_ref_addr) are not considered. The design consists of three key components:

1. Equivalence-Class Based CU Partitioning

We use llvm::EquivalenceClasses to union CUs that are connected by cross-CU references into the same equivalence class. Each class becomes one processing bucket; CUs with no cross-CU references become singleton buckets. Compared to the upstream’s fixed-batch-size grouping, this ensures only truly interdependent CUs are serialized together, maximizing parallelism.

2. Per-Bucket Independent Parallel Processing

Each bucket is dispatched to a thread task with its own DIEBuilder and local range/loc writers, eliminating all inter-bucket shared state:

  • DIE construction and range data emission run entirely within bucket-local buffers, requiring no locks.

  • Split DWO handling: only buildTypeUnits (shared DWO type context) is mutex-protected; all other DWO processing is lock-free.

3. In-Order Sequential Merge

After parallel processing, a merge loop consumes buckets in their original partition order via a condition-variable signaling mechanism. The merge phase finalizes skeleton/str_offsets, applies loclist/range offset fixups, appends local buffers to global writers, and emits compile units. This guarantees a deterministic debug information layout regardless of thread scheduling.

User-facing options

The prototype uses the following knobs:--debug-thread-count to control threads used for DWARF update, and the effective thread count is min(--debug-thread-count, --thread-count).

--cu-processing-batch-size will be deprecated.

Correctness Testing

  • Existing BOLT test suite: All existing llvm-bolt LIT tests must pass with the new implementation.

Open Questions

  1. Should --debug-thread-count default XX ? High thread counts may increase peak memory. Community feedback is requested on a sensible default.

  2. Is ordering guarantee required? When multiple threads produce results concurrently, should the output preserve the original input order? Enforcing strict ordering may introduce synchronization overhead and reduce throughput.

  3. Would CU size-ordered merging improve performance? The current merge phase processes buckets in their original partition order. An alternative strategy is to sort partitions by Bucket CUs size before merging, which may allow better pipelining with the parallel phase. This trades deterministic output layout for potential throughput gains.

References

4 Likes

Thanks for the RFC! The results look very promising - DWARF processing is currently the main build-time bottleneck for large applications.

Preserving determinism is still the top priority for reproducible builds. We can add an option to disable it if it shows significant processing-time improvement.

@maksfb,
Thanks for your reply and suggestion! This is still a draft PR, and I’ll continue refining it based on your suggestion. Once the complete implementation is ready, I’d like to invite you to review the code.

1 Like

Thanks for RFC.
Preserving original order does make figuring out what went wrong much easier if incorrect debug information is produced.

Regarding ā€œEquivalent-Class Based CU Partitioningā€
So with LTO we basically end up with two buckets?
One for CUs with cross CU references that are part of LTO, and the other for non-lto CUs?

Any chance of using/reusing/generalizing the DWARF linking implementation for dsymutil here?

Umm original implementation was based on DWARF linking, but there was some duplication to fit into BOLT. At that time DWARF link was also more limited in what it supported.
There is probably a way to do it (a lot of work), but I think it’s outside of scope of this RFC.

Not exactly two buckets — it’s based on connected components in the cross-CU reference graph.

CUs with cross-CU references are grouped by their connectivity. For example, if A→B, B→C, D (no references), E→F, then:

  • Bucket 1: A, B, C (connected component)
  • Bucket 2: D (isolated, no cross-CU references)
  • Bucket 3: E, F (connected component)

So LTO partitioning produces as many buckets as there are connected components in the reference graph — not a fixed two.

dsymutil’s DWARF linker doesn’t have an explicit partitioning step — it discovers inter-CU dependencies dynamically during liveness analysis and iterates until convergence. My approach does static pre-partitioning via equivalence classes (union-find over the DW_FORM_ref_addr reference graph), which gives a complete picture upfront. My partition decides which CUs can be processed as independent groups. The underlying mechanism (reference graph connectivity) is the same, but the implementations aren’t directly reusable.

By default, for order preservation, is it sufficient to follow the order of bucket partitioning, or must it strictly follow the original input order of the debug info — is there an explicit requirement for this? @maksfb @ayermolo @dblaikie

Not sure of the particular ordering you have in mind (order of what? CUs? DIEs in CUs?) But generally DWARF is order independent. CU order certainly doesn’t matter, DIE order doesn’t matter so long as inter-DIE references are correctly resolved.

Hi @dblaikie ,
This means that after BOLT optimization, the order of CUs may differ from that of the original program, but the multi-threaded output (the order of CUs) is guaranteed to be strictly deterministic. This is because after bucketing, the CUs finalization follows the order of buckets rather than the order of original .debug_info.

@Thrrreeee, once you confirm that gdb and lldb don’t have any issues with the new sorting order, it should be fine to continue with the implementation. On the BOLT side, we certainly don’t expect the output CU order to match the input. As I mentioned previously, we do care about about determinism, but even that could be overwritten by an option if build-time gains are worth it.

Yeah, can’t see any reason that’d be a problem.

Only concern is correlating debug info that got moved around. @maksfb I think your company has some tooling that does it using llvm-dwarf --verify. Don’t quite remember details under the hood.

Also I remember it really helped when debugging the old patching mechanism.

I’m not at all familiar with the DWARF processing code in BOLT, but coming from LLDB & dsymutil, I have some high level thoughts:

  • Great to see that this will be deterministic. It’s hard enough to debug these kind of tools when they are deterministic, I can’t imagine having to deal with non-deterministic output.
  • There seems to be a lot of (at least conceptual) overlap with the DWARF linker. The latter is already pretty modular to support both the ā€œclassicā€ and ā€œparallelā€ variant. I’d be interested to understand why this couldn’t build on that and what the concrete limitations are. While I’m certainly sympathetic to the fact that that would be more work than landing the current implementation, it’s also a pretty weak (technical) argument. The more logic we can share, the more this benefits the community overall.
  • I can confirm that the order doesn’t matter as far as the spec is concerned. LLDB should be able to handle that, though I’d expect something like TestSBTypeClassMembers.py to fail with a different layout. To make qualifying the parallel linker easier for dsymutil, I chose to preserve the order.
1 Like

Implementation was heavily inspired by DWAR linker.
I don’t quite remember all the details (it’s been 2+ years), but I think at that time support for DWARF5/split dwarf/type units wasn’t quite there. There were some issues with data structures. Also I think ā€œparallelā€ variant came after, but could be mistaken.

I think DWARF linker now supports DWARF5, (split dwarf/type units)?

It’s certainly worth looking into it if @Thrrreeee is up for it, but I don’t think this should be a blocker for his PR.

The current implementation provides an option to choose whether to guarantee deterministic ordering. Internal testing results show that the non-deterministic version achieves approximately 30% speedup for smaller programs.We also prefer to guarantee deterministic output.

I’m curious whether the PR 194089 you mentioned can be used to verify the debugging information after the BOLT update.

The draft PR has now been split into two separate PRs

[BOLT][DebugInfo] Make parallel DWARF debug names generation deterministic(1/2) (#197670)

[RFC][BOLT] Add a new parallel DWARF processing(2/2) (#197859)