[RFC][BOLT] A New Parallel DWARF Processing Approach in BOLT

Thrrreeee · May 6, 2026, 2:39am

[RFC][BOLT] A New Parallel DWARF Processing Approach in BOLT

Status: Draft
Company: Bytedance

Summary

Currently, the debug information updates slowly for large binaries, as it only performs concurrent processing for DWO. This RFC proposes to upstream a set of changes that significantly reduce the time cost of --update-debug-sections by parallelizing DWARF update work including mainbinaryCU and splitCU, while preserving deterministic output.

The proposal is based on an implementation currently carried in a PR 195058 and is intended to be discussed and iterated in the LLVM community before landing.

The approximate performance results are shown in the table below:

Split dwarf:

clang:

thread	original(batchsize)	ours	speedup
8	29.07	7.96	3.6x
16	23.69	4.97	4.7x
32	19.85	4.21	4.7x
64	17.91	4.18	4.3x

Bytedance Program one（debug information size almost 6GB）:

thread	original(batchsize)	ours	speedup
8	882.30	228.71	3.9x
16	789.64	153.99	5.1x
32	691.84	152.67	4.5x
48	658.89	163.93	4.0x

Bytedance Program two:

thread	original(batchsize)	ours	speedup
8	76.87	24.46	3.1x
16	82.42	23.58	3.5x
32	76.87	25.01	3.1x
64	71.45	27.20	2.6x

Main binary:

Bydance Program one:

thread	original(batchsize)	ours	speedup
8	119.85	36.28	3.3x
16	112.81	36.98	3.0x
32	115.02	39.80	2.9x
64	108.42	41.82	2.6x

Motivation

Lack of Parallel processing for main Binary CUs. The current implementation has some parallelism in the split DWARF path (generating DWO output), but the main binary CU update path is still mostly sequential. On binaries with few split units or where the main CU rewrite dominates, the speedup is limited.
Inefficient bucketing algorithm. The current level of concurrency remains low, constrained by both the parallelism strategy and the bucketing algorithm.
Fix existing bugs. The current implementation contains several bugs that need to be resolved, including correctness issues in CU rewriting logic and potential data races in shared state, which can lead to incorrect debug information output order in some testcase or crashes on certain inputs.

Proposed Design

The core idea is to restructure the DWARF update pipeline so that CU partitions (buckets) are processed in parallel, each with its own independent DIEBuilder and local range/loc writers, followed by a in-order merge that guarantees deterministic output. Additionally, CUs that are referenced (excluding DW_FORM_ref_addr) are not considered. The design consists of three key components:

1. Equivalence-Class Based CU Partitioning

We use llvm::EquivalenceClasses to union CUs that are connected by cross-CU references into the same equivalence class. Each class becomes one processing bucket; CUs with no cross-CU references become singleton buckets. Compared to the upstream’s fixed-batch-size grouping, this ensures only truly interdependent CUs are serialized together, maximizing parallelism.

2. Per-Bucket Independent Parallel Processing

Each bucket is dispatched to a thread task with its own DIEBuilder and local range/loc writers, eliminating all inter-bucket shared state:

DIE construction and range data emission run entirely within bucket-local buffers, requiring no locks.
Split DWO handling: only buildTypeUnits (shared DWO type context) is mutex-protected; all other DWO processing is lock-free.

3. In-Order Sequential Merge

After parallel processing, a merge loop consumes buckets in their original partition order via a condition-variable signaling mechanism. The merge phase finalizes skeleton/str_offsets, applies loclist/range offset fixups, appends local buffers to global writers, and emits compile units. This guarantees a deterministic debug information layout regardless of thread scheduling.

User-facing options

The prototype uses the following knobs:--debug-thread-count to control threads used for DWARF update, and the effective thread count is min(--debug-thread-count, --thread-count).

--cu-processing-batch-size will be deprecated.

Correctness Testing

Existing BOLT test suite: All existing llvm-bolt LIT tests must pass with the new implementation.

Open Questions

Should --debug-thread-count default XX ? High thread counts may increase peak memory. Community feedback is requested on a sensible default.
Is ordering guarantee required? When multiple threads produce results concurrently, should the output preserve the original input order? Enforcing strict ordering may introduce synchronization overhead and reduce throughput.
Would CU size-ordered merging improve performance? The current merge phase processes buckets in their original partition order. An alternative strategy is to sort partitions by Bucket CUs size before merging, which may allow better pipelining with the parallel phase. This trades deterministic output layout for potential throughput gains.

References

[RFC][BOLT] A New Parallel DWARF Processing Approach in BOLT#195058

maksfb · May 6, 2026, 5:11am

Thanks for the RFC! The results look very promising - DWARF processing is currently the main build-time bottleneck for large applications.

Preserving determinism is still the top priority for reproducible builds. We can add an option to disable it if it shows significant processing-time improvement.

Thrrreeee · May 6, 2026, 2:47pm

@maksfb,
Thanks for your reply and suggestion! This is still a draft PR, and I’ll continue refining it based on your suggestion. Once the complete implementation is ready, I’d like to invite you to review the code.

ayermolo · May 6, 2026, 8:28pm

Thanks for RFC.
Preserving original order does make figuring out what went wrong much easier if incorrect debug information is produced.

Regarding “Equivalent-Class Based CU Partitioning”
So with LTO we basically end up with two buckets?
One for CUs with cross CU references that are part of LTO, and the other for non-lto CUs?

dblaikie · May 6, 2026, 8:42pm

Any chance of using/reusing/generalizing the DWARF linking implementation for dsymutil here?

ayermolo · May 7, 2026, 6:30pm

Umm original implementation was based on DWARF linking, but there was some duplication to fit into BOLT. At that time DWARF link was also more limited in what it supported.
There is probably a way to do it (a lot of work), but I think it’s outside of scope of this RFC.

Thrrreeee · May 9, 2026, 2:43am

Not exactly two buckets — it’s based on connected components in the cross-CU reference graph.

CUs with cross-CU references are grouped by their connectivity. For example, if A→B, B→C, D (no references), E→F, then:

Bucket 1: A, B, C (connected component)
Bucket 2: D (isolated, no cross-CU references)
Bucket 3: E, F (connected component)

So LTO partitioning produces as many buckets as there are connected components in the reference graph — not a fixed two.

Thrrreeee · May 9, 2026, 3:19am

dsymutil’s DWARF linker doesn’t have an explicit partitioning step — it discovers inter-CU dependencies dynamically during liveness analysis and iterates until convergence. My approach does static pre-partitioning via equivalence classes (union-find over the DW_FORM_ref_addr reference graph), which gives a complete picture upfront. My partition decides which CUs can be processed as independent groups. The underlying mechanism (reference graph connectivity) is the same, but the implementations aren’t directly reusable.

Sockke · May 11, 2026, 6:18am

By default, for order preservation, is it sufficient to follow the order of bucket partitioning, or must it strictly follow the original input order of the debug info — is there an explicit requirement for this? @maksfb @ayermolo @dblaikie

dblaikie · May 11, 2026, 4:18pm

Not sure of the particular ordering you have in mind (order of what? CUs? DIEs in CUs?) But generally DWARF is order independent. CU order certainly doesn’t matter, DIE order doesn’t matter so long as inter-DIE references are correctly resolved.

Thrrreeee · May 12, 2026, 2:57am

Hi @dblaikie ,
This means that after BOLT optimization, the order of CUs may differ from that of the original program, but the multi-threaded output (the order of CUs) is guaranteed to be strictly deterministic. This is because after bucketing, the CUs finalization follows the order of buckets rather than the order of original .debug_info.

maksfb · May 12, 2026, 6:28pm

@Thrrreeee, once you confirm that gdb and lldb don’t have any issues with the new sorting order, it should be fine to continue with the implementation. On the BOLT side, we certainly don’t expect the output CU order to match the input. As I mentioned previously, we do care about about determinism, but even that could be overwritten by an option if build-time gains are worth it.

dblaikie · May 12, 2026, 6:38pm

Yeah, can’t see any reason that’d be a problem.

ayermolo · May 18, 2026, 8:07pm

Only concern is correlating debug info that got moved around. @maksfb I think your company has some tooling that does it using llvm-dwarf --verify. Don’t quite remember details under the hood.

Also I remember it really helped when debugging the old patching mechanism.

JDevlieghere · May 18, 2026, 11:13pm

I’m not at all familiar with the DWARF processing code in BOLT, but coming from LLDB & dsymutil, I have some high level thoughts:

Great to see that this will be deterministic. It’s hard enough to debug these kind of tools when they are deterministic, I can’t imagine having to deal with non-deterministic output.
There seems to be a lot of (at least conceptual) overlap with the DWARF linker. The latter is already pretty modular to support both the “classic” and “parallel” variant. I’d be interested to understand why this couldn’t build on that and what the concrete limitations are. While I’m certainly sympathetic to the fact that that would be more work than landing the current implementation, it’s also a pretty weak (technical) argument. The more logic we can share, the more this benefits the community overall.
I can confirm that the order doesn’t matter as far as the spec is concerned. LLDB should be able to handle that, though I’d expect something like TestSBTypeClassMembers.py to fail with a different layout. To make qualifying the parallel linker easier for dsymutil, I chose to preserve the order.

ayermolo · May 19, 2026, 12:50am

Implementation was heavily inspired by DWAR linker.
I don’t quite remember all the details (it’s been 2+ years), but I think at that time support for DWARF5/split dwarf/type units wasn’t quite there. There were some issues with data structures. Also I think “parallel” variant came after, but could be mistaken.

I think DWARF linker now supports DWARF5, (split dwarf/type units)?

It’s certainly worth looking into it if @Thrrreeee is up for it, but I don’t think this should be a blocker for his PR.

Thrrreeee · May 19, 2026, 2:35am

The current implementation provides an option to choose whether to guarantee deterministic ordering. Internal testing results show that the non-deterministic version achieves approximately 30% speedup for smaller programs.We also prefer to guarantee deterministic output.

I’m curious whether the PR 194089 you mentioned can be used to verify the debugging information after the BOLT update.

Thrrreeee · May 19, 2026, 2:51am

The draft PR has now been split into two separate PRs

Topic		Replies	Views
[Proposal][Debuginfo] dsymutil-like tool for ELF. LLVM Dev List Archives	46	739	November 2, 2020
Fragmented DWARF LLVM Dev List Archives	20	406	November 10, 2020
Verification of Accurate Debug Information in BOLT After Optimizations BOLT	11	332	November 19, 2024
LTO debug info LLDB	10	207	May 16, 2017
`llvm-dsymutil` performance LLDB debuginfo	9	531	June 5, 2026

[RFC][BOLT] A New Parallel DWARF Processing Approach in BOLT

[RFC][BOLT] A New Parallel DWARF Processing Approach in BOLT

Summary

Split dwarf:

Main binary:

Motivation

Proposed Design

1. Equivalence-Class Based CU Partitioning

2. Per-Bucket Independent Parallel Processing

3. In-Order Sequential Merge

User-facing options

Correctness Testing

Open Questions

References

[BOLT][DebugInfo] Make parallel DWARF debug names generation deterministic(1/2) (#197670)

[RFC][BOLT] Add a new parallel DWARF processing(2/2) (#197859)

[RFC][BOLT] A New Parallel DWARF Processing Approach in BOLT

[RFC][BOLT] A New Parallel DWARF Processing Approach in BOLT

Summary

Split dwarf:

Main binary:

Motivation

Proposed Design

1. Equivalence-Class Based CU Partitioning

2. Per-Bucket Independent Parallel Processing

3. In-Order Sequential Merge

User-facing options

Correctness Testing

Open Questions

References

[BOLT][DebugInfo] Make parallel DWARF debug names generation deterministic(1/2) (#197670)

[RFC][BOLT] Add a new parallel DWARF processing(2/2) (#197859)

Related topics