[RFC][BOLT] A New Parallel DWARF Processing Approach in BOLT
-
Status: Draft
-
Company: Bytedance
Summary
Currently, the debug information updates slowly for large binaries, as it only performs concurrent processing for DWO. This RFC proposes to upstream a set of changes that significantly reduce the time cost of --update-debug-sections by parallelizing DWARF update work including mainbinaryCU and splitCU, while preserving deterministic output.
The proposal is based on an implementation currently carried in a PR 195058 and is intended to be discussed and iterated in the LLVM community before landing.
The approximate performance results are shown in the table below:
Split dwarf:
clang:
| thread | original(batchsize) | ours | speedup |
|---|---|---|---|
| 8 | 29.07 | 7.96 | 3.6x |
| 16 | 23.69 | 4.97 | 4.7x |
| 32 | 19.85 | 4.21 | 4.7x |
| 64 | 17.91 | 4.18 | 4.3x |
Bytedance Program oneļ¼debug information size almost 6GBļ¼:
| thread | original(batchsize) | ours | speedup |
|---|---|---|---|
| 8 | 882.30 | 228.71 | 3.9x |
| 16 | 789.64 | 153.99 | 5.1x |
| 32 | 691.84 | 152.67 | 4.5x |
| 48 | 658.89 | 163.93 | 4.0x |
Bytedance Program two:
| thread | original(batchsize) | ours | speedup |
|---|---|---|---|
| 8 | 76.87 | 24.46 | 3.1x |
| 16 | 82.42 | 23.58 | 3.5x |
| 32 | 76.87 | 25.01 | 3.1x |
| 64 | 71.45 | 27.20 | 2.6x |
Main binary:
Bydance Program one:
| thread | original(batchsize) | ours | speedup |
|---|---|---|---|
| 8 | 119.85 | 36.28 | 3.3x |
| 16 | 112.81 | 36.98 | 3.0x |
| 32 | 115.02 | 39.80 | 2.9x |
| 64 | 108.42 | 41.82 | 2.6x |
Motivation
-
Lack of Parallel processing for main Binary CUs. The current implementation has some parallelism in the split DWARF path (generating DWO output), but the main binary CU update path is still mostly sequential. On binaries with few split units or where the main CU rewrite dominates, the speedup is limited.
-
Inefficient bucketing algorithm. The current level of concurrency remains low, constrained by both the parallelism strategy and the bucketing algorithm.
-
Fix existing bugs. The current implementation contains several bugs that need to be resolved, including correctness issues in CU rewriting logic and potential data races in shared state, which can lead to incorrect debug information output order in some testcase or crashes on certain inputs.
Proposed Design
The core idea is to restructure the DWARF update pipeline so that CU partitions (buckets) are processed in parallel, each with its own independent DIEBuilder and local range/loc writers, followed by a in-order merge that guarantees deterministic output. Additionally, CUs that are referenced (excluding DW_FORM_ref_addr) are not considered. The design consists of three key components:
1. Equivalence-Class Based CU Partitioning
We use llvm::EquivalenceClasses to union CUs that are connected by cross-CU references into the same equivalence class. Each class becomes one processing bucket; CUs with no cross-CU references become singleton buckets. Compared to the upstreamās fixed-batch-size grouping, this ensures only truly interdependent CUs are serialized together, maximizing parallelism.
2. Per-Bucket Independent Parallel Processing
Each bucket is dispatched to a thread task with its own DIEBuilder and local range/loc writers, eliminating all inter-bucket shared state:
-
DIE construction and range data emission run entirely within bucket-local buffers, requiring no locks.
-
Split DWO handling: only
buildTypeUnits(shared DWO type context) is mutex-protected; all other DWO processing is lock-free.
3. In-Order Sequential Merge
After parallel processing, a merge loop consumes buckets in their original partition order via a condition-variable signaling mechanism. The merge phase finalizes skeleton/str_offsets, applies loclist/range offset fixups, appends local buffers to global writers, and emits compile units. This guarantees a deterministic debug information layout regardless of thread scheduling.
User-facing options
The prototype uses the following knobs:--debug-thread-count to control threads used for DWARF update, and the effective thread count is min(--debug-thread-count, --thread-count).
--cu-processing-batch-size will be deprecated.
Correctness Testing
- Existing BOLT test suite: All existing
llvm-boltLIT tests must pass with the new implementation.
Open Questions
-
Should --debug-thread-count default XX ? High thread counts may increase peak memory. Community feedback is requested on a sensible default.
-
Is ordering guarantee required? When multiple threads produce results concurrently, should the output preserve the original input order? Enforcing strict ordering may introduce synchronization overhead and reduce throughput.
-
Would CU size-ordered merging improve performance? The current merge phase processes buckets in their original partition order. An alternative strategy is to sort partitions by Bucket CUs size before merging, which may allow better pipelining with the parallel phase. This trades deterministic output layout for potential throughput gains.
