Hi,
As part of ongoing studies on debug information quality we are working on at Sapienza University of Rome in collaboration with Google (cc @snehasish), we authored a paper that will appear in CGO’26: Towards Threading the Needle of Debuggable Optimized Binaries.
We would like to share our results for LLVM/clang, hoping to highlight concrete directions and help inform future efforts on improving debug information quality while retaining competitive performance.
Short Summary of Results
By selectively disabling a small set of optimization passes identified by our analysis framework, DebugTuner, we obtain clang custom levels (Ox-dy = standard Ox level with top-y passes disabled) that significantly improve debug information quality for a relatively modest performance cost. For example, at O1 the O1-d5 custom level improves our debug information product metric by 12.8% while reducing SPEC CPU 2017 performance by only 4.7%, and at higher optimization levels custom levels like O3-d3 and O2-d5 achieve 7–16% better debuggability for performance penalties in the 0.8–6% range. These custom levels sit on a Pareto front that spans from “near-baseline performance with better debug info” (e.g., O3-d3) to “aggressively improved debug info with a higher performance cost” (e.g., O1-d9), giving a spectrum of options depending on how much debuggability one wants to trade for speed.
Methodology
We designed DebugTuner, a framework for testing the impact of optimization passes on debug information quality and tuning the compiler towards the generation of more debuggable binaries for a low performance overhead.
DebugTuner systematically disables individual optimization passes from clang’s O1/O2/O3 pipelines, measures their impact on debuggability of real-world programs, and ranks passes by average loss effect.
Studied debug quality metrics (proposed in our prior ASPLOS’23 work [1]):
- Availability of variables: Fraction of variables one can actually see with values when stepping through source code in a debugger (relative to -O0 baseline for debugging)
- Line coverage: Fraction of source lines you can actually step through (relative to -O0 baseline)
- Product score (main metric): availability of variables × line coverage = overall “debug information quality.” A score of 1.0 means everything works perfectly; 0.0 means total loss. The product is an effective metric as the availability of variables inherently depends on line coverage.
Key steps:
- Build: each program in the dataset is built using standard optimization levels (O1, O2 and O3) and custom custom levels with single optimizations disabled from those.
- Input corpus construction: to assemble a set of representative inputs that achieve adequate coverage of each program’s code base, we use and minimize fuzzing corpora for real-world programs tested daily by Google’s OSS-Fuzz continuous testing initiative.
- Hybrid approach for metrics computation: We compute the product metric above from debugger traces, obtained by putting breakpoints on all steppable lines from DWARF information, fixing known cases of inaccurate DWARF information (details in the paper), and executing the test inputs.
- Ranking: For each pass activated in an optimization level, we compute the relative increment of the debug information metric when disabling the pass, and construct per-program rankings. Then, we distill a global rank from average per-program rankings. Finally, we construct modified Ox-dy optimization levels by disabling the top-y passes in the global rank (more details below).
- Performance and AutoFDO evaluation: We measure the performance penalty for the more debuggable Ox-dy levels on SPEC CPU 2017. We then conduct an AutoFDO case study on how these levels improve how AutoFDO maps profile data back to IR constructs using debug information, measuring the performance improvements for SPEC CPU 2017 and, as a proxy for a large workload, for a self-compilation of clang (the results regarding AutoFDO have been discussed in recent thread).
We refer to the paper for more details about the methodology.
LLVM patch: To support disabling one or more passes from activation in an optimization level, we developed a command-line option with a patch that has been recently merged in LLVM: 81eb7de
Results
RQ1: Which passes hurt debug info most?
The table below represents the rankings for the 3 tested optimization levels. The number associated with each pass indicates the average percentage improvement in debug information quality over the standard optimization level.
Clang top-10:
| Rank | O1 | %Δ | O2 | %Δ | O3 | %Δ |
|---|---|---|---|---|---|---|
| 1 | Inliner | 10.56 | Inliner | 12.53 | Inliner | 12.83 |
| 2 | SimplifyCFG | 6.72 | JumpThreading | 4.43 | Machine code sinking | 3.10 |
| 3 | Machine code sinking | 2.22 | Machine code sinking | 2.87 | JumpThreading | 4.22 |
| 4 | InstCombine | 2.90 | SimplifyCFG | 4.59 | LoopStrengthReduce | 2.14 |
| 5 | Control Flow Optimizer | 1.30 | LoopStrengthReduce | 1.91 | SimplifyCFG | 4.33 |
| 6 | EarlyCSE | 2.84 | Control Flow Optimizer | 1.65 | Branch Prob BB Placement | 1.76 |
| 7 | LoopStrengthReduce | 1.46 | DSE | 1.61 | DSE | 1.67 |
| 8 | Branch Prob BB Placement | 0.85 | GVN | 3.18 | LoopUnroll | 1.79 |
| 9 | LoopRotate | 1.03 | LoopRotate | 1.22 | Control Flow Optimizer | 1.74 |
| 10 | SROA | 2.25 | SROA | 2.03 | SROA | 1.89 |
The rankings reveal that the Inliner consistently ranks first across all optimization levels, with its relative impact growing from 10.56% at O1 to 12.83% at O3, as (more aggressive) inlining creates opportunities for downstream passes to harm or disrupt accurate debug mappings. SimplifyCFG appears prominently across levels (rank 2 at O1, 4 at O2, 5 at O3), reflecting its pervasive control flow restructuring that breaks line mappings. Several passes recur in the top-10 across O1/O2/O3 including Machine code sinking (ranks 3 at both O1/O2), LoopStrengthReduce (5 at O2, 4 at O3), Control Flow Optimizer (5 at O1, 6 at O2, 9 at O3), and SROA (10 across all levels), indicating these transformations consistently challenge debug info preservation.
Due to the enabling effects discussed above, we leave out Inliner from what Ox-dy levels can disable.
RQ2: Debug-friendly custom levels?
The table below represents the average increment in debug information quality and associated performance penalty, measured on the custom custom levels constructed from the above ranking.
| Config | O1 debug info quality (%Δ↑) | O1 speedup (%Δ↓) | O2 debug info quality (%Δ↑) | O2 speedup(%Δ↓) | O3 debug info quality (%Δ↑) | O3 speedup(%Δ↓) |
|---|---|---|---|---|---|---|
| -d3 | +10.0 | -3.6 | +12.9 | -6.3 | +7.0 | -0.8 |
| -d5 | +12.8 | -4.7 | +14.6 | -5.4 | +15.8 | -6.6 |
| -d7 | +19.7 | -12.2 | +15.8 | -5.9 | +16.1 | -6.6 |
| -d9 | +23.2 | -18.6 | +19.3 | -13.6 | +22.8 | -15.2 |
The Inliner was excluded from these custom levels due to its substantial performance benefits and primarily indirect impact on debug information (i.e., much of its measured harm stems from creating opportunities for other passes rather than direct debug information destruction).
Debuggability improvements range from 7.0% (O3-d3) to 23.2% (O1-d9) over baseline levels, with corresponding SPEC CPU 2017 performance penalties of 0.8% to 18.6%.
The O1-d5 custom level stands out with a 12.8% debuggability gain for only a 4.7% performance penalty, making it in our opinion an attractive candidate for a clang’s equivalent of gcc’s Og level. For higher optimization levels, O3-d3 achieves 7.0% better debuggability with just 0.8% slowdown, while O2-d5 and O2-d7 deliver 14.6-15.8% debug information gains for 5.4-5.9% performance costs.
When plotting the average debug information quality against SPEC CPU 2017 performance, the custom levels trace a clear Pareto front that captures the trade-off between debuggability and speed. At one end, O3-d3 offers a appreciable 7.0% improvement in debug information quality at just 0.8% performance loss, showing that even highly optimized builds can be made more debug-friendly with almost no cost. Moving along the front, O2-d5 and O2-d7 deliver 14.6-15.8% debug information gains for 5.4-5.9% overhead, while at the more aggressive end O1-d9 recovers 23.2% of debug information at the expense of an 18.6% slowdown. Within this spectrum, O1-d5 stands out as a well-balanced point, measuring substantial improvement in debuggability with a low performance penalty, which is why in the paper we propose it as for reasoning about an Og configuration for clang.
Takeaways
- O1-d5 as Og candidate: +12.8% debuggability for only -4.7% performance cost vs. standard O1
Code and evaluation materials of DebugTuner are available here.
– Cristian (on behalf of @snehasish, @dcdelia and the other co-authors)
References
[1] C. Assaiante, D. C. D’Elia, G. A. Di Luna, and L. Querzoni. 2023. Where Did My Variable Go? Poking Holes in Incomplete Debug Information. In Proceedings of ASPLOS 2023.



