Constructing Debug-Friendly Optimization Pipelines from Individual Pass Evaluation

Hi,

As part of ongoing studies on debug information quality we are working on at Sapienza University of Rome in collaboration with Google (cc @snehasish), we authored a paper that will appear in CGO’26: Towards Threading the Needle of Debuggable Optimized Binaries.
We would like to share our results for LLVM/clang, hoping to highlight concrete directions and help inform future efforts on improving debug information quality while retaining competitive performance.

Short Summary of Results

By selectively disabling a small set of optimization passes identified by our analysis framework, DebugTuner, we obtain clang custom levels (Ox-dy = standard Ox level with top-y passes disabled) that significantly improve debug information quality for a relatively modest performance cost. For example, at O1 the O1-d5 custom level improves our debug information product metric by 12.8% while reducing SPEC CPU 2017 performance by only 4.7%, and at higher optimization levels custom levels like O3-d3 and O2-d5 achieve 7–16% better debuggability for performance penalties in the 0.8–6% range. These custom levels sit on a Pareto front that spans from “near-baseline performance with better debug info” (e.g., O3-d3) to “aggressively improved debug info with a higher performance cost” (e.g., O1-d9), giving a spectrum of options depending on how much debuggability one wants to trade for speed.

Methodology

We designed DebugTuner, a framework for testing the impact of optimization passes on debug information quality and tuning the compiler towards the generation of more debuggable binaries for a low performance overhead.
DebugTuner systematically disables individual optimization passes from clang’s O1/O2/O3 pipelines, measures their impact on debuggability of real-world programs, and ranks passes by average loss effect.

Studied debug quality metrics (proposed in our prior ASPLOS’23 work [1]):

  • Availability of variables: Fraction of variables one can actually see with values when stepping through source code in a debugger (relative to -O0 baseline for debugging)
  • Line coverage: Fraction of source lines you can actually step through (relative to -O0 baseline)
  • Product score (main metric): availability of variables × line coverage = overall “debug information quality.” A score of 1.0 means everything works perfectly; 0.0 means total loss. The product is an effective metric as the availability of variables inherently depends on line coverage.

Key steps:

  • Build: each program in the dataset is built using standard optimization levels (O1, O2 and O3) and custom custom levels with single optimizations disabled from those.
  • Input corpus construction: to assemble a set of representative inputs that achieve adequate coverage of each program’s code base, we use and minimize fuzzing corpora for real-world programs tested daily by Google’s OSS-Fuzz continuous testing initiative.
  • Hybrid approach for metrics computation: We compute the product metric above from debugger traces, obtained by putting breakpoints on all steppable lines from DWARF information, fixing known cases of inaccurate DWARF information (details in the paper), and executing the test inputs.
  • Ranking: For each pass activated in an optimization level, we compute the relative increment of the debug information metric when disabling the pass, and construct per-program rankings. Then, we distill a global rank from average per-program rankings. Finally, we construct modified Ox-dy optimization levels by disabling the top-y passes in the global rank (more details below).
  • Performance and AutoFDO evaluation: We measure the performance penalty for the more debuggable Ox-dy levels on SPEC CPU 2017. We then conduct an AutoFDO case study on how these levels improve how AutoFDO maps profile data back to IR constructs using debug information, measuring the performance improvements for SPEC CPU 2017 and, as a proxy for a large workload, for a self-compilation of clang (the results regarding AutoFDO have been discussed in recent thread).

We refer to the paper for more details about the methodology.

LLVM patch: To support disabling one or more passes from activation in an optimization level, we developed a command-line option with a patch that has been recently merged in LLVM: 81eb7de

Results

RQ1: Which passes hurt debug info most?

The table below represents the rankings for the 3 tested optimization levels. The number associated with each pass indicates the average percentage improvement in debug information quality over the standard optimization level.

Clang top-10:

Rank O1 O2 O3
1 Inliner 10.56 Inliner 12.53 Inliner 12.83
2 SimplifyCFG 6.72 JumpThreading 4.43 Machine code sinking 3.10
3 Machine code sinking 2.22 Machine code sinking 2.87 JumpThreading 4.22
4 InstCombine 2.90 SimplifyCFG 4.59 LoopStrengthReduce 2.14
5 Control Flow Optimizer 1.30 LoopStrengthReduce 1.91 SimplifyCFG 4.33
6 EarlyCSE 2.84 Control Flow Optimizer 1.65 Branch Prob BB Placement 1.76
7 LoopStrengthReduce 1.46 DSE 1.61 DSE 1.67
8 Branch Prob BB Placement 0.85 GVN 3.18 LoopUnroll 1.79
9 LoopRotate 1.03 LoopRotate 1.22 Control Flow Optimizer 1.74
10 SROA 2.25 SROA 2.03 SROA 1.89

The rankings reveal that the Inliner consistently ranks first across all optimization levels, with its relative impact growing from 10.56% at O1 to 12.83% at O3, as (more aggressive) inlining creates opportunities for downstream passes to harm or disrupt accurate debug mappings. SimplifyCFG appears prominently across levels (rank 2 at O1, 4 at O2, 5 at O3), reflecting its pervasive control flow restructuring that breaks line mappings. Several passes recur in the top-10 across O1/O2/O3 including Machine code sinking (ranks 3 at both O1/O2), LoopStrengthReduce (5 at O2, 4 at O3), Control Flow Optimizer (5 at O1, 6 at O2, 9 at O3), and SROA (10 across all levels), indicating these transformations consistently challenge debug info preservation.

Due to the enabling effects discussed above, we leave out Inliner from what Ox-dy levels can disable.

RQ2: Debug-friendly custom levels?

The table below represents the average increment in debug information quality and associated performance penalty, measured on the custom custom levels constructed from the above ranking.

Config O1 debug info quality (%Δ↑) O1 speedup (%Δ↓) O2 debug info quality (%Δ↑) O2 speedup(%Δ↓) O3 debug info quality (%Δ↑) O3 speedup(%Δ↓)
-d3 +10.0 -3.6 +12.9 -6.3 +7.0 -0.8
-d5 +12.8 -4.7 +14.6 -5.4 +15.8 -6.6
-d7 +19.7 -12.2 +15.8 -5.9 +16.1 -6.6
-d9 +23.2 -18.6 +19.3 -13.6 +22.8 -15.2

The Inliner was excluded from these custom levels due to its substantial performance benefits and primarily indirect impact on debug information (i.e., much of its measured harm stems from creating opportunities for other passes rather than direct debug information destruction).

Debuggability improvements range from 7.0% (O3-d3) to 23.2% (O1-d9) over baseline levels, with corresponding SPEC CPU 2017 performance penalties of 0.8% to 18.6%.

The O1-d5 custom level stands out with a 12.8% debuggability gain for only a 4.7% performance penalty, making it in our opinion an attractive candidate for a clang’s equivalent of gcc’s Og level. For higher optimization levels, O3-d3 achieves 7.0% better debuggability with just 0.8% slowdown, while O2-d5 and O2-d7 deliver 14.6-15.8% debug information gains for 5.4-5.9% performance costs.

When plotting the average debug information quality against SPEC CPU 2017 performance, the custom levels trace a clear Pareto front that captures the trade-off between debuggability and speed. At one end, O3-d3 offers a appreciable 7.0% improvement in debug information quality at just 0.8% performance loss, showing that even highly optimized builds can be made more debug-friendly with almost no cost. Moving along the front, O2-d5 and O2-d7 deliver 14.6-15.8% debug information gains for 5.4-5.9% overhead, while at the more aggressive end O1-d9 recovers 23.2% of debug information at the expense of an 18.6% slowdown. Within this spectrum, O1-d5 stands out as a well-balanced point, measuring substantial improvement in debuggability with a low performance penalty, which is why in the paper we propose it as for reasoning about an Og configuration for clang.

Takeaways

  • O1-d5 as Og candidate: +12.8% debuggability for only -4.7% performance cost vs. standard O1

Code and evaluation materials of DebugTuner are available here.

– Cristian (on behalf of @snehasish, @dcdelia and the other co-authors)

References

[1] C. Assaiante, D. C. D’Elia, G. A. Di Luna, and L. Querzoni. 2023. Where Did My Variable Go? Poking Holes in Incomplete Debug Information. In Proceedings of ASPLOS 2023.

7 Likes

If I’m reading the graph correctly, O1-d5 appears marginally better than current Og but at bigger perf cost. But O3-d9 seems better all around, is this right (feels strange that it’s better than the equivalent O2, but still)?

Just want to point out many of these Passes are more canonicalization than optimizations (e.g. LoopRotate) so I believe there are more optimizations got (indirectly) disabled as a consequence

2 Likes

Are cross-module optimizations via link-time optimization considered as part of this work? Maybe as future work?

Did this work provide any insight about how current optimization passes can be made more debug-able?

How I read it, it’s an improvement against the optimization level started from. As O3 starts of worse than O2/O1 for debug experience, there are much more opportunities to improve the debug information.

At the same time, O3 contains the slow steps that have very limited impact on performance, though are still relevant if you need every fraction of a percentage improvement. Compared to the passes in O1, the same absolute improvement in debug information would give O3 a much higher percentage.

As such, I would expect O1 to have limited removal candidates while O2 has more and O3 even more.

I’m wondering about the passes that are at the other end. Do we have passes in O2/O3 that barely impact debugability while adding a lot of performance gain. As this might be relevant to extend 0g. It could mitigate the performance loss at the cost of longer compile times.

So basically the bottom 10.

In the evaluation, we did not consider the available Og pipeline (which is not present in the chart) as it landed after we performed our evaluation. We will perform experiments on our side to do the adeguate comparisons.

Cross-module optimizations via LTO are not considered as part of this work, thank you for pointing this out as it could be an interesting follow up! Regarding how current optimization passes can be made more debuggable, this is not in the scope of our work directly as it is more a problem for bugs finding methodologies. But we think that the rankings may be helpful to understand where major preservation loss is introduced.

Thanks a lot for this comment! The performance experiments we performed were only concerning the configurations we constructed, thus we have no data ready to measure the performance gain of passes at the bottom of the rankings. But this sounds as a promising direction for further work to improve the effectiveness of Og in both debuggability and performance, and we will follow up with experimental evaluation on this.

@StephenTozer I wonder if there may be interest in revisiting the definition of Og given the results @cristianassaiante shared?

cc for visibility: @OCHyams @rastogishubham @adrian.prantl

I’m happy to see more attention come to the tuning of optimization levels for -Og - although there has been much effort put towards improving the quality of debug information in optimized builds, optimization and debuggability are still a tradeoff, and supporting a variety of options for developers with different priorities is important. I’m broadly supportive of efforts to more clearly define what tradeoffs the different optimization levels are making, to give a firm definition to how we measure these tradeoffs, and to ensure that we support a set of options that collectively meet the needs of most developers.
For the results reported here, I have a few thoughts, which I’ll summarize briefly:

  • As of Clang 20, Og applies -fextend-variable-liveness, which in my own experiments showed a much more efficient tradeoff of speed vs debug information availability than disabling optimization passes outright; as this flag may have non-uniform effects on the speed/debug info impacts of the various optimization passes, it may be worth repeating these experiments with the extended liveness flag enabled.

  • I’m glad to see that you’ve experimented with optimization levels other than just O1 - one prior proposal I made was to add an O2g optimization level, which would aim for a level of optimization close to O2 (speed within 5%) while trying to maximize debug info availability; if a feature along these lines is relevant to your interests, it might be worth reviving the idea.

  • For determining passes to disable, interactions between passes may be relevant as noted by @mshockwave and as has been described in the paper. While this is a computationally expensive problem to solve, it would be good to see, if possible, a brute force approach of disabling the highest impact pass, recomputing the impact of all passes, then disabling the next highest impact pass, and so on, for at least one configuration (O1-d5 would be good) and seeing if the results are significantly different from the original.

Overall I’m interested to see how this approach develops, and whether this might lead to a more principled and less ad-hoc long-term approach to deciding how to build our optimization pass pipelines.

2 Likes

@StephenTozer thank you very much for your thoughtful comments and suggestions! :slightly_smiling_face:

We are currently setting up the experiments on our side. Our plan is to regenerate the same chart presented above, this time enabling -fextend-variable-liveness, and compare our configurations against the available -Og. We will share the results as soon as they are ready.

We fully agree that studying interactions between passes is crucial to improving the methodology. The brute-force approach you suggest is certainly feasible from a debug information evaluation perspective. However, performing performance evaluation at each step would be significantly more demanding. For now, we believe it is more practical to conduct performance measurements only after the configuration has been finalized, as we did in the paper. Once the experiments with the extended liveness flag are complete, we plan to explore pass interaction analysis as well.

1 Like

Hi everyone,

Sorry for the long wait! We are back here with the results of the evaluation of our DebugTuner configurations and the -fextend-variable-lifetime flag suggested by @stephentozer, related to potential improvements to the available Og and creation of an O2g level.

The experiments were performed using CLANG 23.0.0git (commit id: 7bf2d5f)

Configurations Naming:
In the presented results, configurations names follow the pattern {LVL}-{CONFIG}[-ext], where:

  • {LVL}: Standard optimization level baseline for the configuration
  • {CONFIG}: DebugTuner configuration (std for standard, d3/d5/d7/d9 as previously described)
  • -ext: Suffix indicating -fextend-variable-lifetime flag is enabled

Experiment Scope:
We evaluated Og-related and O2g-related configurations to measure the tradeoff between debug information availability and performance. Results are presented as both absolute values and relative percentages compared to the baseline standard configuration.

Since the -fextend-variable-lifetime flag mostly affect the availability of variables, we also provide points on single metrics rather than using only the product of metrics (availability of variables and line coverage) we used in the paper evaluation results provided above. As described below, across both O1 and O2 configurations, enabling -fextend-variable-lifetime consistently improves availability of variables by several percentage points within the same DebugTuner configuration, with modest incremental slowdowns on top of the non-extended variant. The impact of the flag seems to be independent from the effects of disabled transformations.

Og-related Configurations Summary

The tables below presents all the metrics involved in these experiments both in absolute value (above) and percentage differences over Og (below).

configuration availability of variables line coverage debug information availability speedup
O1-std 0.8101 0.6355 0.5148 2.4805
O1-std-ext (Og) 0.8791 0.6470 0.5688 2.4275
O1-d3 0.8306 0.6859 0.5697 2.4330
O1-d3-ext 0.8900 0.6957 0.6192 2.3782
O1-d5 0.8344 0.6944 0.5794 2.3944
O1-d5-ext 0.8975 0.7015 0.6297 2.3363
O1-d7 0.8319 0.7464 0.6209 2.2185
O1-d7-ext 0.8951 0.7506 0.6718 2.1872
O1-d9 0.8432 0.7474 0.6302 2.0652
O1-d9-ext 0.9082 0.7502 0.6814 2.0059
configuration availability of variables line coverage debug information availability speedup
O1-std -7.85% -1.78% -9.49% +2.18%
O1-std-ext (Og) +0.00% +0.00% +0.00% +0.00%
O1-d3 -5.52% +6.01% +0.16% +0.23%
O1-d3-ext +1.24% +7.52% +8.86% -2.03%
O1-d5 -5.09% +7.31% +1.86% -1.36%
O1-d5-ext +2.10% +8.43% +10.70% -3.76%
O1-d7 -5.37% +15.36% +9.17% -8.61%
O1-d7-ext +1.82% +16.01% +18.11% -9.90%
O1-d9 -4.08% +15.51% +10.79% -14.93%
O1-d9-ext +3.32% +15.95% +19.79% -17.37%

Within the Og-related configurations, all DebugTuner variants dominate the current Og (O1-std-ext) in debug information availability, at the cost of some speed. The -ext variants consistently add a further bump in availability of variables (and consequently in debug info availability), with modest additional slowdowns.

For a near Og point (minimal speed loss vs Og but better debug), O1-d3 and O1-d3-ext look attractive: they give +6–7.5% line coverage and +0.2–8.9% debug information availability vs Og, with speed essentially flat or only ~2% slower. We believe O1-d3-ext to be better as it improves availability of variables too, thanks to the large impact on this metric of the additional flag.

For a balanced but more debug-friendly Og, O1-d5-ext pushes debug information availability by about +11% (specifically, +2.1% on availability of variables and +8.45% on line coverage), for a speed loss of ~3.8% over the current Og.

O2g-related Configurations Summary

configuration availability of variables line coverage debug information availability speedup
O1-std 0.7802 0.6126 0.4780 2.9199
O1-std-ext 0.8534 0.6253 0.5336 2.6912
O1-d3 0.8176 0.6701 0.5479 2.8164
O1-d3-ext 0.8788 0.6807 0.5981 2.6549
O1-d5 0.8179 0.6734 0.5508 2.8148
O1-d5-ext 0.8788 0.6793 0.5970 2.6539
O1-d7 0.8195 0.6788 0.5563 2.7896
O1-d7-ext 0.8839 0.6798 0.6009 2.6344
O1-d9 0.8384 0.6856 0.5748 2.5159
O1-d9-ext 0.9010 0.6929 0.6243 2.3918
configuration availability of variables line coverage debug information availability speedup
O2-std 0.00% 0.00% 0.00% 0.00%
O2-std-ext 9.38% 2.06% 11.63% -7.83%
O2-d3 4.79% 9.38% 14.62% -3.55%
O2-d3-ext 12.63% 11.10% 25.13% -9.08%
O2-d5 4.83% 9.92% 15.23% -3.60%
O2-d5-ext 12.64% 10.88% 24.89% -9.11%
O2-d7 5.04% 10.79% 16.37% -4.46%
O2-d7-ext 13.29% 10.97% 25.72% -9.78%
O2-d9 7.46% 11.91% 20.25% -13.84%
O2-d9-ext 15.48% 13.10% 30.60% -18.09%

For O2g-related configurations, all tuned variants strictly improve debug information metrics over O2-std, with controlled slowdowns; again, -ext configurations amplify debug information metrics gains at additional (and in this case large) cost to performance, making all -ext configurations go beyond 5% loss limit. Relative to O2-std, availability of variables gains range up to about +15.5%, line coverage to +13.1%, and debug info availability to about +30.6%.

For a near O2 configuration, O2-d3 / O2-d5 give +14–15% on debug information availability (specifically, +4.8% on availability of variables and +9.3-9.9% on line coverage) at only ~3.5–3.6% percentage loss on speedup.

Additional Evaluation Results

We setup a Google drive folder with all the results on a Google sheet document and all the Pareto charts, including the O3 configurations that we tested but omitted here.

2 Likes