tl;dr We (Sony) propose to enable our currently-in-review -fextend-lifetimes
Clang flag by default at Og, and also propose a new flag, O2g, to enable comparable performance to O2 with a better debugging experience.
Some background
The Og flag, according to GCC[0], is intended to offer “a reasonable level of optimization while maintaining fast compilation and a good debugging experience”; in practical terms, it is equal to O1, minus some optimization passes that significantly degrade debug info quality. O1 in GCC is intended to “reduce code size and execution time, without performing any optimizations that take a great deal of compilation time”. Although Clang does not try to match GCC’s definitions exactly, they are a useful starting point since users will expect them to have similar meanings across both compilers.
In Clang, the Og flag is currently an alias for the O1 flag, which is currently defined with almost the same goal as Og in GCC: on the website[1] its description is simply “somewhere between -O0 and -O2”, but more recent work[2] by @echristo redesigned O1 to fit with the objective to “Optimize quickly without destroying debuggability.” Currently, O1 is implemented as O2 minus some passes that are slow to run or antagonistic to debug info quality, and most recent patches that have modified O1 have been for the purpose of reducing compile times or improving debuggability[3].
We have recently opened a review that reproposes a flag for Clang, -fextend-lifetimes
, that was developed and previously opened for review by @wolfy1961, which artificially extends the lifetimes of values to keep them available to developers longer - for full details, see the reviews here [4] (and of course, review comments welcome). We have performed some internal experiments measuring the debug info quality of programs built with different combinations of passes (discussed previously at EuroLLVM 2023 [5]), and found that the -fextend-lifetimes
flag has a significant positive impact on debuggability while being very cost-efficient with respect to performance, as compared to disabling optimization passes. Note that it does not prevent optimizations from being run, it only limits them in cases where they would affect debug info; this is very good for debugging, but does not help speed up compilation.
Our proposed change
There are two concerns that current O1 addresses compared to O2: better debugging and faster compile times. The -fextend-lifetimes
flag is positive for debugging, neutral for compile times, and negative for performance. Therefore, we propose to enable -fextend-lifetimes
by default at Og: this will slightly degrade run-time performance for significantly improved debuggability, with our internal tests showing a 2.4% performance cost (increase in benchmark execution time) in exchange for a 22.1% improvement in debuggability (breakpoint_locations * variables, see [5]). This provides a suitable differentiation between O1 and Og: Og is preferred when the best debugging experience with some optimizations is desired, while O1 is preferred if faster compile times are the primary concern and debugging is less important than run-time performance, aligning with GCC’s definitions.
The difference in execution speed between programs compiled at O1 and O2 is significant in our testing (~11%, this and all other performance numbers in this section were tested with a general CPU benchmark suite), and the combination of -fextend-lifetimes
with O1 widens that gap. For game developers, O2 provides a poor debugging experience, but compiling and debugging games with O1 incurs an unacceptable performance cost. Feedback from developers indicates a middle ground would be appreciated, where a 5% performance cost compared to O2 would be an acceptable cost for better debug info. In our experiments testing Clang’s performance, using the -fextend-lifetimes
flag at O2 incurred a performance cost of ~4.7%, making -O2 -fextend-lifetimes
a good choice for “close-to-O2 optimized debugging”. I believe it is worth making this a “first-class” optimization configuration: we propose a new optimization flag, O2g, with the stated goal of “Providing performance comparable to O2 with an improved debugging experience”. This would differ from O1 and Og in that it is not expected to reduce compilation time, instead attempting to provide the best debugging experience possible while being usable by applications that require performance close-to-O2 to function correctly. This flag has symmetry with the proposed Og above: -Og = -O1 -fextend-lifetimes
, -O2g = -O2 -fextend-lifetimes
. As a natural extension of this, we could also define -O1g = -Og
.
Our debuggability experiments have focused on games since that is our use-case, so any input or data regarding your own use cases and how the use of these flags may impact them positively or negatively is valuable. Although in this proposal both Og and O2g only differ from the existing pipelines by an added flag, they will also be a home for any future changes that are suited to those flag’s stated purposes: for example, if the performance cost of -fextend-lifetimes
is reduced, we would have the capacity to remove some more passes from O2g without straying too far from O2, such as jump-threading, which appears to be one of the worst offenders at O2 for debuggability vs performance; as another example, if at some point we add new debug info features that incur further performance costs with a proportionate debug info improvement, we could consider enabling them at Og but not at O1 - and conversely, we could add passes to O1 that are bad for debuggability but give good performance gains for small compile time costs.
All thoughts, comments, and opinions welcome.
[0] Optimize Options (Using the GNU Compiler Collection (GCC))
[1] clang - the Clang C, C++, and Objective-C compiler — Clang 18.0.0git documentation
[2] Proposal for O1/Og Optimization and Code Generation Pipeline
[3] ⚙ D138455 [WebAssembly] Disable register coalescing at -O1, ⚙ D101939 [AMDGPU] Disable the SIFormMemoryClauses pass at -O1, ⚙ D101414 [AMDGPU] Disable the scalar IR, SDWA and load store vectorizer passes at -O1
[4] https://reviews.llvm.org/D157615
[5] https://youtu.be/f1uHy-ukucc