[RFC] Redefine Og/O1 and add a new level of Og

tl;dr We (Sony) propose to enable our currently-in-review -fextend-lifetimes Clang flag by default at Og, and also propose a new flag, O2g, to enable comparable performance to O2 with a better debugging experience.

Some background

The Og flag, according to GCC[0], is intended to offer “a reasonable level of optimization while maintaining fast compilation and a good debugging experience”; in practical terms, it is equal to O1, minus some optimization passes that significantly degrade debug info quality. O1 in GCC is intended to “reduce code size and execution time, without performing any optimizations that take a great deal of compilation time”. Although Clang does not try to match GCC’s definitions exactly, they are a useful starting point since users will expect them to have similar meanings across both compilers.

In Clang, the Og flag is currently an alias for the O1 flag, which is currently defined with almost the same goal as Og in GCC: on the website[1] its description is simply “somewhere between -O0 and -O2”, but more recent work[2] by @echristo redesigned O1 to fit with the objective to “Optimize quickly without destroying debuggability.” Currently, O1 is implemented as O2 minus some passes that are slow to run or antagonistic to debug info quality, and most recent patches that have modified O1 have been for the purpose of reducing compile times or improving debuggability[3].

We have recently opened a review that reproposes a flag for Clang, -fextend-lifetimes, that was developed and previously opened for review by @wolfy1961, which artificially extends the lifetimes of values to keep them available to developers longer - for full details, see the reviews here [4] (and of course, review comments welcome). We have performed some internal experiments measuring the debug info quality of programs built with different combinations of passes (discussed previously at EuroLLVM 2023 [5]), and found that the -fextend-lifetimes flag has a significant positive impact on debuggability while being very cost-efficient with respect to performance, as compared to disabling optimization passes. Note that it does not prevent optimizations from being run, it only limits them in cases where they would affect debug info; this is very good for debugging, but does not help speed up compilation.

Our proposed change

There are two concerns that current O1 addresses compared to O2: better debugging and faster compile times. The -fextend-lifetimes flag is positive for debugging, neutral for compile times, and negative for performance. Therefore, we propose to enable -fextend-lifetimes by default at Og: this will slightly degrade run-time performance for significantly improved debuggability, with our internal tests showing a 2.4% performance cost (increase in benchmark execution time) in exchange for a 22.1% improvement in debuggability (breakpoint_locations * variables, see [5]). This provides a suitable differentiation between O1 and Og: Og is preferred when the best debugging experience with some optimizations is desired, while O1 is preferred if faster compile times are the primary concern and debugging is less important than run-time performance, aligning with GCC’s definitions.

The difference in execution speed between programs compiled at O1 and O2 is significant in our testing (~11%, this and all other performance numbers in this section were tested with a general CPU benchmark suite), and the combination of -fextend-lifetimes with O1 widens that gap. For game developers, O2 provides a poor debugging experience, but compiling and debugging games with O1 incurs an unacceptable performance cost. Feedback from developers indicates a middle ground would be appreciated, where a 5% performance cost compared to O2 would be an acceptable cost for better debug info. In our experiments testing Clang’s performance, using the -fextend-lifetimes flag at O2 incurred a performance cost of ~4.7%, making -O2 -fextend-lifetimes a good choice for “close-to-O2 optimized debugging”. I believe it is worth making this a “first-class” optimization configuration: we propose a new optimization flag, O2g, with the stated goal of “Providing performance comparable to O2 with an improved debugging experience”. This would differ from O1 and Og in that it is not expected to reduce compilation time, instead attempting to provide the best debugging experience possible while being usable by applications that require performance close-to-O2 to function correctly. This flag has symmetry with the proposed Og above: -Og = -O1 -fextend-lifetimes, -O2g = -O2 -fextend-lifetimes. As a natural extension of this, we could also define -O1g = -Og.

Our debuggability experiments have focused on games since that is our use-case, so any input or data regarding your own use cases and how the use of these flags may impact them positively or negatively is valuable. Although in this proposal both Og and O2g only differ from the existing pipelines by an added flag, they will also be a home for any future changes that are suited to those flag’s stated purposes: for example, if the performance cost of -fextend-lifetimes is reduced, we would have the capacity to remove some more passes from O2g without straying too far from O2, such as jump-threading, which appears to be one of the worst offenders at O2 for debuggability vs performance; as another example, if at some point we add new debug info features that incur further performance costs with a proportionate debug info improvement, we could consider enabling them at Og but not at O1 - and conversely, we could add passes to O1 that are bad for debuggability but give good performance gains for small compile time costs.

All thoughts, comments, and opinions welcome.

[0] Optimize Options (Using the GNU Compiler Collection (GCC))
[1] clang - the Clang C, C++, and Objective-C compiler — Clang 18.0.0git documentation
[2] Proposal for O1/Og Optimization and Code Generation Pipeline
[3] ⚙ D138455 [WebAssembly] Disable register coalescing at -O1, ⚙ D101939 [AMDGPU] Disable the SIFormMemoryClauses pass at -O1, ⚙ D101414 [AMDGPU] Disable the scalar IR, SDWA and load store vectorizer passes at -O1
[4] https://reviews.llvm.org/D157615
[5] https://youtu.be/f1uHy-ukucc

4 Likes

Thanks, this is really cool! I think extending lifetimes makes a lot of sense conceptually, and is a great fit for -Og.

2 Likes

Nice stuff!

Seems unfortunate to make -O1g and -O2g - maybe it’d be more suitable to make some more generalized flag (rather than specifically -fextend-lifetimes) that can be composed with -O1 or -O2, like maybe -fprioritize-debuggability? We can have -Og imply that, since there’s prior art for -Og/we already have that, etc, but maybe we can avoid adding -O2g and instead encourage -O2 -fprioritize-debuggability (spelling to be bikeshedded)

This would also be the first divergenge between -Og and -O1 - not sure there’s anything actionable there/we need to check off, but just something to be aware of.

2 Likes

Regarding O2g, I think the main case for it right now is discoverability; an optimization mode will be more apparent in the documentation, while a separate -f flag would probably be more obscure. In my own experience, I’ve never tried to learn all of the flags that exist in Clang, but I usually either search for or am directed to a flag for a specific required behaviour. The -fprioritize-debuggability flag however would serve a more general purpose, which has more in common with what people expect from optimization modes, i.e. “What qualities do I want my program to be optimized to have?”

Hopefully there hasn’t been too much code written assuming that Og and O1 must be the same, but I believe there are tests that currently use Og that may need to change.

Is it worth adding a release note?

1 Like

All user-facing changes should come with a release note.

Do we have other examples of a -O flag which enables a language dialect mode? As I understand the feature, this flag is not a conforming language mode because it changes the behavior of well-defined code by changing the lifetime of locals and parameters (which, in turn, could change when destructors are run for example). Or am I wrong about that and this only changes behavior of invalid code?

As I understand the feature, this flag is not a conforming language mode because it changes the behavior of well-defined code by changing the lifetime of locals and parameters (which, in turn, could change when destructors are run for example).

To clarify, the flag doesn’t change the behaviour of well-defined code - extending lifetimes here just means blocking optimizations that would shorten the lifetimes of source variables. Just as the optimizations themselves should not change the behaviour of the original code (excluding UB), this flag shouldn’t either.

2 Likes

That you for that clarification! That eases my concerns. :slight_smile:

Apologies, but unfortunately I need to retract the performance stats in the initial post - I’ve made a second try at performance benchmarking with a different benchmark suite (the LLVM test suite), and come up with quite different results, where the differences between each optimization mode are significantly increased. The second set of benchmarks look closer to other results that have compared O0 to O2, meaning the original stats are probably not reliable, though I’ll continue to investigate until I’m confident I’ve got accurate numbers. The proposal for -Og still stands as-is, as -fextend-lifetimes gives a much better debugging:performance trade-off than disabling any particular set of passes; the proposal for -O2g may change however, as at least for our own case the goal would be to have performance within 5% of -O2, while the current results indicate a cost of 12.3%. We are still principally interested in creating an -O2g option however, even if the underlying implementation is different; a suitable alternative may be adding -fextend-this-ptr and disabling some optimization passes. For reference, here are the results I have at the moment (based on current LLVM) for each mode as a factor of O0 (i.e. execution time of 0.5 means the benchmark completed twice as fast as O0):

Mode | Execution Time | Debuggability | Compile Time
O0   |         1.0000 |        1.0000 |       1.0000
Og   |         0.3439 |        0.5357 |       1.8630
O1   |         0.3082 |        0.4241 |       1.7880
O2g  |         0.2823 |        0.4845 |       3.0420
O2   |         0.2514 |        0.3908 |       2.9380

Do we have other examples of a -O flag which enables a language dialect mode?

-Ofast enables -ffast-math and -O4 used to imply -flto.

Thanks for working on this, I am excited to see some movement towards making Og more meaningful. :slightly_smiling_face:

About flags, I am also a bit worried (similar to @dblaikie) about the idea of “multiplying” this concept into the other optimisation levels via -O1g, -O2g, etc.

At least to start out, I think a flag like -fprioritize-debuggability which focuses on the user’s goal is ideal here, and it sounds like at first that would equate to enabling -fextend-lifetimes, but then it could also grow to do more things as we learn of new ways to prioritise debugging.

Since -Og is already (somewhat) understood via GCC to mean “O1 modified for debugging”, then it seems reasonable indeed to make it become -Og = -O1 -fprioritize-debuggability.

For those who want -O2 with good debuggability, I would say for the moment they should learn to write -O2 -fprioritize-debuggability. If we find later on that it’s crucial to make that shorter or more memorable for some reason, then let’s debate that separately.

As a small aside, I am hoping to have something to contribute to this “prioritise debuggability” topic as well in the future. My ongoing research work should lead to tooling that identifies which passes can be trusted to preserve debugging and which cannot. Anyway, I’ll leave further details for a future thread. :wink:

5 Likes

I understand the desire to keep things more simple w.r.t. optimization levels, and as long as the only thing that -O2g would do is enable one or a few flags in addition to -O2, there’s a good argument for not adding it - though I still feel that in principle the definition of the flag (essentially “do some unspecified things to improve the debug info quality of the output code”) fits the role of an optimization level more than it does an -f flag.

Beyond the current proposal though, and setting aside conciseness, memorability, or discoverability concerns: would -fprioritize-debuggability still be preferred over -O2g (or some other name) if we were making non-trivial changes to the optimization pass pipeline as well?

I’ll be looking forward to it - more information on the debug info qualities of passes is always appreciated!

1 Like

Thanks for the work! I share opinions similar to @dblaikie and @jryans .

-Og = -O1 -fprioritize-debuggability is fine. -O2g should probably be held off.

An orthogonal -fprioritize-debuggability helps turning on and off with -fprioritize-debuggability -fno-prioritize-debuggability.

If we add -O2g, there will be an interesting partial overriding concern: does -fno-prioritize-debuggability -O2g let -fno-prioritize-debuggability win or report a warning?

For LTO, we need to map driver -O levels to LTO optimization levels. We now have two dimensions: optimization level and size level. Adding debuggability dimension will further complicate that.

1 Like

I agree the “prioritise debugging” feature does feel like it is “about optimisation”, and thus it is initially tempting to want it to start with an O somehow…

Looking at historical choices GCC has made, I get the impression that there’s meant to be a single O option specified to set the overall level, and many other f options either select passes or control finer details.

We could of course do something different over here, but that history to me supports a view of “prioritise debugging” being a “finer detail” controlled via an f option.

In any case, I would defer to @MaskRay and others who have thought about flag names much more than I have.

1 Like

So far we haven’t mapped anything other than -O0 (oh, and -Oz or -Os?) to LTO optimization levels (-O1, -O2, -O3 don’t persist from a frontend compile to an LTO Backend optimization, right?).

I guess this one could reasonably be like -O0 (-> optnone in the IR) and -Oz (-> optsize in the IR) - perhaps this could be optdebug. Guess that sounds OK to me.

optdebug seems reasonable to me.

Thanks for working on this!
O1 has been way too heavy on the compile times.
So I was unable to use -Og for everyday development and had to revert to -O0.

Unfortunately this change as currently proposed wouldn’t speed up compile times: O1 will be unchanged, and Og would be ~2.5% slower than O1, likely due to the fact that fewer instructions will be removed and more debug values will be retained, which leaves more work for optimization passes to try to do (even if they’re overall making fewer changes).

It is possible that O1 could have more optimizations removed from it, and Og could track that - it’d be useful to know if slow build times are causing many people who would want some optimizations enabled to use O0; what kinds of programs are you building where this is an issue for you?

Typical production codebases, sometimes template-heavy and nobody ever cared much about optimizing compile times. I already set up PCH incl. -fpch-instantiate-templates (unfortunately CMake still doesn’t support -fpch-codegen) but the backend parts still make a big difference.
In my memory gcc’s -Og was more suitable for everyday use. Maybe it’s closer to -O0 than -O1. But the situation may have changed.

In gcc, I would expect Og to build faster than O1, because it improves debuggability by running a subset of the O1 passes. I think if the problem is that O1 is too slow to build with, the solution is probably to remove more passes from O1. According to my results in the table I posted above, O1 is ~1.79x slower than O0, while O2 is 2.93x slower - so O1 is closer to O0 than it is to O2, but still significantly slower; the question is, for how many users is the current O1 is too slow to use?