RFC: Codifying (but not formalizing) the optimization levels in LLVM and Clang

This has been an idea floating around in my head for a while and after several discussions with others it continues to hold up so I thought I would mail it out. Sorry for cross posting to both lists, but this is an issue that would significantly impact both LLVM and Clang.

Essentially, LLVM provides canned optimization “levels” for frontends to re-use. This is nothing new. However, we don’t have good names for them, we don’t expose them at the IR level, and we have a hard time figuring out which optimizations belong in which levels. I’d like to try addressing that by coming up with names and a description of the basic intend goal of each level. I would like, if folks are happy with these ideas, to add these types of descriptions along side these attributes to the langref. Ideas on other (better?) places to document this would be welcome. Certainly, Clang’s documentation would need to be updated to reflect this.

Hopefully we can minimally debate this until the bikeshed is a tolerable shade. Note that I’m absolutely biased based on the behavior of Clang and GCC with these optimization levels, and the relevant history there. However, I’m adding and deviating from the purely historical differences to try and better model the latest developments in LLVM’s optimizer… So here goes:

  1. The easiest: ‘Minimize Size’ or ‘-Oz’
  • Attribute: minsize (we already have it, nothing to do here)

  • Goal: minimize the size of the resulting binary, at (nearly) any cost.

  1. Optimize for size or ‘-Os’
  • Attribute: optsize (we already have it, nothing to do here)
  • Goal: Optimize the execution of the binary without unreasonably[1] increasing the binary size.
    This one is a bit fuzzy, but usually people don’t have a hard time figuring out where the line is. The primary difference between minsize and optsize is that with minsize a pass is free to hurt performance to shrink the size.

[1] The definition of ‘unreasonable’ is of course subjective, but here is at least one strong indicator: any code size growth which is inherently speculative (that is, there isn’t a known, demonstrable performance benefit, but rather it is “often” or “maybe” a benefit) is unlikely to be a good fit in optsize. The canonical example IMO is a vectorizer – while it is reasonable to vectorize a loop, if the vector version might not be executed, and thus the scalar loop remains as well, then it is a poor fit for optsize.

  1. Optimize quickly or ‘-O1’
  • Attribute: quickopt (this would be a new attribute)
  • Goal: Perform basic optimizations to improve both performance and simplicity of the code, but perform them quickly.
    This level is all about compile time, but in a holistic sense. It tries to perform basic optimizations to get reasonably efficient code, and get it very quickly.
  1. Good, well-balanced optimizations, or ‘-O2’
  • Attribute: opt (new attribute)
  • Goal: produce a well optimized binary trading off compile time, space, and runtime efficiency.
    This should be an excellent default for general purpose programs. The idea is to do as much optimization as we can, in as reasonable of a time frame, and with as reasonable code size impact as possible. This level should always produce binaries at least as fast as optsize, but they might be both bigger and faster. This level should always produce binaries at least as fast as quickopt, but they might be both slower to compile.
  1. Optimize to the max or ‘-O3’
  • Attribute: maxopt (new attribute)
  • Goal: produce the fastest binary possible.
    This level has historically been almost exclusively about trading off more binary size for speed than ‘-O2’, but I would propose we change it to be more about trading off either binary size or compilation time to achieve a better performing binary. This level should always produce binaries at least as fast as opt, but they might be faster at the cost of them being larger and taking more time to compile. This would in some cases be a change for LLVM and is definitely a deviation from GCC where O3 will in many cases produce slower binaries due to code size increases that are not accompanied by corresponding performance increases.

To go with these LLVM attributes I’d like to add support for adding attributes in Clang, both compatible with GCC and with the names above for clarity. The goal being to allow a specific function to have its optimization level overridden from the command line based level.

A final note: I would like to remove all other variations on the ‘-O’ flag. That includes the really strange ‘-O4’ behavior. Whether the compilation is LTO should be an orthogonal decision to the particular level of optimization, and we have -flto to achieve this.

-Chandler

As a user I like this it is hard to understand what each level does. I know projects that are using O3 because ‘more must be better’. I don’t know how to explain that it might be slower, and measuring performance is tricky. (Many programs do multpile things, and spend most of their time waiting on the event loop)

Would it be unreasonable to ask for a new/seperate set of optimizations: optimize debug. This would apple agressive optimizations, but not “significantly” changing the order of the code.

I don’t know the optimizer, but I know as a user of compilers that minimal optimization is often the difference between painfully slow program execution and okay performance. However debugging optimized programs can be difficult because the debugger jumps all over making the problem hard to understand.

I’ll leave it to experts to debate shades.

If I understand the attributes correctly, they would be function-level attributes applied to IR functions, correct? I’m curious what the semantics would be for cross-function optimization. For example, consider a function “foo” defined with maxopt and a function “bar” defined with optsize. If foo() calls bar() and the inliner wants to inline bar() into foo(), is that legal? If so, that may cause odd effects as you may perform expensive optimizations later on the inlined version of bar(), even though the original function is marked optsize.

Also, a nit-pick: can we make the naming consistent? It feels a bit weird to have maxOPT and OPTsize. Perhaps use sizeopt and minsizeopt, or optmax and optquick?

Would it be possible to have a parameterised attribute? opt(min), opt(size), opt(max) ?

Would it be unreasonable to ask for a new/seperate set of optimizations:
optimize debug. This would apple agressive optimizations, but not
"significantly" changing the order of the code.

This will be interesting, but knowing how LLVM can segfault when commenting
out one or the other pass from a sequence, I'd think that it'd take a long
time before we could test *every* combination of these "nice to have"
optimization profiles. Basically, you get all the tests we have today and
duplicate it for each profile.

I don't know the optimizer, but I know as a user of compilers that minimal

optimization is often the difference between painfully slow program
execution and okay performance. However debugging optimized programs can be
difficult because the debugger jumps all over making the problem hard to
understand.

I don't think you really need a performing debug image, though. Debuggers
tend to be so slow than the performance of the image is irrelevant.
Normally, the extra time is the number of breakpoints or watchpoints you
have set, and not the image itself... :wink:

--renato

The "always" part cannot really be guaranteed or enforced. I'd state it in terms of intention, i.e. "this level is intended to produce binaries at least as fast as quickopt". Otherwise, the wording may imply that it is a compiler bug if there exists a binary that runs slower at -O2 than at -O1.

-Krzysztof

From: "Chandler Carruth" <chandlerc@gmail.com>
To: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>, "clang-dev Developers" <cfe-dev@cs.uiuc.edu>
Sent: Monday, January 14, 2013 3:09:01 AM
Subject: [LLVMdev] RFC: Codifying (but not formalizing) the optimization levels in LLVM and Clang

This has been an idea floating around in my head for a while and
after several discussions with others it continues to hold up so I
thought I would mail it out. Sorry for cross posting to both lists,
but this is an issue that would significantly impact both LLVM and
Clang.

Essentially, LLVM provides canned optimization "levels" for frontends
to re-use. This is nothing new. However, we don't have good names
for them, we don't expose them at the IR level, and we have a hard
time figuring out which optimizations belong in which levels. I'd
like to try addressing that by coming up with names and a
description of the basic intend goal of each level. I would like, if
folks are happy with these ideas, to add these types of descriptions
along side these attributes to the langref. Ideas on other (better?)
places to document this would be welcome. Certainly, Clang's
documentation would need to be updated to reflect this.

Hopefully we can minimally debate this until the bikeshed is a
tolerable shade. Note that I'm absolutely biased based on the
behavior of Clang and GCC with these optimization levels, and the
relevant history there. However, I'm adding and deviating from the
purely historical differences to try and better model the latest
developments in LLVM's optimizer... So here goes:

1) The easiest: 'Minimize Size' or '-Oz'
- Attribute: minsize (we already have it, nothing to do here)

- Goal: minimize the size of the resulting binary, at (nearly) any
cost.

2) Optimize for size or '-Os'
- Attribute: optsize (we already have it, nothing to do here)
- Goal: Optimize the execution of the binary without unreasonably[1]
increasing the binary size.
This one is a bit fuzzy, but usually people don't have a hard time
figuring out where the line is. The primary difference between
minsize and optsize is that with minsize a pass is free to *hurt*
performance to shrink the size.

[1] The definition of 'unreasonable' is of course subjective, but
here is at least one strong indicator: any code size growth which is
inherently *speculative* (that is, there isn't a known, demonstrable
performance benefit, but rather it is "often" or "maybe" a benefit)
is unlikely to be a good fit in optsize. The canonical example IMO
is a vectorizer -- while it is reasonable to vectorize a loop, if
the vector version might not be executed, and thus the scalar loop
remains as well, then it is a poor fit for optsize.

3) Optimize quickly or '-O1'
- Attribute: quickopt (this would be a new attribute)
- Goal: Perform basic optimizations to improve both performance and
simplicity of the code, but perform them *quickly*.
This level is all about compile time, but in a holistic sense. It
tries to perform basic optimizations to get reasonably efficient
code, and get it very quickly.

4) Good, well-balanced optimizations, or '-O2'
- Attribute: opt (new attribute)
- Goal: produce a well optimized binary trading off compile time,
space, and runtime efficiency.
This should be an excellent default for general purpose programs. The
idea is to do as much optimization as we can, in as reasonable of a
time frame, and with as reasonable code size impact as possible.
This level should always produce binaries at least as fast as
optsize, but they might be both bigger and faster. This level should
always produce binaries at least as fast as quickopt, but they might
be both slower to compile.

5) Optimize to the max or '-O3'
- Attribute: maxopt (new attribute)
- Goal: produce the fastest binary possible.
This level has historically been almost exclusively about trading off
more binary size for speed than '-O2', but I would propose we change
it to be more about trading off either binary size or compilation
time to achieve a better performing binary. This level should always
produce binaries at least as fast as opt, but they might be faster
at the cost of them being larger and taking more time to compile.
This would in some cases be a change for LLVM and is definitely a
deviation from GCC where O3 will in many cases produce *slower*
binaries due to code size increases that are not accompanied by
corresponding performance increases.

To go with these LLVM attributes I'd like to add support for adding
attributes in Clang, both compatible with GCC and with the names
above for clarity. The goal being to allow a specific function to
have its optimization level overridden from the command line based
level.

A final note: I would like to remove all other variations on the '-O'
flag. That includes the really strange '-O4' behavior. Whether the
compilation is LTO should be an orthogonal decision to the
particular level of optimization, and we have -flto to achieve this.

I agree that -O4 and LTO should be separated. In general, we need an easy way for passes (like the vectorizer, loop unroller, etc.) to know not only the current optimization level (which should be easy if these are all function attributes) but also whether they are executing before or during the "final" optimization phase (because there are some things you don't want to do if you'll be later using LTO).

Nevertheless, I think that we definitely do want optimization levels above -O3. None of these should make code slower, but provide a rough indication of how long the user is willing to wait for the compilation process to complete. For example, currently when -vectorize is given, not only is basic-block vectorization enabled, but so are an extra invocation of EarlyCSE and InstCombine. These latter two passes run even if BBVectorizer does nothing (which indicates another needed infrastructure improvement, but that's another story), and sometimes produce a small speedup independent of the vectorizer. Because of the compile-time impact, there might not be sufficient justification for enabling these passes by default at -O3, but it would make sense to do so at a -O4 level: The user wants the compiler to 'try harder'.

There are several parts of the compiler that are essentially solving combinatorial optimization problems. Basic-block vectorization is an obvious example, but instruction scheduling can also fall into this category, as can some forms of reassociation, inlining, etc. We currently use heuristics and cutoffs to deal with this, but given the asymptotic scaling of each algorithm, we could set these cutoffs, and choose the approximation algorithm to use, based on the current optimization level. I'd recommend something like this: -O4 can take twice as long as -O3, -O5 can take twice as long as -O4, etc.

-Hal

Chandler Carruth <chandlerc@gmail.com> writes:

4) Good, well-balanced optimizations, or '-O2'
- Attribute: opt (new attribute)

Since all other levels have a qualifier, I'd suggest we do that here as
well. Perhaps balancedopt?

5) Optimize to the max or '-O3'
- Attribute: maxopt (new attribute)
- Goal: produce the fastest binary possible.

This level should always produce binaries at least as fast as opt, but
they might be faster at the cost of them being larger and taking more
time to compile.

That's almost impossible to achieve if you include things like
vectorization. It is often difficult to know statically whether
vectorization will help or harm. One can do runtime code selection but
that has its own costs and can be slower than O2 in some cases.

I simply don't think it's a useful or achieveable guarantee. It's a
good goal but I would be against making it a reuiqrement as it will
severely limit what we do at O3 over O2.

FWIW, the Cray compiler considers O3 to be "try to make it as fast as
possible without unreasonably increasing compile time" but does not
guarantee it absolutely will be faster than -O2. We have -Oaggress for
"pull out all the stops, take days to compile the code and you get what
you get." :slight_smile:

A final note: I would like to remove all other variations on the '-O'
flag. That includes the really strange '-O4' behavior. Whether the
compilation is LTO should be an orthogonal decision to the particular
level of optimization, and we have -flto to achieve this.

Agreed.

                                     -David

Renato Golin Linaro <renato.golin@linaro.org> writes:

I don't think you really need a performing debug image, though.

Oh, you'd be surprised. We have hit many cases where debugging a
completely unoptimized program is so slow it's not worth doing. Think
about large parallel codes with millions of threads running on hundreds
of thousands of sockets.

Debugging even minimally optimized code can be a huge productivity win.
Plus optimization tends to expose things like race conditions in
parallel codes.

                             -David

That's what -O1 is for, isn't it?

What I meant is that there is little need for an -O3-no-reorder
optimization level. Maybe in extreme cases, but I wouldn't bet on it.

cheers,
--renato

Renato Golin Linaro <renato.golin@linaro.org> writes:

What I meant is that there is little need for an -O3-no-reorder
optimization level. Maybe in extreme cases, but I wouldn't bet on it.

Ah, I misunderstood. I completely agree with this.

                           -David

This doesn't appear to be documented
<http://llvm.org/docs/LangRef.html#function-attributes>.

-- Sean Silva

FYI

Hi Chandler,

All of the below sounds completely reasonable to me. In particular, I'm in favor of the clarifications/adjustments to O3 and the removal of O4.

-Jim

If I understand the attributes correctly, they would be function-level
attributes applied to IR functions, correct? I'm curious what the
semantics would be for cross-function optimization. For example, consider
a function "foo" defined with maxopt and a function "bar" defined with
optsize. If foo() calls bar() and the inliner wants to inline bar() into
foo(), is that legal? If so, that may cause odd effects as you may perform
expensive optimizations later on the inlined version of bar(), even though
the original function is marked optsize.

This is a great question. My plan would be: inlining doesn't impact the
attributes. The inliner will be free to look at both the caller and the
callee's attributes to choose the best inlining decision.

Also, a nit-pick: can we make the naming consistent? It feels a bit
weird to have maxOPT and OPTsize. Perhaps use sizeopt and minsizeopt, or
optmax and optquick?

Meh. I don't care really. It would require changing existing attributes,
but we can do that. I think the most readable structure is the first:

minsizeopt
sizeopt
quickopt
opt
maxopt

I'd like to hear some support for one or the other of these before deciding.

Yes, but I don't think it really makes any difference.

Chandler Carruth <chandlerc@gmail.com> writes:

minsizeopt
sizeopt
quickopt
opt
maxopt

I prefer being consistent and putting "opt" at the end.

I would still like something other than "opt" for the fourth one. "opt"
seems too generic given the other levels.

                             -David

I agree that having just “opt” seems too generic in comparison. Maybe something like “stdopt” (or even the longer “balancedopt”) since it corresponds to -O2 and is intended to represent a good optimization level for most cases with a balance of compile time, space, and runtime efficiency.

  • Kaelyn

GCC tries to make -O1 be compatible with debugging. I find having debugging functional with some level of optimisation very useful, particularly in template heavy C++ code. I don’t know if there is an optimisation level with similar behaviour in clang.

The problem is not so much with the inlining decisions, as much as it is with keeping the attributes in the inlined code. If you have a function foo, which is to be compiled at O2, and it's called from bar, whose optimization level is O3, then you can take the O2 code and insert it in the middle of the O3 code. The problem is that in certain cases, compiling at O3 is not acceptable, so we can't just "upgrade" the opt level from O2 to O3.

-Krzysztof