[RFC] Making the pass manager aware of function-level optimization attributes

Background:
This work is a part of Google Summer of Code (GSoC) 2023. More details can be found here.

Long term, we aim to augment the heuristics approach that LLVM currently uses with an ML-based approach for selecting which optimizations to run, and this proposed change is a (small) step in that direction.

As a first step, we plan to mark functions with a desired optimization level (such as O1, O2, O3) via attributes. The ability to control the marking (albeit currently via a CSV file, and only for testing) has been made available with this commit. A similar idea has been presented in 2013 (https://llvm.org/devmtg/2013-04/karrenberg-slides.pdf).

Proposed Changes:
As a next step, we plan on modifying the FunctionPassManager. In the llvm/lib/Passes/PassBuilder.cpp file, the addPass(…) will be modified to optionally take in a set of optimization levels for which the given pass should be run. So, if a pass is not to be run at O3, O3 would be missing from that set.

The run(…) in the PassManager will be made aware of this set. In the llvm/include/llvm/IR/PassManager.h file, we check for the “opt-level” attribute (added to each function in IR) see here and here. Based on the value it holds (O1, O2, or O3) we add the appropriate passes in the pipeline to run for each function.

To maintain compatibility with the current state of the compiler, there will still be a provision for allowing unannotated IR, that defaults to the current state (same as user indicated level). For testing, the optimization levels are read from a CSV file. But it will eventually be replaced by an Analysis that’ll use a pre-trained ML-based model to determine optimization levels that need to be applied for each function. Also, a provision for removing the attached “opt-level” attribute, once the optimization pipeline has been run, would be provided.

Changing the set of passes we run for per-function attributes is something people have wanted for a while. For example, I mentioned it here. Doing that in general would be awesome.

Your specific proposal with addPass is different than I was envisioning it. I was envisioning a function adaptor that would choose between various pipelines based on function attributes. For example, we separate out the -O1 function simplification pipeline because it’s easier to visualize and modify the set of passes that run for -O1. -O2/3/s/z share the same buildFunctionSimplificationPipeline() because they’re pretty much identical except a couple of things here and there. Changing addPass seems hard to get a high level view of the pipelines since in the code you’ll end up with -O1 passes in the code interspersed with the other pipelines.

[WIP] Example of dynamic function simplification pipeline · aeubanks/llvm-project@9398580 · GitHub is sort of what I was thinking (that only dynamically chooses -O1 vs -O3 but you get the idea). Running opt -debug-pass-manager=quiet -O3 -disable-output /tmp/a.ll on

define void @f() "opt-level"="1" {
  ret void
}
define void @g() "opt-level"="3" {
  ret void
}

you’ll see that GVN only runs on @g and not @f.

Hmm - sorry, shooting from the hip here real quick.

My understanding from previous design discussions (some time ago, admittedly) - this was undesirable, at least from Chandler/some other folks back in those old discussions.

The idea was that something like -O0 (optnone), Oz (optsize) had some semantic behavior the user wanted - and -Og (optdebug - recently introduced) would fit in that idea. But that arbitrary optimization levels (O1 v O2 v O3, etc) didn’t meet that kind of requirement and we didn’t want to open things up to that level of customization at the IR level (especilaly since a lot of the optimizations are cross-function anyway).

I’ll see if I can find those discussions - though perhaps they’re just out of date/different people who aren’t here anymore.

1 Like

At a higher level, I’d say that just sticking the numbers 1/2/3 onto individual functions isn’t really expressive in a way that makes it clear what IR optimizations are expected to do, especially when IPO passes become involved. We want the attributes to express goals: how much do you care about codesize? How much do you care about debuggability? How much do you care about compile speed? Then passes and the pass manager can adapt as needed.

If you’re trying to couple ML to the pass manager, your goals might be different: if you’re training a model based on the specific behavior of some specific passes as implemented in some specific version of LLVM, finer-grained customizations probably make sense. Ensuring the attributes remain meaningful across versions of LLVM isn’t relevant because the model is trained on some specific version of LLVM anyway.

1 Like

In these kind of extensions, I generally think people are asking for too many conflicting things at the same time.
I don’t believe we can be fine granular, generic, stable, non-arbitrary, IPO-aware, …; we need to pick.

Let’s recap where we are:
-Og/z/0/1/2/3 is what the user right now has to pick a pipeline via clang.
These options are coarse grained, mostly but not fully stable, mostly arbitrary, and some of them are IPO-unaware, at least in an LTO setting.

At the same time, we don’t really want to give arbitrary control to the user, as testing and debugging will become a nightmare.

I would suggest to stick with the existing options for an initial prototype, then determine how one would like to generalize this idea, beyond this proposal. IIRC, we had initial numbers just based on O1/2/3 and they looked promising wrt compile time, even if the solution is not perfect.

Once we have gained experience, we can ask ourselves if we want -O3a-z, or -O3pointer-chasing in clang?
Or maybe we want the user to specify full or partial pipelines? etc.

I don’t think we should get hung up on the specific 1/2/3 numbers, that’s what we have today and the RFC uses those as a stepping stone. The mechanism in the RFC, however, allows us to eventually express more nuanced things like you said.

Just to nuance something that may be an inaccurate assumption in the last statement: models trained at a particular point in time have been quite resilient to both compiler and underlying project(s) changes. With both ml-driven inlining for size, and with the ml-driven regalloc eviction, we’ve seen quite good resilience of the model to LLVM changes. We didn’t need to retrain the models so far (one model is 2 years old, the other 1.5 year old. That doesn’t mean that they wouldn’t benefit from retraining, just that they still perform).

Using “numbers” does not seem too bad, however I don’t understand why trying these to the optimization level semantics used for building the pipeline? If we intend to express a tradeoff between compile-time and performance, any arbitrary scale can be used, and something along the line of what @efriedma-quic makes much more sense to me:

I’m not saying we should not use a scale, or multiple for that matter. I am very much in favor of exploring options to expand on O1-3.

What I am trying to say is that we should do that after we have the capabilities in place wrt. the current way of selecting pipelines, which is O1-3.
Arguably, the entire scale idea would similarly be useful for clang, so adding it should also involve the trade-offs around having it only in the middle end / ML pipeline vs having it in clang as well.

I’m not sure we’re on the same line here: there is an objection to make the PassManager aware of O1/O2/O3.

I see.

So you are saying teaching the PM about a scale 0-N is better than the existing pipelines, O1/2/3?
If we squint, O1/2/3 is already a scale with 3 entries, no?

Sure: you can use this as a proxy, one point is to decouple these concepts.

That said, I can’t say that just decoupling this would be enough, the whole concept here seem quite intrusive on the PassManager and like should start with a design discussion about the role and responsibility of the pass manager and how the concept that is proposed (PassesOptimizationLevelsMap, etc.) evolves it.
Right now the current proposal is a bit light on this aspect: I understand the will for the feature, but just from the post here, this seems like just “taking the shortest path through the API to achieve it” (modifying addPass()).

I might have missed it; do you have an alternative idea?

A single pass is used in different “situations”, e.g., multiple locations in the pipeline with O3, or with different configurations. So, a “addPass” interface seems not only “the shortest path” but also adequate as it is the entry point that encodes the pass and the “situation/configuration”.
Our other ML based “pass skipping” also looked at sequences of passes in the pass manager. It seems to me that is the right place to hook up a selection algorithm, assuming we want to “augment” existing pipelines (which is what we try here as it is less scary than completely new pipelines).

Not sure what you’re expecting here, I haven’t paged in the entire context of what you’re trying to spin here. But fundamentally if someone wants to change fundamentally the contact of the pass manager, it’s on them to do the legwork on the design exploration, and make the relevant proposal, starting as I hinted about the current “role and responsibility of the pass manager” as a component (scheduling passes), and how it should evolve (with a new definition of “scheduling” and new dynamic criteria of interactions between the IRUnits and the pass manager). This seems more fundamental to me than a “a (small) step” to support some experimentation as it is presented in the original post here.

On the O1/O2/O3 vs a different scale (which is orthogonal to what I’m referring to above): I can’t reconcile right now how a function could be tagged with an O2 attribute while the set of transformation that would be performed would depend on how the pass-manager is setup: O2 would mean different things in different context, which would be highly confusing.

Hm. It seems this goes down a rabbit hole and I don’t understand why.

What you are saying, I think, is that the pass manager so far just schedules the passes that the pass pipeline has queued up. And while the pass pipelines are O1/2/3 “aware” (or simple different for O1/2/3), the pass manager is not. Making the pass manager O1/2/3 aware is therefore a fundamental change. Did I get this right? I want to ensure we are talking about the same issue(s) before I try to provide alternatives and compare them to the proposed solution.

A function with an O2 tag would be optimized with min(O2, <user-O-level>), so:
-O1 + O2-tag → O1
-O2 + O2-tag → O2
-O3 + O2-tag → O2
Thus “max-O2-tag” would make it arguably clearer.
Now that I think about it, it’s like optnone is a max-O0 tag, basically.
Does that remove some of the confusion?

That is my understand of the contract provided by the pass manager historically (it’s a bit more nuanced, it supports analyses as well for example). I’m not saying we absolutely can’t or shouldn’t change it, but that this is a non-trivial change, and it deserves some explicit redefinition.

I tried to keep separate the specific concern about O1/2/3 vs another way to filter (arbitrary number of another annotation) from the change in the pass manager role, I would say the fundamental change would be in how we’d like the scheduling to take into account the IR itself (the pass manager introspect IR units to decide whether to run a pass or not, whether it’s an attribute or something else), and how we express this (for example you propose that this is part of the “addPass” API with an optional “level” provided, but why not a callback provided by the user for example?).

A little bit, but I don’ believe you’ll solve it by renaming opt-level to max-opt-level here.
Fundamentally the concept of O1/O2/O3 does not exist in the pass manager, and different LLVM-based compilers have different pass pipeline they are building without referring to this C-compiler terminology.

These are fair points.

What we propose now is that the pass manager can be made aware of an optimization level.
(I think we agree O1/2/3 is just what we start with, it’s clear that O0-100 can be done the same way.)
The awareness alone does not impact your program and is opt-in, based on the configuration of the pass pipeline. So, the pipeline creator can attach levels to parts of the pipeline to allows users to effectively opt-out of those parts for some functions. Scheduling happens as usual if opt-out did not happen. If opt-out happened, the pass is not scheduled for the IR Unit.
There is arguably an (opt-in) change made here, but I think it is the best place to do it, assuming we want more fine-grained control over optimization levels on a smaller-than-TU level.
Looking at alternatives, I can see two main ways.
Interestingly, both of which are actually employed but both of which are hard to generalize.

In my mind, the current flow of things is like this:

Pipeline -> PassManager -> runPassOnIRUnit
                /\  
                ||
              IR Unit
  1. We already create different pipelines based on the optimization level. This is our main way of optimizing differently. We have basically made the first part of the flow chart above “level-dependent”, and the rest is not. However, that part is reused across the entire module, as of now, and if we want finer control we’d need multiple pipelines for different parts of the TU. Specifying that is reasonably complex, but orchestrating it is the main challenge. There are dependences (of different kinds) between the passes run on different parts of the TU (which then would be between the passes run in different pipelines), that make it hard to imagine how one would now schedule things. Further, an unaware pass manager would need to reconcile IR Unit split across multiple TU parts, so if the IR Unit components are supposed to use different pipelines.
  2. For optnone we moved the burden of checking it into the runOnPassIRUnit part of each pass. So, we duplicate the handling, even if we hide it behind a “shouldRun” method. This works reasonably if we have two categories of passes, 95% that should ignore optnone functions and 5% that should not. However, at that point, so in the pass, we don’t know anymore what the user chosen level is. All we know is that this pass was picked. Since most passes run in different levels and multiple times, it is not easy anymore to decide to run on not to run on a given IR Unit. If we had 01 passes, O2 passes, … (like we have O0 passes that don’t check optnone), this would be much easier and a valid alternative.

I hope this makes some sense.

Right, pass manager does not have O1/2/3 awareness right now. We are trying to add that as a first step. Awareness of some sort of “optimization level”. As you said, not all compilers have the same pipelines, which is fine though. They can 1), not add any “level” to their passes (via addPass) and get the behavior they have right now, regardless of the annotation on functions. And, 2), add a level based on their understanding such that the function annotation will be interpreted in their context. Both, the annotation on the function, as well as the classification of a pass, are opt-in and based on the context of the compilation.

Mostly lurking, but a few things to point out.

First, clang’s notions of O1/2/3 do not exactly equal LLVM’s notions of O1/2/3. Clang builds its own pipelines, it doesn’t use LLVM’s (occasionally a nuisance to remember when trying to use opt to reproduce something). IIRC there was some work recently to kind-of (but I think not completely) divorce the two notions. But if you’re looking at number-based function attributes, remember that the meanings don’t line up exactly.

This can be even more true of other frontends, especially if they’re being ported from other backends. I’ve worked with compilers that had only 0/1/2, and others that had 0-5. When the frontend builds its own pipeline, this is not a problem. When you start wanting to tweak things based on a backend’s notion of optimizaton “level” then it quickly will become confusing.

I would therefore have the opinion (if I worked in this area enough to have an opinion that should matter) that number-based function attributes are not a great idea.

Second, some history related to optnone (although it has been awhile and I might need correction from others who were there at the time). Originally we proposed that the pass manager would make the decision; a pass would declare itself as “mandatory” (or something like that) if it had to run even at -O0. PM would run only the mandatory passes on functions marked optnone. Chandler was opposed to that idea, at least partly because he was still designing the NPM and didn’t want to expand the “surface area” of that API. Making it each new pass’s responsibility to pay attention to optnone seemed like not such a great design, but it was the only way to get it in at the time.

Subsequently, of course, that mechanism has been leveraged to support pass bisection, which has been a useful tool. Not saying it couldn’t be redesigned to be something the PM supported directly, just that it’s been handy to have.

Lastly: Something like an ML-driven pipeline seems like it doesn’t neatly fall into a 0/1/2/3 categorization. Maybe I’m just too unlearned about this stuff, but I’d expect it to pick and choose at the individual pass level, and some arbitrary (and inconsistent between clang/LLVM) numbering scheme doesn’t seem like a great fit.

1 Like

Nearest history of optnone I could find was the RFC thread: could find was https://groups.google.com/g/llvm-dev/c/iMGAAYUVN3k/m/DHieNkbmRNAJ - I guess any discussion that this shouldn’t generalize to to O1/2/3 wasn’t in the thread, might’ve been an in person discussion, etc. @chandlerc - just on the off chance he’s still got any history of that/wanted to impart those thoughts - otherwise I certainly don’t have enough context & will leave it to others like @aeubanks

Arguably, all we have right now are these numbers (and s/g/z/fast). What I’m trying to say is that adding some new “notion” of optimization level seems orthogonal to this proposal. All the downsides of numbers people mention here, are the same downsides the numbers have right now in our existing tools. You cannot assume O2 clang is equal to O2 gcc or even O2 armclang. O2 is also not stable over time. Different backends interpret O2 differently. O2 clang and O2 opt are potentially different. O3 is not faster than O2. O0 runs passes. O-level defaults are different. Etc.
Summarized, I believe there is agreement numbers are not the best thing to solve this, but it is what people are familiar with. If I’d ask someone to optimize a single function out of a TU more than the rest, I’d assume they will split it into a new file, run O3 on that file and O1 on the original TU. It’s not like they have many other options.

That all said, I think once we have a new notion for optimization level we could easily use it to skip passes via this mechanism.

Oh, right, that would have been when we figured out that -O0 -flto didn’t preserve the O0-ness at LTO time, and clang started adding optnone to all the functions compiled at O0.

1 Like