IPO for linkonce_odr functions

Hi Everyone,

We recently started a discussion about IPO for linkonce_odr functions in ⚙ D83906 [CodeGen] Emit a call instruction instead of an invoke if the called llvm function is marked nounwind . The problem originated from that a front-end interprocedural optimization is incomplete and could generate unstable IR for a small source change (e.g, adjusting function definition order). As a consequence this can affect the profile staleness for AutoFDO. It may also be problem for caching. We were discussion an idea of introducing a cc1 flag (say -fgenerate-stable-IR) to disable such front-end IPO or refinement which can be used to produce a stable IR. I would like to reach out to broader audience for brainstorming.

More context:

In D83906#4179512, @wlei wrote:
We recently hit an issue of inconsistent codegen related with this optimization. In one build, Clang frontend generates different llvm IRs for the same function that is originally from one header file. It turned out this optimization gives different results for different function definition order which is naturally unstable.
See this two repro programs:
p1.cpp: Compiler Explorer

void foo() {};
void bar() noexcept {foo();};

p2.cpp: Compiler Explorer

void foo();
void bar() noexcept {foo();};
void foo(){};
See the codegens of bar are different, for p2.cpp, the callee(foo)’s definition is after the caller(bar), it’s unknown to be marked nounwind before it see foo’s definition, so it still generates the invoke things.

Besides the -fgenerate-stable-IR switch, we have also observed that the LLVM IR optimizer, especially the pre-LTO or non-LTO optimizer, doesn’t handle linkonce_odr functions as effectively as the front-end does. There was a long discussion about that in ⚙ D18634 Don't IPO over functions that can be de-refined , and I believe that’s the reason some IPO peepholes ended up implemented in the front-end. Given the new switch and potentially more work to disable the front-end refinements, we see an opportunity in turning back the IPO for linkonce_odr functions in llvm. Our initial experiments demonstrated promising performance wins and smaller code sizes even for thinLTO where it could benefit the pre-link IR optimizer. I would like to get feedbacks about the approach and potential ways to convey the information of “non-refined” IR to the LLVM optimizer, e.g, through a new module flag or etc.

cc @wlei @WenleiHe @dexonsmith @sanjoyd @rjmccall @ahatanak

Thanks,
Hongtao

2 Likes

Could you elaborate on this please. Is the invoke vs call change the only frontend transformation or are there more? If it’s the latter, why would we be able to do those in the frontend but not in the middle end. I mean, the reasoning behind D18634 applies everywhere. The invoke → call transformation should be valid in the middle end too, in fact, we seem to perform it just fine: Compiler Explorer

Is the performance win coming from disabling frontend IPO/refinement for linkonce_odr functons, or from enabling those in LLVM? How much is the performance difference?

So far the invoke vs call change is the main one we are aware of, but I don’t think that’s the only one. The middle end does not infer attributes for linkonce_odrs as aggressively as the front end does, because of the derefinement that the linker does. A bit more context about derefinement:

In D83906#4180435, @dexonsmith wrote:

Oh, de-refining is pretty nifty / evil. This patch has background:
⚙ D18634 Don't IPO over functions that can be de-refined

Since 2016, the optimizer is not allowed to do IPA on functions that can be de-refined (such as linkonce_odr functions).

Here’s a hypothetical problematic scenario for the optimizer:

  • original IR for B has a throw somewhere
  • function A invokes function B
  • in this TU, B is optimized and removes exceptions, and gets marked nounwind
  • function A leverages the nounwind to turn the invoke into a call
  • function B is de-refined at link/load time: linker chooses a different function B which still has a throw
  • “evil magic” happens (see the discussions around the patch linked above)
  • a crash is introduced

At first blush, it sounds like this could only be a problem if the code has UB in it. However, read the above patch (and follow-ups, and related discussion) for a variety of examples of non-UB cases where IPA on de-refineable functions introduces crashes. I don’t know for sure this could be a problem for nounwind specifically, but in general the LLVM optimizer doesn’t look at attributes of de-refineable functions.

(Note that if you’re doing LTO (like AutoFDO), this usually won’t block optimization, since at LTO time there are very few de-refineable functions (most linkonce_odr functions are internalized, and not exported as weak). So if we added a -cc1 flag to prefer “stable IR” over “frontend peepholes”, it would make sense for -flto to imply it.)

On the other hand, the frontend knows the token sequence from the source language. It knows whether function B is inherently nounwind based on its ODR token sequence; in which case, it’s safe to use the attribute for an IPA peephole.

In your example, the function lo_odr_nounwind comes with a nounwind attribute which could be annotated by the front end in real world. If it didn’t come with that, the middle end would not be able to transform the invoke to a call because of ⚙ D18634 Don't IPO over functions that can be de-refined. But I see an opportunity here to relax the restriction for linkonce_odr as long as we have a way to tell the middle end that the incoming IR 100% represents the source.

The win comes from both. With one of our internal services, I saw around 0.5% win for just enabling those in LLVM. The win bumped up to around 1% when combining the enablement with the frontend IPO disablement. I think the extra win comes from less profile stableness. However, no win or even regression was seen with just the frontend IPO disablement. All of there are done with thinLTO and CSSPGO.

This is either a misconception or a bug in the frontend. Clang should not be able to infer anything for those functions, as the middle cannot either. Clang can, however, interpret user annotations or language semantics. That said, it should just annotate them via existing attributes, see below.

If the source comes with annotations, e.g., nounwind, or pure, we should mark the IR accordingly. The middle end will happily use IR annotations also for functions that can be derefined, as shown. So the only issue I can see, if any, is that the frontend might not attach nounwind when it could.

Clang should not be able to infer anything for those functions, as the middle cannot either.

Why do you think either the front end or the middle end should not infer attributes for linkonce_odrs?

Because of ⚙ D18634 Don't IPO over functions that can be de-refined. As stated there, you cannot derive something from the IR if the function might be replaced by one that has different properties. That said, it is not as clear cut as we force it to be right now, however getting it right w/o introducing bugs is really hard.

To summarize the argument in that review:
Let’s assume there are two functions, same name, both linkonce_odr.
One of them is optimized and we deduce properties based on the optimized code which we then use for IPO.
The other one is not optimized and the properties do not hold.
Now we pick the non-optimized at link time and our IPO was wrong.

This is no different for the frontend or middle end, at least not in general.
All we can for sure do is to take user provided properties (which incl. language semantics) as they have to be the same (due to the odr part).

That’s the current status and I’m proposing some mechanism to allow the IR inferencer to do more for linkonce_odr functions, i.e, it should only deduce properties for those functions based on their unoptimized code.

I think that we currently don’t do inference for linkonce_odrs is mainly because the IR inferencer doesn’t know if the IR it sees 100% represent the source. If IR does, then based on the definition of linkonce_odr, it should be guaranteed that we’re starting from that the source semantics of all versions of the function are identical. Though derefining could happen in downstream optimizations, but the attributes inferred on the original unoptimized IR should be strong enough. WDYT?

Is “stable IR” just making clang is to try to make clang generate IR for functions defined in headers that’s less sensitive to header include order and/or which headers are included? That could be reasonable, I guess, but I’m not sure why you’d want a module flag for that; that’s not a semantic change, just a heuristic choice.

Trying to prevent the LLVM IR optimizer from performing any optimization that’s a refinement doesn’t seem viable; almost any pass can do things that technically count as refinement (for example, constant folding is a refinement in many cases). Trying to keep track for every pass seems like a recipe for miscompiles.

ThinLTO could probably be taught to do better at attribute inference. But I don’t know the details of how that works at the moment.

There are some properties you can deduce by walking the AST, even if they aren’t directly specified. For example, a function which doesn’t call any functions that throw exceptions can’t throw an exception. In general, you can’t do an equivalent analysis on LLVM IR because optimizations can refine out an exception.

I’d be reluctant to trust anything working on IR… clang IR generation itself performs optimizations which could be relevant.

I think this is unreasonable to implement this on the IR level for the expected gain. Technically possible, but you cannot use LLVM-core helpers without making sure (now and into the future) they do not start to exploit UB or other properties.

If anything, perform simple checks in clang and annotate the IR appropriately, that works and is covered under what I called “user provided properties / language semantics”. That said, there is a cost associated with this that needs to be balanced carefully.

What kind of UB do you think we may hit if only doing attribute inference on unoptimized IR? Clang already does refinement today based on unoptimized source and I’m wondering why that cannot be done in LLVM with certain restrict.

Yes. I see IPO in Clang as a main source of unstable IR, which can cause counter-intuitive IR generated for a trivial source change. But I’m trying to avoid stable IR at the cost of code quality, by making the middle end do more to compensate that. By introducing a module flag we can inform the middle end of whether the incoming IR is unoptimized or not. But the flag needs to be maintained during optimizations.

Refinement starts with things that aren’t obviously “optimizations”. For example:

struct A { int x : 3, y : 3; };
struct B { int : 3, y : 3;};
union C { struct A a; struct B b; };
void f() { struct B b = {3}; union C c = {.b = b}; (void)( 3 / (c.a.x + 1)); }

In the C++ abstract machine, this has undefined behavior, but the generated LLVM IR can’t crash because we initialize “b” via memcpy. That’s refinement.

If you’re specifically concerned about nounwind, you might be able to ignore certain kinds of refinement; not sure how much that helps. In any case, you’re dealing with fragile invariants that can’t be enforced by a verifier.

You might want to look more closely at the ThinLTO side of things; after we internalize linkonce_odr functions, are we inferring all the attributes we should infer?

I’m thinking about starting with the -fgenerate-stable-IR switch that only covers the nounwind case so far. We can include more unstable cases once identified. Do you have any suggestions?

ThinLTO has two main optimizers, the prelink optimizer (runs in clang) and the postlink optimizer (runs in the linker). The postlink one, as you pointed, does not have a problem with linkonce_odr. But the prelink one has. In our experiments, enabling the prelink optimizer to infer attributes for linkonce_odr give extra win.

Off the top of my head, the only examples I can come up with where clang IR generation does refinement involve turning undefined behavior into defined behavior, not non-deterministic behavior into a specific behavior. (I think you might only need the latter for nounwind.) But that isn’t to say there’s no such example. And even if there isn’t any such behavior right now, it would be difficult to continue enforcing in the future; every future patch to clang IR generation would have to honor this obscure rule, and as a code owner I don’t think that’s something I’m willing to deal with.

A couple potential alternatives:

  • clang could generate a call graph based on the AST, and use that to infer unwind behavior; that should avoid any issues related to refinement, I think. The implementation would be a bit more complex than IR, but hopefully not too bad.
  • We could consider internalizing certain linkonce_odr functions pre-link, then try to merge them back together during LTO. That way, we don’t have the refinement-related restrictions pre-link… but we risk a codesize increase if merging fails.
1 Like

An example where the C++ program doesn’t have UB (I’m not a C++ language expert so please double check) is turning x = a() + b() into r0 = a(); r1 = b(); x = r0 + r1 (i.e. picking an evaluation order).

For instance in the following program:

int a(bool* flag) { if (*flag) throw(); return 10; }
int b(bool* flag) { *flag = false; return 20; }

int fn() {
  bool flag = true;
  return a(&flag) + b(&flag);
}

I believe from C++ semantics fn may or may not throw but lowering the program to LLVM IR necessarily refines the behavior into either always throwing or never throwing depending on the eval order it picks.

1 Like

The eval order of call operands is what you are looking for, IIRC.
That said, it is unspecified behavior but not UB. (Thanks @jrtc27)

int a(bool* flag) { if (*flag) throw(); return 10; }
int b(bool* flag) { *flag = false; return 20; }

void eval(int, int);

int fn() {
  bool flag = true;
  return eval(a(&flag),b(&flag));
}

Unspecified, not implementation-defined; no consistency is needed, just that it happens in some order each time.

It is a compiler, over the decades, we have given the compiler the stature to make those decisions (for us), hiding things like throw(problem) deep inside functions (or libraries) is a good way to allow the programmer to summon nasal deamons unbeknowst to the programmer.

If the programmer wants a particular evaluation order, it is up to the programmer to construct the code in such a manner that gives the desired order in all cases.