[RFC] A new AAPass for coroutines or a simple workaround?

The motivation issue is here: [bug] clang miscompiles coroutine awaiter, moving write across a critical section · Issue #56301 · llvm/llvm-project · GitHub. But it is super long and complex to read. So I’ll try to give it a simple introduction on the level of middle end.


Background for coroutines

(This section is about some related background about C++20 coroutines and how LLVM transform them. People who know it can skip it.)

The C++20 Coroutines are stackless coroutines. It allows the programmer to split a function into several pieces. But semantically, these pieces still belongs to the same function so that they need to re-use the local variables. So C++20 coroutines need the compiler to construct a struct to maintain such variables.
The struct is called coroutine frame in Clang/LLVM.

In Clang/LLVM, we decide to do the job in the middle end so that we can control the variables living in coroutine frames in a finer-grained form. Since the variables in the frontend may be splitted/optimized in the middle end and not every local variables need to live in different pieces. Then we can get better performance.

coroutine_type coro_func() {
     big_structure a = produce();
     
     if (some_condition()) {
        big_structure b = produce();
        consume(b);
     }

     co_await something();

     consume(a);
}

In this example, while it is valid to put both a and b to the coroutine frame, it is better to only put a to the frame since b is only used in one piece. So the middle end need to analysis the lifetime of values.

The problems

The coroutine frame is allocated by malloc and the address will be a local variable at the start of transforming. e.g.,

coroutine_type coro_func() {
     void *frame = malloc(size);

     big_structure a = produce();
     
     if (some_condition()) {
        big_structure b = produce();
        consume(b);
     }

     co_await something();

     consume(a);
}

Then the problem is that the compiler can’t know the relationship between local variables and the frame pointer. For the reported issue, the pattern will be:

void *frame = malloc(size);
...
bool b = init_value();
func(frame);

use of b ...

Then the AA feel frame pointer won’t alias the address of b and the optimizer sinks the definition of b.

void *frame = malloc(size);
...
func(frame);

bool b = init_value();
use of b ...

Then problem happens.

Possible solutions

There may be two solutions in my mind.

  1. Mark frame pointer as may_alias to all the local variables in the same coroutine before split.
  2. Extract the analysis part in current CoroSplit transformation pass to a separate analysis pass. Then create a new AA pass based on that information form that analysis pass.

The first solution is simple and quick to implement. But it may hurt the runtime performance. I know this is always almost what we did before when we met the issue with other optimization pass don’t understand coroutines. And many people had expressed they don’t feel it good.

The second solution looks good. And it has the potential to fix the previous workarounds. The downside is that it may require a lot of work. (This may be the reason that I delayed many times.) Also as far as I know, we don’t add a lot of analysis pass recently.

I am pretty hesitating about choosing a path to fix this. So I post here to ask your feelings on the issue.

Thanks,
Chuanqi


Update: Another concern I had for the first solution is whether or not the AA in LLVM is transitive. That said, if both the pointer a and b are may_alias with the frame pointer, does that imply the pointer a would be considered to be may_alias with the pointer b? From my reading to the code, the answer is no. But I want to be sure about it. CC @nikic

Make it correct then make it fast.

If 1 is fast to implement, I think we should do it as a stop-gap measure to avoid producing invalid code, and then it gives you more time to explore 2.
But that really depends how long that first option would take to implement, and how bad the performance regression would be (but again, correctness trumps performance)

3 Likes

See my mail to EWG for the fundamental issue of the miscompilations.

If that issue got fixed, then the issue represented here may only cause sanitizers emit false-positive diagnostic messages.

Why? Since C++ language doesn’t provide methods to access the coroutine handle of the current coroutine except via await_suspend(). Also the C++ language only allows 3 operations with the coroutine handle:
(1) Resume a suspended coroutine.
(2) Destroy a suspended coroutine.
(3) Access the promise type.

There is no other way to access the local variables of the current coroutine. So as long as we don’t allow to resume/destroy the current coroutine in the await_suspend, it wouldn’t matter that the compiler may think the coroutine frame doesn’t alias the local variables.

And this is the reason why the issue occurs rarely, since it is indeed unusual that we resume/destroy the current coroutine in the await_suspend of the current coroutine.

Any reason it’s bool b = init_value() and not frame[b_slot] = init_value() or something? Feels like the issue is that IR doesn’t accurately represent what’s actually happening.

The reason is that we’ll split the coroutines later and try to run as much as possible optimizations in ahead. And the frame are actually built during the corosplit pass. And in the front, when the optimization passes runs on the coroutine, there is only a pointer called coroutine frame pointer. But these optimizations don’t know which which variables will be on that frame.

I wonder if it makes sense if you end up undoing/preventing such optimization anyway via various hacks. I.e. if there are any legal optimizations that you could do on b but can’t do on frame[b_slot].

I.e. if there are any legal optimizations that you could do on b but can’t do on frame[b_slot] .

The optimizations immediately comes to my mind are MemToReg and SROA (for simple aggregates other than b). MemToReg is super important and it can trigger other optimizations.

I wonder if it makes sense if you end up undoing/preventing such optimization anyway via various hacks.

IIUC, this is suggesting to doing less optimizations. But we have gained a lot from the current design to enable more optimizaitons before CoroSplit. As a comparison, GCC implements the semantics of coroutines in the front end. The result is that GCC generally generates worse code with coroutines than clang. An aggressive example is: Compiler Explorer. There can be many other examples. So I think it may be unacceptable regression with coroutines to disable a lot optimizations.

Disclaimer: I am not a LLVM developer, so I might ask for some things here that aren’t possible to do, or incorrect from the standpoint of compiler internals. But I hope that I can still contribute to this discussion productively.

Since this issue should only ever occur in await_suspend() (that’s the only place we can get the address of the frame pointer), could we restrict the scope of the fixes to only inside of await_suspend(), while preserving the ability of await_suspend() to be inlined? That way, it would not negatively affect the performance of the rest of the coroutine, just await_suspend() itself. (note that the user could save the address of the coroutine and use it outside of await_suspend() but since the coroutine wouldn’t be suspended there, this would be UB anyway)

Also, I want to point out one thing, if you go with the approach 1. (Mark frame pointer as may_alias to all the local variables in the same coroutine before split.), this should not be applied to variables that are local only to await_suspend(). Consider this modified example of the original reproducer:

std::coroutine_handle<> await_suspend(const std::coroutine_handle<> h) {
    int idx = SomeAtomic.fetch_add(1); // declared in other translation unit
    const auto to_resume = awaitable.Register(h);
    SomeOtherAtomic.store(idx); // declared in other translation unit

    return to_resume;
  }

I think that if the value idx is considered to be a local coroutine variable, that is then marked as may_alias with the frame pointer here, then it will be spilled and reloaded from the coroutine frame, which has possibly been destroyed after the call to awaitable.Register. So this variable which is local only to await_suspend must not be stored in the coroutine frame, and shouldn’t be considered to alias with the frame pointer. Maybe we already do this optimization under certain conditions, but it becomes critical with this proposed aliasing change.

@tzcnt thanks for your comments. But I’ve already had an idea for how to fix the issue based on your suggestion in https://github.com/llvm/llvm-project/issues/56301.

Edit: Or, rather than treating it as a fence, perhaps instead, when the coroutine promise address escapes from await_suspend, we should act as if the this pointer (of the awaiter) has also escaped. This may be a simpler/more correct way to implement a fix.

The wording inspires me that it is sufficient to make the this pointer of the awaiter escapes for every await_suspend. Then we can have much less impact on performance. The only issue may be should we make it conditional or unconditional. But that is an implementation detail.

Given the middle end doesn’t have C++ semantics, we have a simple trick to escape certain pointers in the frontend: insert a call to a foreign functions. For example, we can generate the following code for co_await awaiter; :

if (!awaiter.ready()) {
     call @llvm.coro.awaiter_addr(&awaiter); // or make it conditional if possible
     awaiter.await_suspend(...);
}
...

Then it won’t affect the inlining of await_suspend nor affect the optimization for other local variables other than the awaiters. This is the most optimal solutions I got now.

I think that if the value idx is considered to be a local coroutine variable, that is then marked as may_alias with the frame pointer here, then it will be spilled and reloaded from the coroutine frame,

While the original solution is deprecated, I just want to note that the CoroSplit pass doesn’t use the AA information to judge which values need to be put on the frame. So it doesn’t matter here.