Language Extension for better, more deterministic HALO for C++ Coroutines

ych · July 9, 2024, 5:21pm

C++ Coroutines by default allocates its activation frame on the heap. C++ developers tend to consider heap allocations to be expensive. CoroElide existed to mitigate this problem from LLVM optimization pipeline. However, there are quite a few reports on github that CoroElide is not as effective as one might expect. One such report is (#94215). The issue contained a small snippet of code that demonstrated CoroElide’s ineffectiveness in dealing with even the most ordinary looking task types.

We looked into the reason why CoroElide in its current form is ineffective for C++ coroutines. For CoroElide to happen, the ramp function of callee must be inlined into the caller. This inlining happens after callee has been split but caller is usually still a presplit coroutine. If callee is indeed a coroutine, the inlined coro.id intrinsics of the callee is visible within the caller. CoroElide then runs an analysis to figure out whether the SSA value of coro.begin() of foo gets destroyed before bar terminates. The real trouble here is that Task types are rarely simple enough for the destroy logic to reference the SSA value from coro.begin() directly. Given the escaping nature of the Task types, it’s almost impossible to prove safety even for the most trivial C++ Task types. Improving CoroElide static analysis turned out to be extremely difficult due to this exact reason.

In order to get the best performance out of coroutines, and the certainty that such allocation cost is minimized, we want a better solution to address the HALO problem for C++ coroutines. We tried an attribute approach for addressing this problem in this pull request (#94693). This patch proposes C++ struct/class attribute [[clang::coro_inplace_task]] (feel free to suggest a better name.) that describes to the compiler that the attributed Task type won’t expose APIs or pointers that allows callee coroutines to continue running after caller is destroyed.

The approach we want to take with this language extension generally originates from the philosophy that library implementations of Task types have the control over the structured concurrency guarantees we demand for elision to happen AND coroutines, more often than not, is co_awaited right away as a prvalue. As a result, with sufficient constraints put on Task type implementations. Lifetime for the callee’s frame is shorter than or equal to that of the caller.

We implemented this idea in this PR and applied the attribute to folly::coro::Task and have seen that this patch delivered wall time wins by 10-20% (differs by frame size). See full benchmark results here. The code used for benchmarking can be found here.

Currently we intend to split the PR into Front End and Middle End patches. The ME part currently is more or less a hack and we have plans to rewrite it with an appropriate approach.

avogelsgesang · July 9, 2024, 10:04pm

Thanks for this proposal and for the Pull Request!

If I understand correctly, for code like

Task<size_t> get_file_size(std::path p);

Task<size_t> getMaxFileSizes() {
   auto [size1, size2] = co_await when_all(get_file_size("file1"), get_file_size("file 2"));
   co_return max(size1, size2);
}

the coroutine frame of when_all would be elided, but not the coroutine frames of the two getFileSize calls. Did I understand this correctly? Do you have any thoughts on how this mechanism could be extended to also work in those cases?

Endill · July 9, 2024, 10:55pm

CC @AaronBallman

ych · July 10, 2024, 5:03am

Looking at this function locally, we cannot know what when_all is doing to the two tasks it got as arguments. It may save them to some storage to await later. The lifetime of the task objects of get_file_size("file1") or get_file_size("file2") is not determined at this level. Hence, we cannot make the optimization in your case. That’s also why in general we cannot depend on CoroElide to always work.

avogelsgesang · July 10, 2024, 6:19pm

Could this be solved by an second attribute? Something like

auto when_all([[coro_awaited]] AsyncTask<int> task1, [[coro_awaited]] AsyncTask<int> task2)

where the parameters to when_all are annotated with some annotation (the name is up for discussion) which let’s the compiler know that the tasks which are passed to it will be co_awaited?

I guess that would solve the use case

   auto [size1, size2] = co_await when_all(get_file_size("file1"), get_file_size("file 2"));

Not sure if this would also already be sufficient for cases like

   auto task1 = get_file_size("file1");
   auto task2 = get_file_size("file2");
   auto [size1, size2] = co_await when_all(move(task1), move(task2));

and if clang would be able to perform the necessary data flow analysis in this case? In the worst case, I guess that the call to get_file_size or the variables task1, task2 could be annotated with the [[coro_awaited]] attribute? Would that be possible?

Sorry if I am side-tracking the discussion. Your original proposal sounds good to me, I just want to make sure that it can be extended to other relevant use cases in the future

ych · July 10, 2024, 7:16pm

What you want here is fanouts, which this proposal does not plan to cover.

The move there would actually be more problematic because now you don’t know if the lifetime of get_file_size("file1") and get_file_size("file 2") ends at the full expression.

Even if we do, the new attribute on the function arguments are quite tricky to get right. It requires when_all(...) to be co_awaited as a prvalue, AND it does the right thing to ensure destruction of task1 and task2 before resuming. (i.e. lifetime of its coroutine object being longer than the lifetime of its arguments, and that of the caller being longer than the coroutine object returned by when_all(...)).

Even though library writer may get it right, you also need to think about the marginal benefit it provides. Fanouts are relatively rare. When you concurrently have multiple tasks, the activation frame all living on the same caller frame is probably not a lot cheaper than heap allocation. (you don’t want your caller frame be megabytes when you have not just two, but 1k fanouts, say by awaiting all from a collection.)

ChuanqiXu · July 11, 2024, 2:55am

Thanks for the written up. @ych and I have talked privately on this topic. I am in favor of this in the high level.

And I actually did the pretty same work in the downstream. The reason why I didn’t upstream it is I don’t feel good with my FE codes. There are too many pattern matches which I don’t like. And I feel the ME part is pretty straight forward and stable. This is exactly the opposite from the poster : )
(In a privarte chat, @ych mentioned an idea to implement a more general coro elide frame in the middle end which don’t depends on inlining. But I assume that one is not inclued here).

Some comments:

This patch proposes C++ struct/class attribute [[clang::coro_inplace_task]] (feel free to suggest a better name.)

I prefer the name as coro_elide_after_awaited. I feel it is much more straight forward. And this may not be limited to so-called Task type.

that describes to the compiler that the attributed Task type won’t expose APIs or pointers that allows callee coroutines to continue running after caller is destroyed.

This description looks not natural and incorrect to me.

I’d like to describe it as:

We call a coroutine which returns the attributed type as attributed coroutine.
In an attributed coroutine A, if an just constructed attributed coroutine B gets **immediately ** co_awaited, the lifetime of coroutine B should be guaranteed to be enclosing in the lifetime
of coroutine A. Otherwise, the behavior is undefined.
Then the compiler is allowed to allocate the frame of B as a local variable in A.

Here the concept of immediately co_awaited may not be very clear and may need improved.

So that it will be clear:

Task foo() { ... }
Task bar() {
    co_return co_await foo();
}

The body of foo() will be elided. But

Task foo() { ... }
Task bar() {
    Task f = foo();
    co_return co_await f;
}

won’t be elided. And

Task foo() { ... }
Task forward(Task &&f) { return f; }
Task bar() {
    co_return co_await forward(foo());
}

won’t be elided.

@avogelsgesang We may need new attribute to do that. Your proposed [[coro_awaited]] attribute in parameters looks good. (Although it might be slightly hard to implement). Another idea can handle this and easier to implement, is something like [[coro_always_eliding]], which will elide all the calls:

[[clang::coro_always_eliding]] auto [size1, size2] = auto [size1, size2] = co_await when_all(get_file_size("file1"), get_file_size("file 2"));

While it is more powerful, it would be more dangerous.

And after all, we can do such things as a new attribute, it is not conflicting with the being proposed one.

ych · July 15, 2024, 10:22pm

The proposal is implemented in these new PRs:

C++ FE: [Clang] Introduce Frontend Attribute [[clang::coro_inplace_task]] by yuxuanchen1997 · Pull Request #98971 · llvm/llvm-project · GitHub
ME: [LLVM] Perform HALO on "coro_must_elide" coroutines by yuxuanchen1997 · Pull Request #98974 · llvm/llvm-project · GitHub

ych · September 12, 2024, 8:48pm

I am starting to think about this as a next step. How is this going to work for parameter packs? I guess similarly?

avogelsgesang · September 13, 2024, 11:07am

Nice! I already stumbled across your draft PR [Clang] Propagate elide safe context through [[clang::coro_must_await]] by yuxuanchen1997 · Pull Request #108474 · llvm/llvm-project · GitHub Thank you for putting in all this work!

How is this going to work for parameter packs? I guess similarly?

As a C++ programmer, I would expect that I can declare a function as

template <typename... T>
AsyncTask<std::tuple<T...>> when_all([[coro_awaited]] AsyncTask<T> tasks...)

i.e. that I can simply apply a the same attribute also to parameter packs. Just as I can also apply the [[maybe_unused]] attribute not only to normal parameters but also to parameter packs.

Besides functions like when_all/collectAll/collectAny which themselves are co_awaited, it would be great if we could also enable HALO for functions like sync_wait (from cppcoro) or folly::coro::blockingWait.

From reading [Clang] Propagate elide safe context through [[clang::coro_must_await]] by yuxuanchen1997 · Pull Request #108474 · llvm/llvm-project · GitHub, it seems the current [[clang::coro_must_await]] attribute only takes effect if the function itself is immediately co-awaited.

I guess, we could enable [[clang::coro_must_await]] also for non-co-awaited calls, if the function returns a non-awaitable type?

(That being said, I guess that sync_wait / blockingWait are rarely a performance bottleneck, so maybe this would be overengineering…)

Besides the templated when_all there also usually is a vector-based when_all, with a signature similar to

template <typename. T>
AsyncTask<std::vector<T>> when_all(vector<T> tasks)

Given that those calls take a variable number of couroutines, I see no way how we could enable HALO also for this pattern. Or am I missing something?

ych · September 13, 2024, 5:54pm

This is out of scope for this language extension for sure. You don’t have a prvalue for the Task so the lifetime is unknown/harder to define. I don’t see a clear way to make them elidable as well.

Topic		Replies	Views
Guidance requested: Fixing missed optimization for coroutines LLVM Dev List Archives	1	115	January 23, 2021
[RFC] LLVM Coroutines LLVM Dev List Archives	9	184	June 11, 2016
[RFC] Lifetime bound check for parameters of coroutines Clang Frontend clang	24	1317	January 22, 2024
RFC: Coroutine Optimization Passes LLVM Dev List Archives	8	138	July 18, 2016
RFC: LLVM Coroutine Representation, Round 2 LLVM Dev List Archives	5	89	July 21, 2016

Language Extension for better, more deterministic HALO for C++ Coroutines

Related topics