C++ Coroutines by default allocates its activation frame on the heap. C++ developers tend to consider heap allocations to be expensive. CoroElide existed to mitigate this problem from LLVM optimization pipeline. However, there are quite a few reports on github that CoroElide is not as effective as one might expect. One such report is (#94215). The issue contained a small snippet of code that demonstrated CoroElide’s ineffectiveness in dealing with even the most ordinary looking task types.
We looked into the reason why CoroElide in its current form is ineffective for C++ coroutines. For CoroElide to happen, the ramp function of callee must be inlined into the caller. This inlining happens after callee has been split but caller is usually still a presplit coroutine. If callee is indeed a coroutine, the inlined coro.id
intrinsics of the callee is visible within the caller. CoroElide then runs an analysis to figure out whether the SSA value of coro.begin()
of foo
gets destroyed before bar
terminates. The real trouble here is that Task
types are rarely simple enough for the destroy logic to reference the SSA value from coro.begin()
directly. Given the escaping nature of the Task
types, it’s almost impossible to prove safety even for the most trivial C++ Task types. Improving CoroElide static analysis turned out to be extremely difficult due to this exact reason.
In order to get the best performance out of coroutines, and the certainty that such allocation cost is minimized, we want a better solution to address the HALO problem for C++ coroutines. We tried an attribute approach for addressing this problem in this pull request (#94693). This patch proposes C++ struct/class attribute [[clang::coro_inplace_task]]
(feel free to suggest a better name.) that describes to the compiler that the attributed Task type won’t expose APIs or pointers that allows callee coroutines to continue running after caller is destroyed.
The approach we want to take with this language extension generally originates from the philosophy that library implementations of Task
types have the control over the structured concurrency guarantees we demand for elision to happen AND coroutines, more often than not, is co_awaited right away as a prvalue. As a result, with sufficient constraints put on Task
type implementations. Lifetime for the callee’s frame is shorter than or equal to that of the caller.
We implemented this idea in this PR and applied the attribute to folly::coro::Task and have seen that this patch delivered wall time wins by 10-20% (differs by frame size). See full benchmark results here. The code used for benchmarking can be found here.
Currently we intend to split the PR into Front End and Middle End patches. The ME part currently is more or less a hack and we have plans to rewrite it with an appropriate approach.