Dear Clang community,
TLDR: I came across a few missed optimization for coroutines; would be happy to contribute a patch/an improvement; need guidance, though, as I am new to coroutines in LLVM
You can find the input C++ program in https://godbolt.org/z/Weod78.
The coroutines in that snippet can finish both synchronously and asynchronously. In case a coroutine finishes synchronously, I want to avoid the allocation of the coroutine frame.
Looking at the produced assembly, this snippet shows the following missed optimizations:
- lines 12 - 14: The call to “constant12() [clone .destroy]” is not devirtualized
- line 5: The coroutine frame for
constant12
is not elided (in a trivial, simple case without resumption points) - line 15-16: the call to “.LNoopCoro.ResumeDestroy” is not devirtualized; it should be de-virtualized and inlined
- line 5: The coroutine frame for
sum
is not “conditionally” elided (less trivial)
I already did some digging in the coroutine-related optimization passes and I think I identified the following root causes/solutions
CoroElide disabled for “own“ coroutine frame
CoroElide.cpp, line 271 (see link [1]) explicitly disables the CoroElide pass for the own coroutine frame. The CoroElide pass currently only modifies CoroIds which were inlined from other coroutines and leaves the own CoroId alone. Due to this, the call to “constant12() [clone .destroy]” is not devirtualized, and the coroutine frame cannot be elided. After removing this check and unconditionally applying the CoroElide to all CoroIds, issues (1) and (2) from my example are fixed.
My question: Is this check necessary because the CoroElide pass would otherwise be incorrect? Or is it a performance optimization, i.e. we didn’t expect the CoroElide to be useful when applied to the function’s own CoroId and hence disabled it in this case?
“@llvm.coro.subfn.addr” instrinsic not devirtualized if applied to constants
Issue (3), i.e. the call to “.LNoopCoro.ResumeDestroy” not being devirtualized, seems to be due to the usage of the “@llvm.coro.subfn.addr” instrinsic. CoroElide devirtualizes this instrinsic if applied on a “coro.begin” intrinsic. But there is no devirtualization for subfn.addr calls on constants. Afaict, the lowering in CoroCleanup happens too late in the pipeline, such that the remaining passes won’t remove the load.
I see multiple ways to fix this issue:
- In the CoroEarly pass, lower “coro.destroy” and “coro.resume” directly to the corresponding load, instead of lowering to “coro.subfn.addr”. The normal “memory constant folding pass” (mem2reg? not sure which pass does this…) would see the loads and could constant fold them, thereby devirtualizing the call. Downside: CoroElide would now need to do more complicated pattern matching to identify accesses to the resume/destroy function pointers.
- In the CoroEarly pass, lower “coro.destroy/resume” to memory operations, except if they are applied on a “coro.begin”. If they are applied on a “coro.begin”, keep using “coro.subfn.addr”. Benefit: We can still use mem2reg (?) to devirtualize calls on constant coroutine frames. At the same time, CoroElide can keep using “coro.subfn.addr” to devirtualize non-constant coroutine frames.
- In the CoroCleanup pass, special case the lowering of “coro.subfn.addr” when applied to constants. In that case, don’t generate loads, but rather produce the corresponding constant. Downside: probably too late in the pipeline such that the de-virtualized function would not be inlined.
My question: Which of those potential ways would be preferred?
CoroElide does not support to defer the coroutine frame allocation
Issue (4), i.e. that coroutine frame for the function sum
being allocated unconditionally, seems to be the most challenging to fix.
I am still kind of lost how to even approach this…
My questions: Does anyone of you have an idea how to approach this? Is there maybe even already some relevant literature/research on this topic?
Cheers,
Adrian
[1] llvm-project/CoroElide.cpp at 607bec0bb9f787acca95f53dabe6a5c227f6b6b2 · llvm/llvm-project · GitHub