The biggest problem for coroutine now should be the thread identification problem showed in the 2 bugs:
[coroutines] Compiler incorrectly caches thread_local address across suspend-points · Issue #47179 · llvm/llvm-project · GitHub and [coroutines] Clang incorrectly caches result of pthread_self() across coroutine suspend-points · Issue #47177 · llvm/llvm-project · GitHub.
There is also a relatively long discussion about the bug before: [RFC] Coroutine and pthread_self.
Here is a simple summary for who are interested and not want to follow that long thread.
Simply, there are two bugs actually:
(1) Misoptimization for TLS variable in coroutine.
(2) Misoptimization for pthread_self().
The 2 bugs have the same reason: the introduction of coroutine break the assumption that a function is only possible to be run in one thread. So for the following code:
thread_local_variable = a;
co_await something;
use of thread_local_variable
the compiler would transform it into:
thread_local_variable = a;
co_await something;
use of a
which would be incorrect if the coroutine resume in another thread. This is similar for pthread_self(). And the reason why pthread_self() would be (mis)optimized is that pthread_self() is marked as __attibute__((const))
so that the compiler think the every call to pthread_self() would return the same result.
Personally, I THINK the key problem for pthread_self() is that pthread_self() shouldn’t be marked as const. It depends on the thread it lives instead of its parameters. However, the conclusion is not really helpful. Even if we could remove the const
attribute for pthread_self
in pthread library, it is hard to require our users to update to the newest version. And we are not sure if there are other functions which marked as const but depends on the thread context.
Then there are some principles I concluded:
(1) Bugs should be fixed.
(2) The problem is better to be handled in coroutine modules. It is a burden for developers of other modules to consider coroutine when they write and review codes. From my point of view, the design of LLVM coroutine tries to make coroutine modules self-contained. In other words, it is better to solve problems introduced by coroutine in coroutine codes.
(3) Performance regression should be avoided. If we don’t care about the performance, we could solve the problem easily by forwarding CoroSplit pass in the very early place where no optimization happens. But this is not what we want and it violates the design goal of LLVM coroutine.
So my design would be to add optimization blockers in CoroEarly pass for TLS variables and readnone functions and remove the blockers in CoroCleanup passes. To avoid performance regressions, we need hoist the position of CoroCleanup pass and add a EarlyCSE pass after CoroCleanup pass to make the optimization we want.
Here is the patches: D124360, D124361, D124362, D124363, D124364. We could discuss things in higher level in this thread and discuss details in corresponding review page.
CC people who might be interested: @jyknight @nhaehnle @rjmccall
Thanks,
Chuanqi
Updates for the summary of the proposed solution:
Although the two bugs (TLS variable and pthread_self()) share one internal reason, we could only solve them in 2 separate solutions.
For TLS variable,
The proposed solutions is:
(1) In CoroEarly pass, which starts in the beginning of the pipelines, we would convert all the uses for every TLS variable (Let’s mark it as V) in the module with the new added intrinsic llvm.coro.may_change(V)
.
(2) In CoroCleanup pass, we would replace llvm.coro.may_change(V)
with V
.
In this way, we could guarantee that TLS variables wouldn’t be merged before CoroSplit pass. And after CoroCleanup, the compiler is free to optimize them.
The question might be why do we need to handle all the uses of every TLS variables? Couldn’t we process the uses in coroutines only? The answer is no since the potential inlining. Consider the following example:
thread_local int a = 0;
void g() {
a = 43;
}
void h() {
a = 44;
}
CoroTy Coro() {
g();
co_await something();
h();
uses of a ...
}
If we process the uses in coroutine only in CoroEarly pass, the example above would be transformed incorrectly after inlining.
Maybe we could filter some cases by call graph reachability analysis. But I think it is unnecessary expensive. And the current solution would optimize TLS variables correctly in normal functions and lowered coroutines finally.
For readnone attribute (pthread_self() problem)
The proposed solutions is:
(1) In CoroEarly pass, replace every function attribute readnone
with a placeholder attribute coro_readnone
.
(2) In CoroCleanup pass, replace every function attribute coro_readnone
with function attribute readnone
.
(3) Run EarlyCSE pass after CoroCleanup passes to optimize unoptimized readnone
functions.
Note that we also need to process every function and its calls due to inlining.
Step (3) is necessary to optimize readnone
functions.
Here is the example for the proposal:
auto a = pthread_self();
uses of a...
co_await something();
auto b = pthread_self();
uses of b...
auto c = pthread_self();
uses of c...
Previously, it would be converted to:
auto a = pthread_self();
uses of a...
co_await something();
uses of a...
which is incorrect. And after the proposal, the transformed code would be:
auto a = pthread_self();
uses of a...
co_await something();
auto b = pthread_self();
uses of b...
So the readnone functions across suspend points wouldn’t be optimized but the readnone functions which don’t across suspend points would be optimized correctly and expectly.
The downside for the proposal is that we need to insert a new optimization pass to the pipelines. But if we refused to do so, the proposal would break the optimization for readnone functions. I think it is benefical to run an extra pass to enable the optimization passes.
Note: I said no performance regression in the first post. But I found it couldn’t be proved due to the complexity of phase ordering problems. I mean if there is a optimization pass which depends on the optimization for readnone function and this optimization pass wouldn’t run after CoroCleanup pass. The current proposal would break it.
Here is one hacky example:
auto a = pthread_self();
auto b = pthread_self();
a_big_function(a == b);
Previously, a_big_function
might be inlined due to its parameter is a constant (constant parameter would increase the chance a call get inlined). But now it might not be inlined due to its parameter is not a constant any more.
So the solution couldn’t be proven to be no performance regression in all cases. But I would still love to say it wouldnt’ cause performance regression in most cases : )
-------------- End of the update of the proposed solution -------------------------------