Address thread identification problems with coroutine

ChuanqiXu · April 25, 2022, 8:14am

The biggest problem for coroutine now should be the thread identification problem showed in the 2 bugs:
[coroutines] Compiler incorrectly caches thread_local address across suspend-points · Issue #47179 · llvm/llvm-project · GitHub and [coroutines] Clang incorrectly caches result of pthread_self() across coroutine suspend-points · Issue #47177 · llvm/llvm-project · GitHub.
There is also a relatively long discussion about the bug before: [RFC] Coroutine and pthread_self.

Here is a simple summary for who are interested and not want to follow that long thread.
Simply, there are two bugs actually:
(1) Misoptimization for TLS variable in coroutine.
(2) Misoptimization for pthread_self().

The 2 bugs have the same reason: the introduction of coroutine break the assumption that a function is only possible to be run in one thread. So for the following code:

thread_local_variable = a;
co_await something;
use of thread_local_variable

the compiler would transform it into:

thread_local_variable = a;
co_await something;
use of a

which would be incorrect if the coroutine resume in another thread. This is similar for pthread_self(). And the reason why pthread_self() would be (mis)optimized is that pthread_self() is marked as __attibute__((const)) so that the compiler think the every call to pthread_self() would return the same result.

Personally, I THINK the key problem for pthread_self() is that pthread_self() shouldn’t be marked as const. It depends on the thread it lives instead of its parameters. However, the conclusion is not really helpful. Even if we could remove the const attribute for pthread_self in pthread library, it is hard to require our users to update to the newest version. And we are not sure if there are other functions which marked as const but depends on the thread context.

Then there are some principles I concluded:

(1) Bugs should be fixed.
(2) The problem is better to be handled in coroutine modules. It is a burden for developers of other modules to consider coroutine when they write and review codes. From my point of view, the design of LLVM coroutine tries to make coroutine modules self-contained. In other words, it is better to solve problems introduced by coroutine in coroutine codes.
(3) Performance regression should be avoided. If we don’t care about the performance, we could solve the problem easily by forwarding CoroSplit pass in the very early place where no optimization happens. But this is not what we want and it violates the design goal of LLVM coroutine.

So my design would be to add optimization blockers in CoroEarly pass for TLS variables and readnone functions and remove the blockers in CoroCleanup passes. To avoid performance regressions, we need hoist the position of CoroCleanup pass and add a EarlyCSE pass after CoroCleanup pass to make the optimization we want.

Here is the patches: D124360, D124361, D124362, D124363, D124364. We could discuss things in higher level in this thread and discuss details in corresponding review page.

CC people who might be interested: @jyknight @nhaehnle @rjmccall

Thanks,
Chuanqi

Updates for the summary of the proposed solution:

Although the two bugs (TLS variable and pthread_self()) share one internal reason, we could only solve them in 2 separate solutions.

For TLS variable,

The proposed solutions is:
(1) In CoroEarly pass, which starts in the beginning of the pipelines, we would convert all the uses for every TLS variable (Let’s mark it as V) in the module with the new added intrinsic llvm.coro.may_change(V).
(2) In CoroCleanup pass, we would replace llvm.coro.may_change(V) with V.

In this way, we could guarantee that TLS variables wouldn’t be merged before CoroSplit pass. And after CoroCleanup, the compiler is free to optimize them.

The question might be why do we need to handle all the uses of every TLS variables? Couldn’t we process the uses in coroutines only? The answer is no since the potential inlining. Consider the following example:

thread_local int a = 0;
void g() {
    a = 43;
}
void h() {
    a = 44;
}
CoroTy Coro() {
    g();
    co_await something();
    h();
    uses of a ...
}

If we process the uses in coroutine only in CoroEarly pass, the example above would be transformed incorrectly after inlining.

Maybe we could filter some cases by call graph reachability analysis. But I think it is unnecessary expensive. And the current solution would optimize TLS variables correctly in normal functions and lowered coroutines finally.

For readnone attribute (pthread_self() problem)

The proposed solutions is:
(1) In CoroEarly pass, replace every function attribute readnone with a placeholder attribute coro_readnone.
(2) In CoroCleanup pass, replace every function attribute coro_readnone with function attribute readnone.
(3) Run EarlyCSE pass after CoroCleanup passes to optimize unoptimized readnone functions.

Note that we also need to process every function and its calls due to inlining.

Step (3) is necessary to optimize readnone functions.

Here is the example for the proposal:

auto a = pthread_self();
uses of a...
co_await something();
auto b = pthread_self();
uses of b...
auto c = pthread_self();
uses of c...

Previously, it would be converted to:

auto a = pthread_self();
uses of a...
co_await something();
uses of a...

which is incorrect. And after the proposal, the transformed code would be:

auto a = pthread_self();
uses of a...
co_await something();
auto b = pthread_self();
uses of b...

So the readnone functions across suspend points wouldn’t be optimized but the readnone functions which don’t across suspend points would be optimized correctly and expectly.

The downside for the proposal is that we need to insert a new optimization pass to the pipelines. But if we refused to do so, the proposal would break the optimization for readnone functions. I think it is benefical to run an extra pass to enable the optimization passes.

Note: I said no performance regression in the first post. But I found it couldn’t be proved due to the complexity of phase ordering problems. I mean if there is a optimization pass which depends on the optimization for readnone function and this optimization pass wouldn’t run after CoroCleanup pass. The current proposal would break it.
Here is one hacky example:

auto a = pthread_self();
auto b = pthread_self();
a_big_function(a == b);

Previously, a_big_function might be inlined due to its parameter is a constant (constant parameter would increase the chance a call get inlined). But now it might not be inlined due to its parameter is not a constant any more.

So the solution couldn’t be proven to be no performance regression in all cases. But I would still love to say it wouldnt’ cause performance regression in most cases : )

-------------- End of the update of the proposed solution -------------------------------

efriedma-quic · April 25, 2022, 6:27pm

The new intrinsic and attribute need LangRef documentation.

Could you go into more detail about the performance regressions? In particular, is the issue the “readnone”, or the thread-local variable handling? I’d like to avoid adding extra optimization passes to the general pass pipeline just for the sake of coroutines.

Instead of rewriting the attributes in CoroEarly, maybe it would make sense to have the frontend add CoroReadNone attributes? That would preserve the meaning of the actual “readnone” attribute, if someone wanted to use it for other reason.

rjmccall · April 25, 2022, 7:46pm

This is a nasty problem. Being able to freely move, combine, and remove calls to pthread_self is really valuable, and it’s only an issue in coroutines, which are highly abnormal code.

I agree that we have two separate properties here that on some level it would be nice to distinguish:

readnone except it observes the identity of the current thread
readnone with zero caveats

And I agree that, in the abstract, __attribute__((const)) probably ought to mean the stronger property, i.e. property 2. However, I don’t think it’s acceptable for us to say that it’s retroactively wrong for people to have been using __attribute__((const)) on functions with property 1 just because it breaks in a language feature that came into existence 15 years after __attribute__((const)) was added. That’s particularly true because there are no global effects here: it’s fine to move calls in functions that are called by coroutines, just not in the coroutines themselves. As a result, I think we need to accept the constraint that __attribute__((const)) can mean property 1, which at a high level means that we need to not arbitrarily move and combine calls to __attribute__((const)) functions in coroutines.

Interpreting that into LLVM, in the short term the implication is that we need to not move and combine calls to readnone functions in unsplit coroutine bodies. (Technically, we can move and combine them as long as we don’t cross a suspension point, but that’s not an easy constraint to satisfy given how that optimization is written.) That seems to be the technical direction you’re proposing, so it looks like we’re in agreement. We should still be able to remove unused calls to readnone functions, at least.

In the longer term, LLVM could introduce some sort of thread_readnone attribute distinguished from readnone. readnone would not be allowed to observe the current thread, so we’d be able to move and combine readnone calls in unsplit coroutine bodies, but not thread_readnone calls. Of course, we’d have to compile __attribute__((const)) as implying thread_readnone. I’d be interesting in knowing whether we have evidence that optimizing calls in unsplit bodies is important enough to justify that additional complexity; I could see it going either way.

efriedma-quic · April 25, 2022, 8:00pm

⚙ D124363 [Coroutines] Don't optimize readnone function before we split coroutine (4/5) introduces coro_readnone. That has basically the same semantics as your proposed thread_readnone.

rjmccall · April 25, 2022, 8:08pm

Ah, thanks, I hadn’t caught up on the proposed patches yet.

It would probably be best if this thread could serve as a true summary of the problem and its proposed solution instead of expecting readers to also read all the patches before they reply.

ChuanqiXu · April 26, 2022, 3:31am

@efriedma-quic

The new intrinsic and attribute need LangRef documentation.

Oh, the new intrinsic and attribute are intended to be used internally. So I don’t update the documentation. I found similar cases for llvm.coro.subfn.addr intrinsics. It is used internally for coroutine lowering and it is not documented. Do you think it is necessary to document such internal intrinsic and attributes?

Could you go into more detail about the performance regressions? In particular, is the issue the “readnone”, or the thread-local variable handling? I’d like to avoid adding extra optimization passes to the general pass pipeline just for the sake of coroutines.

I add an update in the main post which discusses the issue. It is about readnone function. I understand it feels not good to insert a new pass but I don’t find better solutions…

Instead of rewriting the attributes in CoroEarly, maybe it would make sense to have the frontend add CoroReadNone attributes?

There are two reasons to implement it in CoroEarly:
(1) it is easier
(2) it serves more languages. I’m wondering if there are similar problems for swift and MLIR.

That would preserve the meaning of the actual “readnone” attribute, if someone wanted to use it for other reason.

From what I can see, the users would still get the same result for readnone attribute.

@rjmccall

Thanks for your input. I think we’re in agreement. I updated the proposal in the first post. Hope it could be the true summary.

Thanks

rjmccall · April 26, 2022, 3:49am

Thank you for the additional write-up.

I don’t think we have a guarantee that the pass will run early enough for that treatment of thread-local variables to work. The fact that it’s a constant is really problematic, because there won’t be code emitted at the site where you take the address of the TLS, and instead that code will be automatically sunk to its uses. This is basically the IR structurally assuming that it cannot observe a thread change within a function.

Your workaround works when the address is written into an alloca or something, as long we promise not to do mem2reg before this pass. But it doesn’t work if the frontend can avoid going through memory, which sometimes it does. For example, if you take the address of TLS for one argument to a call, and the next argument suspends the coroutine, we’ll probably make IR that passes the GlobalVariable* directly as the call argument, which of course may observe the wrong thread. (That specific example might be fine because of the undefinedness of argument evaluation order, but there should be other examples that are well-defined but broken.)

I think that part, at least, needs frontend involvement. Fortunately, it might be okay to just focus on Clang for this, because Swift doesn’t directly use TLS in the frontend; I don’t know about MLIR.

ChuanqiXu · April 26, 2022, 5:27am

To avoid misunderstanding, I want to be sure that your suggestion is to emit intrinsics for TLS variable and special attribute for __attribute__((const)) functions in the frontend instead of CoroEarly pass?

And there are two statements I don’t understand a lot:

The fact that it’s a constant is really problematic, because there won’t be code emitted at the site where you take the address of the TLS, and instead that code will be automatically sunk to its uses.

Yeah, but I feel like it isn’t problem to insert llvm.coro.intrinsics before the uses of TLS variables.

But it doesn’t work if the frontend can avoid going through memory, which sometimes it does.

The following example might be fine as you said. And I don’t understand the reason why the workaround couldn’t work if it doesn’t go through memory.

I don’t have a strong opinion to emit the intrinsic and attribute in the frontend. I just feel is not better than emitting in CoroEarly and it is simpler to implement in CoroEarly. And I think we could control the position of CoroEarly pass to make it in the early enough places.

Beside the place to emit intrinsic and attribute, I am wondering your thoughts to insert a EarlyCSE pass after CoroCleanup pass. Do you think it is acceptable?

ChuanqiXu · April 26, 2022, 5:39am

Oh, do you refer to the sentence why we don’t solve the problem by forwarding CoroSplit? I send some words at: ⚙ D124364 [Pipelines] Enable EarlyCSE after CoroCleanup to avoid runtime performance losses (5/5)

rjmccall · April 26, 2022, 5:10pm

IIRC, something like

thread_local int x = 0;
void helper(int *x, int y);
my_coro coroutine() {
  helper(&x, co_await something_suspending());
}

will get generated as:

@x = thread_local global i32 0, align 4
define @coroutine() {
  call @llvm.coro.id(...)
  // &x is actually evaluated to @x here, but it implicitly sinks
  call @llvm.coro.suspend(...)
  %arg2 = ...
  call @helper(i32* @x, i32 %arg2)
}

So the problem here is that the evaluation of &x has essentially been sunk below the suspension just because it’s a constant and IR doesn’t anchor it as an instruction. This is coming out of the frontend, so your pass will replace the @x when it’s used as an argument to @helper, which is not when it was formally evaluated; as a result, we’ll take the address of the TLS on the thread we’re on after the suspension, not before.

ChuanqiXu · April 27, 2022, 2:55am

Yeah, this example what you described before. It might be fine due to language don’t specify the order to evaluate parameters as you said. I’ve spent some time to find a well-defined but broken example but I failed. Although we don’t find a concrete example to show the current solution is bad, I agree that the solution to emit intrinsics and attributes would make us feel better.

Will refactor the patch to emit intrinsics and attributes in the frontend.

danilaml · April 27, 2022, 11:09am

Why is the inlining a problem here? It appears possible to me to make inliner “smarter” about inlining into coroutines by transforming certain TLS accesses into the proposed intrinsics (or inserting some sort of “tls mem barrier” after the inlined function body).

ChuanqiXu · April 27, 2022, 11:19am

Oh, good catcha! I think it makes sense to rewrite the access of TLS variable and readnone function when inlining. Will do.

ChuanqiXu · April 28, 2022, 9:20am

@danilaml

I tried to implement your idea for TLS variable: ⚙ D124592 [DRAFT] [Coroutines] Add coro_maychange intrinsic to coroutines only for TLS variables But I feel it is not good since its time complexity looks bad and it wouldn’t make optimization better. It doesn’t matter to add llvm.coro.maychange for TLS variables in normal functions since it is a travial optimization to move and combine constant values.

But your suggestion helps for the pthread_self() problem. It looks like we could add coro_readnone attribute for callsites in unlowered coroutines only and replace readnone with coro_readnone when we inline a function into an unlowered coroutine. Although I meet some problems that the optimizer is still going to optimize the call sites whose callee is marked as readnone…

danilaml · April 28, 2022, 2:25pm

Well, it should be done linearly time-complexity wise, so at worst just some constant factor increase for inlining, and the regular non-coro codepath shouldn’t be affected in any way. I don’t know if the other approaches have this property. Not sure what you mean by “make optimizations better”.

efriedma-quic · April 28, 2022, 5:30pm

As a practical matter, the change to TLS variables seems less complicated, since all the complexity is centered around the new intrinsic; I think we should try to get that landed first.

For both llvm.coro.maychange and coro_readnone functions, have you tried marking them readonly inaccessiblememonly? That should recover a lot of the optimizations we normally do. And we can probably tweak alias analysis to recover almost all of the optimizations we still miss at that point.

Distinguishing between “coroutine” and “non-coroutine” functions seems complicated. In particular, it makes interprocedural optimizations harder: every interprocedural optimization that runs before CoroCleanup would need to be aware of the coroutine-specific constructs, to avoid breaking the invariants. And that seems sort of counterproductive… if we’re going to need to mess with all these optimizations anyway, there’s not much point to introducing the new attribute. It’s not just the inliner; it’s also IPSCCP, and Attributor, and every other pass we run.

rjmccall · April 28, 2022, 11:11pm

Unlowered coroutines have novel invariants and generalize the set of things that can be expressed in IR, so they fundamentally do require support from optimizer passes that are maintaining invariants and making assumptions about IR. I think we can summarize Chuanqi’s proposed approach by saying that he’s strengthening the invariants in unlowered coroutines. That necessarily requires support from interprocedural passes so that they don’t break those invariants when they move code/information into unlowered coroutines. For readnone, that seems like a pragmatic way to isolate intraprocedural passes by avoiding exposing them to information that they can’t soundly use in unlowered coroutines. For TLS, it is mandatory because the basic representation chosen by LLVM IR is designed to automatically do an optimization that’s not sound in unlowered coruotines. So I think this is the way to go unless we think we’re not going to be able to isolate the intraprocedural passes in the end.

ChuanqiXu · April 29, 2022, 2:37am

I mean the current approach wouldn’t affect us to optimize TLS variables in normal functions and lowered coroutines. This is showed in the test of ⚙ D124361 [Coroutines] Add coro_maychange intrinsic to solve TLS problem (2/5). So the idea of involving inlining to not add llvm.coro.may_change to non-coroutines looks complicated but not beneficial to me.

Good idea! Will do.

The last problem we’re discussing should be whether or not to involve other passes to solve TLS problem. I prefer to not involve additional passes since a call to a function without body breaks many optimizations naturally. And the difference of TLS problem with the readnone problem is: the current solution for TLS variable wouldn’t affect the optimization in normal functions and unlowered functions. So it makes sense that we want to make the solution of readnone problem better. But I feel it is not necessary to require to not insert llvm.coro.may_change in normal functions. Unless we want to make the style of two solutions uniform. But I feel it is not necessary.

(BTW, I feel like we need the involvement of other passes for readnone problems if we don’t like the current solution for readnone problems)

I made a try to only insert llvm.coro.may_change to coroutines and handle it in inliner in: ⚙ D124592 [DRAFT] [Coroutines] Add coro_maychange intrinsic to coroutines only for TLS variables. But it looks not good to me due to the complexity problems. Maybe we could have better implementation to eliminate the complexity problem but I feel it might not be too simple. And as Eli said, I am not sure if it is necessary to involve other IPA/IPO. According to my understanding to IPSCCP, I think we don’t need the involvement of IPSCCP. But I don’t know about Attributor. We might need to consider a lot of IPA/IPO to make sure we don’t missed anything. But I feel like the direction is error prone.

jyknight · April 29, 2022, 3:01am

For TLS access: I think we should be attacking this from a more principled standpoint, and actually deprecate/remove the ability to access a TLS variable’s address as a constant expression altogether.

It’s really just NOT a constant expression – I think that was just an incorrect representational choice. I also note that LLVM has some pretty busted behavior right now, independent of coroutines. E.g.

@x = thread_local global i32 0, align 4
@y = global i32* @x, align 8

That IR compiles, but it’s nonsense. There’s nothing that @y can actually be initialized to here, which is a direct consequence of i32* @x being a fake constant.

rjmccall · April 29, 2022, 3:08am

I mean, if we’re actually willing to change the general representation of TLS accesses, that’s fine with me; I agree the representation is not good. You will need to make a specific RFC about that, though, and do a bunch of implementation work to e.g. upgrade existing bitcode and avoid performance regressions in non-coroutine code, and of course whatever optimizations you do will have to be weakened in coroutine bodies.

Topic		Replies	Views
[RFC] Coroutine and pthread_self LLVM Dev List Archives	19	508	December 14, 2020
[RFC] LLVM Coroutines LLVM Dev List Archives	9	212	June 11, 2016
RFC: Coroutine Optimization Passes LLVM Dev List Archives	8	143	July 18, 2016
[RFC] LLVM Coroutines LLVM Dev List Archives	19	165	June 16, 2016
RFC: LLVM Coroutine Representation, Round 2 LLVM Dev List Archives	5	110	July 21, 2016

Address thread identification problems with coroutine

For TLS variable,

For readnone attribute (pthread_self() problem)

Related topics