Address thread identification problems with coroutine

jyknight · April 29, 2022, 3:30am

If we implement it using an intrinsic, (e.g. marked nounwind readnone willreturn), then no special behavior will need to be implemented for coroutines. Fixing readnone handling in coroutines then automatically fixes TLS access as well.

rjmccall · April 29, 2022, 4:20am

If you make optimizing TLS accesses unnecessarily weaker in non-coroutines, yes. Those attributes will forbid reordering materializations of the address of a thread-local variable with an unknown call.

ChuanqiXu · April 29, 2022, 6:07am

To make sure that I don’t misunderstand, let me try to repeat your suggestion:
(1) Replace access of TLS variable with an intrinsic marked as nounwind, readnone, willreturn.
(2) Try to handle readnone problems in coroutines.

It looks like we could solve the thread-identity problems in coroutines in this approach. But I feel like it wouldn’t help with the general TLS access problem you said:

@x = thread_local global i32 0, align 4
@y = global i32* @x, align 8

Do I misunderstand anything?

In case I don’t misunderstood you, the approach combines two problems together. However, the current problem I saw is that we didn’t find a good enough solution for readnone problems. And in my proposed solution, it looks possible that we could solve TLS problem at least. But if we adopt the proposal you said, I feel like we must solve the two problems together, which seems like a little bit worse to me.

jyknight · April 29, 2022, 3:18pm

I’m not sure what you mean. Reordering and coalescing is certainly allowed with those attrs? I wrote an example just to make sure I wasn’t missing something obvious here…

declare i32* @get_tl() nounwind readnone willreturn
declare void @unknown()

define i32 @foo() {
  %1 = tail call i32* () @get_tl() nounwind readnone willreturn
  call void @unknown()
  %2 = tail call i32* () @get_tl() nounwind readnone willreturn
  %3 = load i32, i32* %1
  %4 = load i32, i32* %2
  %5 = add i32 %3, %4
  ret i32 %5
}

piped to opt -O2 results in only a single call to @get_tl, as desired.

; ModuleID = '<stdin>'
source_filename = "<stdin>"

; Function Attrs: mustprogress nofree nosync nounwind readnone willreturn
declare i32* @get_tl() local_unnamed_addr #0

declare void @unknown() local_unnamed_addr

define i32 @foo() local_unnamed_addr {
  %1 = tail call i32* @get_tl() #1
  tail call void @unknown()
  %2 = load i32, i32* %1, align 4
  %3 = shl i32 %2, 1
  ret i32 %3
}

attributes #0 = { mustprogress nofree nosync nounwind readnone willreturn }
attributes #1 = { nounwind readnone willreturn }

jyknight · April 29, 2022, 3:19pm

Replace address computation of TLS variables, not access, but yes.

It does solve it, by making that IR not possible to write. In my not-really-fleshed-out idea here, @x should no longer be usable as a pointer, so i32* @x is impossible to write in IR – which is as it should be. You need an Instruction to retrieve the address for @x, and thus it’s clearly impossible to use in a global value initializer. Perhaps @x has become a token type of some sort, which can be passed to the intrinsic but isn’t otherwise usable?

We must solve the readnone problem, so this line of objection isn’t very persuasive to me. It’s not just a nice-to-have that we can ignore… And, additionally, I think we do have good options for solving it.

rjmccall · April 29, 2022, 3:28pm

You’re right, sorry; Eli had been suggesting readonly, and I didn’t notice you’d changed it toreadnone. But if you use readnone, then surely you’ll be able to do re-ordering with coro.suspends in coroutine bodies, where it is illegal. Unless you’re suggesting this only as a way of reducing the TLS problem to the more general readnone problem?

danilaml · April 29, 2022, 3:29pm

If the “address of tls is a constant” issue is solved in IR (which appears to be already potentially unsound as demonstrated), I guess then the coroutine problem just reduces to a kind of issues GC has that LLVM already handles with suspend intrinsics acting as a statepoint.

efriedma-quic · April 29, 2022, 10:22pm

The purest solution here is to introduce @llvm.threadlocal.address to compute the address of thread-local variables as discussed. And then to handle the “readnone” problem, we introduce attributes readnone_except_thread_id to indicate a function is pure except that it computes the thread id/address of thread-local variables, and attribute may_suspend_thread to indicate that a function contains a suspend point where the current hardware thread can change. And then use all of this globally everywhere. This works not only for coroutines, but also any other form of cooperative green threads.

The primary problem is the “globally everywhere” part; the changes aren’t isolated to programs that use coroutines, so we’d have work carefully to avoid regressions.

nhaehnle · May 2, 2022, 7:28pm

+1 to this approach, since everything else seems to subvert the very meaning of SSA form.

Though may I propose as slightly cleaner and more forward looking attributes and attribute names instead:

noread_thread_id
maywrite_thread_id

Without either of these attributes, a function may read the thread ID but it may not change it. This is a reasonable default for most programming languages.

(The exception would be a language with pervasive green threads – such a language would have to conservatively mark all functions as maywrite_thread_id in the frontend, similar to how GPU languages must conservatively mark all functions as convergent in the frontend.)

readonly / readnone only apply to memory in this design. A readnone function may still read from the thread_id. And a readnone function may still be maywrite_thread_id.

ChuanqiXu · May 5, 2022, 3:01am

Thanks for the summary. I think it makes sense. I would try to work on this direction.

I feel like it is a little bit complicated. IIUC, your proposal would mark a coroutine with maywrite_thread_id . But it is not true, the coroutine self wouldn’t write the thread id. It would only resume in another hardware thread.

jyknight · May 5, 2022, 1:48pm

I’ll note this is basically what was already discussed a couple years ago – “Proposal 2” in my message in 2020. [RFC] Coroutine and pthread_self - #17 by jyknight

Still seems like the right answer to me.

Yet, while it’s useful to discuss what new attributes we might want, conceptually, I think we don’t actually need to introduce them yet. We can simply assume that all functions can read the thread-id, and that the only thing which could possibly ‘write’ it is a direct call to the intrinsic @llvm.coro.suspend.

Also note that for coroutines, especially, the “changes thread id” property cannot transitively propagate up to callers of a coroutine function. We don’t need to worry that a function we call might, itself, cause a thread-switch from the point-of-view of the caller function body – that’s not possible.

efriedma-quic · May 5, 2022, 6:46pm

I’m afraid if we don’t actually have the attributes, the implementation gets a lot more confusing.

A common predicate for call instructions is going to be something like “if the callee doesn’t read the thread id, or the caller doesn’t write the thread id”. i.e. something we don’t need to model as a memory access. Without the attributes, this actually boils down to “are we in a coroutine”: only coroutines can write the thread id, and we have to assume any call can read the thread id.

So the implementation comes down to scattering “isInCoroutine()” checks all over the place, and nobody will understand why these checks exist. And optimization inside coroutines probably gets significantly worse because we’re forced to assume any call reads the thread id.

Maybe we can mitigate this with carefully named helper methods.

jyknight · May 5, 2022, 8:56pm

What you need to know is: “does the body of the function I’m looking at optimizing contain any coroutine suspend calls?” If it doesn’t, you can apply the current readnone optimizations, but if it does, you can’t – unless/until you rewrite such optimizations to prohibit movement across coroutine suspend points.

But that property isn’t represented by those attributes. A “maywrite_thread_id” function attribute shouldn’t be set on a function that contains a call to llvm.coro.suspend, only on the llvm.coro.suspend itself.

This is kinda similar to the way we power down certain optimizations in functions that call setjmp, using Function::callsFunctionThatReturnsTwice(). So, sure, we could introduce a Function::callsFunctionThatMaywriteThreadId(), which checks for any call to something with the maywrite_thread_id attribute. And, yea, perhaps we should implement that now, even in absence of an explicit attribute, simply delegating to Function::isPresplitCoroutine().

As for optimizations getting significantly worse without the addition of noread_thread_id: it’s possible, but I don’t think it’s likely to be a major issue. The optimization passes also run post-CoroSplit, at which point all current optimizations will apply normally. I suspect there is some benefit to noread_thread_id, allowing some readnone optimizations to apply before CoroSplit. I just doubt it’s significant.

ChuanqiXu · May 6, 2022, 2:59am

We don’t need to know if the current function calls (stackless) coroutines. We only need to know if the current function is a coroutine. Since a normal function would only run in a single thread no matter if it calls stackless coroutines. Here I said stackless coroutine since I found there are notes about green thread (stackful coroutine). The problems in green thread is a little bit more complicated. I think we could fix them later (I met the problem about TLS in stackful coroutine indeed!)

I found the current discussing topic is about implementation. Here are my thoughts about the implementation:
(1) Replace the access of TLS variable with @llvm.threadlocal.address() intrinsic. The intrinsic is readnone, willreturn and nothrow. (* But it might not be able to fit the expection. See the explanantion in the end)
(2) In CoroEarly, we could replace readnone attribute with coro_readnone attribute in coroutines.
(3) In inlining, if we are inlining a function to a coroutine, we would replace the inlined calls of readnone to coro_readnone.
(4) In EarlyCSE and other passes, we need to disable the optimization in case the call is coro_readnone but the callee is readnone.
(5) In CoroCleanup, we would replace every coro_readnone with readnone attribute.
(6) (Optional) Run EarlyCSE after CoroCleanup to enable the optimization for readnone in the lowered coroutines.

Let me talk about (4). Simply, the optimizer would optimize the following code:

void foo() {
    call void @bar() coro_readnone
    co_await something();
    call void @bar() coro_readnone
}

declare void @bar() readnone

We could find this in https://github.com/llvm/llvm-project/blob/3b762b3ab8d205cd6a7d42c96d39d5f4f701f2ab/llvm/include/llvm/IR/InstrTypes.h#L2315-L2325 and https://github.com/llvm/llvm-project/blob/3b762b3ab8d205cd6a7d42c96d39d5f4f701f2ab/llvm/lib/Analysis/BasicAliasAnalysis.cpp#L793-L796 at least.

The pros and cons of the solution is:

Pros:
We don’t need to worry about the optimization in normal functions. It would remain the same by design.

Cons:
As we discussed above, we don’t know if there are other IPA/IPO which need to be handled.
The implementation is much complex. At least we need to get involved in alias analysis.

Another option is to replace every readnone attribute with coro_readnone in the module in CoroEarly and replace them back in CoroCleanup. And we need to run EarlyCSE after CoroCleanup.

The pros and cons is:
Pros:
The implementation is much cleaner.

Cons:
The optimization in many normal functions may be changed. We couldn’t be sure how much optimization chances would be remained by the EarlyCSE after CoroCleanup.

Personally, I prefer the first solution if it is implementable since the optimization is important. @efriedma-quic Would you like to pay the risk that there are other IPA/IPO to get involved?

(About @llvm.threadlocal.address())

Previously, we decide to insert intrinsic for TLS variable in the frontend. But if we insert the intrinsic in the frontend, we couldn’t block the motivation example given by James:

@x = thread_local global i32 0, align 4
@y = global i32* @x, align 8

BTW, the C++ standard has specification for initialize globals with TLS variable [stmt.dcl]p3:

Dynamic initialization of a block variable with static storage duration or thread storage duration is performed the first time control passes through its declaration; such a variable is considered initialized upon the completion of its initialization.

So the following code should be legal in C++:

thread_local int i;
int j = i;

nhaehnle · May 6, 2022, 6:31pm

Such cases don’t lead to global variable initializers in LLVM IR though, do they?

nhaehnle · May 6, 2022, 6:55pm

This sounds reasonable to me.

Right. Whether a maywrite_thread_id attribute propagates to callers depends on the context, and for coroutines the answer is “No”.

I am aware of two situations in which the answer would be “Yes”, none of which has proper support in LLVM yet (though I’m interested in one of them):

A system with green threads / some form of stackfull coroutines.
Raytracing shaders in graphics languages, where a TraceRay() builtin plays a similar role to green thread suspend/yield.

That said, just keying of @llvm.coro.suspend or a coroutine attribute on the function seems fine for now.

ChuanqiXu · May 9, 2022, 6:08am

Yes, the case wouldn’t generate code like the example that James gives.

jyknight · May 9, 2022, 5:46pm

We only need to know if the current function is a coroutine.

Yes, that’s what I was trying to say. (“is a coroutine” being effectively equivalent to “has a coroutine suspend call”).

Option 1: replace readnone attribute with coro_readnone attribute in coroutines
Option 2: replace every readnone attribute with coro_readnone in the module

I don’t like either of those two options: I don’t think it’s a good idea to have a pass convert readnone into coro_readnone to hide it from other optimizations, and then convert back again later to unhide. That is a workaround, not a solution.

Instead, optimization passes should each be checking for their required properties directly. So in code where previously [Call is readnone] by itself was sufficient, now it ought to be doing something like:
[Call is readnone] AND NOT [Call’s Parent Fn is a pre-split cororoutine].

Or if we later add a noread_thread_id attribute:
[Call is readnone] AND ([Call is noread_thread_id] OR NOT [Call’s Parent Fn is a pre-split cororoutine])

(And instead of “is a pre-split coroutine”, you could write “contains calls that may write thread-id” or “contains coroutine suspend calls”. All the same thing.)

ChuanqiXu · May 10, 2022, 7:25am

Personally, I don’t feel it is bad to insert checks (Call is readnone and the current Fn is presplit coroutine) in passes. However, from what I see, almost the changes which want to fix coroutine problems in other passes get blocked. Some examples: ⚙ D104007 [BasicAA] Properly mark that coroutine suspension may modify parameters, ⚙ D101510 Do not merge memcpy if the first source is a parameter of coroutine function, ⚙ D89711 [Coroutine] Prevent value reusing across coroutine suspensions in EarlyCSE and GVN. (I didn’t say the decisions made is not proper or correct). I feel people prefer to solve coroutines problem in coroutines codes.

The true problem here is that the misuse of readnone attribute for functions which need to do thread identification. So I think the true solution should introduce new an attribute readnone_except_thread_id and mark pthread_self() and all other similar misuses. And all other solutions are workarounds to me. But everyone of us agree that it is meaningless to fix the problem in library side.

Then I think the current solution is the best we could do here. Compare to your suggestion, I think the current solution bother other passes less (It touched inliner after all, I know) and makes coroutines passes depends other passes less, which fits the design intention of coroutines.

ChuanqiXu · May 10, 2022, 8:48am

I sent: https://reviews.llvm.org/D125291, ⚙ D125292 [Coroutines] Introduce "coro_readnone" operand bundles (2/3) and ⚙ D125293 [Coroutines] Run EarlyCSE for changed functions in CoroCleanup (3/3) to address the opinion I’ve received.

To summarize the current solution:
(1) Generate @llvm.threadlocal.address() to access TLS variables in the frontend. @llvm.threadlocal.address() is marked as willreturn ,readnone and nothrow.
(2) In CoroEarly Pass, remove the readnone attributes for the calls in coroutines and add “coro_readnone” operand bundles, readonly and inaccessiblememonly attributes for these calls.
(3) In inliner, when a readnone call get inlined to a unlowered coroutine, do the similar works as above.
(4) In CoroCleanup pass, remove “coro_readnone” operand bundles, readonly and inaccessiblememonly attributes and add readnone attributes for the calls with “coro_readnone” operand bundles.
(5) In CoroCleanup pass, run EarlyCSE pass for changed functions.

Pros:
We could be sure that the solution wouldn’t affect the optimization for normal functions.

Possible Question:
Why introduce operand bundles?

Given we don’t want to remove the attributes for the function declaration, it doesn’t work to remove the attributes at callsite simply. We could find the logic at https://github.com/llvm/llvm-project/blob/3b762b3ab8d205cd6a7d42c96d39d5f4f701f2ab/llvm/include/llvm/IR/InstrTypes.h#L2315-L2325 . We could find that it would try to lookup in the callee in case it failed to search at the callsites.
Then I feel operand bundles is better than attributes for the question. Since operand bundles mean semantics on the call actually. But the attributes is for function and arguments.

Other notes:

To James: I don’t think the current solution fits your original expectation, which tries to fix the problem that we treat the address of TLS variable as constant. I feel it is beyond the scope of the current topic (thread identification problem in coroutines). If you don’t like it, I would like to emit them in coroutines only. Now I emit them unconditionally since I feel it might be the direction you described.

To Eli and John: Cons: As Eli and I said before, we might need to touch other IPA/IPO if we don’t want to affect the optimizations for many normal functions. Currently, we only touched inliner. And I know IPSCCP and FunctionSpecialization wouldn’t get affected. But I really can’t be sure about other IPA/IPO.
Now I feel it is more horrible that the optimization for many other normal functions would be changed.
So I think we should take the current solution and take a “find then fix” manner for other IPA/IPO.

Is there any comments didn’t get addressed in this or in previous reply?

CC: @efriedma-quic @rjmccall @jyknight @nhaehnle @danilaml

Topic		Replies	Views
[RFC] Coroutine and pthread_self LLVM Dev List Archives	19	397	December 14, 2020
Misoptimization of TLS and __attibute__((const)) in stackful coroutines (-ffunction-thread-migration) IR & Optimizations coroutines	9	341	July 25, 2024
Is there a way to indicate a function is a coroutine suspension point? IR & Optimizations clang	12	362	March 26, 2024
Bug Report -- Possible optimizer bug with thread_local variables LLVM Dev List Archives	4	91	November 10, 2012
How to force re-evaluate thread_local address? LLVM Dev List Archives	2	95	February 1, 2016

Address thread identification problems with coroutine

Related topics