Misoptimization of TLS and attibute((const)) in stackful coroutines (-ffunction-thread-migration)

apolukhin · July 11, 2024, 8:19am

The problem highlighted in Address thread identification problems with coroutine is actually wider.

With stackful coroutines the assumption that a function is only possible to be run in one thread does not hold. So for the following code:

thread_local_variable = a;
something(); // `co_await something;` with stackless coroutines
use of thread_local_variable

the compiler would transform it into:

thread_local_variable = a;
something(); // `co_await something;` with stackless coroutines
use of a

However, something(); may suspend the current coroutine and resume it on another thread. This is similar for pthread_self(). And the reason why pthread_self() would be (mis)optimized is that pthread_self() is marked as __attibute__((const)) so that the compiler think the every call to pthread_self() would return the same result.

AFAIK the above problems are fixed for stackless coroutines in Coro* passes. However, with stackful coroutines the above problems hold as any function could perform the coroutine migration and there’s no co_* keyword to notify the compiler about the migration.

Reproducer Compiler Explorer , where clang-16 moves the lea rbx, [rax + sample()::i@TPOFF] out of the inner loop, however call call engine::Yeild() migrates the coroutine.

Would it be possible to add some command line option like -ffunction-thread-migration, to adjust existing Coro* passes with an assumption that every (non attribute((const))) function call could resume the function on another thread?

Stackful coroutines are quite popular and used in multiple projects. Off the top of my head:

https://wg21.link/P0876 - Stackful coroutines proposal for inclusion into C++ Standard
Boost.Context/Boost.Coroutine/Boost.Coroutine2/Boost.Fibers
userver
Bloomberg’s quantum
YDB
Facebook’s folly::fibers
QEMU

ChuanqiXu · July 11, 2024, 8:28am

IIRC, the being proposed WG21 paper P0876, doesn’t allow stackful coroutines to switch in different threads.

AFAIK the above problems are fixed for stackless coroutines in Coro* passes.

It’s not the case. It is fixed in several separate optimization passes.

Would it be possible to add some command line option like -ffunction-thread-migration, to adjust existing Coro* passes with an assumption that every (non attribute((const))) function call could resume the function on another thread?

So if we want to do this, given stackful coroutine is transparent to the compiler, I think we can only forbids all such optimizations across the compilation.

apolukhin · July 11, 2024, 8:37am

Would it be possible to add some command line option like -ffunction-thread-migration, to adjust existing Coro* passes with an assumption that every (non attribute((const))) function call could resume the function on another thread?

So if we want to do this, given stackful coroutine is transparent to the compiler, I think we can only forbids all such optimizations across the compilation.

Would it work with LTO, if all the translation units are compiled with -ffunction-thread-migration or LTO would misoptimize TLS/__attibute__((const)) access?

ChuanqiXu · July 11, 2024, 8:42am

If we actually disabled all the optimizations for that cases, we shouldn’t miscompile it. But if we’re asking if we can save some optimizations, maybe technically possible, but that may be hard on the hand since inter procedure analysis is always hard. On the other hand, even if we can do that, I think we will still give up on most cases.

n-omer · July 11, 2024, 1:22pm

FYI I’ve created Addresses of TLS variables are kept alive across fiber/stack-full coroutine context switches which may result in a crash · Issue #98479 · llvm/llvm-project · GitHub to track issues like this.

danilaml · July 11, 2024, 3:19pm

I feel like it’s sort of Alias Analysis problem.
@llvm.threadlocal.address.p0 is marked as memory(none) while in actuality it reads memory(tls)/memory(args) which can be clobbered by any call (not specifically marked as “context-switch safe”) if I understood the problem correctly.

jyknight · July 11, 2024, 3:47pm

This certainly isn’t a bug or a miscompile. You might call it a missing feature. But my inclination is that we should call this use-case invalid, and should not add support to the compiler.

Indeed, the standards proposal for “stackful coroutines” forbids resuming on a different thread than it was suspended, so this won’t be a problem for that proposal.

Other uses of stack-switching coroutines I’m aware of have also been careful to keep this property. I believe that scheduling fibers within a single thread is all Folly supports, e.g.

justincady · July 11, 2024, 5:19pm

FYI, I think this problem was also discussed in Is there a way to indicate a function is a coroutine suspension point?.

E1pp · July 24, 2024, 11:43pm

I believe invalidating such a use-case would also invalidate the most efficient implementations of fibers and scheduling algorithms:

To optimize resource (re-)allocations fiber pools are implemented with suspending fiber on thread-1, putting it in the queue, which can be later extracted by an arbitrary thread-2 and switched to (in order for fiber to be reused). Enforcing 1-thread function rule would require these pools to be thread local and thus would cause memory over consumption (e.g. thread-1 runs a lot of fibers during interval [0, 10] then does nothing and thread-2 does nothing from [0, 10] and then wants to run a lot of fibers – it could reuse fibers left by thread-1 instead of allocating them).
Work-stealing thread pool would not be able to run code in fibers properly as even if “task” of resuming fiber can be scheduled to the same thread-1, another thread-2 might “steal” it during the load balancing algorithm thus causing fiber to switch threads. Hardcoding some separation between normal tasks and context switching tasks would noticeably damage the quality of load balancing and thus is undesirable.

Generally, in an attempt to circumvent compiler optimization issues and other caveats caused by thread switching, we represent thread_local variables in a form of reference returning function:

__attribute__((__noinline__)) int& Variable()
{
  static thread_local int data = {};
  asm volatile("");
  return data;
}

Unfortunately, with newer clang versions and LTO enabled we observe calls to this variable being optimized with 1-thread per function assumptions which obviously causes data races when the assumptions don’t hold.

I don’t really know how hard it would be to implement, but I think having an attribute which requests compiler and linker to not bother doing anything with the marked function, would be a satisfactory solution (even if it leads to some silly scenarios of 10 calls to Variable right next to each other not being collapsed into a single one). We would be able to enjoy benefits of enabling LTO again and clang developers wouldn’t have to give up on the 1-thread per function assumption in their optimization logic.

jyknight · July 25, 2024, 1:30am

Without endorsing that hack, I think it ought to fix it if you put __attribute__((noinline,optnone)), instead of only noinline.

Topic		Replies	Views
Address thread identification problems with coroutine IR & Optimizations	58	2388	June 21, 2024
[RFC] Coroutine and pthread_self LLVM Dev List Archives	19	395	December 14, 2020
[RFC] A new AAPass for coroutines or a simple workaround? IR & Optimizations	8	460	August 3, 2023
Guidance requested: Fixing missed optimization for coroutines LLVM Dev List Archives	1	90	January 23, 2021
[clang-tidy][RFC] checks concurrency-async-{blocking, fs, no-new-threads} LLVM Dev List Archives	1	103	January 28, 2021

Misoptimization of TLS and __attibute__((const)) in stackful coroutines (-ffunction-thread-migration)

Related Topics

Misoptimization of TLS and attibute((const)) in stackful coroutines (-ffunction-thread-migration)