Misoptimization of TLS and __attibute__((const)) in stackful coroutines (-ffunction-thread-migration)

The problem highlighted in Address thread identification problems with coroutine is actually wider.

With stackful coroutines the assumption that a function is only possible to be run in one thread does not hold. So for the following code:

thread_local_variable = a;
something(); // `co_await something;` with stackless coroutines
use of thread_local_variable

the compiler would transform it into:

thread_local_variable = a;
something(); // `co_await something;` with stackless coroutines
use of a

However, something(); may suspend the current coroutine and resume it on another thread. This is similar for pthread_self(). And the reason why pthread_self() would be (mis)optimized is that pthread_self() is marked as __attibute__((const)) so that the compiler think the every call to pthread_self() would return the same result.

AFAIK the above problems are fixed for stackless coroutines in Coro* passes. However, with stackful coroutines the above problems hold as any function could perform the coroutine migration and there’s no co_* keyword to notify the compiler about the migration.

Reproducer Compiler Explorer , where clang-16 moves the lea rbx, [rax + sample()::i@TPOFF] out of the inner loop, however call call engine::Yeild() migrates the coroutine.

Would it be possible to add some command line option like -ffunction-thread-migration, to adjust existing Coro* passes with an assumption that every (non attribute((const))) function call could resume the function on another thread?

Stackful coroutines are quite popular and used in multiple projects. Off the top of my head:

  • https://wg21.link/P0876 - Stackful coroutines proposal for inclusion into C++ Standard
  • Boost.Context/Boost.Coroutine/Boost.Coroutine2/Boost.Fibers
  • userver
  • Bloomberg’s quantum
  • YDB
  • Facebook’s folly::fibers
  • QEMU

IIRC, the being proposed WG21 paper P0876, doesn’t allow stackful coroutines to switch in different threads.

AFAIK the above problems are fixed for stackless coroutines in Coro* passes.

It’s not the case. It is fixed in several separate optimization passes.

Would it be possible to add some command line option like -ffunction-thread-migration, to adjust existing Coro* passes with an assumption that every (non attribute((const))) function call could resume the function on another thread?

So if we want to do this, given stackful coroutine is transparent to the compiler, I think we can only forbids all such optimizations across the compilation.

1 Like

Would it be possible to add some command line option like -ffunction-thread-migration, to adjust existing Coro* passes with an assumption that every (non attribute((const))) function call could resume the function on another thread?

So if we want to do this, given stackful coroutine is transparent to the compiler, I think we can only forbids all such optimizations across the compilation.

Would it work with LTO, if all the translation units are compiled with -ffunction-thread-migration or LTO would misoptimize TLS/__attibute__((const)) access?

If we actually disabled all the optimizations for that cases, we shouldn’t miscompile it. But if we’re asking if we can save some optimizations, maybe technically possible, but that may be hard on the hand since inter procedure analysis is always hard. On the other hand, even if we can do that, I think we will still give up on most cases.

FYI I’ve created Addresses of TLS variables are kept alive across fiber/stack-full coroutine context switches which may result in a crash · Issue #98479 · llvm/llvm-project · GitHub to track issues like this.

I feel like it’s sort of Alias Analysis problem.
@llvm.threadlocal.address.p0 is marked as memory(none) while in actuality it reads memory(tls)/memory(args) which can be clobbered by any call (not specifically marked as “context-switch safe”) if I understood the problem correctly.

This certainly isn’t a bug or a miscompile. You might call it a missing feature. But my inclination is that we should call this use-case invalid, and should not add support to the compiler.

Indeed, the standards proposal for “stackful coroutines” forbids resuming on a different thread than it was suspended, so this won’t be a problem for that proposal.

Other uses of stack-switching coroutines I’m aware of have also been careful to keep this property. I believe that scheduling fibers within a single thread is all Folly supports, e.g.

2 Likes

FYI, I think this problem was also discussed in Is there a way to indicate a function is a coroutine suspension point?.

1 Like

I believe invalidating such a use-case would also invalidate the most efficient implementations of fibers and scheduling algorithms:

  1. To optimize resource (re-)allocations fiber pools are implemented with suspending fiber on thread-1, putting it in the queue, which can be later extracted by an arbitrary thread-2 and switched to (in order for fiber to be reused). Enforcing 1-thread function rule would require these pools to be thread local and thus would cause memory over consumption (e.g. thread-1 runs a lot of fibers during interval [0, 10] then does nothing and thread-2 does nothing from [0, 10] and then wants to run a lot of fibers – it could reuse fibers left by thread-1 instead of allocating them).

  2. Work-stealing thread pool would not be able to run code in fibers properly as even if “task” of resuming fiber can be scheduled to the same thread-1, another thread-2 might “steal” it during the load balancing algorithm thus causing fiber to switch threads. Hardcoding some separation between normal tasks and context switching tasks would noticeably damage the quality of load balancing and thus is undesirable.

Generally, in an attempt to circumvent compiler optimization issues and other caveats caused by thread switching, we represent thread_local variables in a form of reference returning function:

__attribute__((__noinline__)) int& Variable()
{
  static thread_local int data = {};
  asm volatile("");
  return data;
}

Unfortunately, with newer clang versions and LTO enabled we observe calls to this variable being optimized with 1-thread per function assumptions which obviously causes data races when the assumptions don’t hold.

I don’t really know how hard it would be to implement, but I think having an attribute which requests compiler and linker to not bother doing anything with the marked function, would be a satisfactory solution (even if it leads to some silly scenarios of 10 calls to Variable right next to each other not being collapsed into a single one). We would be able to enjoy benefits of enabling LTO again and clang developers wouldn’t have to give up on the 1-thread per function assumption in their optimization logic.

Without endorsing that hack, I think it ought to fix it if you put __attribute__((noinline,optnone)), instead of only noinline.