Use Case
At Meta, we heavily use C++ coroutines, and we’ve hit multiple production incidents where a service deadlocks because a coroutine suspended while holding a lock. Example scenario:
- A thread in some executor is executing a coroutine, and acquires a mutex.
- Before the mutex is released, the coroutine suspends (e.g. by co_await-ing another coroutine).
- This can result in deadlocks where all the threads in the executor get blocked waiting on the mutex, and the suspended coroutine can never release the mutex because there are no threads in the executor to resume the coroutine.
I think it would be helpful to add a new check to ThreadSanitizer to help catch these kinds of scenarios.
Proof of Concept
To demonstrate what a check might look like, I implemented a proof-of-concept of this check using the folly coroutines library, and some local changes to ThreadSanitizer. The proof-of-concept requires these local changes:
- Changes to ThreadSanitizer to expose a function to check if locks are held by the current thread
- Changes to the folly coro library to call the above function when suspending a coroutine
This is an example program demonstrating the bad behavior using the folly coroutines library (coroutine suspending while holding a lock):
#include <mutex>
#include <folly/experimental/coro/BlockingWait.h>
#include <folly/experimental/coro/Task.h>
#include <folly/init/Init.h>
#include <glog/logging.h>
std::mutex m;
folly::coro::Task<void> a2() {
LOG(INFO) << __FUNCTION__;
co_return;
}
folly::coro::Task<void> a1() {
LOG(INFO) << __FUNCTION__;
std::unique_lock<std::mutex> g(m);
co_await a2(); // Potential deadlock! TSAN crashes here!
}
int main(int argc, char** argv) {
folly::init(&argc, &argv);
folly::coro::blockingWait(a1());
}
This is an example of the program correctly failing as a result of this check:
FATAL: ThreadSanitizer CHECK failed: llvm-project/compiler-rt/lib/tsan/rtl/tsan_rtl.cpp:1060 "((0)) == ((thr->mset.Size()))" (0x0, 0x1)
#0 __tsan::TsanCheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) <null> (main+0x...)
#1 __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) <null> (main+0x...)
#2 __tsan::ThreadStateCheckNoLocks(__tsan::ThreadState*) <null> (main+0x...)
#3 folly::coro::ensureNoLocks() folly/folly/experimental/coro/Task.h:73 (main+0x...)
#4 auto folly::coro::Task<void>::Awaiter::await_suspend<folly::coro::detail::TaskPromise<void> >(std::__n4861::coroutine_handle<folly::coro::detail::TaskPromise<void> >) fbcode/folly/experimental/coro/Task.h:805 (main+0x...)
#5 a1() (.resume) main.cpp:21 (main+0x...)
#6 std::__n4861::coroutine_handle<void>::resume() const libgcc/include/c++/trunk/coroutine:127 (libfolly_tracing_async_stack.so+0x4222)
#7 folly::resumeCoroutineWithNewAsyncStackRoot(std::__n4861::coroutine_handle<void>, folly::AsyncStackFrame&) fbcode/folly/tracing/AsyncStack.cpp:179 (libfolly_tracing_async_stack.so+0x...)
...
Proposed ThreadSanitizer Changes
For turning the proof-of-concept into real changes, I’m thinking about the following:
- We can add an optional new ThreadSanitizer check, something like
coroutine-suspended-while-holding-lock (potential deadlock)
- We can add new annotations that can be invoked from application code to mark when coroutines are suspended/resumed/destroyed. Something like:
void AnnotateCoroResumed(char *f, int l, uptr coro);
void AnnotateCoroSuspended(char *f, int l, uptr coro);
void AnnotateCoroDestroyed(char *f, int l, uptr coro);
- Applications or libraries (e.g. folly coro) can annotate their coroutine implementations with these new annotations to allow ThreadSanitizer to correctly detect these bugs.
- In the ThreadSanitizer runtime: whenever a coroutine is suspended, the runtime will check that no locks are held. If there are locks held, then it will report the stack trace where the coroutine was suspended, as well as where the mutexes were acquired.
- Separately, I see that there is also fiber support in ThreadSanitizer, but I’m not sure how much of that is reusable for this kind of check.
What do folks think? If folks agree this is a good idea, I can help implement this check. I have not contributed to llvm open source before, and would appreciate help and guidance on implementing this change!