Add ThreadSanitizer check to prevent coroutine suspending while holding a lock (potential deadlock)?

Use Case

At Meta, we heavily use C++ coroutines, and we’ve hit multiple production incidents where a service deadlocks because a coroutine suspended while holding a lock. Example scenario:

  • A thread in some executor is executing a coroutine, and acquires a mutex.
  • Before the mutex is released, the coroutine suspends (e.g. by co_await-ing another coroutine).
  • This can result in deadlocks where all the threads in the executor get blocked waiting on the mutex, and the suspended coroutine can never release the mutex because there are no threads in the executor to resume the coroutine.

I think it would be helpful to add a new check to ThreadSanitizer to help catch these kinds of scenarios.

Proof of Concept

To demonstrate what a check might look like, I implemented a proof-of-concept of this check using the folly coroutines library, and some local changes to ThreadSanitizer. The proof-of-concept requires these local changes:

This is an example program demonstrating the bad behavior using the folly coroutines library (coroutine suspending while holding a lock):

#include <mutex>

#include <folly/experimental/coro/BlockingWait.h>
#include <folly/experimental/coro/Task.h>
#include <folly/init/Init.h>
#include <glog/logging.h>

std::mutex m;

folly::coro::Task<void> a2() {
  LOG(INFO) << __FUNCTION__;
  co_return;
}

folly::coro::Task<void> a1() {
  LOG(INFO) << __FUNCTION__;
  std::unique_lock<std::mutex> g(m);
  co_await a2();  // Potential deadlock! TSAN crashes here!
}

int main(int argc, char** argv) {
  folly::init(&argc, &argv);
  folly::coro::blockingWait(a1());
}

This is an example of the program correctly failing as a result of this check:

FATAL: ThreadSanitizer CHECK failed: llvm-project/compiler-rt/lib/tsan/rtl/tsan_rtl.cpp:1060 "((0)) == ((thr->mset.Size()))" (0x0, 0x1)
    #0 __tsan::TsanCheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) <null> (main+0x...)
    #1 __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) <null> (main+0x...)
    #2 __tsan::ThreadStateCheckNoLocks(__tsan::ThreadState*) <null> (main+0x...)
    #3 folly::coro::ensureNoLocks() folly/folly/experimental/coro/Task.h:73 (main+0x...)
    #4 auto folly::coro::Task<void>::Awaiter::await_suspend<folly::coro::detail::TaskPromise<void> >(std::__n4861::coroutine_handle<folly::coro::detail::TaskPromise<void> >) fbcode/folly/experimental/coro/Task.h:805 (main+0x...)
    #5 a1() (.resume) main.cpp:21 (main+0x...)
    #6 std::__n4861::coroutine_handle<void>::resume() const libgcc/include/c++/trunk/coroutine:127 (libfolly_tracing_async_stack.so+0x4222)
    #7 folly::resumeCoroutineWithNewAsyncStackRoot(std::__n4861::coroutine_handle<void>, folly::AsyncStackFrame&) fbcode/folly/tracing/AsyncStack.cpp:179 (libfolly_tracing_async_stack.so+0x...)
    ...

Proposed ThreadSanitizer Changes

For turning the proof-of-concept into real changes, I’m thinking about the following:

  • We can add an optional new ThreadSanitizer check, something like coroutine-suspended-while-holding-lock (potential deadlock)
  • We can add new annotations that can be invoked from application code to mark when coroutines are suspended/resumed/destroyed. Something like:
void AnnotateCoroResumed(char *f, int l, uptr coro);
void AnnotateCoroSuspended(char *f, int l, uptr coro);
void AnnotateCoroDestroyed(char *f, int l, uptr coro);
  • Applications or libraries (e.g. folly coro) can annotate their coroutine implementations with these new annotations to allow ThreadSanitizer to correctly detect these bugs.
  • In the ThreadSanitizer runtime: whenever a coroutine is suspended, the runtime will check that no locks are held. If there are locks held, then it will report the stack trace where the coroutine was suspended, as well as where the mutexes were acquired.
  • Separately, I see that there is also fiber support in ThreadSanitizer, but I’m not sure how much of that is reusable for this kind of check.

What do folks think? If folks agree this is a good idea, I can help implement this check. I have not contributed to llvm open source before, and would appreciate help and guidance on implementing this change!

This motivation is pretty interesting from the users perspective! (But I don’t have experience in implementing the sanitizers too).

One concern about the design is that it looks too coarse-grained. IIUC, the error message will be reported if a coroutine with a lock held got resumed. I am wondering if it is too aggressive… But maybe this is about the design principles about TSAN. I know TSAN are much more tolerant with the false positive diagnostic messages. (Although this is the reason some people don’t enable TSAN).

And out of curiosity, there are already plenty of studies about dead lock detection in goroutine in golang. How do you think reusing/mimic their strategies? I know goroutines are stackful coroutines and C++20 coroutines are stackless coroutines. But there are similar points from the higher level of users.

Hi Kenny,

In the ThreadSanitizer runtime: whenever a coroutine is suspended, the runtime will check that no locks are held.

What will AnnotateCoroResumed/AnnotateCoroDestroyed do? Also report if any mutexes are held?

Strictly speaking, this is not a bug, right? I mean a coroutine can have mutexes locked when switching. It’s just up to the code to arrange that no bad things happen, right?

If all annotations do the same and it’s not a bug in all cases, I would prefer a single __tsan_check_no_mutexes_held function.
Such function can be resused in other contexts as well. E.g at thread exit, or in some performance critical parts of the code.

The function needs to be added to include/sanitizer/tsan_interface.h and also needs a test in test/tsan/

(char *f, int l, uptr coro) arguments are not needed, tsan can unwind the current stack.

If you agree with these suggestions, please send a patch.

TSAN has zero tolerance to false positive race reports.
There are some false positives deadlock reports, but most of these are bugs/unimplemented logic. It’s not supposed to be that way, it’s just nobody had time to finish the deadlock detector.

Strictly speaking, this is not a bug, right? I mean a coroutine can have mutexes locked when switching. It’s just up to the code to arrange that no bad things happen, right?

Yeah, this is what I understand too. A coroutine with a mutex locked during switching is not a strict bug.

TSAN has zero tolerance to false positive race reports.

And this is what I get confused. If it is not a bug, and the Tsan sends an error message, isn’t it a false positive diagnostic?

If we add just __tsan_check_no_mutexes_held, then it’s up to the user. Tsan itself will never produce such warnings on its own.

Consider: asan also has an annotation to mark a memory range as poisoned. If a user uses it incorrectly and poison something that is going to be used, then asan will also produce a “false positive”. But I wouldn’t call it a false positive.

1 Like

I was thinking that these would do nothing for now, but could be reused in the future if we wanted more coroutine support. The minimum needed in my proposal is only AnnotateCoroSuspended. I think your suggestions are simpler, and I can work on that instead.

Yes that’s true, it’s not a bug in all cases, but often it is. This is similar to the lock-order-inversion check: it’s not a bug in all cases, but frequently indicates a real potential deadlock.

I think this approach can work. Thanks for the suggestion! I’ll work on the change.

I’m not sure I understand this part – what do you mean by “tsan can unwind the current stack”? In my proof-of-concept changes to tsan, I used a simple CHECK that would abort the program if there are mutexes held, but this might not be what we want in the real changes. What do you recommend instead of using a CHECK?

Print a bug report, like tsan does for all other bug types (see e.g. ReportRace).

But the (char *f, int l, uptr coro) annotations arguments won’t be needed for that.

Thanks for the guidance! I created a pull request here: