Add ThreadSanitizer check to prevent coroutine suspending while holding a lock (potential deadlock)?

kennyyu · October 13, 2023, 1:40am

Use Case

At Meta, we heavily use C++ coroutines, and we’ve hit multiple production incidents where a service deadlocks because a coroutine suspended while holding a lock. Example scenario:

A thread in some executor is executing a coroutine, and acquires a mutex.
Before the mutex is released, the coroutine suspends (e.g. by co_await-ing another coroutine).
This can result in deadlocks where all the threads in the executor get blocked waiting on the mutex, and the suspended coroutine can never release the mutex because there are no threads in the executor to resume the coroutine.

I think it would be helpful to add a new check to ThreadSanitizer to help catch these kinds of scenarios.

Proof of Concept

To demonstrate what a check might look like, I implemented a proof-of-concept of this check using the folly coroutines library, and some local changes to ThreadSanitizer. The proof-of-concept requires these local changes:

This is an example program demonstrating the bad behavior using the folly coroutines library (coroutine suspending while holding a lock):

#include <mutex>

#include <folly/experimental/coro/BlockingWait.h>
#include <folly/experimental/coro/Task.h>
#include <folly/init/Init.h>
#include <glog/logging.h>

std::mutex m;

folly::coro::Task<void> a2() {
  LOG(INFO) << __FUNCTION__;
  co_return;
}

folly::coro::Task<void> a1() {
  LOG(INFO) << __FUNCTION__;
  std::unique_lock<std::mutex> g(m);
  co_await a2();  // Potential deadlock! TSAN crashes here!
}

int main(int argc, char** argv) {
  folly::init(&argc, &argv);
  folly::coro::blockingWait(a1());
}

This is an example of the program correctly failing as a result of this check:

FATAL: ThreadSanitizer CHECK failed: llvm-project/compiler-rt/lib/tsan/rtl/tsan_rtl.cpp:1060 "((0)) == ((thr->mset.Size()))" (0x0, 0x1)
    #0 __tsan::TsanCheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) <null> (main+0x...)
    #1 __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) <null> (main+0x...)
    #2 __tsan::ThreadStateCheckNoLocks(__tsan::ThreadState*) <null> (main+0x...)
    #3 folly::coro::ensureNoLocks() folly/folly/experimental/coro/Task.h:73 (main+0x...)
    #4 auto folly::coro::Task<void>::Awaiter::await_suspend<folly::coro::detail::TaskPromise<void> >(std::__n4861::coroutine_handle<folly::coro::detail::TaskPromise<void> >) fbcode/folly/experimental/coro/Task.h:805 (main+0x...)
    #5 a1() (.resume) main.cpp:21 (main+0x...)
    #6 std::__n4861::coroutine_handle<void>::resume() const libgcc/include/c++/trunk/coroutine:127 (libfolly_tracing_async_stack.so+0x4222)
    #7 folly::resumeCoroutineWithNewAsyncStackRoot(std::__n4861::coroutine_handle<void>, folly::AsyncStackFrame&) fbcode/folly/tracing/AsyncStack.cpp:179 (libfolly_tracing_async_stack.so+0x...)
    ...

Proposed ThreadSanitizer Changes

For turning the proof-of-concept into real changes, I’m thinking about the following:

We can add an optional new ThreadSanitizer check, something like coroutine-suspended-while-holding-lock (potential deadlock)
We can add new annotations that can be invoked from application code to mark when coroutines are suspended/resumed/destroyed. Something like:

void AnnotateCoroResumed(char *f, int l, uptr coro);
void AnnotateCoroSuspended(char *f, int l, uptr coro);
void AnnotateCoroDestroyed(char *f, int l, uptr coro);

Applications or libraries (e.g. folly coro) can annotate their coroutine implementations with these new annotations to allow ThreadSanitizer to correctly detect these bugs.
In the ThreadSanitizer runtime: whenever a coroutine is suspended, the runtime will check that no locks are held. If there are locks held, then it will report the stack trace where the coroutine was suspended, as well as where the mutexes were acquired.
Separately, I see that there is also fiber support in ThreadSanitizer, but I’m not sure how much of that is reusable for this kind of check.

What do folks think? If folks agree this is a good idea, I can help implement this check. I have not contributed to llvm open source before, and would appreciate help and guidance on implementing this change!

ChuanqiXu · October 13, 2023, 2:42am

This motivation is pretty interesting from the users perspective! (But I don’t have experience in implementing the sanitizers too).

One concern about the design is that it looks too coarse-grained. IIUC, the error message will be reported if a coroutine with a lock held got resumed. I am wondering if it is too aggressive… But maybe this is about the design principles about TSAN. I know TSAN are much more tolerant with the false positive diagnostic messages. (Although this is the reason some people don’t enable TSAN).

And out of curiosity, there are already plenty of studies about dead lock detection in goroutine in golang. How do you think reusing/mimic their strategies? I know goroutines are stackful coroutines and C++20 coroutines are stackless coroutines. But there are similar points from the higher level of users.

tsan · October 13, 2023, 8:55am

Hi Kenny,

In the ThreadSanitizer runtime: whenever a coroutine is suspended, the runtime will check that no locks are held.

What will AnnotateCoroResumed/AnnotateCoroDestroyed do? Also report if any mutexes are held?

Strictly speaking, this is not a bug, right? I mean a coroutine can have mutexes locked when switching. It’s just up to the code to arrange that no bad things happen, right?

If all annotations do the same and it’s not a bug in all cases, I would prefer a single __tsan_check_no_mutexes_held function.
Such function can be resused in other contexts as well. E.g at thread exit, or in some performance critical parts of the code.

The function needs to be added to include/sanitizer/tsan_interface.h and also needs a test in test/tsan/

(char *f, int l, uptr coro) arguments are not needed, tsan can unwind the current stack.

If you agree with these suggestions, please send a patch.

tsan · October 13, 2023, 8:58am

TSAN has zero tolerance to false positive race reports.
There are some false positives deadlock reports, but most of these are bugs/unimplemented logic. It’s not supposed to be that way, it’s just nobody had time to finish the deadlock detector.

ChuanqiXu · October 13, 2023, 9:15am

Strictly speaking, this is not a bug, right? I mean a coroutine can have mutexes locked when switching. It’s just up to the code to arrange that no bad things happen, right?

Yeah, this is what I understand too. A coroutine with a mutex locked during switching is not a strict bug.

TSAN has zero tolerance to false positive race reports.

And this is what I get confused. If it is not a bug, and the Tsan sends an error message, isn’t it a false positive diagnostic?

tsan · October 13, 2023, 9:21am

If we add just __tsan_check_no_mutexes_held, then it’s up to the user. Tsan itself will never produce such warnings on its own.

Consider: asan also has an annotation to mark a memory range as poisoned. If a user uses it incorrectly and poison something that is going to be used, then asan will also produce a “false positive”. But I wouldn’t call it a false positive.

kennyyu · October 13, 2023, 5:36pm

I was thinking that these would do nothing for now, but could be reused in the future if we wanted more coroutine support. The minimum needed in my proposal is only AnnotateCoroSuspended. I think your suggestions are simpler, and I can work on that instead.

Yes that’s true, it’s not a bug in all cases, but often it is. This is similar to the lock-order-inversion check: it’s not a bug in all cases, but frequently indicates a real potential deadlock.

I think this approach can work. Thanks for the suggestion! I’ll work on the change.

I’m not sure I understand this part – what do you mean by “tsan can unwind the current stack”? In my proof-of-concept changes to tsan, I used a simple CHECK that would abort the program if there are mutexes held, but this might not be what we want in the real changes. What do you recommend instead of using a CHECK?

tsan · October 13, 2023, 5:40pm

Print a bug report, like tsan does for all other bug types (see e.g. ReportRace).

But the (char *f, int l, uptr coro) annotations arguments won’t be needed for that.

kennyyu · October 17, 2023, 7:20pm

Thanks for the guidance! I created a pull request here:

github.com/llvm/llvm-project

[TSAN] Add __tsan_check_no_mutexes_held helper

llvm:main ← kennyyu:tsan_check_mutexes_not_held

opened 07:19PM - 17 Oct 23 UTC

kennyyu

+88 -7

This adds a new helper that can be called from application code to ensure that n…o mutexes are held on specific code paths. This is useful for multiple scenarios, including ensuring no locks are held: - at thread exit - in peformance-critical code - when a coroutine is suspended (can cause deadlocks) See this discourse thread for more discussion: https://discourse.llvm.org/t/add-threadsanitizer-check-to-prevent-coroutine-suspending-while-holding-a-lock-potential-deadlock/74051

Topic		Replies	Views
[ThreadSanitizer] Get deadlocks working LLVM Dev List Archives	0	73	September 7, 2015
[ThreadSanitizer] Get deadlocks working LLVM Dev List Archives	2	74	September 10, 2015
Thread Safetry Analysis with async locks + coroutines Clang Frontend	1	171	August 28, 2023
[clang-tidy][RFC] checks concurrency-async-{blocking, fs, no-new-threads} LLVM Dev List Archives	1	84	January 28, 2021
[cxxabi] Thread-safe statics causing deadlocks Clang Frontend	7	92	July 22, 2016

Add ThreadSanitizer check to prevent coroutine suspending while holding a lock (potential deadlock)?

Use Case

Proof of Concept

Proposed ThreadSanitizer Changes

Related Topics