Sampling-based performance measurement of LLVM OpenMP runtime leads to deadlock!

While using a sampling-based profiler (HPCToolkit) to measure the performance of an application using a dynamically-linked version of the LLVM OpenMP runtime, I encountered a deadlock on x86_64. Although I haven’t considered other architectures in detail, I believe that they may be similarly affected.

Here’s what I believe I have observed: there is a subtle race condition between TLS setup for an OpenMP runtime and and a profiler that inspects it through the OMPT interface.

A thread executing code in __kmp_launch_worker in the context of the LLVM OpenMP runtime library acquired the lock controlling access to TLS state (
__tls_get_addr calls tls_get_addr_tail calls pthread_mutex_lock)to set up TLS needed for its access to its thread local variable __kmp_gtid in frame 24 of the callstack shown below. Immediately after acquiring the TLS lock by setting its __lock field with a CMPXCHG but before recording the lock owner or finishing TLS setup, the thread was interrupted by our profiler. As a normal part of its operation to record a sample, our profiler uses the OMPT tools API to check if the thread is an OpenMP thread by inspecting the thread id being maintained by the OpenMP runtime. A call to a runtime entry point through the OMPT API led to an access to __kmp_gtid in frame 5 of the call stack. However, TLS has still not been set up for the OpenMP runtime shared library for this thread and causing the access to __kmp_gtid to go through the same protocol as before (__tls_get_addr calls tls_get_addr_tail calls pthread_mutex_lock). However, the lock has already been acquired in frame 21 so it is unavailable for acquistion in frames 0-2, causing deadlock. The TLS lock is implemented as a recursive lock, but the profiler interrupted the lock acquisition in libpthread before the owner field of the recursive lock was set, so the inner call to pthread_mutex_lock can’t succeed.

This is a serious problem if a profiler using the OMPT interface can cause a deadlock.

We need a design of the OMPT interface and OpenMP runtime implementations that make this impossible.

After thinking about this for a while, I think that a profiler can arrange to receive the ompt_callback_thread_begin and the profiler then set a thread local flag in its own TLS variables to note that a thread is an OpenMP thread. A profiler must not invoke any ompt runtime entry point on a thread that has not announced itself as an OpenMP thread by previously calling ompt_callback_thread_begin. An OpenMP runtime should ensure that its TLS is allocated before invoking the callback ompt_callback_thread_begin. Similarly, a profiler shouldn’t invoke an OMPT callback on a thread after receiving ompt_callback_thread_end.

If a profiler thread doesn’t use the OMPT interface to inspect a thread that hasn’t announced itself as an OpenMP thread, it won’t access any TLS state that the OpenMP library may maintain.

Does anyone care to comment or offer a vision of a different solution?

Below my signature block are some details of the thread state that I observed, in case you want to validate my assessment of the situation.