Multi-threading and mutexes in LLVM

+chandlerc, aaronballman, in case there are additional carryovers and/or issues from the review thread which I’ve left out.

I have a patch up for review[1, 2] that attempts to replace LLVM’s mutex implementation with std::mutex and std::recursive_mutex. While the patch seems to work, there are questions surrounding whether or not the approach used is correct.

I’ll try to summarize the issues as best I can, in hopes of getting some feedback from a broader audience, to make sure this is done correctly:

  1. Should support multi-threading be a compile-time or runtime parameter in LLVM?

Currently it is both. It is compile-time through the use of the define LLVM_ENABLE_THREADS, and it is runtime through the use of functions llvm_start_multithreaded, llvm_is_multithreaded, etc. I and some others feel like runtime support for multi-threading could be removed, and it should be compile-time only. However, I am not aware of all the ways in which this is being used, so this is where I would like some feedback. The issues I have with runtime multithreading support are the following:

  • It leads to confusing code. At any given point, is multi-threading enabled or disabled? You never know without calling llvm_is_multithreaded, but even calling that is inherently racy, because someone else could disable it after it returns.

  • It leads to subtle bugs. clang_createIndex, the first time it’s called, enables multi-threading. What happens if someone else disables it later? Things like this shouldn’t even be possible.

  • Not all platforms even support threading to begin with. This works now because llvm_start_multithreaded(), if the compile time flag is set to disable threads, simply does nothing. But this decision should be made by someone else. Nobody really checks the return value from llvm_start_multithreaded anyway, so there’s already probably bugs where someone tries to start multi-threading, and it fails.

  • What does it actually mean to turn multi-threading support on and off?

Anybody that tries to do this is almost certainly broken due to some edge cases about when it’s on and when it’s off. So this goes back to the first two points about confusing code and subtle bugs.

  1. What should happen when you try to acquire a mutex in an app with threading disabled?

If this is a compile-time parameter, the solution is simple: make an empty mutex class that satisfies the Lockable concept, and have its methods do nothing. Then typedef something like llvm::mutex to be either std::mutex or llvm::null_mutex accordingly. If this is a runtime parameter, it’s more complicated, and we should ask ourselves whether or not it’s necessary to avoid the overhead of acquiring an uncontended mutex in single threaded apps. For what it’s worth, all reasonable STL implementations will use a lightweight mutex implementation, which generally require less than 100 nanoseconds to acquire uncontended, so I think we don’t need to care, but again some feedback would be nice.

  1. We have some debug code in our mutex implementation that is intended to try to help catch deadlocks and/or race conditions. Do we need this code?

I think we can get by without it, but again some feedback about how, if at all, people are using this would be nice. For example, if you want to detect deadlocks and race conditions you can use TSan.

Thanks!
Zach

[1] - http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20140602/220051.html
[2] - http://reviews.llvm.org/D4033

Zach, thanks for bringing the discussion to the full mailing list. While
some of this was hashed out in the review thread, I'll drop my personal
feelings here. Maybe will encourage others to chime in...

1) Should support multi-threading be a compile-time or runtime parameter
in LLVM?

Currently it is both. It is compile-time through the use of the
define LLVM_ENABLE_THREADS, and it is runtime through the use of functions
llvm_start_multithreaded, llvm_is_multithreaded, etc. I and some others
feel like runtime support for multi-threading could be removed, and it
should be compile-time only. However, I am not aware of all the ways in
which this is being used, so this is where I would like some feedback. The
issues I have with runtime multithreading support are the following:

* It leads to confusing code. At any given point, is multi-threading
enabled or disabled? You never know without calling llvm_is_multithreaded,
but even calling that is inherently racy, because someone else could
disable it after it returns.

* It leads to subtle bugs. clang_createIndex, the first time it's called,
enables multi-threading. What happens if someone else disables it later?
Things like this shouldn't even be possible.

* Not all platforms even support threading to begin with. This works now
because llvm_start_multithreaded(), if the compile time flag is set to
disable threads, simply does nothing. But this decision should be made by
someone else. Nobody really checks the return value from
llvm_start_multithreaded anyway, so there's already probably bugs where
someone tries to start multi-threading, and it fails.

* What does it actually mean to turn multi-threading support on and off?

Anybody that tries to do this is almost certainly broken due to some edge
cases about when it's on and when it's off. So this goes back to the first
two points about confusing code and subtle bugs.

I share your concerns here. I think that it is a complete mistake to
support this at runtime.

Note, I'm not saying that we should penalize single threaded users of LLVM.
There are many legitimate uses of LLVM which reserve exactly one thread for
it and it seems reasonable for LLVM to try to make sure that use case
continues to work. I don't think removing the runtime configuration of this
would impact those users in any measurable way, or if it does I think those
are all bugs in how LLVM works and should be fixed irrespective of this
decision.

2) What should happen when you try to acquire a mutex in an app with
threading disabled?

If this is a compile-time parameter, the solution is simple: make an empty
mutex class that satisfies the Lockable concept, and have its methods do
nothing. Then typedef something like llvm::mutex to be either std::mutex
or llvm::null_mutex accordingly. If this is a runtime parameter, it's more
complicated, and we should ask ourselves whether or not it's necessary to
avoid the overhead of acquiring an uncontended mutex in single threaded
apps. For what it's worth, all reasonable STL implementations will use a
lightweight mutex implementation, which generally require less than 100
nanoseconds to acquire uncontended, so I think we don't need to care, but
again some feedback would be nice.

I would really like to move LLVM to a point where it never had a mutex lock
acquisition in such a critical path that removing it for single threaded
users was important. Any such use of a mutex is, IMO, a performance bug for
*both* the single threaded users and the multithreaded users! We shouldn't
hack around this by making a synchronization construct that isn't really a
synchronization construct. That makes LLVM increasingly hard to develop for
very marginal gains.

Note that I think your 100 nanosecond estimate is really conservative.
There are many mutex implementations even faster. I use the numbers also
cited by Jeff Dean and Peter Norvig which estimate it at 25 nanoseconds:
http://norvig.com/21-days.html#answers

3) We have some debug code in our mutex implementation that is intended to
try to help catch deadlocks and/or race conditions. Do we need this code?

I think we can get by without it, but again some feedback about how, if at
all, people are using this would be nice. For example, if you want to
detect deadlocks and race conditions you can use TSan.

I think that we should add such debug code to the std::mutex implementation
in libc++, and then most developers should have easy access to it. I'll
talk to Kostya and the folks that work on TSan to see if they would be up
for contributing it.

+chandlerc, aaronballman, in case there are additional carryovers and/or
issues from the review thread which I've left out.

I have a patch up for review[1, 2] that attempts to replace LLVM's mutex
implementation with std::mutex and std::recursive_mutex. While the patch
seems to work, there are questions surrounding whether or not the approach
used is correct.

I'll try to summarize the issues as best I can, in hopes of getting some
feedback from a broader audience, to make sure this is done correctly:

1) Should support multi-threading be a compile-time or runtime parameter
in LLVM?

Currently it is both. It is compile-time through the use of the
define LLVM_ENABLE_THREADS, and it is runtime through the use of functions
llvm_start_multithreaded, llvm_is_multithreaded, etc. I and some others
feel like runtime support for multi-threading could be removed, and it
should be compile-time only. However, I am not aware of all the ways in
which this is being used, so this is where I would like some feedback. The
issues I have with runtime multithreading support are the following:

* It leads to confusing code. At any given point, is multi-threading
enabled or disabled? You never know without calling llvm_is_multithreaded,
but even calling that is inherently racy, because someone else could
disable it after it returns.

* It leads to subtle bugs. clang_createIndex, the first time it's called,
enables multi-threading. What happens if someone else disables it later?
Things like this shouldn't even be possible.

* Not all platforms even support threading to begin with. This works now
because llvm_start_multithreaded(), if the compile time flag is set to
disable threads, simply does nothing. But this decision should be made by
someone else. Nobody really checks the return value from
llvm_start_multithreaded anyway, so there's already probably bugs where
someone tries to start multi-threading, and it fails.

* What does it actually mean to turn multi-threading support on and off?

Anybody that tries to do this is almost certainly broken due to some edge
cases about when it's on and when it's off. So this goes back to the first
two points about confusing code and subtle bugs.

+1
I am in favor of removing the run-time check. We've seen lots of code like
this "if (have_threads) mu->lock();", and it was very often broken.

2) What should happen when you try to acquire a mutex in an app with
threading disabled?

If this is a compile-time parameter, the solution is simple: make an empty
mutex class that satisfies the Lockable concept, and have its methods do
nothing. Then typedef something like llvm::mutex to be either std::mutex
or llvm::null_mutex accordingly. If this is a runtime parameter, it's more
complicated, and we should ask ourselves whether or not it's necessary to
avoid the overhead of acquiring an uncontended mutex in single threaded
apps. For what it's worth, all reasonable STL implementations will use a
lightweight mutex implementation, which generally require less than 100
nanoseconds to acquire uncontended, so I think we don't need to care, but
again some feedback would be nice.

3) We have some debug code in our mutex implementation that is intended to
try to help catch deadlocks and/or race conditions. Do we need this code?

I think we can get by without it, but again some feedback about how, if at
all, people are using this would be nice. For example, if you want to
detect deadlocks and race conditions you can use TSan.

The problem with race detection is that it slows down the process, but I
doubt that a debug code in mutex can catch (many) races.

As for the deadlocks, indeed it is possible to add deadlock detection
directly to std::mutex and std::spinlock code.
It may even end up being more efficient than a standalone deadlock detector

I don’t see any reason not to reserve a word in the mutex so that in (an ABI-compatible) debug build the mutex can support deadlock detection. Some people are super concerned about having an extra word in a mutex, but I’m not at all.

For libc++, it would probably need to be behind an ABI-breaking macro on Mac and FreeBSD, but we haven’t committed to any ABI stability on Linux, so we could probably enable it by default there, and get into build bots.

Maybe bring this up on the cfe-dev list to discuss with Marshall and other folks interested in libc++?

I don't see any reason not to reserve a word in the mutex so that in (an ABI-compatible) debug build the mutex can support deadlock detection. Some people are super concerned about having an extra word in a mutex, but I'm not at all.

Making a mutex span multiple cache lines can have very serious and unpredictable performance impacts in the contended case (if you're not contended, a mutex is probably the wrong synchronisation primitive for you) on modern CPUs.

For libc++, it would probably need to be behind an ABI-breaking macro on Mac and FreeBSD, but we haven't committed to any ABI stability on Linux, so we could probably enable it by default there, and get into build bots.

Maybe bring this up on the cfe-dev list to discuss with Marshall and other folks interested in libc++?

On FreeBSD and OS X, the underlying pthread_mutex can already do deadlock detection, so I don't see why you'd need to add another word. The PTHREAD_MUTEX_ERRORCHECK attribute has been part of POSIX since 1997, so I'd expect it to be supported everywhere.

For FreeBSD, we also had a GSoC student a couple of years ago who ported the WITNESS lock order reversal checking framework from the kernel to userspace, for more detailed checking.

David

+chandlerc, aaronballman, in case there are additional carryovers and/or
issues from the review thread which I've left out.

I have a patch up for review[1, 2] that attempts to replace LLVM's mutex
implementation with std::mutex and std::recursive_mutex. While the patch
seems to work, there are questions surrounding whether or not the approach
used is correct.

I'll try to summarize the issues as best I can, in hopes of getting some
feedback from a broader audience, to make sure this is done correctly:

1) Should support multi-threading be a compile-time or runtime parameter in
LLVM?

Currently it is both. It is compile-time through the use of the define
LLVM_ENABLE_THREADS, and it is runtime through the use of functions
llvm_start_multithreaded, llvm_is_multithreaded, etc. I and some others
feel like runtime support for multi-threading could be removed, and it
should be compile-time only. However, I am not aware of all the ways in
which this is being used, so this is where I would like some feedback. The
issues I have with runtime multithreading support are the following:

* It leads to confusing code. At any given point, is multi-threading
enabled or disabled? You never know without calling llvm_is_multithreaded,
but even calling that is inherently racy, because someone else could disable
it after it returns.

* It leads to subtle bugs. clang_createIndex, the first time it's called,
enables multi-threading. What happens if someone else disables it later?
Things like this shouldn't even be possible.

* Not all platforms even support threading to begin with. This works now
because llvm_start_multithreaded(), if the compile time flag is set to
disable threads, simply does nothing. But this decision should be made by
someone else. Nobody really checks the return value from
llvm_start_multithreaded anyway, so there's already probably bugs where
someone tries to start multi-threading, and it fails.

* What does it actually mean to turn multi-threading support on and off?

Anybody that tries to do this is almost certainly broken due to some edge
cases about when it's on and when it's off. So this goes back to the first
two points about confusing code and subtle bugs.

I agree with your assessment, and my personal feeling is that this
should be a compile-time switch. Code which attempts to do this at
runtime often has subtle bugs.

2) What should happen when you try to acquire a mutex in an app with
threading disabled?

If this is a compile-time parameter, the solution is simple: make an empty
mutex class that satisfies the Lockable concept, and have its methods do
nothing. Then typedef something like llvm::mutex to be either std::mutex or
llvm::null_mutex accordingly. If this is a runtime parameter, it's more
complicated, and we should ask ourselves whether or not it's necessary to
avoid the overhead of acquiring an uncontended mutex in single threaded
apps. For what it's worth, all reasonable STL implementations will use a
lightweight mutex implementation, which generally require less than 100
nanoseconds to acquire uncontended, so I think we don't need to care, but
again some feedback would be nice.

I agree with the compile time approach.

3) We have some debug code in our mutex implementation that is intended to
try to help catch deadlocks and/or race conditions. Do we need this code?

I think we can get by without it, but again some feedback about how, if at
all, people are using this would be nice. For example, if you want to
detect deadlocks and race conditions you can use TSan.

I don't think the code is needed (I believe TSan suffices), but is it
possible that there are platforms for which TSan is unsupported and
this code may prove useful? I'm not certain. As Chandler points out, I
think supporting this at the libc++ level may actually be more
profitable.

~Aaron

I don't see any reason not to reserve a word in the mutex so that in (an ABI-compatible) debug build the mutex can support deadlock detection. Some people are super concerned about having an extra word in a mutex, but I'm not at all.

Making a mutex span multiple cache lines can have very serious and unpredictable performance impacts in the contended case (if you're not contended, a mutex is probably the wrong synchronisation primitive for you) on modern CPUs.

Are we currently bumping up against cache line size limitations with
the libc++ mutex implementation?

For libc++, it would probably need to be behind an ABI-breaking macro on Mac and FreeBSD, but we haven't committed to any ABI stability on Linux, so we could probably enable it by default there, and get into build bots.

Maybe bring this up on the cfe-dev list to discuss with Marshall and other folks interested in libc++?

On FreeBSD and OS X, the underlying pthread_mutex can already do deadlock detection, so I don't see why you'd need to add another word. The PTHREAD_MUTEX_ERRORCHECK attribute has been part of POSIX since 1997, so I'd expect it to be supported everywhere.

Not every platform uses pthreads (such as Win32), so presuming that an
underlying library will always support such functionality may not be
the appropriate design decision.

~Aaron

Yes. I still have a debt with by vector annotations in libc++, let me deal
with that first and then send a proposal about deadlock detector.

On FreeBSD and OS X, the underlying pthread_mutex can already do deadlock
detection, so I don't see why you'd need to add another word. The
PTHREAD_MUTEX_ERRORCHECK attribute has been part of POSIX since 1997, so
I'd expect it to be supported everywhere.

PTHREAD_MUTEX_ERRORCHECK detects the deadlock that already happened.
tsan's deadlock detector (as well as helgrind and many other similar tools)
detects lock order inversion, i.e. a situation which may potentially lead
to a deadlock.

--kcc

Yes, that's what WITNESS does in the FreeBSD kernel. The line after the one you quoted mentioned the port of this to userspace pthreads.

David

> tsan's deadlock detector (as well as helgrind and many other similar
tools) detects lock order inversion, i.e. a situation which may potentially
lead to a deadlock.

Yes, that's what WITNESS does in the FreeBSD kernel. The line after the
one you quoted mentioned the port of this to userspace pthreads.

Did it extend the pthread_mutex_t data structure?

No, it stored in a look-aside structure. We've found that requiring users to recompile all of their code, including shared libraries, for a new ABI is a non-starter.

David

>
> > tsan's deadlock detector (as well as helgrind and many other similar
tools) detects lock order inversion, i.e. a situation which may potentially
lead to a deadlock.
>
> Yes, that's what WITNESS does in the FreeBSD kernel. The line after the
one you quoted mentioned the port of this to userspace pthreads.
>
> Did it extend the pthread_mutex_t data structure?

No, it stored in a look-aside structure. We've found that requiring users
to recompile all of their code, including shared libraries, for a new ABI
is a non-starter.

Yea, for lpthread having a separate debug variant is hard.
But using look-aside structure for deadlock detection inside the pthread
library is no better than
having an external deadlock detector that intercepts pthread_mutex_*
(that's what tsan and others do)

Hi,

> 1) Should support multi-threading be a compile-time or runtime parameter in
> LLVM?
>
> Currently it is both. It is compile-time through the use of the define
> LLVM_ENABLE_THREADS, and it is runtime through the use of functions
> llvm_start_multithreaded, llvm_is_multithreaded, etc. I and some others
> feel like runtime support for multi-threading could be removed, and it
> should be compile-time only. However, I am not aware of all the ways in
> which this is being used, so this is where I would like some feedback. The
> issues I have with runtime multithreading support are the following:
>
> * It leads to confusing code. At any given point, is multi-threading
> enabled or disabled? You never know without calling llvm_is_multithreaded,
> but even calling that is inherently racy, because someone else could disable
> it after it returns.
>
> * It leads to subtle bugs. clang_createIndex, the first time it's called,
> enables multi-threading. What happens if someone else disables it later?
> Things like this shouldn't even be possible.
>
> * Not all platforms even support threading to begin with. This works now
> because llvm_start_multithreaded(), if the compile time flag is set to
> disable threads, simply does nothing. But this decision should be made by
> someone else. Nobody really checks the return value from
> llvm_start_multithreaded anyway, so there's already probably bugs where
> someone tries to start multi-threading, and it fails.
>
> * What does it actually mean to turn multi-threading support on and off?
>
> Anybody that tries to do this is almost certainly broken due to some edge
> cases about when it's on and when it's off. So this goes back to the first
> two points about confusing code and subtle bugs.

I agree with your assessment, and my personal feeling is that this
should be a compile-time switch. Code which attempts to do this at
runtime often has subtle bugs.

Just to make you aware of a use case scenario that strikes me since some time
together with the open source OpenGL driver stack that makes use of llvm.

There you see driver modules which potentially get loaded anywhere at
application run time. Potentially this can happen from any thread (even if in
reality probably serialized by the dynamic loader) and thus can also race against
the main application itself calling llvm_start_multithreaded in an other thread.
Or if I remember correctly as documented, llvm_start_multithreaded must not be
called concurrently to any other llvm api calls possibly accessing any of
the llvm singeltons. Which in turn is hard to guarantee if you get dynamically
loaded at run time and you have to expect the main application to use llvm already.

Given that I would wish that you can just make use of llvm concurrently and having
that no option at all, but I can easily anticipate the usual resistance against that.

Nevertheless if multithreading is a compile option, it would be great if a user
of llvm can see if the llvm libraries have been compiled with multithreading
enabled or not. Probably best queriable at runtime so that a potential
user can bail out in a nice way instead of just crashing later ...

So it would be good if you keep this use case in mind when doing changes in this corner!

Thanks

Mathias

I’m reviving this thread because I have a new patch ready to go that attempts to push this through again.

http://reviews.llvm.org/D4223

The last time, there were a couple of major issues that came up during review, outlined in the original post of this thread. All of those have been addressed in already-submitted patches.

One additional concern was expressed offline, which is simply to make sure that this change does not cause a significant performance regression in the single-threaded case. I have run wall-clock comparisons on Windows and Mac OSX of compiling a very large TU that showed no noticeable difference, so I believe that concern has been addressed as well.

Feel free to discuss here or offer comments on the review, but I wanted to make sure the community is aware that I’m attempting to push this through again, in case anyone has concerns that have yet to be addressed.

Zach