I was looking at the profile for a tool I’m working on, and noticed that it is spending 10% of its time doing locking related stuff. The structure of the tool is that it reading in a ton of stuff (e.g. one moderate example I’m working with is 40M of input) into MLIR, then uses its multithreaded pass manager to do transformations.
As it happens, the structure of this is that the parsing pass is single threaded, because it is parsing through a linear file (the parser is simple and fast, so this is bound by IR construction). This means that none of the locking during IR construction is useful.
Historically, LLVM had a design where you could dynamically enable and disable multithreading support in a tool, which would be perfect for this use case, but it got removed by this patch: (xref https://reviews.llvm.org/D4216). The rationale in the patch doesn’t make sense to me - this mode had nothing to do with the old LLVM global lock, this had to do with whether llvm::llvm_is_multithreaded() returned true or false … which all the locking stuff is guarded on.
Would it make sense to re-enable this, or am I missing something?
I was looking at the profile for a tool I’m working on, and noticed that it is spending 10% of its time doing locking related stuff. The structure of the tool is that it reading in a ton of stuff (e.g. one moderate example I’m working with is 40M of input) into MLIR, then uses its multithreaded pass manager to do transformations.
As it happens, the structure of this is that the parsing pass is single threaded, because it is parsing through a linear file (the parser is simple and fast, so this is bound by IR construction). This means that none of the locking during IR construction is useful.
I’m curious which are the places that show up on the profile? Do you have a few stacktraces to share?
Historically, LLVM had a design where you could dynamically enable and disable multithreading support in a tool, which would be perfect for this use case, but it got removed by this patch: (xref https://reviews.llvm.org/D4216). The rationale in the patch doesn’t make sense to me - this mode had nothing to do with the old LLVM global lock, this had to do with whether llvm::llvm_is_multithreaded() returned true or false … which all the locking stuff is guarded on.
It seems that at the time the assumption was that this flag was there to alleviate the cost of the global lock only and removing the lock removed the motivation for the feature? Looks like you proved this wrong
+Zach, David, and Reid to make sure they don’t miss this.
Would it make sense to re-enable this, or am I missing something?
Finding a way to re-enable it seems interesting. I wonder how much it’ll interact with the places inside the compiler that are threaded now, maybe it isn’t much more than tracking and auditing the uses of LLVM_ENABLE_THREADS (like lib/Support/ThreadPool.cpp for example). Have you already looked into it?
I was looking at the profile for a tool I’m working on, and noticed that it is spending 10% of its time doing locking related stuff. The structure of the tool is that it reading in a ton of stuff (e.g. one moderate example I’m working with is 40M of input) into MLIR, then uses its multithreaded pass manager to do transformations.
As it happens, the structure of this is that the parsing pass is single threaded, because it is parsing through a linear file (the parser is simple and fast, so this is bound by IR construction). This means that none of the locking during IR construction is useful.
I’m curious which are the places that show up on the profile? Do you have a few stacktraces to share?
In my case, it is all MLIR attribute/type uniquification stuff which is guarded by a RWMutex.
Historically, LLVM had a design where you could dynamically enable and disable multithreading support in a tool, which would be perfect for this use case, but it got removed by this patch: (xref https://reviews.llvm.org/D4216). The rationale in the patch doesn’t make sense to me - this mode had nothing to do with the old LLVM global lock, this had to do with whether llvm::llvm_is_multithreaded() returned true or false … which all the locking stuff is guarded on.
It seems that at the time the assumption was that this flag was there to alleviate the cost of the global lock only and removing the lock removed the motivation for the feature? Looks like you proved this wrong
+Zach, David, and Reid to make sure they don’t miss this.
Yeah, it was about not paying the cost for synchronization when it wasn’t worthwhile.
Would it make sense to re-enable this, or am I missing something?
Finding a way to re-enable it seems interesting. I wonder how much it’ll interact with the places inside the compiler that are threaded now, maybe it isn’t much more than tracking and auditing the uses of LLVM_ENABLE_THREADS (like lib/Support/ThreadPool.cpp for example). Have you already looked into it?
It is super-easy to reenable, because the entire codebase is still calling llvm::llvm_is_multithreaded(). We just need to add the global back, along with the methods to set and clear the global, and change llvm::llvm_is_multithreaded() to something like:
This is the part I am not sure about: the ThreadPool I mentioned above for example is not checking llvm_is_multithreaded() I believe, and I doubt that the client of the ThreadPool are. So you can have multiple things in flight in the ThreadPool that relies on llvm::Mutex and similar things to operate properly.
Seems like we could end-up in a situation where llvm_is_multithreaded() is returning false, effectively disabling all the mutex and other safety, while some code would use the ThreadPool.
Yes, the llvm::Smart* family of locks still exist. But very few places are using them outside of MLIR; it’s more common to just use plain std::mutex.
That said, I don’t think it’s really a good idea to use them, even if they were fixed to work as designed. It’s not composable: the boolean “enabled” bit is process-wide, not local to whatever data structure you’re trying to build. So your single-threaded tool gets some benefit, but the benefit goes away as soon as the process starts using multiple threads, even if there still only one thread using the MLIR context in question.
Yes, the llvm::Smart* family of locks still exist. But very few places are using them outside of MLIR; it’s more common to just use plain std::mutex.
That said, I don’t think it’s really a good idea to use them, even if they were fixed to work as designed. It’s not composable: the boolean “enabled” bit is process-wide, not local to whatever data structure you’re trying to build. So your single-threaded tool gets some benefit, but the benefit goes away as soon as the process starts using multiple threads, even if there still only one thread using the MLIR context in question.
Yes, I agree, this is similar to Mehdi’s point. I think it is clear that “enable” and “disable” multithreaded mode should only be called from applications, not libraries. Calling one of them from a library breaks composability.
So probably I’d recommend two things:
If locking uncontended locks is showing up on profiles as a performance bottleneck, it’s probably worth looking into ways to reduce that overhead in both single-threaded and multi-threaded contexts. (Reducing the number of locks taken in frequently called code, or using a better lock implementation).
If you want some mechanism to disable MLIR locking, it should probably be a boolean attached to the MLIR context in question, not a global variable.
Ok, but let me argue the other way. We currently have a cmake flag that sets LLVM_ENABLE_THREADS, and that flag enables an across the board speedup. That cmake flag is the *worst* possible thing for library composability. :-). Are you suggesting that we remove it?
If locking uncontended locks is showing up on profiles as a performance bottleneck, it’s probably worth looking into ways to reduce that overhead in both single-threaded and multi-threaded contexts. (Reducing the number of locks taken in frequently called code, or using a better lock implementation).
If you want some mechanism to disable MLIR locking, it should probably be a boolean attached to the MLIR context in question, not a global variable.
Ok, but let me argue the other way. We currently have a cmake flag that sets LLVM_ENABLE_THREADS, and that flag enables an across the board speedup. That cmake flag is the worst possible thing for library composability. :-). Are you suggesting that we remove it?
Yes, I would like to remove LLVM_ENABLE_THREADS. Assuming you’re not building for some exotic target that doesn’t have threads, there isn’t any reason to randomly shut off all thread-related functionality in the LLVM support libraries. There isn’t any significant performance or codesize gain to be had outside of MLIR, as far as I know, and it increases the number of configurations we have to worry about. I have no idea if turning it off even works on master; I don’t know of any buildbots or users using that configuration.
If you want to support some sort of lockless mode in MLIR, I think that burden should be carried as part of MLIR, instead of infecting the entire LLVM codebase.
I agree here. It is going to be much easier, and likely saner, to have the MLIR bits controlled by a flag in the MLIR context. Trying to stop/start threading bits in utilities doesn’t seem reliable given all of the different factors, e.g. ThreadPool /llvm::parallel_for/etc. functionality. It also provides a much more controlled interface/contract.
If locking uncontended locks is showing up on profiles as a performance bottleneck, it’s probably worth looking into ways to reduce that overhead in both single-threaded and multi-threaded contexts. (Reducing the number of locks taken in frequently called code, or using a better lock implementation).
If you want some mechanism to disable MLIR locking, it should probably be a boolean attached to the MLIR context in question, not a global variable.
Ok, but let me argue the other way. We currently have a cmake flag that sets LLVM_ENABLE_THREADS, and that flag enables an across the board speedup. That cmake flag is the worst possible thing for library composability. :-). Are you suggesting that we remove it?
Yes, I would like to remove LLVM_ENABLE_THREADS. Assuming you’re not building for some exotic target that doesn’t have threads, there isn’t any reason to randomly shut off all thread-related functionality in the LLVM support libraries.
There isn’t any significant performance or codesize gain to be had outside of MLIR, as far as I know, and it increases the number of configurations we have to worry about.
Well: if one can measure performance / binary size, I think it is a valid configuration to have. Being able to build and embed the compiler without paying the price for what you don’t need seems valuable to me.
Of course if we’re confident that there is no use (no way to use LLVM as a library and measure a difference there), removing it seems OK to me.
I have no idea if turning it off even works on master; I don’t know of any buildbots or users using that configuration.
If you want to support some sort of lockless mode in MLIR, I think that burden should be carried as part of MLIR, instead of infecting the entire LLVM codebase.
I agree that we should look into improving the situation on the MLIRContext itself to reduce the cost when used outside of a multi-threaded context.
It isn’t great in general to maintain a modal behavior in order to dynamically manage the thread safety aspect of an object, so this will likely require quite some thoughts.
I was looking at the profile for a tool I’m working on, and noticed that it is spending 10% of its time doing locking related stuff. The structure of the tool is that it reading in a ton of stuff (e.g. one moderate example I’m working with is 40M of input) into MLIR, then uses its multithreaded pass manager to do transformations.
As it happens, the structure of this is that the parsing pass is single threaded, because it is parsing through a linear file (the parser is simple and fast, so this is bound by IR construction). This means that none of the locking during IR construction is useful.
I'm curious which are the places that show up on the profile? Do you have a few stacktraces to share?
In my case, it is all MLIR attribute/type uniquification stuff which is guarded by a RWMutex.
If this can become a performance problem, is there a way to tackle the
problem head-on to reduce that cost even in scenarios that really are
multithreaded? E.g., an inlined initial atomic lock that falls back to
the "real" lock implementation on failure?
As long as it's only a small number of hotspots (and attribute/type
uniquing seem like plausible candidates), it'd seem justified to do
such things.
Historically, LLVM had a design where you could dynamically enable and disable multithreading support in a tool, which would be perfect for this use case, but it got removed by this patch: (xref https://reviews.llvm.org/D4216). The rationale in the patch doesn’t make sense to me - this mode had nothing to do with the old LLVM global lock, this had to do with whether llvm::llvm_is_multithreaded() returned true or false … which all the locking stuff is guarded on.
Would it make sense to re-enable this, or am I missing something?
This was probably one of my first patches ever to LLVM, but it sounds like the thought process was something like:
Stop using llvm_start_multithreaded in two places where it wasn’t needed.
Realize that it’s never called anymore anywhere
Delete dead code.
Given that it’s been 6 years with nobody else needing this, I’m skeptical that its of broad enough utility to re-introduce. On the other hand, I’m basically inactive at this point so I have no horse in the race.
My only input on this is that, if we do add this back, threading should be enabled by default, and then apps can disable it if they know it is safe to do so.
As Eli mentioned, C++11 threading primitives have proliferated in LLVM, and I would hesitate to add wrappers for them all. Either way I don’t feel super strongly about it.
I think that per the discussion up-thread that we should keep things as they are - downstream clients that don’t want the overhead should conditionalize themselves using local techniques.
If we keep LLVM_ENABLE_THREADS, then I think it would make sense to move that inline into the Threading.h header file, so everything can be constant propagated without LTO. Right now we have this in the .cpp file: