Parallelizing loading of shared libraries

After a dealing with a bunch of microoptimizations, I’m back to parallelizing loading of shared modules. My naive approach was to just create a new thread per shared library. I have a feeling some users may not like that; I think I read an email from someone who has thousands of shared libraries. That’s a lot of threads :slight_smile:

The problem is loading a shared library can cause downstream parallelization through TaskPool. I can’t then also have the loading of a shared library itself go through TaskPool, as that could cause a deadlock - if all the worker threads are waiting on work that TaskPool needs to run on a worker thread… then nothing will happen.

Three possible solutions:

  1. Remove the notion of a single global TaskPool, but instead have a static pool at each callsite that wants it. That way multiple paths into the same code would share the same pool, but different places in the code would have their own pool.

  2. Change the wait code for TaskRunner to note whether it is already on a TaskPool thread, and if so, spawn another one. However, I don’t think that fully solves the issue of having too many threads loading shared libraries, as there is no guarantee the new worker would work on the “deepest” work. I suppose each task would be annotated with depth, and the work could be sorted in TaskPool though…

  3. Leave a separate thread per shared library.

Thoughts?

Under what conditions would a worker thread spawn additional work to be run in parallel and then wait for it, as opposed to just doing it serially? Is it feasible to just require tasks to be non blocking?

A worker thread would call DynamicLoader::LoadModuleAtAddress. This in
turn eventually calls SymbolFileDWARF::Index, which uses TaskRunners to
1. extracts dies for each DWARF compile unit in a separate thread
2. parse/unmangle/etc all the symbols

The code distance from DynamicLoader to SymbolFileDWARF is enough that
disallowing LoadModuleAtAddress to block seems to be a nonstarter.

We started out with the philosophy that lldb wouldn't touch any more information in a shared library than we actually needed. So when a library gets loaded we might need to read in and resolve its section list, but we won't read in any symbols if we don't need to look at them. The idea was that if you did "load a binary, and run it" until the binary stops for some reason, we haven't done any unnecessary work. Similarly, if all the breakpoints the user sets are scoped to a shared library then there's no need for us to read any symbols for any other shared libraries. I think that is a good goal, it allows the debugger to be used in special purpose analysis tools w/o forcing it to pay costs that a more general purpose debug session might require.

I think it would be hard to convert all the usages of modules to from "do something with a shared library" mode to "tell me you are interested in a shared library and give me a callback" so that the module reading could be parallelized on demand. But at the very least we need to allow a mode where symbol reading is done lazily.

The other concern is that lldb keeps the modules it reads in a global cache, shared by all debuggers & targets. It is very possible that you could have two targets or two debuggers each with one target that are reading in shared libraries simultaneously, and adding them to the global cache. In some of the uses that lldb has under Xcode this is actually very common. So the task pool will have to be built up as things are added to the global shared module cache, not at the level of individual targets noticing the read-in of a shared library.

Jim

So as it turns out, at least on my platform (Ubuntu 14.04), the symbols are loaded regardless. I changed my test so:

  1. main() just returns right away

  2. cmdline is: lldb -b -o run /path/to/my/binary

and it takes the same amount of time as setting a breakpoint.

Somebody is probably setting an internal breakpoint for some purpose w/o scoping it to the shared library it's to be found in. Either that or somebody has broken lazy loading altogether. But that's not intended behavior.

Jim

It's the gdb jit interface breakpoint. I don't think there is a good
way to scope that to a library, as that symbol can be anywhere...

Interesting. Do you have to catch this information as the JIT modules get loaded, or can you recover the data after-the-fact? For most uses, I don't think you need to track JIT modules as they are loaded, but it would be good enough to refresh the list on stop.

Jim

Hmm, turns out I was wrong about delayed symbol loading not working under Linux. I’ve added timings to the review.

Well, you need to know that the library got loaded so you can decide
whether you want to set a breakpoint there before it gets a chance to
execute. It's the same thing as the dynamic linker rendezvous
breakpoint, only for jitted code.

**I suppose** delay the lookup until the first breakpoint gets set by
the user, as that is the first time we will need that information, but
I guess we don't have the ability to express that now.

FWIW, the jit interface can be disabled by a setting
(plugin.jit-loader.gdb.enable-jit-breakpoint).

pl

I looked at this option in the past and this was my preferred
solution. My suggestion would be to have two task pools. One for
low-level parallelism, which spawns
std::thread::hardware_concurrency() threads, and another one for
higher level tasks, which can only spawn a smaller number of threads
(the algorithm for the exact number TBD). The high-level threads can
access to low-level ones, but not the other way around, which
guarantees progress.

I propose to hardcode 2 pools, as I don't want to make it easy for
people to create additional ones -- I think we should be having this
discussion every time someone tries to add one, and have a very good
justification for it (FWIW, I think your justification is good in this
case, and I am grateful that you are pursuing this).

pl

Hmmm ok, I don’t like hard coding pools. Your idea about limiting the number of high level threads gave me an idea:

  1. System has one high level TaskPool.

  2. TaskPools have up to one child and one parent (the parent for the high level TaskPool = nullptr).

  3. When a worker starts up for a given TaskPool, it ensures a single child exists.

  4. There is a thread local variable that indicates which TaskPool that thread enqueues into (via AddTask). If that variable is nullptr, then it is the high level TaskPool.Threads that are not workers enqueue into this TaskPool. If the thread is a worker thread, then the variable points to the worker’s child.

  5. When creating a thread in a TaskPool, it’s thread count AND the thread count of the parent, grandparent, etc are incremented.

  6. In the main worker loop, if there is no more work to do, OR the thread count is too high, the worker “promotes” itself. Promotion means:
    a. decrement the thread count for the current task pool
    b. if there is no parent, exit; otherwise, become a worker for the parent task pool (and update the thread local TaskPool enqueue pointer).

The main points are:

  1. We don’t hard code the number of task pools; the code automatically uses the fewest number of taskpools needed regardless of the number of places in the code that want task pools.

  2. When the child taskpools are busy, parent taskpools reduce their number of workers over time to reduce oversubscription.

You can fiddle with the # of allowed threads per level; for example, if you take into account number the height of the pool, and the number of child threads, then you could allocate each level 1/2 of the number of threads as the level below it, unless the level below wasn’t using all the threads; then the steady state would be 2 * cores, rather than height * cores. I think that it probably overkill though.

Have we examined llvm::ThreadPool to see if it can work for our needs? And if not, what kind of changes would be needed to llvm::ThreadPool to make it suitable?

The overall concept is similar; it comes down to implementation details like

  1. llvm doesn’t have a global pool, it’s probably instantiated on demand

  2. llvm keeps threads around until the pool is destroyed, rather than letting the threads exit when they have nothing to do

  3. llvm starts up all the threads immediately, rather than on demand.

Overall I like the current lldb version better than the llvm version, but I haven’t examined any of the use cases of the llvm version to know whether it could be dropped in without issue. However, neither does what I want, so I’ll move forward prototyping what I think it should do, and then see how applicable it is to llvm.

#1 is no big deal, we could just allocate one in a global class somewhere.

#2 actually seems quite desirable, is there any reason you don’t want that?

#3 seems like a win for performance since no locks have to be acquired to manage the collection of threads

The algorithm sounds reasonable to me. I'm just not sold on the
"automatic" part. My feeling is that if you cannot tell statically
what "depth" in the pool your code runs in, there is something wrong
with your code and you should fix that first.

Besides, hardcoding the nesting logic into "add" is kinda wrong.
Adding a task is not the problematic operation, waiting for the result
of one is. Granted, generally these happen on the same thread, but
they don't have to be -- you can write a continuation-style
computation, where you do a bit of work, and then enqueue a task to do
the rest. This would create an infinite pool depth here.

Btw, are we sure it's not possible to solve this with just one thread
pool. What would happen if we changed the implementation of "wait" so
that if the target task is not scheduled yet, we just go ahead an
compute it on our thread? I haven't thought through all the details,
but is sounds like this could actually give better performance in some
scenarios...

Besides, hardcoding the nesting logic into "add" is kinda wrong.
Adding a task is not the problematic operation, waiting for the result
of one is. Granted, generally these happen on the same thread, but
they don't have to be -- you can write a continuation-style
computation, where you do a bit of work, and then enqueue a task to do
the rest. This would create an infinite pool depth here.

True, but that doesn't seem to be the style of code here. If it were you
wouldn't need multiple pools, since you'd just wait for the callback that
your work was done.

Btw, are we sure it's not possible to solve this with just one thread
pool. What would happen if we changed the implementation of "wait" so
that if the target task is not scheduled yet, we just go ahead an
compute it on our thread? I haven't thought through all the details,
but is sounds like this could actually give better performance in some
scenarios...

My initial reaction was "that wouldn't work, what if you ran another posix
dl load?" But then I suppose *it* would run more work, and eventually
you'd run a leaf task and finish something.

You'd have to make sure your work could be run regardless of what mutexes
the caller already had (since you may be running work for another
subsystem), but that's probably not too onerous, esp given how many
recursive mutexes lldb uses..

I would still very much prefer we see if there is a way we can adapt LLVM’s ThreadPool class to be suitable for our needs. Unless some fundamental aspect of its design results in unacceptable performance for our needs, I think we should just use it and not re-invent another one. If there are improvements to be made, let’s make them there instead of in LLDB so that other LLVM users can benefit.

IMO we should start with proving a better version in the lldb codebase, and then work on pushing it upstream. I have found much more resistance getting changes in to llvm than lldb, and for good reason - more projects depend on llvm than lldb.

I think that’s all the more reason we should work on getting something into LLVM first. Anything we already have in LLDB, or any modifications we make will likely not be pushed up to LLVM, especially since LLVM already has a ThreadPool, so any changes you make to LLDB’s thread pool will likely have to be re-written when trying to get it to LLVM. And since, as you said, more projects depend on LLVM than LLDB, there’s a good chance that the baseline you’d be starting from when making improvements is more easily adaptable to what you want to do. LLDB has a long history of being shy of making changes in LLVM where appropriate, and myself and others have started pushing back on that more and more, because it accumulates long term technical debt.

In my experience, “let’s just get it into LLDB first and then work on getting it up to LLVM later” ends up being “well, it’s in LLDB now, so since my immediate problem is solved I may or may not have time to revisit this in the future” (even if the original intent is sincere).

If there is some resistance getting changes into LLVM, feel free to add me as a reviewer, and I can find the right people to move it along. I’d still like to at least hear a strong argument for why the existing implementation in LLVM is unacceptable for what we need. I’m ok with “non optimal”. Unless it’s “unsuitable”, we should start there and make incremental improvements.