Parallelize loading of shared libraries

The POSIX dynamic loader processes one module at a time. If you have a lot of shared libraries, each with a lot of symbols, this creates unneeded serialization (despite the use of TaskRunners during symbol loading, there is still quite a bit of serialization when loading a library).

In order to parallelize this, I actually had to do two things. Neither one makes any difference, only the combination improves performance (I left them as separate patches for clarity):

  1. Change the POSIX dynamic loader to fork each module into its own thread. I didn’t use TaskRunner because some of the called functions use TaskRunner, and it isn’t recursion safe. The final modules are added to the list in the original order despite whatever order the threads finish.

  2. Change Module::AppendImpl to fire off some expensive work as a separate thread.

These two changes bring startup time down from 36 (assuming the previously mentioned crc changes) seconds to 11. It doesn’t improve efficiency, it just increases parallelism.

dyn_load_thread.patch (1.7 KB)

prime_caches.patch (6.1 KB)

I’ve have looked at paralelization of the module loading code some time ago, albeit with a slightly different use case in mind. I eventually abandoned it (at least temporarily) because I could not get it to work correctly for all use cases.

I do think that doing this is a good idea, but I think it will have to be done with a very steady hand. E.g., if I patch your changes in right now I get about 10 random tests failing on every test suite run, so it’s clear that you are introducing a race somewhere.

We will also need to have a discussion about what kind of work can be done eagerly, as I believe we are trying to a lot of things very lazily (which unfortunately makes efficient paralelization more complicated).

Ok. I tried doing something similar to gdb but was unable to make any headway because they have so many global variables. It looked more promising with lldb since there were already some locks.

I assume you’re talking about check-lldb?
https://lldb.llvm.org/test.html

I’ll work on getting those to pass reliably.

As for eager vs not, I was just running code that already runs as part of:

b main

run

That said, I’m sure all the symbol loading is due to setting a breakpoint on a function name. Is there really that much value in deferring that? What if loading the symbols was done in parallel without delaying execution of the debugged program if you didn’t have a breakpoint? Then the impact would be (nearly) invisible to the end user.

Ok. I tried doing something similar to gdb but was unable to make any headway because they have so many global variables. It looked more promising with lldb since there were already some locks.

I assume you're talking about check-lldb?
https://lldb.llvm.org/test.html

I'll work on getting those to pass reliably.

As for eager vs not, I was just running code that already runs as part of:
b main
run

That said, I'm sure all the symbol loading is due to setting a breakpoint on a function name. Is there really that much value in deferring that? What if loading the symbols was done in parallel without delaying execution of the debugged program if you didn't have a breakpoint? Then the impact would be (nearly) invisible to the end user.

Currently, if you say:

(lldb) break set -n main -s MyBinary

then lldb will only read the symbol table for MyBinary. If read in all symbols when you first see the libraries, you remove that user-controllable optimization.

If the parallelization gets startup to the point where the extra work doesn't matter much, then I'm not 100% against the trade-off. But that's the trade-off.

Jim