Dlopen extremely slow while LLDB is attached

Hey guys,

I’m trying to track down an issue I’m seeing where dlopen takes significantly longer to execute when LLDB is attached vs GDB (I’ve attached a simple program that I used to reproduce the issue).
I was wondering if anybody had any idea what might be contributing to the additional execution time?

Running without any debugger attached:
$ ./lldb-load-sample
Handle: 0x555555768c80

Done loading. 848.27ms

$ ./lldb-load-sample
Handle: 0x555555768c80

Done loading. 19.6047ms

I noticed that the first run was significantly slower than any subsequent runs. Most likely due to some caching in Linux.

For LLDB:

(lldb) file lldb-load-sample
Current executable set to ‘lldb-load-sample’ (x86_64).
(lldb) run
Process 82804 launched: ‘/lldb-load-sample’ (x86_64)
Handle: 0x555555768c80

Done loading. 5742.78ms
Process 82804 exited with status = 0 (0x00000000)
(lldb) run

Process 83454 launched: ‘/lldb-load-sample’ (x86_64)
Handle: 0x555555768c80

Done loading. 19.4184ms

Process 83454 exited with status = 0 (0x00000000)

I noticed that subsequent runs were much faster (most likely due to some caching in Linux / LLDB), but that isn’t relevant in my situation. Exiting LLDB and starting a new LLDB process still has an extremely long first run (In this case ~5.5s). There are other real world cases (initializing Vulkan which does a bunch of dlopens) where this can add 10s of seconds really slowing down iteration time.

For GDB:
(gdb) file lldb-load-sample
Reading symbols from a.out…done.

(gdb) run
Starting program: /lldb-load-sample

Handle: 0x555555768c80

Done loading. 79.7276ms

[Inferior 1 (process 85063) exited normally]

(gdb) run

Starting program: /lldb-load-sample

Handle: 0x555555768c80

Done loading. 80.325ms

[Inferior 1 (process 85063) exited normally]

As you can see the first run is slightly slower than running without a debugger attached, but it’s not enough to be noticeable.

Thanks,
Scott

lldb-load-sample.cc (761 Bytes)

Was liblldb.so build with debug information? You're probably looking at lldb scanning the DWARF to make up its symbol table. That would be re-used on subsequent reruns so you're only seeing the cost that first time through. gdb may be using the standard dwarf accelerator tables, or it may be delaying the cost of the scan until you try to do something like a breakpoint by name.

J

No liblldb.so didn’t have any debug information, I made sure to strip it. With debug information it was 1.7GB which took way too long to load for a practical reproduction case.

I would venture to say LLDB is indexing the debug info during the shared library load breakpoint for some reason. GDB might not have any breakpoints or symbols to find to do in the shared library, so it might not end up parsing anything. So my guess is LLDB is looking for a symbol in any shared library that is loaded and when the shared library gets loaded it causes LLDB to do more work. All of LLDB’s breakpoints are always looking for new locations to resolve (file and line breakpoints, breakpoints by name, and other plug-ins might be looking for things). You might try enabling with:

(lldb) log enable --timestamp --file /tmp/log.txt dwarf info
(lldb) file lldb-load-sample
(lldb) run
(lldb) quit

then see if you can see any delays inside of LLDB.

Greg

Thanks Greg and Jason for pointing me in the right direction!

The results from my original tests were due to a timeout being triggered:

1524681484.092825651 (x86_64) /tmp/liblldb.so.7.0.0: Reading EH frame info
1524681489.783293724 error: timed out, status = timed out
1524681489.783379793 this = 0x00005555573DDC80, timeout = 5000000 us
1524681489.840902615 Breakpoint::ModulesChanged: num_modules: 1 load: 1 delete_locations: 0

I ran the exact same setup on a few other machines, and none of them hit the timeout (so not sure what happened with my first test, more investigation is required). Without the timeout, loading times of a stripped shared library with LLDB attached was much more reasonable.

I then ran another test on a shared library that had DWARF sections. I tracked down the majority of the extra time to be taking place inside SymbolFileDWARF::Index(), which confirms your theory of the additoinal time being due to indexing the DWARF sections.

I found this message (http://lists.llvm.org/pipermail/llvm-dev/2018-January/120509.html) which mentioned adding DWARF5 accelerator table support. Is it reasonable to assume once that is implemented the time it takes to index DWARF sections should be drastically reduced?

Thanks Greg and Jason for pointing me in the right direction!

The results from my original tests were due to a timeout being triggered:

1524681484.092825651 (x86_64) /tmp/liblldb.so.7.0.0: Reading EH frame info
1524681489.783293724 error: timed out, status = timed out
1524681489.783379793 this = 0x00005555573DDC80, timeout = 5000000 us
1524681489.840902615 Breakpoint::ModulesChanged: num_modules: 1 load: 1

delete_locations: 0

Hmm... that is strange. Could you also enable the "gdb-remote packets" log
so we can see what is timing out.

I ran the exact same setup on a few other machines, and none of them hit

the timeout (so not sure what happened with my first test, more
investigation is required). Without the timeout, loading times of a
stripped shared library with LLDB attached was much more reasonable.

I then ran another test on a shared library that had DWARF sections. I

tracked down the majority of the extra time to be taking place inside
SymbolFileDWARF::Index(), which confirms your theory of the additoinal time
being due to indexing the DWARF sections.

I found this message (

http://lists.llvm.org/pipermail/llvm-dev/2018-January/120509.html) which
mentioned adding DWARF5 accelerator table support. Is it reasonable to
assume once that is implemented the time it takes to index DWARF sections
should be drastically reduced?

Yes, that is the idea. I got a bit side-tracked the last two weeks, but I
should be back on that project soon. The current state is that the compiler
support is mostly in place, and now I need to add debugger support for
this. (Maybe it's obvious, but you will only get the speed up if you use a
compiler which supports this feature, which will be clang-7 at the
earliest, and you will probably need to pass an extra flag to enable it)

I messed up my test setup and incorrectly attributed the additional time. Parsing the DWARF sections does add additional time, but only increases by a small percentage.

By pure chance I noticed that the latency between the client and server had a huge impact. I did some tests against lldb_server running on a machine with a RTT of 18ms to the client. The difference in load time in my real world example (initializing Vulkan) was around 16x slower (~3.5s vs 55s). I did some more digging and it looks like there are certain operations that perform a number of synchronous requests to the server (DYLDRendezvous::TakeSnapshot - updating SO entries and ThreadList::WillResume - grabbing SP and FP for every thread). Since all the requests are synchronous they are extremely sensitive to increased latency.

Is this a limitation of the gdb-server (can’t handle parallel requests)?

Or is this not a common use case, and is not a known issue?

Hmm… that is strange. Could you also enable the “gdb-remote packets” log
so we can see what is timing out.

I enabled “gdb-remote all” logging, and searched for all all instances of ‘0x00005652ACF3F120’ (the address of the connection object that is reporting the timeout?). Seems to be a pretty good corelation between the timeouts and ‘Communication::SyncronizeWithReadThread’, unfortunately I haven’t had time to investigate further.

1525110345.916504145 0x5652acefcbb0 Broadcaster(“lldb.process”)::RestoreBroadcaster (about to pop listener(“lldb.PlatformLinux.DebugProcess.hijack”)=0x5652acf079a0)
1525110345.916502953 this = 0x00005652ACF3F120, timeout = 5000000 us

I messed up my test setup and incorrectly attributed the additional time.

Parsing the DWARF sections does add additional time, but only increases by
a small percentage.

By pure chance I noticed that the latency between the client and server

had a huge impact. I did some tests against lldb_server running on a
machine with a RTT of 18ms to the client. The difference in load time in my
real world example (initializing Vulkan) was around 16x slower (~3.5s vs
55s). I did some more digging and it looks like there are certain
operations that perform a number of synchronous requests to the server
(DYLDRendezvous::TakeSnapshot - updating SO entries and
ThreadList::WillResume - grabbing SP and FP for every thread). Since all
the requests are synchronous they are extremely sensitive to increased
latency.

Is this a limitation of the gdb-server (can't handle parallel requests)?
Or is this not a common use case, and is not a known issue?

This is a known issue, though I did not expect it to have that much of an
impact. In fact, I have trouble reconciling this fact with your earlier
statement that second and subsequent runs are much faster. The SO entry
loading is something that has to happen on every run, so I don't see why
the second run would be faster. This would be more consistent with the
debug-info parsing case, as there we only index the dwarf once (if it
hasn't changed). So, I think we are missing something here.

In any case, this is not a fundamental limitation, and there are ways to
remove that. The most obvious one is to move the rendezvous structure
parsing to the server -- there is even a gdb packet for that, i don't know
its name off-hand. Currently we have support for that in the client (for
communicating with stubs that support it), but not in lldb-server.

For the register reading part, we usually make sure we send the "important"
registers in a batch, so that the client does not have to handle every one
separately. At the moment it's only PC, because that used to be enough at
some point. I don't know if anything changed in the client to make it ask
for more info, but this is something that needs to be looked at more
closely.

I enabled "gdb-remote all" logging, and searched for all all instances of

'0x00005652ACF3F120' (the address of the connection object that is
reporting the timeout?). Seems to be a pretty good corelation between the
timeouts and 'Communication::SyncronizeWithReadThread', unfortunately I
haven't had time to investigate further.

1525110345.916504145 0x5652acefcbb0

Broadcaster("lldb.process")::RestoreBroadcaster (about to pop
listener("lldb.PlatformLinux.DebugProcess.hijack")=0x5652acf079a0)

1525110345.916502953 this = 0x00005652ACF3F120, timeout = 5000000 us
--
1525110345.919557333 0x7f0868000940

'Communication::SyncronizeWithReadThread'
Listener::FindNextEventInternal(broadcaster=(nil),
broadcaster_names=(nil)[0], event_type_mask=0x00000000, remove=1) event
0x7f086c0008c0

525110345.919566154 this = 0x00005652ACF3F120, timeout = 5000000 us
--
1525110346.123922110 0x7f0868000d10

'Communication::SyncronizeWithReadThread'
Listener::FindNextEventInternal(broadcaster=(nil),
broadcaster_names=(nil)[0], event_type_mask=0x00000000, remove=1) event
0x7f086c0008c0

1525110346.123931408 this = 0x00005652ACF3F120, timeout = 5000000 us
--
1525110346.152676821 0x7f0868006710

'Communication::SyncronizeWithReadThread'
Listener::FindNextEventInternal(broadcaster=(nil),
broadcaster_names=(nil)[0], event_type_mask=0x00000000, remove=1) event
0x7f086c0008c0

1525110346.152685642 this = 0x00005652ACF3F120, timeout = 5000000 us
--
1525110346.167683363 0x7f08682b2fe0

'Communication::SyncronizeWithReadThread'
Listener::FindNextEventInternal(broadcaster=(nil),
broadcaster_names=(nil)[0], event_type_mask=0x00000000, remove=1) event
0x7f086c0008c0

1525110346.167692184 this = 0x00005652ACF3F120, timeout = 5000000 us

I think the timeout quest is a red haring, tbh. These two are correlated,
because they are both things that we need to do when the process stops
(flush its STDIO after it stops, and start the read thread after we resume
it). The run fast, and there are no timeouts involved.

--
1525110351.172777176 error: timed out, status = timed out, uri =
1525110351.172847271 this = 0x00005652ACF3F120, timeout = 5000000 us
1525110356.173308611 error: timed out, status = timed out, uri =
1525110356.173368216 this = 0x00005652ACF3F120, timeout = 5000000 us
1525110361.175591230 error: timed out, status = timed out, uri =
1525110361.175647497 this = 0x00005652ACF3F120, timeout = 5000000 us
1525110366.180710316 error: timed out, status = timed out, uri =
1525110366.180769205 this = 0x00005652ACF3F120, timeout = 5000000 us

And I bet these happen while the process is running, so they do not impact
the latency at all. It's just us waiting for the process to stop in a loop.

This is a known issue, though I did not expect it to have that much of an
impact. In fact, I have trouble reconciling this fact with your earlier
statement that second and subsequent runs are much faster. The SO entry
loading is something that has to happen on every run, so I don’t see why
the second run would be faster. This would be more consistent with the
debug-info parsing case, as there we only index the dwarf once (if it
hasn’t changed). So, I think we are missing something here.

My statement that subsequent runs are much faster was for my simplified example (only running locally) that was meant to reproduce the problem I was seeing when initializing Vulkan. When I went back and tested initializing Vulkan (remotely) with SymbolFileDWARF::Index() commented out it was still very slow, and I realized I was on the wrong path.

My latency to the server is normally only a few ms, which is why I initially ignored that difference between the two test environments. But it turns out a few ms is enough to make it twice as slow as running locally. This is what initially kicked off my investigation. I didn’t catch this earlier because I was previously using GDB locally, and recently switched to using LLDB remotely. I hadn’t compared LLDB local vs LLDB remote. My latest example to a server with a RTT of 18ms was to confirm that latency was the contributing factor.