lldb stops on every call to dlopen

Hello lldb developers, I am running into a problem with lldb on Linux. I am currently running llvm 6.0.0.

I have an executable that dynamically loads a large number of shared libraries at runtime. These are explicitly loaded via dlopen (they are specified in a configuration file), and after loading a few (typically a dozen or so, but the number varies) lldb will halt during dlopen. If I continue, it will load a few more then halt again, which makes debugging from startup impractical since there are so many libraries to be loaded (more than a hundred of them).

When I build and debug this same C++ on macOS, the debugger works fine. I have verified that target.process.stop-on-sharedlibrary-events is false. I turned on dyld logging and I see lots of log messages about RendezvousBreakpoint being hit, but I don’t see anything that sheds light on why some libraries load without stopping but others don’t.

I have tried to recreate this in a trivial program that calls dlopen in a loop, but haven’t been able to reproduce.

Can your offer any suggestions for further debugging this? More supporting evidence follows.

Here is the message when the debugger stops:

Process 120004 stopped

  • thread #1, name = ‘xxxxxxxx’, stop reason = trace
    frame #0: 0x00002aaaacfca6a0 libc.so.6__restore_rt libc.so.6__restore_rt:
    → 0x2aaaacfca6a0 <+0>: movq $0xf, %rax
    0x2aaaacfca6a7 <+7>: syscall
    0x2aaaacfca6a9 <+9>: nopl (%rax)

libc.so.6`__libc_sigaction:
0x2aaaacfca6b0 <+0>: subq $0xd0, %rsp

I do not have the stop on shared library events setting enabled:

(lldb) settings show target.process.stop-on-sharedlibrary-events
target.process.stop-on-sharedlibrary-events (boolean) = false

The backtrace goes back to dlopen:

(lldb) bt

  • thread #1, name = ‘xxxxx’, stop reason = trace
  • frame #0: 0x00002aaaacfca6a0 libc.so.6__restore_rt frame #1: 0x00002aaaaaab9eb0 ld-linux-x86-64.so.2 frame #2: 0x00002aaaaaabdc53 ld-linux-x86-64.so.2dl_open_worker + 499
    frame #3: 0x00002aaaaaab9286 ld-linux-x86-64.so.2_dl_catch_error + 102 frame #4: 0x00002aaaaaabd63a ld-linux-x86-64.so.2_dl_open + 186
    frame #5: 0x00002aaaac39df66 libdl.so.2dlopen_doit + 102 frame #6: 0x00002aaaaaab9286 ld-linux-x86-64.so.2_dl_catch_error + 102
    frame #7: 0x00002aaaac39e29c libdl.so.2_dlerror_run + 124 frame #8: 0x00002aaaac39dee1 libdl.so.2__dlopen_check + 49

the dyld debug log has a lot of this:

209 intern-state DynamicLoaderPOSIXDYLD::RendezvousBreakpointHit pid 153501 stop_when_images_change=false
210 intern-state DynamicLoaderPOSIXDYLD::RendezvousBreakpointHit called for pid 153501
211 intern-state DYLDRendezvous::Resolve address size: 8, padding 4
212 intern-state DYLDRendezvous::Resolve cursor = 0x2aaaaaccc160
213 intern-state DynamicLoaderPOSIXDYLD::RendezvousBreakpointHit pid 153501 stop_when_images_change=false
214 intern-state DynamicLoaderPOSIXDYLD::RendezvousBreakpointHit called for pid 153501
215 intern-state DYLDRendezvous::Resolve address size: 8, padding 4
216 intern-state DYLDRendezvous::Resolve cursor = 0x2aaaaaccc160

thanks,
–steve

In the woods too, a man casts off his years, as the snake his slough, and at what period soever of life, is always a child.

[+lldb-dev]

Hello Steve,

thanks for the report.

The fact that you see the rendezvous breakpoint being hit many times is not
surprising. We get those every time the library is loaded (we need that to
load relevant debug info and set potential breakpoints). However, they
should generally not be surfaced to the user (unless you have the
stop-on-sharedlibrary-events setting set, which you don't).

The part that is suspicious to me is that __restore_rt shows up on the top
of the backtrace. This is a trampoline used to return from signal handlers,
and it would seem to indicate that you got some sort of a signal while
loading the libraries. I don't know why this would happen, but it could be
that this is confusing lldb's auto-resume logic.

The interesting part to see here is what lldb thinks are the stop reasons
for individual threads in the process (is the process multi-threaded?) for
the last couple of stops. The "lldb step" and "gdb-remote packets" log
categories are the most interesting to observe here. If you are able to
send me the log traces, I can help you interpret them.

regards,
pavel

Hello lldb developers, I am running into a problem with lldb on Linux. I

am currently running llvm 6.0.0.

I have an executable that dynamically loads a large number of shared

libraries at runtime. These are explicitly loaded via dlopen (they are
specified in a configuration file), and after loading a few (typically a
dozen or so, but the number varies) lldb will halt during dlopen. If I
continue, it will load a few more then halt again, which makes debugging
from startup impractical since there are so many libraries to be loaded
(more than a hundred of them).

When I build and debug this same C++ on macOS, the debugger works fine.

I have verified that target.process.stop-on-sharedlibrary-events is false.
I turned on dyld logging and I see lots of log messages about
RendezvousBreakpoint being hit, but I don’t see anything that sheds light
on why some libraries load without stopping but others don’t.

I have tried to recreate this in a trivial program that calls dlopen in a

loop, but haven’t been able to reproduce.

Can your offer any suggestions for further debugging this? More

supporting evidence follows.

Here is the message when the debugger stops:

Process 120004 stopped
* thread #1, name = ‘xxxxxxxx', stop reason = trace
     frame #0: 0x00002aaaacfca6a0 libc.so.6`__restore_rt
libc.so.6`__restore_rt:
-> 0x2aaaacfca6a0 <+0>: movq $0xf, %rax
     0x2aaaacfca6a7 <+7>: syscall
     0x2aaaacfca6a9 <+9>: nopl (%rax)

libc.so.6`__libc_sigaction:
     0x2aaaacfca6b0 <+0>: subq $0xd0, %rsp

I do not have the stop on shared library events setting enabled:

(lldb) settings show target.process.stop-on-sharedlibrary-events
target.process.stop-on-sharedlibrary-events (boolean) = false

The backtrace goes back to dlopen:

(lldb) bt
* thread #1, name = ‘xxxxx', stop reason = trace
   * frame #0: 0x00002aaaacfca6a0 libc.so.6`__restore_rt
     frame #1: 0x00002aaaaaab9eb0 ld-linux-x86-64.so.2
     frame #2: 0x00002aaaaaabdc53 ld-linux-x86-64.so.2`dl_open_worker + 499
     frame #3: 0x00002aaaaaab9286 ld-linux-x86-64.so.2`_dl_catch_error +

102

     frame #4: 0x00002aaaaaabd63a ld-linux-x86-64.so.2`_dl_open + 186
     frame #5: 0x00002aaaac39df66 libdl.so.2`dlopen_doit + 102
     frame #6: 0x00002aaaaaab9286 ld-linux-x86-64.so.2`_dl_catch_error +

102

     frame #7: 0x00002aaaac39e29c libdl.so.2`_dlerror_run + 124
     frame #8: 0x00002aaaac39dee1 libdl.so.2`__dlopen_check + 49

the dyld debug log has a lot of this:
209 intern-state DynamicLoaderPOSIXDYLD::RendezvousBreakpointHit pid

153501 stop_when_images_change=false

210 intern-state DynamicLoaderPOSIXDYLD::RendezvousBreakpointHit

called for pid 153501

211 intern-state DYLDRendezvous::Resolve address size: 8, padding 4
212 intern-state DYLDRendezvous::Resolve cursor = 0x2aaaaaccc160
213 intern-state DynamicLoaderPOSIXDYLD::RendezvousBreakpointHit pid

153501 stop_when_images_change=false

214 intern-state DynamicLoaderPOSIXDYLD::RendezvousBreakpointHit

called for pid 153501

215 intern-state DYLDRendezvous::Resolve address size: 8, padding 4
216 intern-state DYLDRendezvous::Resolve cursor = 0x2aaaaaccc160

thanks,
--steve

 In the woods too, a man casts off his years, as the snake his slough,

and at what period soever of life, is always a child.

It is interesting that the stop reason on the thread that stopped is "trace". That's what you would expect returning from the single-step to step over the breakpoint. But it looks like we got a signal while single-stepping, but the stop reason was misreported by somebody.

Jim

Pavel asked for a dump of gdb-remote commands. I got that and ran it through the gdbremote decoder, and trimmed to include what looks like the last successful continue after breakpoint and then the halt on dlopen. Both cases stop on signal 5.

After the stop message the debugger issues two binary reads and then apparently makes the decision that it should stop rather than continue. The stopping case is missing the equivalent of "Element 1: Single stepping past breakpoint site 2 at 0x2aaaaaab9eb0” which is in the continuing case. I’ve attached the file here:

out (23.5 KB)

It's a bit of a wild guess, but is it possible that you (or one of the
libraries you use) are doing anything with signals (SIGALRM or such?). I
think I remember looking at the code handling the server-side ignored
signal handling and thinking that it could go wrong if you get a signal
while doing a instruction-step. I am not sure it fully applies here as the
last command that lldb client did was a "continue", but i think it has to
have something to do with signals, as you end up stopped in a signal
handler.

Could you try the following sequence of commands?
(lldb) process launch --stop-at-entry-point
(lldb) process handle --notify true --stop true #Stop on all signals
(lldb) continue

and let us know if you see any extra stops due to signals. If that doesn't
find anything then I think we'll have start pulling logs from the
lldb-server side, as there doesn't seem to be anything wrong with the
client. The easiest way to achieve that is to do a
export LLDB_SERVER_LOG_CHANNELS="posix all:gdb-remote packets"
export LLDB_DEBUGSERVER_LOG_FILE=/tmp/server.log
before launching lldb.

Pavel asked for a dump of gdb-remote commands. I got that and ran it

through the gdbremote decoder, and trimmed to include what looks like the
last successful continue after breakpoint and then the halt on dlopen.
Both cases stop on signal 5.

After the stop message the debugger issues two binary reads and then

apparently makes the decision that it should stop rather than continue.
The stopping case is missing the equivalent of "Element 1: Single stepping
past breakpoint site 2 at 0x2aaaaaab9eb0” which is in the continuing case.
I’ve attached the file here:

thanks,
--steve

 From the ashes of disaster grow the roses of success.

It is interesting that the stop reason on the thread that stopped is

"trace". That's what you would expect returning from the single-step to
step over the breakpoint. But it looks like we got a signal while
single-stepping, but the stop reason was misreported by somebody.

Jim

[+lldb-dev]

Hello Steve,

thanks for the report.

The fact that you see the rendezvous breakpoint being hit many times is

not

surprising. We get those every time the library is loaded (we need that to
load relevant debug info and set potential breakpoints). However, they
should generally not be surfaced to the user (unless you have the
stop-on-sharedlibrary-events setting set, which you don't).

The part that is suspicious to me is that __restore_rt shows up on the top
of the backtrace. This is a trampoline used to return from signal

handlers,

Ding! That’s it. This program does use SIGALRM. I wouldn’t have thought it would be enabled at this point, but apparently it is because I got lots of sigalarm halts during the .so loading.

Further, if I run in lldb with

process handle -n false -p false -s false SIGALRM

then the debugger seems to run fine without stopping during dlopen(). The SIGALRM itself isn’t important to the operation of the program.

With this knowledge I can probably create a simple testcase. Would this be considered an lldb bug? If so should I submit a testcase in some way?

thanks!
–steve

The lover of nature is he whose inward and outward senses are still truly adjusted to each other; who has retained the spirit of infancy even into the era of manhood.

+Eugene, who wrote the signal-skipping code

Cool, I'm glad we got to the bottom of this.

At this point, I *think* we should be able to reproduce the issue
ourselves, but a self-contained reproducer would definitely make it easier
for us and would be appreciated, particularly as there are still a few
things I don't understand here (e.g., I think this should only matter only
after an instruction-step, but your last log seems to show it happens after
a "continue").

In any case, I do have a couple of follow-up questions:
- How are you sending the SIGALRM and with what frequency?
- Are you using SIGALRM on macOS as well (implying we handle it correctly
there)?

Also, you may know this already, but I feel obligated to point out that
setting "--pass false" will change the behavior of your program.

cheers,
pavel

Ding! That’s it. This program does use SIGALRM. I wouldn’t have

thought it would be enabled at this point, but apparently it is because I
got lots of sigalarm halts during the .so loading.

Further, if I run in lldb with

process handle -n false -p false -s false SIGALRM

then the debugger seems to run fine without stopping during dlopen().

The SIGALRM itself isn’t important to the operation of the program.

With this knowledge I can probably create a simple testcase. Would this

be considered an lldb bug? If so should I submit a testcase in some way?

thanks!
--steve

 The lover of nature is he whose inward and outward senses are still

truly adjusted to each other; who has retained the spirit of infancy even
into the era of manhood.

Hi Pavel, Eugene, here is a testcase that demonstrates the problem. The code that sets up sigalrm in main() is the same as the program I have been debugging except that I’ve reduced the interval in this testcase. In the original, as per the comment, the interval is .8 seconds.

When I run this under macOS lldb it runs to the end and exits normally without stopping. If I run it under Linux it stops 5 or 6 times with "stop reason = trace”

I compile with

clang test.c -ldl

on both platforms.

cheers,
--steve

#include <dlfcn.h>
#include <signal.h>
#include <stdio.h>
#include <sys/time.h>

static int alarms=0;

void statistics_sigalarm_handler (int x)
{
  alarms++;
}

void loadso(char *path)
{
  void *handlep;
  handlep = dlopen(path, RTLD_LAZY | RTLD_LOCAL);
  if (handlep == NULL) {
    printf("dlopen failed %s\n", dlerror());
  }
}

int main()
{
  signal(SIGALRM, statistics_sigalarm_handler);

  struct itimerval itimerval;
  struct timeval timeval;
  timeval.tv_sec = 0;
  timeval.tv_usec = 10000; // 8 tenths of a second

  itimerval.it_interval = timeval;
  itimerval.it_value = timeval;
  int r = setitimer(ITIMER_REAL, &itimerval, 0);
  if (r) perror("setitimer");

  for (int i=0; i<=100; i++) {
#ifdef __APPLE__
    loadso(“some_macos_library.dylib");
    loadso(“some_other_macos_library.dylib");
#else
    loadso(“some_linux_library.so");
    loadso(“some_other_linux_library.so");
#endif
  }
  signal(SIGALRM, SIG_DFL);
  printf("Got %d signals\n", alarms);
}