Unreliable process attach on Linux

Consider this example program:

#include <err.h>
#include <sys/wait.h>
#include <unistd.h>

#include <lldb/API/SBDebugger.h>
#include <lldb/API/SBProcess.h>
#include <lldb/API/SBTarget.h>

int
main(void)
{
  // Target process for the debugger.
  pid_t pid = fork();
  if (pid < 0)
    err(1, "fork");
  if (pid == 0)
    while (true)
      pause();

  lldb::SBDebugger::Initialize();
  {
    auto debugger(lldb::SBDebugger::Create());
    if (!debugger.IsValid())
      errx(1, "SBDebugger::Create failed");

    auto target(debugger.CreateTarget(nullptr));
    if (!target.IsValid())
      errx(1, "SBDebugger::CreateTarget failed");

    lldb::SBAttachInfo attachinfo(pid);
    lldb::SBError error;
    auto process(target.Attach(attachinfo, error));
    if (!process.IsValid())
      errx(1, "SBTarget::Attach failed: %s", error.GetCString());
    error = process.Detach();
    if (error.Fail())
      errx(1, "SBProcess::Detach failed: %s", error.GetCString());
  }
  lldb::SBDebugger::Terminate();

  if (kill(pid, SIGKILL) != 0)
    err(1, "kill");
  if (waitpid(pid, NULL, 0) < 0)
    err(1, "waitpid");

  return 0;
}

Run it in a loop like this:

$ while ./test-attach ; do date; done

On Linux x86-64 (Fedora 29), with LLDB 7 (lldb-7.0.0-1.fc29.x86_64) and
kernel 4.19.12 (kernel-4.19.12-301.fc29.x86_64), after 100 iterations or
so, attaching to the newly created process fails:

test-attach: SBTarget::Attach failed: lost connection

This also reproduces occasionally with LLDB itself (with “lldb -p PID”).

Any suggestions how to get more information about the cause of this
error?

Thanks,
Florian

FYI after 30000 runs it still runs fine with your reproducer both with system
lldb-devel-7.0.0-1.fc29.x86_64 and COPR
lldb-experimental-devel-8.0.0-0.20190102snap0.fc29.x86_64 (=trunk), part
running without /usr/lib/debug and part with.

Fedora 29 x86_64 + kernel-4.19.10-300.fc29.x86_64

(I haven't investigated the code why it could fail this way.)

Jan

* Jan Kratochvil:

Run it in a loop like this:

$ while ./test-attach ; do date; done

On Linux x86-64 (Fedora 29), with LLDB 7 (lldb-7.0.0-1.fc29.x86_64) and
kernel 4.19.12 (kernel-4.19.12-301.fc29.x86_64), after 100 iterations or
so, attaching to the newly created process fails:

test-attach: SBTarget::Attach failed: lost connection

FYI after 30000 runs it still runs fine with your reproducer both with system
lldb-devel-7.0.0-1.fc29.x86_64 and COPR
lldb-experimental-devel-8.0.0-0.20190102snap0.fc29.x86_64 (=trunk), part
running without /usr/lib/debug and part with.

Well, that's odd. Shall I try to reproduce this on a lab machine?

Fedora 29 x86_64 + kernel-4.19.10-300.fc29.x86_64

(I haven't investigated the code why it could fail this way.)

First, I want to get more logging data out of LLDB. Maybe this will
tell us where things go wrong.

Thanks,
Florian

I’d be curious to see if the PID of the process that is failed to attach to is the same as one of the PIDs of a process that was previously attached to (and if so, if it is the first such case where a PID is recycled).

* Zachary Turner:

I'd be curious to see if the PID of the process that is failed to
attach to is the same as one of the PIDs of a process that was
previously attached to (and if so, if it is the first such case where
a PID is recycled).

I added logging of the PID, and got this (the failure happened rather
quickly this time):

$ while ./test-attach ; do date ; done
PID: 16658
Sat Jan 5 21:28:28 CET 2019
PID: 16831
Sat Jan 5 21:28:29 CET 2019
PID: 17006
Sat Jan 5 21:28:30 CET 2019
PID: 17176
Sat Jan 5 21:28:30 CET 2019
PID: 17351
Sat Jan 5 21:28:31 CET 2019
PID: 17526
Sat Jan 5 21:28:32 CET 2019
PID: 17700
test-attach: SBTarget::Attach failed: lost connection

So there isn't any evidence of PID reuse. My lldb -p test case also
triggers this with a long-running process, so there isn't any PID reuse
there either.

Thanks,
Florian

I would recommend enabling gdb-remote logging (so something like: debugger.HandleCommand("log enable gdb-remote packets")) to see at which stage do we actually lose the gdb-server connection.

My best bet would be that on your machine/build the server is slower than usual in responding to one of the client packets and that causes the connection to be dropped.

cheers,
pavel

* Pavel Labath:

Consider this example program:

#include <err.h>
#include <sys/wait.h>
#include <unistd.h>

#include <lldb/API/SBDebugger.h>
#include <lldb/API/SBProcess.h>
#include <lldb/API/SBTarget.h>

int
main(void)
{
   // Target process for the debugger.
   pid_t pid = fork();
   if (pid < 0)
     err(1, "fork");
   if (pid == 0)
     while (true)
       pause();

   lldb::SBDebugger::Initialize();
   {
     auto debugger(lldb::SBDebugger::Create());
     if (!debugger.IsValid())
       errx(1, "SBDebugger::Create failed");

     auto target(debugger.CreateTarget(nullptr));
     if (!target.IsValid())
       errx(1, "SBDebugger::CreateTarget failed");

     lldb::SBAttachInfo attachinfo(pid);
     lldb::SBError error;
     auto process(target.Attach(attachinfo, error));
     if (!process.IsValid())
       errx(1, "SBTarget::Attach failed: %s", error.GetCString());
     error = process.Detach();
     if (error.Fail())
       errx(1, "SBProcess::Detach failed: %s", error.GetCString());
   }
   lldb::SBDebugger::Terminate();

   if (kill(pid, SIGKILL) != 0)
     err(1, "kill");
   if (waitpid(pid, NULL, 0) < 0)
     err(1, "waitpid");

   return 0;
}

Run it in a loop like this:

$ while ./test-attach ; do date; done

On Linux x86-64 (Fedora 29), with LLDB 7 (lldb-7.0.0-1.fc29.x86_64) and
kernel 4.19.12 (kernel-4.19.12-301.fc29.x86_64), after 100 iterations or
so, attaching to the newly created process fails:

test-attach: SBTarget::Attach failed: lost connection

This also reproduces occasionally with LLDB itself (with “lldb -p PID”).

Any suggestions how to get more information about the cause of this
error?

I would recommend enabling gdb-remote logging (so something like:
debugger.HandleCommand("log enable gdb-remote packets")) to see at
which stage do we actually lose the gdb-server connection.

Thanks. I enabled logging like this:

    auto debugger(lldb::SBDebugger::Create());
    if (!debugger.IsValid())
      errx(1, "SBDebugger::Create failed");

    debugger.HandleCommand("log enable gdb-remote packets");

    auto target(debugger.CreateTarget(nullptr));
    if (!target.IsValid())
      errx(1, "SBDebugger::CreateTarget failed");

And here's the output I get:

test-attach < 1> send packet: +
test-attach history[1] tid=0x1cab < 1> send packet: +
test-attach < 19> send packet: $QStartNoAckMode#b0
test-attach < 1> read packet: +
test-attach < 6> read packet: $OK#9a
test-attach < 1> send packet: +
test-attach < 41> send packet: $qSupported:xmlRegisters=i386,arm,mips#12
test-attach < 124> read packet: $PacketSize=20000;QStartNoAckMode+;QThreadSuffixSupported+;QListThreadsInStopReply+;qEcho+;QPassSignals+;qXfer:auxv:read+#be
test-attach < 26> send packet: $QThreadSuffixSupported#e4
test-attach < 6> read packet: $OK#9a
test-attach < 27> send packet: $QListThreadsInStopReply#21
test-attach < 6> read packet: $OK#9a
test-attach < 13> send packet: $qHostInfo#9b
test-attach < 11> send packet: $qEcho:1#5b
test-attach: SBTarget::Attach failed: lost connection

Florian

Thanks. I think this is what I suspected. The server is extremely slow in responding to the qHostInfo packet. This timeout for this was recently increased to 10 seconds, but it looks like 7.0 still has the default (1 second) timeout.

If you don't want to recompile or update, you should be able to work around this by increasing the default timeout with the following command "settings set plugin.process.gdb-remote.packet-timeout 10".

regards,
pavel

* Pavel Labath:

Thanks. I think this is what I suspected. The server is extremely slow
in responding to the qHostInfo packet. This timeout for this was
recently increased to 10 seconds, but it looks like 7.0 still has the
default (1 second) timeout.

If you don't want to recompile or update, you should be able to work
around this by increasing the default timeout with the following
command "settings set plugin.process.gdb-remote.packet-timeout 10".

I see, that helps.

There's a host name in the qHostInfo response? Where's the code that
determines the host name? On the other end? I wonder if it performs a
DNS lookup. That could explain the delay.

Thanks,
Florian

Yes, there's a dns lookup being done on the other end. TBH, I'm not really sure what's it being used for. Maybe we should try deleting the hostname field from the qHostInfo response (or just put an IP address there).

* Pavel Labath:

Yes, there's a dns lookup being done on the other end. TBH, I'm not
really sure what's it being used for. Maybe we should try deleting the
hostname field from the qHostInfo response (or just put an IP address
there).

Or use the system host name without resorting to DNS (using uname or
gethostname on GNU/Linux). The DNS lookup is really surprising.

Thanks,
Florian