remote server crash -> lldb waits forever in accept()

Hi Chris, Pavel,

I've got a problem launching the hexagon simulator from lldb. It's launched
the same way that debugserver/lldb-server is launched, with reverse connect
on a TCP socket. The user can modify the simulator command line (using
target.run-args), and can easily give a set of options that the simulator
can't handle. In this case the simulator quits before connecting to lldb,
and lldb will wait forever for the connection, since TCPSocket::Accept has
no timeout.

Currently I have a select call before the accept, to make sure there is
activity on the socket. This doesn't feel right with the new changes using
MainLoop, so I wanted to see what the list thinks. I believe it will happen
any time that debugserver/lldb-server quits or crashes before connecting.
That should be easy to test.

This is what I use, right before the call to accept_loop.Run in
TCPSocket::accept:

    NativeSocket accept_socket = -1;
    fd_set readset;
    FD_ZERO(&readset);
    for (auto socket : m_listen_sockets) {
      auto fd = socket.first;
      FD_SET(fd, &readset);
      if (fd > accept_socket)
        accept_socket = fd;
    }
    struct timeval timeout = {10, 0};
    int result = ::select(accept_socket + 1, &readset, NULL, NULL,
&timeout);
    if (result == 0) {
      printf("error: timeout waiting for remote server to connect!\n");
      error.SetErrorString("timeout waiting for remote server to connect");
      return error;
    } else if (result < 0) {
      printf("error: remote server does not exist!\n");
      SetLastError(error);
      return error;
    }

Any thoughts this issue?

Ted

Seems to me a better UI would be to make ^C interrupt this wait. That seems to me better than putting in some arbitrary timeout.

Jim

Ultimately I think the solution here is two changes

(1) We should add a timeout to MainLoop so that it doesn't wait forever unless we really want to wait forever.
(2) MainLoop can exit on sigint for any platform that has ppoll, pselect, or kevent, so we should probably set that up too.

-Chris

(1) is okay, except pretty much every time so far I've ever said "operation X can't possibly take more than N seconds" somebody finds a case in which that assumption was wrong. So you end up having to make the timeouts so long they look like stalls anyway... An explicit cancellation gesture is better if supportable.

Jim

I think in the common case of listening for a remote connection infinite (or very very long) timeout with signal interrupt is the preferred approach. There are other situations where we use SelectHelper with smaller timeouts, and I think ultimately we should replace SelectHelper with MainLoop because they do basically the same thing.

-Chris

MainLoop was meant to be a general event multiplexer. One of those
events can certainly be "a certain amount of time expiring".
So, you could write something like:
loop.RegisterAlarm(seconds(N), [&] { loop.RequestTermination() });
which would translate to an appropriate timeout argument to ppoll.

(2) MainLoop can exit on sigint for any platform that has ppoll, pselect, or kevent, so we should probably set that up too.

Listening for signals is very complicated in multithreaded programs. I
am not sure about the situation with kevent, but I am sure the ppoll
based version will not work reliably for this.
Then there is also the problem of installing a signal handler in a
shared library (liblldb), which can conflict with whatever is the host
program doing.

I would love to have this functionality, as I think it's badly needed
in a lot of places, but I think it needs to be done the long way round
and have a sort of an Interrupt function on the SBAPI level (maybe the
existing SBDebugger.DispatchInputInterrupt would be enough, although I
am not convinced of the signal-safety of that function), which would
then trigger an exit from the main loop where necessary. lldb and
other host programs could then install their own sigint handlers (lldb
already does btw), and call this function when appropriate.

Let me throw another option out there:
if lldb-server fails to start because of a borked command line you
should be able to detect this (waitpid). So then again you are in the
business of multiplexing two events (accept() and waitpid()) and you
don't need timeouts. (It will be tricky to multiplex waitpid properly
without installing a SIGCHLD handler though).

MainLoop was meant to be a general event multiplexer. One of those
events can certainly be "a certain amount of time expiring".
So, you could write something like:
loop.RegisterAlarm(seconds(N), [&] { loop.RequestTermination() });
which would translate to an appropriate timeout argument to ppoll.

(2) MainLoop can exit on sigint for any platform that has ppoll, pselect, or kevent, so we should probably set that up too.

Listening for signals is very complicated in multithreaded programs. I
am not sure about the situation with kevent, but I am sure the ppoll
based version will not work reliably for this.
Then there is also the problem of installing a signal handler in a
shared library (liblldb), which can conflict with whatever is the host
program doing.

I would love to have this functionality, as I think it's badly needed
in a lot of places, but I think it needs to be done the long way round
and have a sort of an Interrupt function on the SBAPI level (maybe the
existing SBDebugger.DispatchInputInterrupt would be enough, although I
am not convinced of the signal-safety of that function), which would
then trigger an exit from the main loop where necessary. lldb and
other host programs could then install their own sigint handlers (lldb
already does btw), and call this function when appropriate.

Yes, this is certainly the right way to do it. That's how the lldb driver does things. If we run into troubles with this doing more work than it should in the signal handler, we can make something cleverer. But the set of things that are in theory signal safe is a small subset of the set of things that are in practice are signal safe. When I last worked on gdbtk (that was way back in the day), it used to do all the updates of the GUI in a SIGALRM handler. That should never have worked in theory yet in practice it worked just fine.

Let me throw another option out there:
if lldb-server fails to start because of a borked command line you
should be able to detect this (waitpid). So then again you are in the
business of multiplexing two events (accept() and waitpid()) and you
don't need timeouts. (It will be tricky to multiplex waitpid properly
without installing a SIGCHLD handler though).

You can do this with kqueues on systems that support them.

Jim