race condition using gdb-remote over ssh port forwarding

Greetings, I’ve been using liblldb to remotely debug to a linux server with port forwarding. To do this, I start lldb-server to with --listen specifying a localhost port, as well as with ----min-gdbserver-port and --max-gdbserver-port to specify a specific port for use by ‘gdb remote’. Both ports are forwarded to the remote PC, where liblldb connects to localhost.

This generally works fine, but there is a race condition. When the client tells lldb-server to start gdb-remote, the port is returned to the client which may try to connect before the gdb-remote process is actually listening. Without port-forwarding, this is okay because the client has retry logic:

ProcessGDBRemote::ConnectToDebugserver

retry_count++;
if (retry_count >= max_retry_count)
break;
usleep(100000);

But with port-forwarding, the initial connection is always accepted by the port-forwarder, and only then does it try to establish a connection to the remote port. It has no way to not accept the incoming local connection until it tries the remote end.

lldb has some logic to detect this further in the function, by using a handshake to ensure the connection is actually made:

// We always seem to be able to open a connection to a local port
// so we need to make sure we can then send data to it. If we can’t
// then we aren’t actually connected to anything, so try and do the
// handshake with the remote GDB server and make sure that goes
// alright.
if (!m_gdb_comm.HandshakeWithServer(&error)) {
m_gdb_comm.Disconnect();
if (error.Success())
error.SetErrorString(“not connected to remote gdb server”);
return error;
}

But the problem here is that no retry is performed on failure. The caller to the ‘attach’ API also can’t retry because the gdb server is terminated on the error.

I would like to submit a patch, but first check to see if this solution would be acceptable:

  • Include the handshake within the connection retry loop.
  • This means fully disconnecting the re-establishing the connection in the loop if the handshake fails.
  • Changing the timeout check to be based on a total absolute time instead of 50 iterations with a 100ms sleep.

Thoughts?

Alternatives could be:

  • Have lldb-server delay responding to the ‘start gdb server’ request until it could tell (somehow) that the process is listening.
  • A sleep of some kind on the client side after starting the server but before trying to connect.

Thanks,
Chris

When I wrote the code I was assuming that when the platform lldb-server responds with the port that the gdb-remote would be ready to receive packets right away, so I can see how and why this is happening. Seems like we have the retry stuff under control when we don't get a connection right away, but we should fix this such that when we hand the port back to LLDB from platform lldb-server, it should be listening and ready to accept a connection right away with no delays needed, though this can wait until later since it currently works.

What kind of port forwarding are you using? The main issue is I would assume that when someone tries to connect to a port on the port forwarder that it would fail to connect if it isn't able to connect on the other side. So this really comes down to a question of what a standard port forwarder's contract should really be.

If anyone has extensive experience in port forward tech, please chime in.

Short answer: not sure what the right solution is as it depends on what proper port forwarding etiquette is.

Greg

Not sure if it is applicable in this case but GDB has this feature:
  Server (Debugging with GDB)
  (gdb) target remote | ssh -T hostname gdbserver - hello

Jan

I have been using both a custom port-forwarding program as well as OpenSSH. The custom solution is a bit slower because it goes through a proxy, so the problem doesn’t manifest. I’ve been trying to switch to OpenSSH which is fast enough to consistent hit the race condition.

I would expect other solutions to be similar to OpenSSH because I don’t think there’s a way to have the client connection fail based on the result of a remote connection. Our own forwarding solution was built using this method because of this.

I agree that fixing this on the server seems like a better solution, since the client only gets a port when it has one to connect to. But I don’t know the best way to make this happen from lldb-server after launching the gdb instance.

I have been using both a custom port-forwarding program as well as OpenSSH. The custom solution is a bit slower because it goes through a proxy, so the problem doesn’t manifest. I’ve been trying to switch to OpenSSH which is fast enough to consistent hit the race condition.

I would expect other solutions to be similar to OpenSSH because I don’t think there’s a way to have the client connection fail based on the result of a remote connection. Our own forwarding solution was built using this method because of this.

I agree that fixing this on the server seems like a better solution, since the client only gets a port when it has one to connect to. But I don’t know the best way to make this happen from lldb-server after launching the gdb instance.

You might need to connect to it from the platform lldb-server and then disconnect and have the gdb-server listen for another connection. Maybe if you pass an extra argument for the double connect?

Greetings, I've been using liblldb to remotely debug to a linux server with
port forwarding. To do this, I start lldb-server to with --listen
specifying a localhost port, as well as with ----min-gdbserver-port and
--max-gdbserver-port to specify a specific port for use by 'gdb remote'.
Both ports are forwarded to the remote PC, where liblldb connects to
localhost.

This generally works fine, but there is a race condition. When the client
tells lldb-server to start gdb-remote, the port is returned to the client
which may try to connect before the gdb-remote process is actually
listening.

Are you sure this is what's happening? Looking at lldb-server source
code, I see that it starts listening on lldb-gdbserver.cpp:320
      error = acceptor_up->Listen(1);
      if (error.Fail()) {
        fprintf(stderr, "failed to listen: %s\n", error.AsCString());
        exit(1);
      }
and the port number is written out only later, on line 334:
      error = acceptor_up->Listen(1);
      if (error.Fail()) {
        fprintf(stderr, "failed to listen: %s\n", error.AsCString());
        exit(1);
      }

After listen returns the connection by your port forwarder should not
be rejected, and it can only connect once it knows the port to connect
to.

That's not to say that connecting through port forwarders is easy. For
android debugging, we also have to setup port forwarding with adb, and
we need to do some quite elaborate dances to make sure that the
connection is reliable and race free (see
PlatformAndroidRemoteGDBServer::MakeConnectURL) -- however this is to
resolve a different issue (allocating the port on the local side from
which to do the forwarding), which does not sound like the problem you
describe.

I would like to submit a patch, but first check to see if this solution
would be acceptable:
- Include the handshake within the connection retry loop.
- This means fully disconnecting the re-establishing the connection in the
loop if the handshake fails.
- Changing the timeout check to be based on a total absolute time instead of
50 iterations with a 100ms sleep.

Thoughts?

Alternatives could be:
- Have lldb-server delay responding to the 'start gdb server' request until
it could tell (somehow) that the process is listening.
- A sleep of some kind on the client side after starting the server but
before trying to connect.

The most port-forwarder-friendly solution (and one that I've been
wanting to implement for a while, but never got around to it) I can
think of is to not require a second port for the llgs connection. It
could work approximately like this:
- we add a new packet type to the lldb-server platform instance: (e.g.
qExecGDBServer)
- upon receiving this packet, the platform instance would exec the
gdb-server instance, and pass it the existing socket file descriptor
(via a command line argument or whatever)
- the gdb-server instance would start up with the connection primed
and ready, and would not need to do any listening and writing back the
port number, etc.
- the lldb client would end up with it's "platform" connection being
connected to llgs, and would have to create a new platform connection
(or it could create a scratch platform solely for the purpose of
exec()ing the llgs). Either way, the second connection would be to the
exact same address and port as the first one, so there should not be
any extra forwarding setup necessary.

The details of this will need to be thought through (e.g., who should
send the first packet after the exec()? how to detect an error? should
the connection start in no-ack mode?), but I don't think it should be
too complicated overall. And all port-forwarding users would benefit
from that (we could delete the android-specific code I mentioned, and
you could stop mucking with --max-gdbserver-port).

What do you think about that?

pavel

Greetings, I've been using liblldb to remotely debug to a linux server with
port forwarding. To do this, I start lldb-server to with --listen
specifying a localhost port, as well as with ----min-gdbserver-port and
--max-gdbserver-port to specify a specific port for use by 'gdb remote'.
Both ports are forwarded to the remote PC, where liblldb connects to
localhost.

This generally works fine, but there is a race condition. When the client
tells lldb-server to start gdb-remote, the port is returned to the client
which may try to connect before the gdb-remote process is actually
listening.

Are you sure this is what's happening? Looking at lldb-server source
code, I see that it starts listening on lldb-gdbserver.cpp:320
     error = acceptor_up->Listen(1);
     if (error.Fail()) {
       fprintf(stderr, "failed to listen: %s\n", error.AsCString());
       exit(1);
     }
and the port number is written out only later, on line 334:
     error = acceptor_up->Listen(1);
     if (error.Fail()) {
       fprintf(stderr, "failed to listen: %s\n", error.AsCString());
       exit(1);
     }

After listen returns the connection by your port forwarder should not
be rejected, and it can only connect once it knows the port to connect
to.

That's not to say that connecting through port forwarders is easy. For
android debugging, we also have to setup port forwarding with adb, and
we need to do some quite elaborate dances to make sure that the
connection is reliable and race free (see
PlatformAndroidRemoteGDBServer::MakeConnectURL) -- however this is to
resolve a different issue (allocating the port on the local side from
which to do the forwarding), which does not sound like the problem you
describe.

I would like to submit a patch, but first check to see if this solution
would be acceptable:
- Include the handshake within the connection retry loop.
- This means fully disconnecting the re-establishing the connection in the
loop if the handshake fails.
- Changing the timeout check to be based on a total absolute time instead of
50 iterations with a 100ms sleep.

Thoughts?

Alternatives could be:
- Have lldb-server delay responding to the 'start gdb server' request until
it could tell (somehow) that the process is listening.
- A sleep of some kind on the client side after starting the server but
before trying to connect.

The most port-forwarder-friendly solution (and one that I've been
wanting to implement for a while, but never got around to it) I can
think of is to not require a second port for the llgs connection. It
could work approximately like this:
- we add a new packet type to the lldb-server platform instance: (e.g.
qExecGDBServer)
- upon receiving this packet, the platform instance would exec the
gdb-server instance, and pass it the existing socket file descriptor
(via a command line argument or whatever)
- the gdb-server instance would start up with the connection primed
and ready, and would not need to do any listening and writing back the
port number, etc.
- the lldb client would end up with it's "platform" connection being
connected to llgs, and would have to create a new platform connection
(or it could create a scratch platform solely for the purpose of
exec()ing the llgs). Either way, the second connection would be to the
exact same address and port as the first one, so there should not be
any extra forwarding setup necessary.

I am not a fan of losing the lldb-server platform connection by turning it into the GDB server connection. It means you need to connect and launch each time instead of connect once to the platform and then launch N times.

The details of this will need to be thought through (e.g., who should
send the first packet after the exec()? how to detect an error? should
the connection start in no-ack mode?), but I don't think it should be
too complicated overall. And all port-forwarding users would benefit
from that (we could delete the android-specific code I mentioned, and
you could stop mucking with --max-gdbserver-port).

What do you think about that?

Since the port forwarding and the lldb-server need to be in cahoots, maybe an extra argument is added to the "lldb-server platform" invocation that tell us we are using port forwarding and we do something special in the "lldb-server platform" binary to bullet proof the connection?

Greg

Hi Pavel, I think you are on the right track in that it already does seem to wait for the gdbserver to be up using a pipe.

In StartDebugserverProcess it seems like there is a pipe created to tell when the server is running. Here is a comment:

// socket_pipe is used by debug server to communicate back either
// TCP port or domain socket name which it listens on.
// The second purpose of the pipe to serve as a synchronization point -
// once data is written to the pipe, debug server is up and running.

However, it looks like specifying --min-gdbserver-port and --max-gdbserver-port cause the pipe to not be used for some reason.

Here is the gdbserver invocation when I don’t specify the gdb min and max port:

lldb-server gdbserver tcp://:0 --native-regs --pipe 6

And here is the gdbserver invocation when I do specify the gdb port:

lldb-server gdbserver tcp://: --native-regs

Its not obvious to me from looking at the code why this is being skipped. But I will debug further why this argument is not used in this case.

Thanks,
Chris

Aha, interesting. That would explain why there is a race. We should
add the pipe synchronization even in case we know what is the port
going to be (though I still like the "exec" idea).

The most port-forwarder-friendly solution (and one that I've been
wanting to implement for a while, but never got around to it) I can
think of is to not require a second port for the llgs connection. It
could work approximately like this:
- we add a new packet type to the lldb-server platform instance: (e.g.
qExecGDBServer)
- upon receiving this packet, the platform instance would exec the
gdb-server instance, and pass it the existing socket file descriptor
(via a command line argument or whatever)
- the gdb-server instance would start up with the connection primed
and ready, and would not need to do any listening and writing back the
port number, etc.
- the lldb client would end up with it's "platform" connection being
connected to llgs, and would have to create a new platform connection
(or it could create a scratch platform solely for the purpose of
exec()ing the llgs). Either way, the second connection would be to the
exact same address and port as the first one, so there should not be
any extra forwarding setup necessary.

I am not a fan of losing the lldb-server platform connection by turning it into the GDB server connection. It means you need to connect and launch each time instead of connect once to the platform and then launch N times.

The overall number of connections stays the same. The difference is
that instead of 1 connection to the platform and N gdb-server
connections, you will have N+1 platform connections (of those, N will
be later transformed into an llgs connection).

The only thing that changes from the client side is the gdb-server
startup sequence. Instead of:
- send qSpawnGDBServer
- read port
- construct url from port // This is where things break down. With
NAT, we have no way do know whether the port we should use is the same
as the other side uses.
- connect
- talk to gdb-server

we do this:
- connect to platform (using the already-known url)
- send qExecGdbServer
- talk to gdb-server

The details of this will need to be thought through (e.g., who should
send the first packet after the exec()? how to detect an error? should
the connection start in no-ack mode?), but I don't think it should be
too complicated overall. And all port-forwarding users would benefit
from that (we could delete the android-specific code I mentioned, and
you could stop mucking with --max-gdbserver-port).

What do you think about that?

Since the port forwarding and the lldb-server need to be in cahoots, maybe an extra argument is added to the "lldb-server platform" invocation that tell us we are using port forwarding and we do something special in the "lldb-server platform" binary to bullet proof the connection?

I don't think there is any single thing that can be done to "bullet
proof" the connection, as the exact course of action will depend on
the type of forwarding used. And it certainly can't be done on the
server, as the server generally knows nothing about the port
forwarding. The only one who can possibly know about the port
forwarding is the client, but even this is not always the case (it
could be your home router doing a NAT). The only bullet proof way to
get rid of these kinds of issues is to avoid the FTP-like two
connection setup (the separate data and control connections is why
every NAT needs to handle FTP protocol specially, and every protocol
since then has avoided this kind of setup, as it was causing too many
issues).

I made a mistake and was not looking at the latest source. Looks like this exact issue was already discovered and fixed:
https://reviews.llvm.org/D30255

I’m using 4.0.1, so it appears I need to upgrade.

Chris