Failure connecting to lldb-server running in local virtual machine.

Hi,

I’ve been hitting issues connecting to lldb-server. I’ve been trying from Mac to Linux running in a Virtual Box VM on the same machine. Once I’ve created a target and issued the “run” command lldb immediately disconnects with “error: connect remote failed (failed to get reply to handshake packet)”. The full output from a failed connection attempting to debug a simple “Hello World!” program is:

Hi,

I’ve been hitting issues connecting to lldb-server. I’ve been trying from Mac to Linux running in a Virtual Box VM on the same machine. Once I’ve created a target and issued the “run” command lldb immediately disconnects with “error: connect remote failed (failed to get reply to handshake packet)”. The full output from a failed connection attempting to debug a simple “Hello World!” program is:


(lldb) platform select remote-linux
Platform: remote-linux
Connected: no
(lldb) platform connect connect://127.0.0.1:1234
Platform: remote-linux
Triple: x86_64-pc-linux-gnu
OS Version: 4.8.0 (4.8.0-22-generic)
Kernel: #24-Ubuntu SMP Sat Oct 8 09:15:00 UTC 2016
Hostname: hhellyer-VirtualBox
Connected: yes
WorkingDir: /home/hhellyer
(lldb) target create hello.out
Current executable set to ‘hello.out’ (x86_64).
(lldb) run
error: connect remote failed (failed to get reply to handshake packet)
error: process launch failed: failed to get reply to handshake packet

I’m running the server (on Linux) with:

lldb-server platform --listen *:1234 -P 2345
(I need to specify the -P as only a few ports are forwarded from the VirtualBox vm.)

With logging enabled the logs showed the failure happened when the lldb-server received the “QStartNoAckMode” packet.

this is just the first packet we send after sending the “ack” down to the remote server. When we send the “ack”, we don’t need a response, then we send the “QStartNoAckMode” packet and actually wait for a response. If we don’t get one, then we bail. So this is just the first packet that is sent that expects a response.

I initially thought this was a timing issue on the connection between the client and the server. After doing some investigation I ended up adding code to dump a backtrace when the connection was disconnected (in Communication::Disconnect) and suddenly running the target started working. I replaced the backtrace with a sleep(1) call and it continued working. After that I setup another remote virtual linux box (actually some distance away on the network) and found that lldb worked fine connecting to the remote lldb-server there, presumably because the connection was much slower.

We seem to have a race condition here. Any help figuring out what that might be would be great.

At this point I was assuming it was a timing issue however another configuration that worked was lldb-server and lldb both running on the Linux VM inside Virtual Box which I would have assumed would also be very quick. I’m also wondering if lldb does anything special when the connection goes to 127.0.0.1 that using a local VM might confuse.

So a little background on what happens:

  • You launch lldb-server in platform mode on the remote host (VM in your case)
  • We attach to it from LLDB on your host machine like you did above
  • When we want to debug something on the other side (“run” command above), we send a packet to the lldb-server on the remote host and ask it to launch a GDB server for us. The lldb-server does launch one and sends back a hostname () and a port () in the response. The host side LLDB then tries to connect to this IP address and port using the equivalent of:

(lldb) process connect connect://:

So the question is: does your VM have a unique IP address? if so, no port mapping will need to be done. If it responds with 127.0.0.1, then we have port collision issues and will probably need to use a port offsets. Typically with VMs there is some port remapping that has to happen. If you and your VM are both listening for a connection on port 1234, what happens if you would do:

(lldb) platform connect connect://127.0.0.1:1234

Would it connect to your host machine, or the VM? Typically there are some port remapping settings that you do. Like “add 10000” to any ports for the VM, so you would then just do:

(lldb) platform connect connect://127.0.0.1:11234

Notice the extra 10000 added to the original 1234.

When starting the lldb-server in platform mode, you can ask it to use a min and max port range for connections so that you can edit your VM settings so that certain ports are available:

% lldb-server platform --listen ‘*:1000’ --min-gdbserver-port 1001 --max-gdbserver-port 1010 --port-offset=10000

This would listen on port 1000, and reserver 1001 - 1010 for GDB remote connections when the server is asked to spawn a lldb-server in debug mode. Since you specified your port offset of 10000, when a packet is sent to your “lldb-server platform” that asks it to start a lldb-server for debugging it will use a port between 1001 and 1010 and when it sends the info back to the client it will add the port offset, so it will send back “11001” as the first port to connect to and the port remapping will know to talk to your VM.

I was testing from Mac to Virtual Box just because it was simpler than testing to a remote system, which was actually the goal and seems to work, so this isn’t totally blocking me but it does seem like a problem and I’m not sure if other users connecting to remote linux machine will hit this problem.

Sounds like a race condition we need to solve.

Are there any known issues around this type of connection already? Or does anyone have any useful pointers? I couldn’t see anything quite the same on http://bugs.llvm.org/ so asking here seemed like a logical next step.

This sounds like a very common VM issue. If you can elaborate on what happens with ports on your VM we might be able to deduce some of the problems. Filing a bug will be nice, but since you are running into this issue, seems like you will need to debug it and make it work. We are happy to assist.

Greg

Sounds like a race condition we need to solve.

It seemed likely but I thought I’d confirm that before I investigated too deeply just in case there was anything obvious I’d missed.
``

This sounds like a very common VM issue. If you can elaborate on
what happens with ports on your VM we might be able to deduce some
of the problems. Filing a bug will be nice, but since you are
running into this issue, seems like you will need to debug it and
make it work. We are happy to assist.

Initially I was mapping ports with an offset but as part of investigating I simplified things and now my setup just routes ports 1234 and 2345 on my machine to 1234 and 2345 on the VM. I might retry with an offset just to double check if that changes anything but I don’t think it did.

It does seem to be fairly common, it happens with my VM and to another colleague using docker. It works with a remote system for me but fails with with a remote system for him.

I’ve had some other bits to look at today but I’ll try to take a deeper look next week. I’ll try to update this thread with what I find as any useful suggestions other people have are likely to save a bit of time.

Thanks,

Howard Hellyer
IBM Runtime Technologies, IBM Systems
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

> this is just the first packet we send after sending the "ack" down
> to the remote server. When we send the "ack", we don't need a
> response, then we send the "QStartNoAckMode" packet and actually
> wait for a response. If we don't get one, then we bail. So this is
> just the first packet that is sent that expects a response.

I've been doing some more investigation and this seems to be true but it looks like the actual "ack" might be part of the problem.

In GDBRemoteCommunicationClient::HandshakeWithServer lldb sends an Ack() packet (a '+') then calls ReadPacket until it doesn't succeed in reading any more packets to make sure everything outstanding has been read.
Then it does QueryNoAckModeSupported. What seems to be happening in the passing case is that ReadPacket returns with a result of ErrorReplyTimeout.
In the failing case ReadPacket returns ErrorDisconnected. This seems to be happening because in the failing case (without the delay) the underlying "recv" call returns 0 indicating end-of-file and then GDBRemoteCommunication::WaitForPacketNoLock calls Disconnect and returns ErrorDisconnected to HandshakeWithServer.
HandshakeWithServer seems to rely on ReadPacket failing to read successfully but with a type of error that means the connection is still valid. In the bad case it's failing with an error that means lldb has disconnected.

As yet I don't understand exactly why it reads an empty response in the failing case. I'm not sure if there's something queued or if it's managing to start talking to the remote process before it's finished starting. Moving the sleep(1) just before GDBRemoteCommunicationClient::HandshakeWithServer seems to work (with the sleep inside Disconnect() removed) so I think this is on the right path.

I'll continue investigating and see if I can determine why. It might be the problem is on the server side and lldb-server is passing back the connection details before the new process is ready for the client.

Howard.
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

> We seem to have a race condition here. Any help figuring out what
> that might be would be great.

I think I've now identified the problem.

Normally in lldb-server when the child lldb-server process is launched inside GDBRemoteCommunication::StartDebugserverProcess the parent lldb-server waits for it to start up and write back the port is has selected to listen on a pipe.
When lldb-server specifies a port for the child to listen on (when we start lldb-server with -P <port> or -m <minport> and -M <maxport>) then it doesn't bother setting up that pipe as it knows what the port will be. However not listening means the parent can reply to the lldb client with the port number before the debugserver process is actually up and and listening on that port.

Simply making sure the pipe is always setup and used to read the port number from (even when lldb-server knows what it will be) ensures the child process is ready and allows me to connect reliably with an unmodified client lldb.

The one second delay in the client was fixing things as inside PlatformRemoteGDBServer::DebugProcess there is a retry to ConnectRemote and the delay gave time for the child process to begin before the retry. This also explains why more people weren't seeing this problem as it only applies if you use the -P or -m/-M options. I think I only found those by searching the code, they don't appear in the help for lldb-server when you run it and I didn't manage to find any docs.

Unless anyone points out a major flaw in the logic above I'll tidy up my fix and put a patch up on https://reviews.llvm.org/ in the next few days. The code for listening on a pipe from the child process is slightly different when __APPLE__ is defined so I need to verify the change on Mac and Linux. I'll probably just always read back the port and if it comes back as an unexpected value that may need to be a new error case.

Would you mind if I also raised a patch to update the documentation to include the -P and -m/-M options? (Both in the help when you run lldb-server and on www/remote.html.) I originally came to this problem because another team couldn't connect to a remote system with only certain ports open and found them the options they needed by reading the source and I think doing things like connecting to processes in a VM or docker container where these options are needed are becoming more common.
Thanks,

Howard.

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU