Lldb-server spins forever in ptrace with 100% CPU on Linux Ubuntu 16.04

Hi,

I am running Linux Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-51-generic x86_64) in virtual machine.

When I am trying to write memory in my process, sometimes it works but sometimes lldb-server goes unresponsive. It spins with 100% CPU utilization and does not respond to any signals except SIGKILL.

I tried to debug it, but under debugger everything works perfectly :). Here are a couple observations:

  1. I believe that lldb-server spins inside ptrace. I put breakpoint on the highlighted line, and it does not hit. If I put breakpoint on line before, it hits but lldb-server hangs.

Error

NativeProcessLinux::PtraceWrapper(int req, lldb::pid_t pid, void *addr, void *data, size_t data_size, long *result)

{

Error error;

long int ret;

Log *log (ProcessPOSIXLog::GetLogIfAllCategoriesSet (POSIX_LOG_PTRACE));

PtraceDisplayBytes(req, data, data_size);

errno = 0;

if (req == PTRACE_GETREGSET || req == PTRACE_SETREGSET)

ret = ptrace(static_cast<__ptrace_request>(req), static_cast< ::pid_t>(pid), *(unsigned int *)addr, data);

else

ret = ptrace(static_cast<__ptrace_request>(req), static_cast< ::pid_t>(pid), addr, data);

if (ret == -1)

error.SetErrorToErrno();

  1. It seems that hang is caused by the client trying to read response too fast. I mean, if I step through the client code it works - i.e. there is significant delay between client writing into pipe and issuing ::select to wait for response.

Any advice how to deal with the situation except putting random sleeps in random places?

Thanks,

Eugene

Hello Eugene,

this sounds troubling, and I'd like to get to the bottom of this. If
you can get me a bit more information about this, I believe we can
figure it out:

- could you get a backtrace of lldb-server when it is in the "stuck"
state (just attach with lldb/gdb after it hangs and do "bt")? I want
to see the where is it spinning, as I don't see any obvious infinite
loop there.

- are you able to still reproduce the bug with logging enabled? If so,
I'd like to see the log file to understand this better. (You can
enable logging by starting lldb-server with: --log-file XXX.log
--log-channels "lldb all:linux all". If you're starting it via lldb
client you can set the LLDB_DEBUGSERVER_LOG_FILE and
LLDB_SERVER_LOG_CHANNELS environment vars to achieve this)

- If you can get me reasonably detailed repro steps, I can try to
investigate (I am fine with the first step being "install ubuntu 16.04
in virtualbox")

Hi,
1. I believe that lldb-server spins inside ptrace. I put breakpoint on the
highlighted line, and it does not hit. If I put breakpoint on line before,
it hits but lldb-server hangs.

Do you mean actually inside the ptrace(2) syscall? Your description
would certainly fit that, but that sounds scary, as it would mean a
kernel bug. If that's the case, then we have to start looking in the
kernel. I have some experience with that, but If we can boil this down
to a simple use case, we can also ask the kernel ptrace folks for
help.

2. It seems that hang is caused by the client trying to read response too
fast. I mean, if I step through the client code it works - i.e. there is
significant delay between client writing into pipe and issuing ::select to
wait for response.

I am not sure how this fits in with the item above. I find it hard to
believe that the presence of select(2) in one process would affect the
outcome of ptrace() in another. Unless we are actually encountering a
kernel scheduler bug, which I find unlikely. Hopefully we can get more
insight here with additional logging information.

Any advice how to deal with the situation except putting random sleeps in
random places?

Inserting sleeps in various places is a valid (albeit very slow)
strategy for debugging races. If you can push the sleep down, you will
eventually reach the place where it will be obvious what is racing
(or, at least, which component is to blame). Hopefully we can do
something smarter though.

If you are suspecting a kernel bug, I'd recommend recreating it in a
simple standalone application (fork, trace the child, write its
memory), as then it is easy to ask for help

pl

> could you get a backtrace of lldb-server when it is in the “stuck”

state (just attach with lldb/gdb after it hangs and do “bt”)?

You wish OutlookEmoji-☹.pngThe lldb-server does not react to any signals including SIGSTOP, so gdb just hangs forever.

> If you can get me reasonably detailed repro steps, I can try to investigate

Unfortunately I do not have repro myself. It happens semi-randomly on some machines and I need to borrow the machine with the problem. Here are some details from my records:

  • It is pretty big piece of RX memory, /proc//maps shows this:
    409701000-40b49c000 r-xp 00000000 00:00 0
  • Writing into some locations within that VMA works
  • When it repros, it is pretty consistent, but changing in the target may shift it - i.e. make no repro or fail at different address.

> are you able to still reproduce the bug with logging enabled?

Yes. Here are a few last lines from the log:

1481139040.768961000 0x7fff253c9780 Communication::Write (src = 0x7fff253c8f48, src_len = 7) connection = 0x24a6bd0
1481139040.768963000 0x24a6bd0 ConnectionFileDescriptor::Write (src = 0x7fff253c8f48, src_len = 7)
1481139040.768973000 0x24a6cc0 Socket::Write() (socket = 6, src = 0x7fff253c8f48, src_len = 7, flags = 0) => 7 (error = (null))
1481139040.768976000 0x24a6bd0 ConnectionFileDescriptor::Write(fd = 6, src = 0x7fff253c8f48, src_len = 7) => 7 (error = (null))
1481139040.768979000 0x7fff253c9780 Communication::Read (dst = 0x7fff253c7140, dst_len = 8192, timeout = 0 usec) connection = 0x24a6bd0
1481139040.768982000 0x24a6bd0 ConnectionFileDescriptor::BytesAvailable (timeout_usec = 0)
1481139040.768984000 0x24a6bd0 ConnectionFileDescriptor::BytesAvailable() ::select (nfds=7, fds={6, 4}, NULL, NULL, timeout=0x7fff253c6d80)…
1481139040.768986000 0x24a6bd0 ConnectionFileDescriptor::BytesAvailable() ::select (nfds=7, fds={6, 4}, NULL, NULL, timeout=0x7fff253c6d80) => 0, error = (null)
1481139090.788317000 0x7fff253c9780 Communication::Read (dst = 0x7fff253c7140, dst_len = 8192, timeout = 0 usec) connection = 0x24a6bd0
1481139090.788356000 0x24a6bd0 ConnectionFileDescriptor::BytesAvailable (timeout_usec = 0)
1481139090.788364000 0x24a6bd0 ConnectionFileDescriptor::BytesAvailable() ::select (nfds=7, fds={6, 4}, NULL, NULL, timeout=0x7fff253c6d80)…
1481139090.788368000 0x24a6bd0 ConnectionFileDescriptor::BytesAvailable() ::select (nfds=7, fds={6, 4}, NULL, NULL, timeout=0x7fff253c6d80) => 1, error = (null)
1481139090.788378000 0x24a6cc0 Socket::Read() (socket = 6, src = 0x7fff253c7140, src_len = 25, flags = 0) => 25 (error = (null))
1481139090.788382000 0x24a6bd0 ConnectionFileDescriptor::Read() fd = 6, dst = 0x7fff253c7140, dst_len = 8192) => 25, error = (null)
1481139090.788395000 NativeProcessLinux::WriteMemory(0x409d5a7d0, 0x25271d0, 4)
1481139090.788409000 NativeProcessLinux::ReadMemory using process_vm_readv to read 8 bytes from inferior address 0x409d5a7d0: Success
1481139090.788414000 PTRACE_POKEDATA [1][0][0][0][57][41][54][41]

Thanks,

Eugene

Hi,

It seems the problem is related to transparent huge pages. Disabling them helps.

The LLDB spins in this code:

[11428.634820] Call Trace:
[11428.634821] [] do_huge_pmd_wp_page+0x62e/0xb80
[11428.634822] [] handle_mm_fault+0x705/0xfe0
[11428.634823] [] ? follow_page_mask+0x37d/0x7a0
[11428.634824] [] __get_user_pages+0x19b/0x660
[11428.634825] [] get_user_pages+0x52/0x60
[11428.634825] [] __access_remote_vm+0xcb/0x1f0
[11428.634826] [] access_process_vm+0x50/0x70
[11428.634827] [] ptrace_request+0x2da/0x620
[11428.634828] [] ? wait_task_inactive+0x16c/0x1f0
[11428.634829] [] arch_ptrace+0x29b/0x300
[11428.634830] [] SyS_ptrace+0xbe/0x110
[11428.634831] [] system_call_fastpath+0x16/0x1b

Eugene

OutlookEmoji-☹.png

Nice catch. I think this backtrace should be enough to bring the problem to the attention of the kernel developers. Have you tried contacting them already? I can help you with that if you want.

OutlookEmoji-☹.png

Yes, we have contacted RHEL people. It repros there too.

OutlookEmoji-☹.png