A nasty failure in TestConcurrentEvents on linux

After a considerate amount of effort, I have managed to get a log trace of a rare failure in TestConcurrentEvents (test_two_watchpoints_one_signal_dwarf). Unfortunately, I do not see a clear way how to avoid it, so I’m writing here in case anyone has an idea.

Here is my interpretation of the events, backed by log snippets (full log in the attachment, I also can provide the corresponding core file if anyone wants to dig around).

The inferior starts out by creating three threads. Of those, two are set up to hit a watchpoint and one signals itself.

411 1426700114.595431000 [7295/729c]: MonitorChildProcessThreadFunction ::waitpid (pid = -29339, &status, options = 1073741824) => pid = 29346, status = 0x00000a7f (STOPPED), signal = 10, exit_state = 0

On line 411 we see llgs catching the USR1(10) signal.

412 1426700114.595450000 [7295/729c]: NativeProcessLinux::MonitorSignal() received signal SIGUSR1 (10) with code SI_TKILL, (siginfo pid = 29339 (not from llgs), waitpid pid = 29346)

Here, we see that the ptrace(GETSIGINFO) call was successful, we get info about the signal and continue on our way, stop all other threads (which hit a watchpoint by the way), report this to lldb, etc. After some back and forth, lldb tells us to resume the inferior and let it process the SIGUSR1 signal.

548 1426700116.644772000 [7295/729a]: NativeProcessLinux::Resume() resuming thread = 29346 with signal SIGUSR1

However, here things get strange. Immediately after that we get a second waitpid notification,

555 1426700116.644896000 [7295/729c]: MonitorChildProcessThreadFunction ::waitpid (pid = -29339, &status, options = 1073741824) => pid = 29346, status = 0x00000a7f (STOPPED), signal = 10, exit_state = 0

This shouldn’t happen as we have just reinjected the signal for the process to handle, and no other signal was sent to the thread in the meantime. But we don’t know that so we try to process the signal as usual, by calling ptrace(GETSIGINFO).

558 1426700116.644932000 [7295/729c]: ThreadStateCoordinator::EnqueueEvent enqueued event: EventThreadDeath (tid=29346)

561 1426700116.644961000 [7295/729c]: NativeProcessLinux::MonitorCallback GetSignalInfo failed: No such process, tid = 29346, signal = 10, status = 0 (thread/process killed, is not main thread, thread metadata removed)

These lines tell us that we are now in the branch where ptrace(GETSIGINFO) failed with ESRCH. This can happen for two reasons:

  • the thread in question has died without us noticing
  • the thread is running
    Since we have just received a waitpid notification, we know that the thread should not be running and we assume that the thread has died between waitpid and getsiginfo calls. We clean up our internal structures and forget about the thread.

However, the thread is very much alive and kicking and we get another notification from the thread on line 567. Now we try to find the thread in our data structures and assert because it does not exist.

These events would seem to indicate that the USR1 notification on line 555 was bogus - there should be no additional signal, and even if there was, it should have stopped the thread, but it didn’t do that.

So it seems that we should just ignore bogus signals (= assume they don’t stop the thread and they don’t kill it). However, it is not easy to differentiate between the three scenarios:

  • real signal - waitpid returns, getsiginfo suceeds - thread is stopped
  • bogus signal - waitpid returns, getsiginfo fails - thread is running
  • thread death - waitpid returns, getsiginfo fails - thread is dead

Does anyone have an idea how to address this? I have a few, but I don’t like any of them very much, so I’d rather keep them to myself for now. Besides, this mail has gotten too long already. If you’re still reading this, give yourself a cookie. :slight_smile:

pl

core-lldb.log (70.7 KB)

I think you should look at git blame on relevant lines in the kernel and ask for help from those people and/or loop them onto this thread.