[PATCH] Unexpected group-stop with threading under Linux

This patch addresses a bug where in a multi-threaded program a new signal from the inferior may be received before all group-stop messages from an earlier signal have been handled. The attached .cpp file exposes the issue:

  • clang++ -o thread thread.cpp -lpthread -g -O0

  • lldb ./thread

  • r

  • hit Ctrl-C

  • br set -f thread.cpp -l 11

  • c

A breakpoint is usually hit before all the group-stop messages involved in restarting the process are handled and the internal state then gets out of sync.

The attached patch solves the issue and all unit tests pass, I’m not a ptrace expert though so there may be a more appropriate fix. The group-stop handling was taken from the existing ProcessMonitor::MonitorCallback.

Thanks,

Andrew

thread.cpp (443 Bytes)

extra-group-stop.patch (1.53 KB)

I’m giving this a shot now locally. The change looks reasonable to me - any ptrace experts disagree?

The one bit I’m not entirely sure about is: if assertions are disabled, and for some reason the WSTOPSIG(status) is not SIGSTOP (since we’re not explicitly checking before the Resume(wait_pid, SIGSTOP) call), how badly can that gunk things up? An if check before the Resume call would at least stop the wrong signal from getting sent to the inferior if I’m reading it right.

Verified this fixes it. I’ll go ahead and check it in.

Hey Todd, great! I think if we end up in there and it’s not a SIGSTOP then we have no idea what’s going on and it’s not clear what the correct behaviour would be. I’m happy to wrap the whole thing in an if statement checking for SIGSTOP though if you think that’s best, though it’s not clear whether the thing to do is to continue or return at that point.

Thanks!

Sending source/Plugins/Process/Linux/ProcessMonitor.cpp
Transmitting file data .
Committed revision 200226.

I think if we end up in there and it’s not a SIGSTOP then we have no idea what’s going on and it’s not clear what the correct behaviour would be.

After looking at the whole path, it really isn’t clear to me what the right thing would be to do in that case as you said, so I was fine leaving it as is. I think we’d be hosed either way at that point. I’ll be working deeper in this code soon anyway, so if other reasonable causes for the error surface, I’ll revisit the code at that time.