Process attach and pgid of forked process under Linux

Currently under Linux if you attach to a process whose process group id is not equal to its process id (such as the child process of a fork() call) the calls to waitpid() that pass -1*pid will return ECHILD since the pid argument refers to a process group that doesn’t exist. These calls occur in Host::MonitorChildProcessThreadFunction() and the Linux ProcessMonitor.

Changing -1pid to simply -1 or to -1getpgid(pid) resolves the issue but it’s not clear if this is the right fix as I’m unsure how other OSes deal with this scenario.

Any thoughts?

Thanks,
Andrew

Why are you waiting for process groups? That's not something we have to do on Mac OS X.

Jim

I’d suggest trying the change, running the tests, and see if anything new fails.

And if this fixes currently broken behavior, add a test to make sure we don’t regress it. I’m pretty sure if it breaks something we’ll know pretty quickly (either via bugs or our own usage). At which point - we should add a test so we don’t regress in the future and perhaps add a few comments as to the reason.

Ok great, I’m attaching a patch here that just uses getpgid() to determine the pgid to use with waitpid(). I get the same unit test results before and after the patch (there are a few tests that fail consistently for me). I tried using a simple -1 with waitpid() but this results in a hang in the unit tests. This is probably something worth investigating but for now this patch does resolve the issues around waitpid() when attaching.

I mentioned a few issues I ran into with the unit tests in the email I just pinged with a Linux detach patch, basically there appears to be a bug or limitation somewhere in that doing a “continue” in asynchronous mode in a unit test breaks things and this is needed for most attach-related debug testing. I can probably look into this (along with the couple of failing tests I have) but likely won’t be able to this week. I haven’t filed a bug for the consistently-failing tests on my end since it doesn’t appear to be happening for others and I haven’t had a chance to investigate here, let me know if I should file a bug anyway.

waitpid-getpgid.patch (1.21 KB)

Ok great, I'm attaching a patch here that just uses getpgid() to determine the pgid to use with waitpid(). I get the same unit test results before and after the patch (there are a few tests that fail consistently for me). I tried using a simple -1 with waitpid() but this results in a hang in the unit tests. This is probably something worth investigating but for now this patch does resolve the issues around waitpid() when attaching.

We should never use -1 with waitpid() as you could end up reaping a process that someone else is already trying to reap. You must only try to reap the process that you launched using some pid for you process. Many things launch processes within LLDB like ProcessGDBRemote launching "debugserver" binaries. The ProcessGDBRemote must be notified when and if "debugserver" crashes, so if anyone else does waitpid(-1, ...) ProcessGDBRemote might never find out if debugserver crashes for some reason.

Ok that makes sense. I’ve updated the patch I sent yesterday with a unit test that exposes the problem as well as a couple of other process attach issues that were fixed in the last few days. It was in fact this getpgid() issue that was causing problems when trying to create a unit test earlier as the problem scenario occurs when the unit test script forks a subprocess with popen().

wait-getpgid+unittest.patch (5.67 KB)

Ah nice. I’ll definitely run this again soon. Thanks for getting a test around it - we can’t have enough of that around important expectations.

Thanks Todd, and no problem on the unit test, I’ll try to include one for anything I come across in the future as well.

Having a look at this now.

I built clean and ran tests error free on r205315.

After testing with the patch applied, I’m getting the following failures (3 times in a row):

FAIL: LLDB (suite) :: TestHelloWorld.py (Linux tfiala2.mtv.corp.google.com 3.2.5-gg1336 #1 SMP Thu Aug 29 02:37:18 PDT 2013 x86_64 x86_64)
FAIL: LLDB (suite) :: TestAttachResume.py (Linux tfiala2.mtv.corp.google.com 3.2.5-gg1336 #1 SMP Thu Aug 29 02:37:18 PDT 2013 x86_64 x86_64)

Looking deeper, it looks like the main.c code in test/functionalities/attach_resume/main.c is assuming c99 mode which isn’t the default when using gcc. I’m clearing that locally (pulling for loop variable declarations out) and will see if that fixes everything else.

Back in a bit…

That change did successfully fix the TestAttachResume.py failure.

This one is a bit different and looks like we may have perturbed something related to shutdown:

TestHelloWorld.py:

Executing tearDown hook: def cleanupSubprocesses(self):

Ensure any subprocesses are cleaned up

for p in self.subprocesses:
if p.poll() == None:
p.terminate()
del p
del self.subprocesses[:]

Ensure any forked processes are cleaned up

for pid in self.forkedProcessPids:
if os.path.exists("/proc/" + str(pid)):
os.kill(pid, signal.SIGTERM)

python: /mnt/ssd/work/git/gen/llvm/tools/lldb/source/Plugins/Process/POSIX/ProcessPOSIX.cpp:427: virtual void ProcessPOSIX::SendMessage(const ProcessMessage&): Assertion `thread’ failed.
Aborted (core dumped)

I’m having a look at that core now.

Hey Andrew,

I think the core I was getting on the assert is during exiting and doesn’t really seem to indicate a serious issue. I fixed it up with a null check and added some logging.

See the attached patch. This is a cumulative patch with my fixes for those two test failures plus all of your original changes.

As this patch stands, LGTM. If you want to check it in, I’m fine with it from a Linux standpoint. In the common Host.cpp, you now have a getpgid() call. I’m not sure that’s going to work on a Windows build? Might need to #ifdef that or create a macro for it, or something else entirely. Any Windows folks able to comment on that?

waitpid_amacpherson_mod.diff (6.81 KB)

Hey Todd,

Sounds great. Sorry about the C99 issue, I was using clang which I’m guessing behaves differently there. For the waitpid + getpgid in Host.cpp, that section is #ifdef’d out on Windows so should be safe there too, I’ve committed the patch now. Thanks!

Andrew

Great! Saw it go in, glad you were able to track that down.

Thanks, Andrew.