Attaching to a stopped (cored) process hangs lldb-server

One different approach is to have your tool write all STDIN to a file (the core file comes into the tool as STDIN bytes) and then hand LLDB the core file and do any needed backtracing and data gathering from the core file instead of actually attaching to the process for real. All executable and shared library object files (ELF files) from the core file are still on disk so you can get symbols and use the debug info, so LLDB should be able to load all frames up and symbolicate up the crash location. It should be just as good as having the process around without any bad side affects. Core files are less useful if they must be archived and symbolicated later because the executable files might not be around anymore since things like test suites might produce binaries for testing and remove them after the test is run or crashed.

What do you think about this approach?

Greg Clayton

Not able to do that as the servers have no hard drives (use ram disk and net boot) and the tool is trying to avoid a core storm that takes down the network file share. I found out what is causing it to hang, there is a call to waitpid in NativeLinuxProcess.cpp that waits forever. As the process is already stopped, I disabled that and it looks to be working

Mark Chandler
Battle.Net Engineering Systems | Blizzard Entertainment
(P) 949-955-1380 x15353

So you would save the file, create a target:

const char *core_path = ...; // Save STDIN to a file and put path in "core_path"
const bool source_init_files = false;
debugger = lldb::SBDebugger::Create(source_init_files);
target = debugger.CreateTarget(None);
process = target.LoadCore(core_path);
if (process.IsValid())
{
    // Do any symbolication you need to on your process core file
    // as it will behave just like a real process, you just can't run it
}

Makes sense about not writing the core file to disk.

Is there a way you can detect this "core" mode where we don't have to waitpid? Seems like that www.sourceware.org message had ideas on how to detect this case?

Greg

The biggest tell is that the process state is already 'S' or stopped. I don’t know lldb at all to make a change to fix this though.

Mark Chandler
Battle.Net Engineering Systems | Blizzard Entertainment
(P) 949-955-1380 x15353

Can someone with linux experience chime in here? It shouldn't be too hard to figure out which flag 'S' is in. On MacOS we can get a process info structure from a pid and that will have bits set that indicate 'S'...

If you want to checkin this tool into the LLDB source tree at trunk/tools/core_tool then we can get more people to work on it and improve it. It would be nice to have this available for all linux users. I would love to see an JSON output mode that is parseable by automated tools instead of people saving text formats that must be text scraped.

If you can get this into a tool, others can help get this working. Any interest in this?

Greg

Ill have to talk to the powers to be but for the most part I could check it in. There is an internal json format that the tool saves it into and a bit of it is geared towards Blizzard Entertainment server setup though but for the most part its pretty simple (under 1000 lines).

Mark Chandler

Battle.Net Engineering Systems | Blizzard Entertainment

(P) 949-955-1380 x15353

I'm following this discussion, but I don't yet understand what is
going on here completely. What I am sure is that the problem here is
not the S+ state, as that just means "interruptible sleep, foreground
process", and a lot of processes have that state and we attach to them
just fine. I would need to investigate what are the exact properties
or this cored state. I'll try to take a look when I get some spare
cycles, but that might not happen very soon.

Mark, have you investigated what is the next thing to fail after you
remove the waitpid call?

pl

The ptrace options per thread id also fail so I removed that as well. Atm lldb-server is seg-faulting in ThreadAttach that im trying to work out why.

Mark Chandler
Battle.Net Engineering Systems | Blizzard Entertainment
(P) 949-955-1380 x15353

Hey Pavel,

I think Mark is also on RHEL 5-era, so this going way back in the kernel space. It is entirely possible he is seeing different behavior based on that. We only recently started working on RHEL 7 and (I’ve heard reports of) 6. So this could just be legitimate behavioral difference that we won’t see on much newer Ubuntu kernels and/or configuration differences between RHEL and Debian-based kernels.

Although doing any kind of waitpid() in the case of a core file doesn’t make sense.

The process is still around. The process is being handed the core file via STDIN, but the process is still around and this tool is attaching to that process and ignoring the core file data. I would vote to use the core file data if the tool is checked in, or at least provide an option to either attach to the process or use the core file data...

Greg

The problem becomes when the core data on stdin is gigabytes in size and there is little to no diskspace or memory (as the process is still around) to store/process the data.

Mark Chandler
Battle.Net Engineering Systems | Blizzard Entertainment
(P) 949-955-1380 x15353

So the entire core file is in memory somehow and when it is read from STDIN will be then be freed? Seems like a really lame way to pass the core file around as it requires up to 2x the size of the core in memory. We could add a new version of SBTarget::LoadCore() like:

SBProcess
SBTarget::LoadCoreFromData(const void *data, uint64_t data_len);

But this will be a memory hog depending on if the memory from STDIN containing the core file gets freed immediately after it is consumed, or if the data is still around.

Im not sure, but I assume that the kernel writes the core out as the process reads it. Will need to dig into the kernel code to confirm.

Mark Chandler
Battle.Net Engineering Systems | Blizzard Entertainment
(P) 949-955-1380 x15353

It's going to be quite difficult for lldb to do anything reasonable with the core file if we can't seek around in it. So for practical purposes it is going to have to get stored somehow, either in a file or in some memory that lldb can do random access on. So practically whoever is getting this stream will need to store it somewhere and hand the results to lldb. So from lldb, adding an API that takes the core file from a chunk of memory seems fine. But I'd leave it up to the clients to figure out how to cons up such a thing as they know their situation better (and most times just writing it to disk first is going to seem reasonable.)

BTW, if your platform has a network connection maybe you could stream the core out to a core server that has disk space. That would lessen the need for heroics on the lldb running on this limited resource platform?

Jim

We currently do stream it out to network attached storage except that doesn't scale when multi servers core at once. The whole point of this is try and reduce the amount of data needed to be saved and not save the same core twice.

Mark Chandler
Battle.Net Engineering Systems | Blizzard Entertainment
(P) 949-955-1380 x15353