LLDB hang loading Linux core files from live processes (Bug 26322)

I’ve been hitting a hang when lldb loads some core dumps created on Linux, generally those created via gcore.

I found an open bug for this here:
https://llvm.org/bugs/show_bug.cgi?id=26322
and the fix that was suggested there still works. (The patch needs some tidying up due to the code formatting changes.)

I’d quite like to take that change and submit an updated patch via phabricator. Since no-one else has done that so far I was wondering if there was a problem with the approach it took or just a question of time. The patch just adds a flag to say that the process was loaded from a core file and uses that as a simple check to see if lldb should wait for the process to resume or not. Doing that works around changing the logic for working out the thread states. I’m not sure if that’s bad as it avoids fixing the thread state logic or good as it allows the core to load without needing to change the thread states from the state they were in when the core file was created.

If no-one objects I’ll grab the bug and submit a patch, otherwise please let me know and I’ll look at fixing it another way.

Thanks,

Howard Hellyer
IBM Runtime Technologies, IBM Systems
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

I think that approach is kind of a bandaid.

Core files can't resume, so it would be better to figure out why telling a core file which can't resume to resume caused us to go into a tail spin. That should just fall out of WillResume returning false or some other better general signal. Special-casing core files seems a bit of a hack.

That being said, if nobody has time to make a better solution, a bandaid is better than bleeding...

Jim

Hi Jim

I was afraid someone would say that but I’ve done some digging and found a difference in the core files I get generated by gcore to those generated by a crash or abort.

Most of the core files have one SIGINFO structure in the core, I think it belongs to the preceding thread (the one that caught the signal).
In the core files generated by gcore all of the threads have a SIGINFO structure following their PRSTATUS structure. In the non-gcore files the value of info.si_signo in the PRSTATUS structure is a signal number. In the gcore file this is actually 0 but the SIGINFO structure following PRSTATUS has an si_signo value of 19.

Looking at it with eu-readelf shows:

CORE 336 PRSTATUS
info.si_signo: 0, info.si_code: 0, info.si_errno: 0, cursig: 0
sigpend: <>
sighold: <>
… lots of registsers…
CORE 128 SIGINFO
si_signo: 19, si_errno: 0, si_code: 0
sender PID: 0, sender UID: 0

I think gcore is being clever. It’s including the “real” signal number the running thread had received at the time the core was taken (info.si_signo is 0) but also the signal it had used to interrupt the thread and gather it’s state. The value in PRSTATUS info.si_signo is the signal number that ends up in m_signo in ThreadElfCore and ultimately is looked for in the set of signals lldb should stop on in UnixSignals::GetShouldStop. 0 is not found in that set since there isn’t a signal 0. I think gcore is doing all this so that it preserves the real signal state the process had before gcore attached to it, I guess in case you are trying to debug something to do with signals and need to see that state. (That’s a bit of a guess mind you.)

I can think of three solutions:

  • Read the signal information from the SIGINFO block for a thread if it’s present. Core files generated by abort or a crash only seem to have a SIGINFO for one thread which looks like it’s the one that received/trigger the signal in the first place. This means adding a something to parse that block out of the elf core as well as PRSTATUS and override the state from PRSTATUS if we see it. SIGINFO always seems to come after PRSTATUS and probably has to as PRSTATUS contains the pid and identifies that there is a new thread in the core so if SIGINFO is found that signal number will just replace the first one.

  • Never allow a threads signal number to be 0 when it comes form an elf core dump. (This is probably as much of a band aid as the first solution.)

  • Stick with the first solution of saying that we can never resume a core file. The only thing in this solutions favour is that it means the “real” thread state that gcore tried to preserve is known to lldb. Once the core file is loaded typing continue does result in an error message telling you that you can’t resume from a core file.

I’ll have a go at prototyping the solution to read the SIGINFO structure but I’d appreciate any thoughts on which is the “correct” fix.

Thanks,

Howard Hellyer
IBM Runtime Technologies, IBM Systems

I think both are valid fixes. Threads in core files can have a non-zero signal. See comments below.

Hi Jim

I was afraid someone would say that but I've done some digging and found a difference in the core files I get generated by gcore to those generated by a crash or abort.

Most of the core files have one SIGINFO structure in the core, I think it belongs to the preceding thread (the one that caught the signal).
In the core files generated by gcore all of the threads have a SIGINFO structure following their PRSTATUS structure. In the non-gcore files the value of info.si_signo in the PRSTATUS structure is a signal number. In the gcore file this is actually 0 but the SIGINFO structure following PRSTATUS has an si_signo value of 19.

Looking at it with eu-readelf shows:

  CORE 336 PRSTATUS
    info.si_signo: 0, info.si_code: 0, info.si_errno: 0, cursig: 0
    sigpend: <>
    sighold: <>
... lots of registsers...
  CORE 128 SIGINFO
    si_signo: 19, si_errno: 0, si_code: 0
    sender PID: 0, sender UID: 0

I think gcore is being clever. It's including the "real" signal number the running thread had received at the time the core was taken (info.si_signo is 0) but also the signal it had used to interrupt the thread and gather it's state. The value in PRSTATUS info.si_signo is the signal number that ends up in m_signo in ThreadElfCore and ultimately is looked for in the set of signals lldb should stop on in UnixSignals::GetShouldStop. 0 is not found in that set since there isn't a signal 0. I think gcore is doing all this so that it preserves the real signal state the process had before gcore attached to it, I guess in case you are trying to debug something to do with signals and need to see that state. (That's a bit of a guess mind you.)

I can think of three solutions:

- Read the signal information from the SIGINFO block for a thread if it's present. Core files generated by abort or a crash only seem to have a SIGINFO for one thread which looks like it's the one that received/trigger the signal in the first place. This means adding a something to parse that block out of the elf core as well as PRSTATUS and override the state from PRSTATUS if we see it. SIGINFO always seems to come after PRSTATUS and probably has to as PRSTATUS contains the pid and identifies that there is a new thread in the core so if SIGINFO is found that signal number will just replace the first one.

You want to figure out which one the accurate signal and use that. Doesn't matter how you do this, but this will be up to the ProcessELFCore or ThreadELFCore classes.

- Never allow a threads signal number to be 0 when it comes form an elf core dump. (This is probably as much of a band aid as the first solution.)

Threads should be able to have no signal. If you have 10 threads and thread 6 crashes with SIGABRT, but all other threads were just running, I would expect all threads except for thread 6 to have 0 signal values, or no stop reason. If you end up with 10 threads and all have no signal information, I would say that you can just give the first thread a SIGSTOP to be safe.

- Stick with the first solution of saying that we can never resume a core file. The only thing in this solutions favour is that it means the "real" thread state that gcore tried to preserve is known to lldb. Once the core file is loaded typing continue does result in an error message telling you that you can't resume from a core file.

The suggested can be done in a cleaner way: Have ProcessELFCore and ProcessMachCore override "Error Process::WillResume()" just return an error:

Error ProcessELFCore::WillResume()
{
    return Error("can't resume a process in a core file");
}

So I think the correct fix is all three of the above.

Greg

You want to figure out which one the accurate signal and use that.
Doesn’t matter how you do this, but this will be up to the
ProcessELFCore or ThreadELFCore classes.

I’m going to do a little more research (books and google) to see if I can get an answer on this one. I’m actually having trouble finding core files (at least in my own collection) where threads have different signals in info.si_signo in PRSTATUS. For the ones I’ve checked that crashed or received a signal all the threads have the same value in info.si_signo. Typically just one thread (the thread that triggered or received the signal) has a SIGINFO note for the thread that actually received the signal. (My collection of cores is a bit random so that’s not a comprehensive survey by any means.)

I’m getting the impression that the value in PRSTATUS may be for the whole process with any thread that actually received a signal having a SIGINFO note containing that information but I’m not totally sure either way yet. I haven’t found anything that documents that behaviour yet. (If anyone knows of a good reference please let me know!) It would explain why all the threads in a core created by gcore have a SIGINFO note as each one will be stopped in turn. It would also mean that for the non-gcore created cores I’ve got (from crashes and kills) only one thread would have a non-zero signal which sounds correct. Currently for those core files running “thread list” shows all threads as having stopped on the same signal with only one thread in a position where that signal makes sense. Switching to not use info.si_signo is a slightly bigger change though!

  • Never allow a threads signal number to be 0 when it comes form
    an elf core dump. (This is probably as much of a band aid as the
    first solution.)

Threads should be able to have no signal. If you have 10 threads and
thread 6 crashes with SIGABRT, but all other threads were just
running, I would expect all threads except for thread 6 to have 0
signal values, or no stop reason. If you end up with 10 threads and
all have no signal information, I would say that you can just give
the first thread a SIGSTOP to be safe.

I checked this with one of the gcore files by just setting the first threads signal and leaving the others to pick up 0 as they used to. That works.

Putting in a check that makes sure that at least one thread that has some kind of signal seems reasonable. I’ll add that as a fallback sanity check.

The suggested can be done in a cleaner way: Have ProcessELFCore and
ProcessMachCore override “Error Process::WillResume()” just return an error:

Error ProcessELFCore::WillResume()
{
return Error(“can’t resume a process in a core file”);
}

I think that’s called too late. It’s not called until the decision has been made to resume the process. Also the base implementation already returns an error and I don’t think either ProcessElfCore or ProcessMachCore override it.

So I think the correct fix is all three of the above.

I think it’s close and discussing the problem is actually helping a lot, thanks for the help. I’ll grab the bug and put up a patch - hopefully tomorrow.

Thanks,

Howard Hellyer
IBM Runtime Technologies, IBM Systems