Question (bug?) about thread tids when lldb loads a core dump.

When I load a core dump with lldb 3.6, the tids displayed in “thread list” and more importantly returned by lldb::SBThread::GetThreadID() (used in a lldb plugin I wrote) doesn’t match the tids from the live session the core was generated from. gdb displays the correct tids on the same core dump (see below). My plugin depends on the above api returning the tid during the live session.

Can anybody help?

lldb-3.6

(lldb) target create --core /tmp/core-corerun-6-1000-1000-20870-1438119958 /tmp/corerun

Core file ‘/tmp/core-corerun-6-1000-1000-20870-1438119958’ (x86_64) was loaded.

Process 0 stopped

  • thread #1: tid = 0x0000, 0x00007f01fe64d267 libc.so.6`gsignal + 55, name = ‘corerun’, stop reason = signal SIGABRT

frame #0: 0x00007f01fe64d267 libc.so.6`gsignal + 55

→ 0x7f01fe64d267: addb %al, (%rax)

0x7f01fe64d269: addb %al, (%rax)

0x7f01fe64d26b: addb %al, (%rax)

0x7f01fe64d26d: addb %al, (%rax)

thread #2: tid = 0x0001, 0x00007f01fe7138dd libc.so.6`__poll + 45, stop reason = signal SIGABRT

frame #0: 0x00007f01fe7138dd libc.so.6`__poll + 45

→ 0x7f01fe7138dd: addb %al, (%rax)

0x7f01fe7138df: addb %al, (%rax)

0x7f01fe7138e1: addb %al, (%rax)

0x7f01fe7138e3: addb %al, (%rax)

thread #3: tid = 0x0002, 0x00007f01fd27dda0 libpthread.so.0`__pthread_cond_wait + 192, stop reason = signal SIGABRT

frame #0: 0x00007f01fd27dda0 libpthread.so.0`__pthread_cond_wait + 192

→ 0x7f01fd27dda0: addb %al, (%rax)

0x7f01fd27dda2: addb %al, (%rax)

0x7f01fd27dda4: addb %al, (%rax)

0x7f01fd27dda6: addb %al, (%rax)

thread #4: tid = 0x0003, 0x00007f01fd27e149 libpthread.so.0`__pthread_cond_timedwait + 297, stop reason = signal SIGABRT

frame #0: 0x00007f01fd27e149 libpthread.so.0`__pthread_cond_timedwait + 297

→ 0x7f01fd27e149: addb %al, (%rax)

0x7f01fd27e14b: addb %al, (%rax)

0x7f01fd27e14d: addb %al, (%rax)

0x7f01fd27e14f: addb %al, (%rax)

thread #5: tid = 0x0004, 0x00007f01fe70f28d libc.so.6`__open64 + 45, stop reason = signal SIGABRT

frame #0: 0x00007f01fe70f28d libc.so.6`__open64 + 45

→ 0x7f01fe70f28d: addb %al, (%rax)

0x7f01fe70f28f: addb %al, (%rax)

0x7f01fe70f291: addb %al, (%rax)

0x7f01fe70f293: addb %al, (%rax)

thread #6: tid = 0x0005, 0x00007f01fe70f49d libc.so.6`__read + 45, stop reason = signal SIGABRT

frame #0: 0x00007f01fe70f49d libc.so.6`__read + 45

→ 0x7f01fe70f49d: addb %al, (%rax)

0x7f01fe70f49f: addb %al, (%rax)

0x7f01fe70f4a1: addb %al, (%rax)

0x7f01fe70f4a3: addb %al, (%rax)

(lldb) thread list

Process 0 stopped

  • thread #1: tid = 0x0000, 0x00007f01fe64d267 libc.so.6`gsignal + 55, name = ‘corerun’, stop reason = signal SIGABRT

thread #2: tid = 0x0001, 0x00007f01fe7138dd libc.so.6`__poll + 45, stop reason = signal SIGABRT

thread #3: tid = 0x0002, 0x00007f01fd27dda0 libpthread.so.0`__pthread_cond_wait + 192, stop reason = signal SIGABRT

thread #4: tid = 0x0003, 0x00007f01fd27e149 libpthread.so.0`__pthread_cond_timedwait + 297, stop reason = signal SIGABRT

thread #5: tid = 0x0004, 0x00007f01fe70f28d libc.so.6`__open64 + 45, stop reason = signal SIGABRT

thread #6: tid = 0x0005, 0x00007f01fe70f49d libc.so.6`__read + 45, stop reason = signal SIGABRT

I noticed that while studying the code in order to determine how to do the same thing for Windows mini dumps. Note that the loop index is treated as the thread ID in ProcessElfCore::UpdateThreadList:

for (lldb::tid_t tid = 0; tid < num_threads; ++tid)

{
const ThreadData &td = m_thread_data[tid];
lldb::ThreadSP thread_sp(new ThreadElfCore (*this, tid, td));
new_thread_list.AddThread (thread_sp);
}

I wondered if this was intentional, to avoid confusion between the dead threads and any live threads that might happen to be using a recycled thread ID.

I suspect it's not intentional, and that it just wasn't apparent to
the original author how to obtain the tid. For FreeBSD the tid is
(somewhat unintuitively) found in the pr_pid field of the NT_PRSTATUS
note. I've put a change in review (http://reviews.llvm.org/D11652)
that fixes this for FreeBSD:

(lldb) thread list
Process 0 stopped
* thread #1: tid = 102802, 0x00000008008fa4fa
libc.so.7`__sys_nanosleep + 10 at _nanosleep.S:3, name = 'sleep', stop
reason = signal SIGABRT

If someone can tell me where to obtain the Linux tid I'll update the patch.

Would be great if we had a test that verified this. I think we could do this by making a small program that gets its own main thread id at runtime and stores it in a local variable. Generate a core dump while stopped at a breakpoint right after the variable is initialized. Then have the test verify that whatever command reports the current thread is has the same value as the variable.

Yes, we definitely need tests. We've discussed core file tests a bit
in the past, but haven't come to a resolution as I recall.

I'm not a fan of generating cores on the fly in the tests; we should
be able to test core loading for all supported targets, and I'd rather
not spam system logs with crash reports in order to run a test. I
think we could instead just commit a set of test executables and
associated core files to the repository. Some effort is probably
necessary to reduce the size of core files on certain operating
systems -- on FreeBSD we end up with core files of at least ~4MB, due
to malloc defaults.

I started working on collecting userland core files from various
operating systems a while back:
https://github.com/emaste/userland-cores
I can take another look with a goal of producing a representative
sample that could be used for LLDB tests.

User-level core files on Mac OS X are huge (500MB for a nothing app). This is because most of the system libraries are munged into a unified "shared cache" and that gets written out in toto in the core. Just a heads up...

Jim

I know why this is happening: the ELF core file was copied from the mach-o core file and in mach-o core files we have load commands in the file that describe each thread's registers, but there is no thread ID in the load command that says "this is thread 0x7f01fc42a700".

If ELF core files actually do have the thread IDs in the core files, then this is easy to fix and is an LLDB bug.

Greg

Hmm, so the way this works on Windows is that inside the core file are the names of libraries along with a corresponding hash of the library. When the core file is read, the debugger searches a pre-defined set of locations for symbol information for the libraries (a symbol file which claims to match the name and hash of the library written in the core).

Has any thought ever been given to doing something like this on Mac? It seems a little wasteful to have every core file duplicate the same information about system libraries.

Even easier, what about a flag when generating a core that just says “don’t write system libraries into the core”? Sure, it might degrade the debugging experience, but I think there’s still some basic things you can do without that information.

For Windows core file tests I had planned to just make these lightweight core files (in the windows universe these are called “minidumps”), which are usually less than 100KB.

Hmm, OK.

I can get FreeBSD cores from my 3-thread test app to about 1MB with
some malloc tuning, but I think committing to the main repo won't fly
if we have some operating systems with huge cores. Do you think it's
reasonable to stash large test binaries and core files in some
auxiliary repo or storage location, and have users optionally fetch
them? The test could be skipped if the file doesn't exist or doesn't
match an expected hash.

http://reviews.llvm.org/D11652 has the fix for FreeBSD. I just now
uploaded a new diff to continue assigning tids 0, 1, 2... for Linux
until someone can update it.

OS X crash reports are vended by a text crash log, which does contain hashes & load locations of all the libraries involved, and the backtraces of all the threads, and register state of the crashing thread. In my experience this is usually good enough and if it isn't the core files seldom add that crucial missing detail, since that crucial missing detail was something that was true 30 seconds ago...

So I don't think the Core OS folks are likely to spend a lot of time worrying about the size of user space core files.

Jim

But can a user generate one of these from inside the debugger, transfer it to another machine, load it in the debugger and use it to diagnose a problem? The easy portability of core dumps between machines has always been one of the best things about them on Windows. I can just email someone a core file when their program crashes and they can load it up and take a look. Not to mention that it makes testing easier, since we can check in core files all day long and not have a noticeable impact on the repo.

In any case, just an idea. We’ll probably check in dumps for testing the windows core file process plugin, so wherever possible we can try to come up with tests that are platform agnostic, but it would still be nice long term if it were possible to get better coverage on other platforms as well.

But can a user generate one of these from inside the debugger, transfer it to another machine, load it in the debugger and use it to diagnose a problem? The easy portability of core dumps between machines has always been one of the best things about them on Windows. I can just email someone a core file when their program crashes and they can load it up and take a look. Not to mention that it makes testing easier, since we can check in core files all day long and not have a noticeable impact on the repo.

Yes, this all works, including preserving the UUID's of the loaded libraries so you can symbolicate the crashes after the fact. Provided of course that you don't mind being mailed a 500MB file.

Jim

Oh I must have misunderstood your earlier post. I thought you were saying that the OSX crash reports (text based) was much smaller than 500MB, so I was asking if the debugger could produce and/or load those.

Ah, I think I've seen python scripts out there to dump a crash log format list of the current threads, and the lldb.macosx.crashlog command knows how to parse and create a debug session from those files. But that code is pretty much unrelated to the core file reading code.

So while it would be interesting to test text-file crash log reading as a separate issue, it wouldn't test the core file reading code, which was what we were talking about in this thread.

Jim

I think it's going to be too late for 3.7 since we don't have the fix
in place yet. I'm not aware of the Linux ELF core info myself and
don't have a Linux machine handy to test at the moment.

Some time ago I planned to collect sample core files, and if someone
can clone https://github.com/emaste/userland-cores, run it, and add
the resulting core from Linux I'll see if it's straightforward to
update the ELF core plugin to handle it.