regarding [Bug 15671] New: backtrace truncated after assertion failure in inferior

I see what's going on here.

/lib/x86_64-linux-gnu/libc.so.6 was built -fomit-frame-pointer, and it includes eh_frame instructions on how to unwind the frames. But when lldb gets to

#2 0x00007ffff7a4a0ee in ?? () from /lib/x86_64-linux-gnu/libc.so.6

it doesn't have any eh_frame instructions. lldb can figure out the stack pointer value (from frame 1) which tells us the "bottom" of this stack frame but it can't find the "top" without eh_frame unwind instructions or knowing what function it is in so it can do an assembly instruction scan to understand how the stack frame was set up. lldb tries to get a saved frame pointer (rbp) which would give us the "top" of the stack frame but the saved rbp value it gets (0x40067e0) is obviously invalid.

It might be interesting to see the output of

image show-unwind -n abort

to see exactly what the eh_frame instructions read (this is lldb's interpretation of the eh_frame instructions, of course, it might be useful to include the output of readelf -wf libc.so.6 or readelf -wF libc.so.6 for the abort() function, going by a web page for readelf I found on the web.) The log output included this,

th1/fr0 supplying caller's saved reg 16's location, cached
th1/fr1 requested caller's saved PC but this UnwindPlan uses a RA reg; getting reg 16 instead
th1/fr1 supplying caller's saved reg 16's location using eh_frame CFI UnwindPlan
th1/fr1 supplying caller's register 16 from the stack, saved at CFA plus offset
  th1/fr2 pc = 0x00007f216e4850ee

That bit about "this UnwindPlan uses a RA reg" is novel for x86 code, it's normally you see in arm code where the caller's saved pc value is in the link register on a function call. But as you'd guess from the name abort(), this may have the caller's register context saved in an unusual way so this may be fine.

I'm surprised gdb can unwind this successfully.

As I alluded to above, lldb can profile the assembly language instructions of a function to understand the prologue setup (where registers are saved, how the stack is set up, etc.) -- but to do this, it needs to know the start address of the function. This "#2 0x00007ffff7a4a0ee in ?? ()" frame clearly doesn't have any symbolic information with its address range so lldb can't do its assembly scan. And it doesn't have eh_frame instructions to help either.

On Mac OS X we're often working with binaries that have had most of their symbols stripped. Because it is so valuable to lldb to have accurate function ranges, we supplement the symbol table with two sources: The LC_FUNCTION_STARTS section, and barring that (this is new), the eh_frame section. LC_FUNCTION_STARTS is an array of LEB128 encoded offsets of all the start addresses of the functions in the file. The first function is at offset 0, etc. It's real compact, typically a few bytes per function. The eh_frame section is another great source of function bounds information but it tends to be larger and slower to parse through. lldb adds fake symbol names for these function ranges that it adds, e.g. a fake symbol added to the program Dock might be "__lldb_unnamed_function3491$$Dock".

Of course, given that lldb couldn't find eh_frame instructions for "#2 0x00007ffff7a4a0ee in ?? ()", maybe even that wouldn't have helped.

The only solution I can think of here is if abort()'s eh_frame does provide a saved location for rbp but lldb failed to read it correctly. Else, I have no idea how gdb managed to unwind out of this one.

having done lots of asm debugging with gdb, I can offer a guess. gdb seems to able to unwind frameless leaf functions with no unwind info. so perhaps as a final fallback it pops the top entry on the stack and treats it as the return pc. if it can unwind the caller using that pc, the it is good.

just a guess...

-Luddy

Yeah, lldb uses similar tricks. If you have eh_frame instructions, unwinding from -fomit-frame-pointer code is easy. And if you have accurate function bounds for all the frames, lldb can usually manage to unwind an -fomit-frame-pointer stack without eh_frame (because it inspects the actual assembly instructions in the prologue to understand the stack setup). But in this particular backtrace we've got -fomit-frame-pointer frames using eh_frame, then one function that doesn't have any symbol name or eh_frame entry, and I honestly have no idea how gdb found its way out of that one. The only reasonable approach here would be to assume that this frame used a frame pointer (rbp), grab the saved rbp value and try to find the caller's pc based on that -- but that failed.

Well, maybe the additional information from Ben (the eh_frame instructions for abort() most importantly) will provide a hint. The only thing I can think is that maybe lldb misinterpreted that function's eh_frame instructions.

J

hi, just to clarify, I regularly write asm with no eh frames or fonction bounds, no .cfi. gdb unwinds my leaf funtions fine. it is my impression that gdb will in the absence of frame info assume that the topmost item on the stack at a trap is a return pc (even though the trapped pc cannot be identified and has invalid rbp, so disasm of the leaf itself is not possible

put differently if one can't figure out the leaf one can grope for the return pc on the stack and try again at the caller. if the teturn pc points just after a plausible-looking call insn then you're good. hope that makes sense...

I've updated bugzilla with the output of image show-unwind -n abort. I couldn't attach the output of readelf -wf libc.so.6 (too big) - is there a way to only show info about the abort function? The name 'abort' doesn't appear in the output.

Ben

Hi Jason,

So, this thread is still relevant and reproducible using functionalities/inferior-asserting on platforms where libc.so is compiled with -fomit-frame-pointer.

The only solution I can think of here is if abort()'s eh_frame does provide a saved location for rbp but lldb failed to read it correctly. Else, I have no idea how gdb managed to unwind out of this one.

FYI, the routine RegisterContextLLDB::InitializeNoneZerothFrame calls ReadGPRValue for active_row->GetCFARegister(), which allows m_cfa to be set for frame 1 'abort'. When this routine runs for the mystery frame 2, m_sym_ctx.GetAddressRange comes up empty handed (consistent with gdb's backtrace), so addr_range.GetBaseAddress() is not valid. As a result, m_current_offset is -1, and this routine returns before m_cfa is read, resulting in an invalid frame.

But in this particular backtrace we've got -fomit-frame-pointer frames using eh_frame, then one function that doesn't have any symbol name or eh_frame entry, and I honestly have no idea how gdb found its way out of that one.

Even if the function for frame 2 doesn't have a symbol name, is it possible that it has an eh_frame entry that we can use?

The only reasonable approach here would be to assume that this frame used a frame pointer (rbp), grab the saved rbp value and try to find the caller's pc based on that -- but that failed.

So, I see the code that executes to handle the case where a function ends with a call instruction, which backs up the PC by one byte. However, ResolveSymbolContextForAddress fails, and SymbolContext::GetAddressRange comes up empty handed because the member function is 0, so addr_range is not set by this code.

Without a function symbol, is there a way to set m_current_offset so that ReadGPRRegister can read the saved rbp for frame 2? Thanks,

- Ashok

FYI, gdb can identify the frame addresses for/relative to mystery frame 2 while at the assert site:

(gdb) f 2
#2 0x00007ffff7a4a0ee in ?? () from /lib/x86_64-linux-gnu/libc.so.6

(gdb) info frame
Stack level 2, frame at 0x7fffffffdee0:
rip = 0x7ffff7a4a0ee; saved rip 0x7ffff7a4a192
called by frame at 0x7fffffffdf10, caller of frame at 0x7fffffffde80
Arglist at 0x7fffffffde78, args:
Locals at 0x7fffffffde78, Previous frame's sp is 0x7fffffffdee0
Saved registers:
  rbx at 0x7fffffffdec0, rbp at 0x7fffffffdec8, r12 at 0x7fffffffded0,
  rip at 0x7fffffffded8

- Ashok

Hi Jason,

  Frame 2 did not get a valid CFA for this frame, stopping stack walk

So, the attached patch allows the unwinder to get past frame 2 using eh_frame information that is dug up based on the pc rather than the start address of the function (i.e. to handle the case where the function symbol is unavailable).

This fix is coupled with GetFullUnwindPlanForFrame rather than lowered to UnwindTable and FuncUnwinders. Alternately, I could add or modify routines like GetFuncUnwindersContainingAddress to avoid the requirement for a SymbolContext. Similarly, I could add or modify routines like GetUnwindPlanAtCallSite to allow the caller to specify a pc.

The attached patch also slides m_current_pc in the case where a Symbol is found at pc - 1. Note that the log while adding frame 2 indicates a bogus fp:
  th1/fr2 supplying caller's register 6 from the stack, saved at CFA plus offset
   th1/fr3 fp = 0x00000000004006db

The slide keeps me out of the weeds while adding frame 3 (see the attached log). The combined result is a healthy stack:

(lldb) bt
* thread #1: tid = 0x2987, 0x00007ffba7b23425 libc.so.6`raise + 53, stop reason = signal SIGABRT
     frame #0: 0x00007ffba7b23425 libc.so.6`raise + 53
     frame #1: 0x00007ffba7b26b8b libc.so.6`abort + 379
     frame #2: 0x00007ffba7b1c0ee libc.so.6
     frame #3: 0x00007ffba7b1c192 libc.so.6`__assert_fail + 66
     frame #4: 0x00000000004005c0 a.out`main(argc=1, argv=0x00007fff1ccfbd68) + 112 at main.c:18
     frame #5: 0x00007ffba7b0e76d libc.so.6`__libc_start_main + 237
     frame #6: 0x0000000000400489 a.out`_start + 41

Perhaps it would be helpful to provide a slightly different entry for frame #2 like:
     frame #2: 0x00007ffba7b1c0ee libc.so.6`??? + offset

For now, I set eSkipFrame which is documented as a frame state that indicates that the unwinder found issues and is hoping to recover. Perhaps a new value would better document the fact that the frame goes with a function with no known symbol.

I'll commit this patch by next Monday since this is an important use case for lldb 3.3 (and I assume that WDC is all encompassing for a bit), but do fire away with any feedback. Cheers,

- Ashok

pr15671.patch (5.73 KB)

unwind-full.txt (8.36 KB)

Hi Dmitri,

I noticed that the lldb clang buildbot has been consistently failing with a timeout since build #3975:
  http://lab.llvm.org:8011/builders/lldb-x86_64-debian-clang/builds/3975

I'm unable to reproduce the timeout locally using revisions at or later than r35424. Also, the changes in build #3975 and even #3976 don't immediately stand out as candidates for a hang. Finally, the gcc buildbot is running just dandy.

Any chance that the clang buildbot has its ears caught? Thanks,

- Ashok

Or, did the clang version on that machine change recently?

(resending because I forgot to cc lldb-dev)

Hi Ashok, thanks for working on this -- I know the unwinder code can be a hard to modify, RegisterContextLLDB.cpp is a little complex in places. :confused:

On Mac OS X we have a section in the binary (LC_FUNCTION_STARTS) which has the start address of every function in the binary, even if it has been stripped. It's represented as an array of uleb128's giving the offsets between the functions - it ends up being something like 1.5 bytes per function on typical code, pretty compact.

When ObjectFileMachO is reading the symbols out of a binary, it also reads the LC_FUNCTION_STARTS and creates dummy names for any functions it didn't have a symbol for. e.g. for an app called Dock we might have a name like ___lldb_unnamed_function2951$$Dock.

A recent change to ObjectFileMachO is that it also gets the function start addresses from the eh_frame information if LC_FUNCTION_STARTS doesn't exist:

           // If m_type is eTypeDebugInfo, then this is a dSYM - it will have the load command claiming an eh_frame
           // but it doesn't actually have the eh_frame content. And if we have a dSYM, we don't need to do any
           // of this fill-in-the-missing-symbols works anyway - the debug info should give us all the functions in
           // the module.
           if (text_section_sp.get() && eh_frame_section_sp.get() && m_type != eTypeDebugInfo)
           {
               DWARFCallFrameInfo eh_frame(*this, eh_frame_section_sp, eRegisterKindGCC, true);
               DWARFCallFrameInfo::FunctionAddressAndSizeVector functions;
               eh_frame.GetFunctionAddressAndSizeVector (functions);
               addr_t text_base_addr = text_section_sp->GetFileAddress();
               size_t count = functions.GetSize();
               for (size_t i = 0; i < count; ++i)
               {
                   const DWARFCallFrameInfo::FunctionAddressAndSizeVector::Entry *func = functions.GetEntryAtIndex (i);
                   if (func)
                   {
                       FunctionStarts::Entry function_start_entry;
                       function_start_entry.addr = func->base - text_base_addr;
                       function_starts.Append(function_start_entry);
                   }
               }
           }

(it munges the eh_frame information so that it looks the same as the LC_FUNCTION_STARTS data so the same code path can be used to add them to the symbols).

Wouldn't this be another possible way of handling this on Linux?

There are additional benefits to having accurate function start addresses --- if you're stepping through one of these no-symbol functions, lldb will ignore the eh_frame unwind, trying to disassemble the instructions and spot register saves / stack changes. To do this, it needs to know the correct start address of the function.

Let me know what you think.

I have just updated Clang to r183602 and now the tests started failing
in a different way:

http://lab.llvm.org:8011/builders/lldb-x86_64-debian-clang/builds/3999

Dmitri

Thanks for the report Dmitri; we are seeing the same thing blocking the
Debian packages, so I'm not sure it's related to the clang version change
(although the GCC buildbot seems unaffected.)

I'm looking at the issue now, but still unsure of the root cause. Will let
you know when it's fixed or more info is available.

Cheers,
Dan

Hi Ashok, thanks for working on this -- I know the unwinder code can be a hard to modify, RegisterContextLLDB.cpp is a little complex in places. :confused:

For sure, Jason, thanks for the sophisticated unwinder.

A recent change to ObjectFileMachO is that it also gets the function start addresses from the eh_frame information if LC_FUNCTION_STARTS doesn't exist:

Nice, I see how that's an advantage in spite of the performance hit. I'll certainly look at reworking ObjectFileELF to add the function symbols for stripped symbols from the eh_frame information.

Let me know what you think.

Perhaps the best approach is to do both. Having my suggested new code path in the unwinder isn't fundamentally wrong or a performance concern. In contrast, it does unblock Linux core file support and a high-profile bug for a common use case. I think it also improves the applicability of the unwinder while looking for improvements in other object-file formats (i.e. ObjectFilePECOFF).

If you like the idea, I'm happy to commit & improve,

- Ashok

Ping!

FYI Jason, I verified that the original patch (attached again) continues to apply cleanly and resolve the failure in functionalities/inferior-assert with SVN trunk.

- Ashok

pr15671.patch (5.73 KB)

Hi Ashok, I apologize for taking so long to get back to you on this radar. There are a lot of corner cases handled in RegisterContextLLDB and I wanted to look it over carefully before I said anything.

This patch is fine.

+ if (eh_frame->GetUnwindPlan (m_current_pc, *unwind_plan_sp))
+ {
+ m_frame_type = eSkipFrame; // no symbol context, but we can use eh_frame to get back on track.
+ return unwind_plan_sp;
+ }

I wouldn't use eSkipFrame - wouldn't eNormalFrame work? eSkipFrame was intended to indicate a frame that is known to be invalid, an artifact of following the frame-unwind chain via the architectural default unwind plans. In your case, you have a function with fixed bounds and full unwind information -- you only lack a function name.

A more ambitious solution here would be to have the ObjectFile/SymbolFile ingest the function address ranges from eh_frame and supplement the symbol table with those additional functions, making up names.

I understand why doing this (at initial ObjectFile creation time) is a performance hit on ELF systems - on Mac OS X we have a section in Mach-O with a compact/fast to parse function start addresses (our LC_FUNCTION_STARTS load command) so doing this unconditionally at ObjectFile creation time makes sense.

The best solution on an ELF system would have lldb get to this point where it's unwinding through a dylib, can't find a symbol for a pc value, CAN find an eh_frame entry for it -- and asks the ObjectFile to supplement its symbol table with eh_frame entries and then uses those.

But I'm not going to ask you to make a change that big - your change is fine and I don't see any problems happening because of it. I'd recommend trying to use eNormalFrame, that's the only change I'd suggest.

Thanks for the careful review, Jason. I committed the patch including your review feedback in r186585 and the Linux buildbots show no regressions.

The best solution on an ELF system would have lldb get to this point where it's unwinding through a dylib, can't find a symbol for a pc value, CAN find an eh_frame entry for it -- and asks the ObjectFile to supplement its symbol table with eh_frame entries and then uses those.

Thanks for the suggestion, that sounds like an optimal mix of performance and functionality. Perhaps this use-based search would be more efficient if we can use the current PC to limit the parsing to a single FDE.

Looking at the code in ObjectFileMachO, I was thinking it would be good have a central routine (perhaps Symtab::AddSyntheticSymbol) to construct the symbol name consistently across object-file formats.

Say, is it possible to run into this situation outside of the unwinder as well (i.e. disassembly or step-in)? Here's what I get for a test case that used to show the partial backtrace:
(lldb) bt
* thread #1: tid = 0x3030, 0x00007f9ca7d17425 libc.so.6`raise + 53, name = 'a.out, stop reason = signal SIGABRT
    frame #0: 0x00007f9ca7d17425 libc.so.6`raise + 53
    frame #1: 0x00007f9ca7d1ab8b libc.so.6`abort + 379
    frame #2: 0x00007f9ca7d100ee libc.so.6
    frame #3: 0x00007f9ca7d10192 libc.so.6`__assert_fail + 66
    frame #4: 0x00000000004005c0 a.out`main(argc=1, argv=0x00007fffa7b6c108) + 112 at main.c:18
    frame #5: 0x00007f9ca7d0276d libc.so.6`__libc_start_main + 237
    frame #6: 0x0000000000400489 a.out`_start + 41
(lldb) disassemble -a 0x00007f9ca7d100ee
error: Could not find function bounds for address 0x7f9ca7d100ee

I tried "log enable --verbose lldb default", but there were no clues as to the reason for the failure. I suspect that the use-based search would need to be implemented in more than one part of lldb. Cheers,

- Ashok

Thanks for the suggestion, that sounds like an optimal mix of performance and functionality. Perhaps this use-based search would be more efficient if we can use the current PC to limit the parsing to a single FDE.

I believe the most expensive part of reading the eh_frame section is scanning the entire section to build up an index of CIE and FDE entries.

Say, is it possible to run into this situation outside of the unwinder as well (i.e. disassembly or step-in)? Here's what I get for a test case that used to show the partial backtrace:
(lldb) bt
* thread #1: tid = 0x3030, 0x00007f9ca7d17425 libc.so.6`raise + 53, name = 'a.out, stop reason = signal SIGABRT
   frame #0: 0x00007f9ca7d17425 libc.so.6`raise + 53
   frame #1: 0x00007f9ca7d1ab8b libc.so.6`abort + 379
   frame #2: 0x00007f9ca7d100ee libc.so.6
   frame #3: 0x00007f9ca7d10192 libc.so.6`__assert_fail + 66
   frame #4: 0x00000000004005c0 a.out`main(argc=1, argv=0x00007fffa7b6c108) + 112 at main.c:18
   frame #5: 0x00007f9ca7d0276d libc.so.6`__libc_start_main + 237
   frame #6: 0x0000000000400489 a.out`_start + 41
(lldb) disassemble -a 0x00007f9ca7d100ee
error: Could not find function bounds for address 0x7f9ca7d100ee

I tried "log enable --verbose lldb default", but there were no clues as to the reason for the failure. I suspect that the use-based search would need to be implemented in more than one part of lldb. Cheers,

In this case the unwinder got the functions details for frame 2 (0x00007f9ca7d100ee) from the eh_frame information -- but the rest of lldb doesn't know anything about that function. This is where the ObjectFile needs to add the start/end address and a synthesized name to the list of symbols. (e.g. ___lldb_unnamed_function3087$$libc would be a typical synthesized function name for a stripped binary on Mac OS X).

Say Dmitri,

Did you notice that the lldb clang buildslave is offline?

- Ashok

Say Dmitri,

Did you notice that the lldb clang buildslave is offline again today?

- Ashok