Signal stack unwinding and __kernel_rt_sigreturn

Hi,

I’m trying to use lldb in a project where I need to report stack traces from signal handlers, and need to do so on aarch64 Linux. Even for “synchronous” signal handling, I’m hitting a couple issues preventing this from working “out of the box”:

1 - The platform signal handler trampoline function on the stack is __kernel_rt_sigreturn, not sigtramp. I’m wondering if it’s ok to just add __kernel_rt_sigreturn to the list of trap handler symbol names in PlatformLinux, or if I need to include it in “UserSpecifiedTrapHandlerFunctionNames”, or something else

2 - When the user handler is invoked, its return address is set to the very first byte of __kernel_rt_sigreturn, which throws off unwinding because we assume that frame must really be at a call in the preceding function. I asked about this on IRC, where Jan Kratochvil mentioned that the decrement shouldn’t happen for frames with S in the eh_frame’s augmentation. I’ve verified that __kernel_rt_sigreturn indeed has the S. I’m not sure where I’d find official documentation about that, but the DWARF Standards Committee’s wiki[1] does link to Ian Lance Taylor’s blog[2] which says “The character ‘S’ in the augmentation string means that this CIE represents a stack frame for the invocation of a signal handler. When unwinding the stack, signal stack frames are handled slightly differently: the instruction pointer is assumed to be before the next instruction to execute rather than after it.” So I’m interested in encoding that knowledge in LLDB, but not sure architecturally whether it would be more appropriate to dig into the eh_frame record earlier, or to just have this be a property of symbols flagged as trap handlers, or something else.

I’d very much appreciate any feedback on this. I’ve put up a patch[3] on Phab with a testcase that demonstrates the issue (on aarch64 linux) and an implementation of the low-churn “communicate this in the trap handler symbol list” approach.

Thanks,

-Joseph

[1] - Exception Handling - wiki.dwarfstd.org

[2] - Airs – Ian Lance Taylor » .eh_frame

[3] - https://reviews.llvm.org/D63667

Hi,

I'm trying to use lldb in a project where I need to report stack traces from signal handlers, and need to do so on aarch64 Linux. Even for "synchronous" signal handling, I'm hitting a couple issues preventing this from working "out of the box":
1 - The platform signal handler trampoline function on the stack is __kernel_rt_sigreturn, not sigtramp. I'm wondering if it's ok to just add __kernel_rt_sigreturn to the list of trap handler symbol names in PlatformLinux, or if I need to include it in "UserSpecifiedTrapHandlerFunctionNames", or something else

If this is a linux kernel issue where the function can be either or both "sigtramp" or "__kernel_rt_sigreturn" and everything else behaves correctly this is fine.

2 - When the user handler is invoked, its return address is set to the very first byte of __kernel_rt_sigreturn, which throws off unwinding because we assume that frame must really be at a call in the preceding function. I asked about this on IRC, where Jan Kratochvil mentioned that the decrement shouldn't happen for frames with S in the eh_frame's augmentation. I've verified that __kernel_rt_sigreturn indeed has the S. I'm not sure where I'd find official documentation about that, but the DWARF Standards Committee's wiki[1] does link to Ian Lance Taylor's blog[2] which says "The character ‘S’ in the augmentation string means that this CIE represents a stack frame for the invocation of a signal handler. When unwinding the stack, signal stack frames are handled slightly differently: the instruction pointer is assumed to be before the next instruction to execute rather than after it." So I'm interested in encoding that knowledge in LLDB, but not sure architecturally whether it would be more appropriate to dig into the eh_frame record earlier, or to just have this be a property of symbols flagged as trap handlers, or something else.

If we have hints that unwinding should not backup the PC, then this is fine to use. We need the ability to indicate that a lldb_private::StackFrame frame behaves like frame zero even when it is in the middle. I believe the code for sigtramp already does this somehow. I CC'ed Jason Molenda so he can chime in.

2 - When the user handler is invoked, its return address is set to the very first byte of __kernel_rt_sigreturn, which throws off unwinding because we assume that frame must really be at a call in the preceding function. I asked about this on IRC, where Jan Kratochvil mentioned that the decrement shouldn't happen for frames with S in the eh_frame's augmentation. I've verified that __kernel_rt_sigreturn indeed has the S. I'm not sure where I'd find official documentation about that, but the DWARF Standards Committee's wiki[1] does link to Ian Lance Taylor's blog[2] which says "The character ‘S’ in the augmentation string means that this CIE represents a stack frame for the invocation of a signal handler. When unwinding the stack, signal stack frames are handled slightly differently: the instruction pointer is assumed to be before the next instruction to execute rather than after it." So I'm interested in encoding that knowledge in LLDB, but not sure architecturally whether it would be more appropriate to dig into the eh_frame record earlier, or to just have this be a property of symbols flagged as trap handlers, or something else.

If we have hints that unwinding should not backup the PC, then this is fine to use. We need the ability to indicate that a lldb_private::StackFrame frame behaves like frame zero even when it is in the middle. I believe the code for sigtramp already does this somehow. I CC'ed Jason Molenda so he can chime in.

Sorry for the delay in replying, yes the discussion over on https://reviews.llvm.org/D63667 is also related - we should record the S flag in the UnwindPlan but because of the order of operations, always getting the eh_frame UnwindPlan to see if this is a signal handler would be expensive (we try to delay fetching the eh_frame as much as possible because we pay a one-time cost per binary to scan the section).