[RFC] Improving compact x86-64 compact unwind descriptors

Here is our proposal to extend/enhance the x86-64 compact unwind
descriptors to fully describe the prologue/epilogue for asynchronous
unwinding. I believe there are missing/lacking CFI directives as well,
but I'll save that for another thread.

Asynchronous Compact Unwind Descriptors
Ron Brender, VMS Software, Inc.
Revised January 25, 2018

1 Introduction
This document proposes means to extend so-called compact unwind
descriptors to support fully
asynchronous exception handling. This will make extended compact unwind
descriptors an
alternative to DWARF CFI (call frame information) for achieving
asynchronous exception
support.

Compact unwind descriptors can and have been used in both 64- and 32-bit
environments.
However, this proposal addresses only 64-bit environments. While the
ideas presented here can
be readily adapted for use in a 32-bit environment, for simplicity this
document makes no
attempt to do so.

There are generally three kinds of information that together form the
heart of modern software
exception handling systems:

1. Information that is used to divide the remaining unwind information
into groups that are
specific to particular regions of memory (often, but not
necessarily, associated with a
single function) as well as provide a way to efficiently search for
and identify the
grouping that is associated with a particular address in memory.

2. Information that can be used to virtually or actually unwind from
the call frame of an
executing function to the call frame of its caller (at the point of
the call).

3. Identification of an associated personality routine that is invoked
by the general exception
handling mechanization to guide how processing should proceed for a
given function, as
well as additional “language specific data” needed for the
personality routine to do its
job. Note that the personality routine and its related data are
specified as an adjunct to
compiled code and totally opaque to the general mechanism (other
than the specified
interface).

Note in particular that C++ exception handling is built on top of a
personality routine and
language specific data area ABI that itself can be implemented using
either DWARE CFI or
extended compact unwind information as described here. The choice
between the two is
transparent to C++.

2 Compact Unwind Overview
This section provides a brief overview of key features of the LLVM
compact unwind design. It
does not attempt a comprehensive re-statement of all aspects of the
design except to the extent
necessary to motivate and understand later proposed changes and
enhancements.

2.1 Compact Unwind Group Description
A compact unwind group consist of five fields, as follows:

63 32 31 0 |
-------------------------------------------------------------------|
STARTING-ADDRESS |
LENGTH | COMPACT-UNWIND-DESCRIPTION |
PERSONALITY-FUNCTION-POINTER |
LANGUAGE-SPECIFIC-DATA-ADDRESS |
-------------------------------------------------------------------|

STARTING-ADDRESS (64-bits) is the lowest address of a region of memory
occupied by some
code, typically the entry point of a function.

LENGTH (32-bits) is the number of bytes included in this group,
typically including all and only
the code of a function.

COMPACT-UNWIND-DESCRIPTION (32-bits) is a description of the fully
formed frame of a
function and how to unwind it. This is described further following.

PERSONALITY-FUNCTION-POINTER (64-bit) is a pointer to the personality
routine.

LANGUAGE-SPECIFIC-DATA-ADDRESS (64-bits, sometimes abbreviated LSDA) is a
pointer to some data to be passed to the personality routine when it is
called.

A key observation is that the starting address plus length way of
describing a group means that
the set of groups for a compilation unit need not describe all of the
code in that unit. In
particular, it appears to be expected that no unwind information need be
generated for leaf
functions.

On the other hand, it is reasonable to expect that the groups that are
emitted are ordered by the
starting address. This means that a simple and fast binary search can be
used to map an address
to the group that applies to that address, if any.

It is useful to note that the run-time representation of unwind
information can vary from little
more than a simple concatenation of the compile-time information to a
substantial rewriting of
unwind information by the linker. The proposal favors simple
concatenation while maintaining
the same ordering of groups as their associated code.

2.2 Compact Unwind Frame Description
A compact unwind frame description describes a frame in sufficient
detail to be able to unwind
that frame to the frame of its caller.

31 28 27 24 23 0 |
-------------------------------------------------------------------|
FLAGS | MODE | |
-------------------------------------------------------------------|

At the top most level, there are four bits that are not of further
interest here. Interpretation of
these bits is neither used nor changed.

Also at the top-level is a 4-bit mode field. This is the tag of a
discriminated (tagged, variant)
union that selects the interpretation of the remaining 24 bits.

Of the 16 possible modes, only 5 are defined :

Code Meaning Description
0 Old “Old” is presumed to refer to some historical
design that is no longer of interest.
It is treated here as Reserved.
1 RBP-based The frame uses the RBP register as a frame pointer.
The size of the frame can
frame vary during execution.
2 RSP-based The frame uses RSP as the frame pointer. The size
of the frame is fixed (at
frame compilation time).
3 Large RSP- The frame uses RSP as the frame pointer, The size
of the frame is fixed (at
based frame compilation time); however, that size is too large
to express within this 32-bit
descriptor encoding.
4 DWARF The frame, for whatever reason, cannot be
adequately described using the
escape compact unwind frame description. The remaining
24-bits are an index into
what the DWARF standard calls the .debug_frame
section (__eh_frame in
LLVM).

2.2.1 RBP-based Frame (MODE=1)
For a RBP-based frame, the remaining 24 bits are encoded as follows:

23 16 | 15 | 14 0 |
-------------------------------------------------------------------|
OFFSET | 0 | REGS |
-------------------------------------------------------------------|

In a RBP-based frame the RBP register is pushed on the stack immediately
after the return
address, then RSP is moved to RBP. To unwind, RSP is restored with the
current RPB value,
then RBP is restored by popping off the stack, and the return is done by
popping the stack once
more into the instruction pointer.

All preserved registers are saved in a small range in the stack that
starts at RBP-8 to RBP-2040.
The offset/8 relative to RBP is encoded in the 8-bit OFFSET field. The
registers saved are
encoded in the 15-bit REGS field as five 3-bit entries.

2.2.2 RSP-Based Frame (MODE=2)
For a RSP-based frame, the remaining 24 bits are encoded as follows:

23 16 | 15 13 | 12 10 | 9 0 |
-------------------------------------------------------------------|
SIZE | | CNT | REG_PERM |
-------------------------------------------------------------------|

In a RSP-based frame the stack pointer serves directly as the frame
pointer and RBP is available
for use as a general register. Upon entry, the stack pointer is
decremented by 8*SIZE bytes (the
maximum stack allocation is thus 2040 bytes). To unwind, the stack size
is added to the stack
pointer, and completed by popping the stack once more into the
instruction pointer.

All preserved registers are saved on the stack immediately after the
return address. The number
of registers saved (up to 6) is encoded in the 3-bit CNT field. The
11-bit REG_PERM field
encodes which registers were saved and in what order.

2.2.3 Large RSP-Based Frame (MODE=3)
For a large RSP-based frame, the remaining 24 bits are encoded as follows:

23 16 | 15 13 | 12 10 | 9 0 |
-------------------------------------------------------------------|
SIZE | ADJ | CNT | REG_PERM |
-------------------------------------------------------------------|

This case is like the previous, except the stack size is too large to
encode in the compact unwind
encoding. Instead, the function must include a "subq $nnnnnnnn, RSP"
instruction in its
prologue to allocate the stack. The offset from the entry point of the
function to the nnnnnnnn
value in the function is given in the SIZE field.

Depending on the exact instructions used to save registers (PUSH versus
MOV), the nnnnnnnn
value in the instruction stream may not be quite the full stack size.
ADJ * 8 is the additional
adjustment needed to get the actual size.

2.2.4 DWARF Escape (MODE=4)
The frame, for whatever reason, cannot be adequately described using a
compact unwind frame
description. The remaining 24-bits are an index into what the DWARF
standard calls the
.debug_frame section (called __eh_frame in LLVM).

3 Asynchronous Changes and Enhancements
It is immediately obvious that omission of unwind information for leaf
functions (with any kind
of frame) precludes handling an exception that might occur during its
execution. It follows that
unwind information must cover all of the code of a module (with one
exception discussed
below). But if successive unwind groups are ordered (as previously
assumed) and also leave no
gaps, then the LENGTH field is redundant and can be omitted. The
beginning address of a
following group is always one byte past the end of the predecessor
group. There remains only the
question of how to identify the last group of a set.

It should also be clear that the unwind representation described in the
prior section is not
sufficient to unwind from an asynchronous exception that might occur in
either the prologue or
epilogue of a function. To see this consider what would happen if an
exception occurred at either
the entry point or the return instruction of either a RBP- or RSP-frame
function. To be able to
handle asynchronous exceptions at any point during function execution,
it is necessary to add
additional information to each unwind group.

These two considerations can be combined. The result is simply to
repurpose the LENGTH field
to encode prologue and epilogue information.

3.1 Extended MODEs
To preserve backward compatibility and to allow intermixing of
traditional and extended
compact unwind groups, new MODEs are defined as follows:

Code Meaning Description

Hi John & Ron, I read through the proposal and had a couple of quick observations.

1. The proposed encoding assumes that the epilogue instructions always come at the end of the function -- or rather, just before the next function. If there is a stack protector __stack_chk_fail sequence, or there is NOP padding between functions, then the epilogue cannot be expressed. The proposed encoding allows for instructions scheduled before the prologue, as long as they're guaranteed to never throw an exception or spill registers, but there's no similar affordance for after the epilogue.

2. In section 3.1, "A null frame (MODE = 8) is the simplest possible frame, with no allocated stack of either kind (hence no saved registers). A null frame might be considered just a special case of a RSP-based frame with a stack size of zero. But unlike other frames, its frame pointer is usually not 16-byte aligned." Did you meant stack pointer here? The x86_64 SysV abi requires 16-byte alignment of the stack pointer before a CALL instruction, but I don't know of any alignment requirements on the frame pointer.

3. In the same section, you propose an alternative "default null frame group" be defined,

"If the first attempt to lookup an unwind group for an exception address fails, then it is (tentatively) assumed to have occurred within a null frame function or in a part of a function that is adequately described by a null frame. The presumed return address is (virtually or actually) popped from the top of stack and looked up. This second attempted lookup must succeed, in which case processing continues normally. A failure is a fatal error."

I suspect I missed a part of the proposal covering this, but if entries are described by only start address, is there a problem of the previous non-null function's range extending & picking up these functions if they don't have entries?

4. For frameless functions -- small-stack RSP functions -- you're assuming that the compiler saves and restores spilled registers with MOV instructions instead of PUSH/POP instructions. I think this is the normal behavior of the compiler, something like

  subq $0x98, %rsp
  movq %rsi, 0x10(%rsp)
  movq %rdi, 0x18(%rsp)

or whatever, but the proposed encoding can only allow for one change in stack size. The current encoding requires this for large-stack RSP functions (we get the offset to the RSP instruction where we read the amount being decremented in the compact unwind encoding), but for small stack frameless functions the compiler can today express in compact unwind a combination of sub and push instructions as long as it's a fixed amount within the range that the small-stack RSP encoding can express.

John and Ron,

I developed the original compact unwind implementation for macOS 10.6 back in 2009. I tried to leave space in the design to support finer grain exception handling such as for asynchronous or for the shrink wrap optimization. The idea I had at the time was instead of having just one 32-bit compact unwind info per function, there could be an array of them each covering a different range. All but the first would have the UNWIND_IS_NOT_FUNCTION_START bit set.

On macOS, the linker takes the 32-byte “compact unwind group description” records and folds them down into a complete different (read only) runtime table which basically maps an offset in the DSO into a 32-bit compact info for that address.

It seems like from your proposal that your linker keeps the “compact unwind group description” records as is and just concatenates them, and the runtime would need to understand the prolog/epilog encoding. I would think that just widening the group record (as you suggest in 5.1) would be much simpler (like SHT_REL was widened to SHT_RELA).

-Nick

>Hi John & Ron, I read through the proposal and had a couple of quick
observations.

    Hi Jason. Thanks for the interesting observations and questions.

>1. The proposed encoding assumes that the epilogue instructions always
come at the end of the  function -- or rather, just before the next
function.  If there is a stack protector __stack_chk_fail sequence, or
there is NOP padding between functions, then the epilogue cannot be
expressed.  The proposed encoding allows for instructions scheduled
before the prologue, as long as they're guaranteed to never throw an
exception or spill registers, but there's no similar affordance for
after the epilogue.

    There are at least two alternatives for this situation:
    a.  The main group can be ended at the RET instruction and
        a new group started to cover the following bytes.
        Interestingly, both NOPs and a __stack_chk_fail call
        sequence are not really unwindable instructions.
        Perhaps a new NOUNWIND mode should be added for such
        cases; this would protect against an asynchronous
        exception attempting what is likely a dangerous stack
        walk.
    b.  Starting a new group is admittedly a rather large hammer
        that is perhaps OK if rare but not if it is common. If
        common, an alternative might be to widen the E (epilogue)
        field to allow more idiomatic alternatives. 4 bits might
        be enough: 8 codes for a RET followed by 0 to 7 NOPs +
        1 code for RET + __stack_chk_fail call + 1 code for
        just a __stack_chk_fail call + 1 code for no RET and
        nothing following. For anything else, fall back to a.
    Personally I rather like the combination of a. + b.

>2. In section 3.1, "A null frame (MODE = 8) is the simplest possible
frame, with no allocated stack of either kind (hence no saved
registers). A null frame might be considered just a special case of a
RSP-based frame with a stack size of zero. But unlike other frames,
its frame pointer is usually not 16-byte aligned." Did you meant stack
pointer here?  The x86_64 SysV abi requires 16-byte alignment of the
stack pointer before a CALL instruction, but I don't know of any
alignment requirements on the frame pointer.

    Yes, it would be clearer to say the stack pointer is not
    16-byte aligned.

    Aside: in my vocabulary, the frame pointer for a RSP-based
    or null frame function is the stack pointer. (And I find it
    an abuse of the English language to refer to a RSP-based
    frame as "frameless".) So what I wrote strictly is correct,
    but it does tend to confuse. Sorry...

3. In the same section, you propose an alternative "default null frame
group" be defined,

>"If the first attempt to lookup an unwind group for an exception
address fails, then it is (tentatively) assumed to have occurred
within a null frame function or in a part of a function that is
adequately described by a null frame. The presumed return address is
(virtually or actually) popped from the top of stack and looked up.
This second attempted lookup must succeed, in which case processing
continues normally. A failure is a fatal error."

I suspect I missed a part of the proposal covering this, but if
entries are described by only start address, is there a problem of the
previous non-null function's range extending & picking up these
functions if they don't have entries?

    For normal entries, I admit I sorta hand waved in Section 3.4
    over how to determine the end address of the last group. There
    are lots of alternatives. Most involve some kind of dummy
    extended parameter group (MODE =12) where the STARTING-ADDRESS
    is one more than the last address of the last "real" group.
    There might also be linker assistance of some kind. In any
    case I think it a problem that need not be solved upfront but
    can wait for broader issues to settle.

    The default unwind group has no starting address, of course.
    It is defined by what is left over after all other normal
    groups have been considered.

>4. For frameless functions -- small-stack RSP functions -- you're
assuming that the compiler saves and restores spilled registers with
MOV instructions instead of PUSH/POP instructions.  I think this is
the normal behavior of the compiler, something like

  subq    $0x98, %rsp
  movq    %rsi, 0x10(%rsp)
  movq    %rdi, 0x18(%rsp)

or whatever, but the proposed encoding can only allow for one change
in stack size.  The current encoding requires this for large-stack RSP
functions (we get the offset to the RSP instruction where we read the
amount being decremented in the compact unwind encoding), but for
small stack frameless functions the compiler can today express in
compact unwind a combination of sub and push instructions as long as
it's a fixed amount within the range that the small-stack RSP encoding
can express.

    You are quite right, with the understanding that "the range
    that the small-stack RSP encoding can express" is in fact the
    body of the function and excludes the prologue and epilogue.
    Our proposal, of course, is trying to describe the prologue and
    epilogue as well.

    My understanding of modern x86 processors is that there is
    no meaningful performance difference between using pushes/pops
    versus moves. That is, using only moves in prologues and
    epilogues is a negligible price to pay in our quest for
    a simple and compact unwind representation.
    But I do take the point that this should be mentioned in
    Section 3.5 regarding code generation implications.

Thanks again,
Ron


Hi Nick,

It is a pleasure to be in contact with the creator of the compact unwind approach!

I can see how an array of 32-bit unwind blocks could be used to describe each distinct point within a function (within a prolog in particular). But then you end up with six or seven or more such blocks for a large percentage of functions, don’t you? Seems like a lot of additional space for something that is usually simple, compact (sic) and idiomatic and dilutes the original benefit. We are seeking a summary description in a small fixed size to supplement the base unwind block.

Since you mention it, why have the UNWIND_IS_NOT_FUNCTION_START flag? Seems like you don’t need it for unwinding purposes. Especially in a function that has a code layout where the entry point is not the first/lowest addressed instruction of the function (which I have seen in some GEM produced code for Alpha). An aside, but just curious.

Not being a MAC guy the macOS two-level lookup scheme has always been rather a mystery. What little I know is gleaned from the compact_unwind_encoding.h file found around the net. But I have never seen the .c/.cxx/.cpp that goes with it to better understand how the mapping works. Always figured the was considered proprietary but maybe I just don’t know where/how to look. Is there public code and/or design note information to look at?

You are correct that this paper tends toward trying to create a scheme with little or no linker involvement. There are good reasons to ask more of the linker for better space economy, so this is still under discussion.

You are also correct that the unwinder needs to interpret the additional prolog/epilog information. But the rest of your comment has me confused: the extra 2-bytes is just a widening as you suggest, right? And there is no Section 5.1–did you mean 3.1?

Thanks again,

Ron

For what it's worth, there are two implementations of consuming linked compact unwind in lldb http://lldb.llvm.org . In tools/compact-unwind there is a standalone reader that I wrote in 2014 to make sure I understood the format from the compact_unwind_encoding.h header when I first read it. I should really go back and make sure that it is correct all the way through or remove it, I have a vague memory that it may not interpret something correctly in how it decodes the unwind state information.

I can more strongly recommend the reader in source/Symbol/CompactUnwindInfo.cpp -- lldb has been using this information for unwinds for a few years now. This is written as a part of lldb so it won't be quite as simple to understand, but it's all there. lldb has been using this for for a few years now. I don't read/use the LSDA or personality information yet.

Hi Nick,

It is a pleasure to be in contact with the creator of the compact unwind approach!

I can see how an array of 32-bit unwind blocks could be used to describe each distinct point within a function (within a prolog in particular). But then you end up with six or seven or more such blocks for a large percentage of functions, don’t you?

You would normally just need one region for the prolog and one for the epilog. In the compact unwind info you can have mode that describes which kind of prolog/epilog is used for all the common cases. Only if instructions are scheduled into the middle of a prolog would not need more ranges.

Seems like a lot of additional space for something that is usually simple, compact (sic) and idiomatic and dilutes the original benefit. We are seeking a summary description in a small fixed size to supplement the base unwind block.

Since you mention it, why have the UNWIND_IS_NOT_FUNCTION_START flag? Seems like you don’t need it for unwinding purposes.

The LSDA info for C++ encodes the landing pads (e.g. catch clause inside a function) as offsets from the start of the function. If each compact info range was independent, the runtime could not the function start to pass to the personality function, which the personality function adds to the offset and computes the landing pad. By marking the non-start compact infos this way, the runtime can walk backwards until to it finds the first range without that bit and know it is the actual start of the function.

Especially in a function that has a code layout where the entry point is not the first/lowest addressed instruction of the function (which I have seen in some GEM produced code for Alpha). An aside, but just curious.

Not being a MAC guy the macOS two-level lookup scheme has always been rather a mystery. What little I know is gleaned from the compact_unwind_encoding.h file found around the net. But I have never seen the .c/.cxx/.cpp that goes with it to better understand how the mapping works.

You can see the runtime unwinder that parses the compact unwind section in UnwindCursor<>::getInfoFromCompactEncodingSection() in
https://llvm.org/svn/llvm-project/libunwind/trunk/src/UnwindCursor.hpp

The goal is to make the section read-only (no load time fixups), small, and fast

Always figured the was considered proprietary but maybe I just don’t know where/how to look. Is there public code and/or design note information to look at?

You are correct that this paper tends toward trying to create a scheme with little or no linker involvement. There are good reasons to ask more of the linker for better space economy, so this is still under discussion.

If there is no linker optimization, does that mean at runtime the array of group entries in unsorted and you have to do a linear search?

You are also correct that the unwinder needs to interpret the additional prolog/epilog information. But the rest of your comment has me confused: the extra 2-bytes is just a widening as you suggest, right? And there is no Section 5.1–did you mean 3.1?

In the PDF I see section 5 “Other OpenVMS Refinements”. I read that as maybe changing the layout of the group record (widening).

-Nick

Note: the first message is truncated by the lists.llvm.org to Discourse transition.

Original message: https://lists.llvm.org/pipermail/llvm-dev/2018-January/120741.html


@JohnReagan Thanks for sharing this RFC. I’m quite interested in this work. A few questions:

  • Size comparison: Do you have any measurements on the size difference between the extended compact unwind descriptors versus regular .eh_frame sections? I’m curious about the space savings (or cost) in typical binaries.
  • Upstream plans: Is any of this work planned to be upstreamed to LLVM?
  • Implementation details: If upstreaming isn’t feasible or planned, would it be possible to share an outline of the required changes (assembler, linker, runtime library, debugger)? This would help other ELF platforms (like Linux).

The ability to handle asynchronous unwinding with support for personality/LSDA is quite valuable (as .eh_frame can then be mostly or entirely eliminated), especially for size-constrained environments like mobile devices.
While there is some hype about SFrame, I think its viability as the userspace profiling solution on ELF platforms (like Linux) remains an open question due to its larger size.
OpenVMS x86-64 will be helpful in providing real-world data on compact unwind effectiveness on ELF.

Aside: The format is compatible with APX since R16-R31 are caller-saved registers.


A table in the original message does not render properly due to wrapping. Here is the fixed table

Issue/Problem                                        Possible Resolution
-------------------------------------------------------------------------------------------------
The first part of the prologue is too large to be    Use two unwind groups:
described by the 8-bit RxP-FRAME-SIZE1 field.        1) a null frame group of any appropriate size,
                                                     2) a suitable RxP-based frame.

The second part of the prologue is too large to be   Use three unwind groups:
described by the 8-bit RxP-FRAME-SIZE2 field.        1) a RSP-frame group to describe the first part of
                                                     the prologue only (PROLOGUE-SIZE1 = the
                                                     length of the entire group and PROLOGUE-SIZE2 = 0,
                                                     2) a RSP-frame group to describe the second part
                                                     of the prologue (PROLOGUE-SIZE1 & 2 both = 0
                                                     and no registers saved),
                                                     3) a suitable RxP-based frame group (with
                                                     PROLOGUE-SIZE1 & 2 both = 0).

There is more than one return location (without a    Generally use one unwind group for each sequence
shared epilogue sequence).                           of code ending with a RET. A group following a
                                                     RET will have no prologue (PROLOGUE-SIZE1
                                                     & 2 both = 0).

There is more than one entry point to a function     Generally use one unwind group for each sequence
(for example, a FORTRAN multiple entry point         of code beginning at an entry point. Each group
subroutine or analogous assembly language            may have a normal prologue and all but the last
equivalents)                                         might have NO epilogue (E flag clear).

edit (Nov 4 and 8). My notes on the original compact unwinding information

  • 32-bit descriptors: Group description records (compiler output) are compressed further to 32-bit compact unwind descriptors.
  • Standard prologue/epilogue encoding: Most functions have regular prologues and epilogues. With some code generation restrictions (e.g., register saving order, no unrelated instruction in the prologue/epilogue) they can be encoded with a 32-bit descriptor, while the rare cases use DWARF CFI fallback.
  • Two-level page table structure: This compresses 32-bit descriptors further and enables efficient lookup.
    • The section header references an arbitrary number of pages. Each page is 4096 bytes and describes several hundred entries.
    • The page header consists of a common first level header and a regular or compressed second level header.
    • First level headers are stored consecutively after personality routines. The first level describes: the first address mapped by this page (enabling smaller relative addresses in entries), the offset to the second level, and the base offset into the LSDA array.
    • Second level headers are stored within the respective 4096-byte pages.
    • The compressed second level enables local palettes for common descriptors.
  • Compressed page entries: In a page using the compressed second level header, an entry encodes a 24-bit relative instruction address and an 8-bit descriptor index.
  • Separate LSDA storage: LSDA entries are stored separately as 8-byte (function_offset, lsda_offset) pairs. This optimizes for the no-LSDA case but is expensive when more than 50% entries need LSDA. A quick estimate in a following comment from aengelke shows that the current approach is beneficial.

Notes on the OpenVMS extension

This approach makes reasonable assumptions:

  • Any use of preserved registers must be delayed until all preserved registers have been saved.
  • In a function with an RBP-based frame, the prologue must have adjacent push rbp; mov rbp, rsp instructions, and the epilogue must have adjacent mov rsp, rbp; pop rbp instructions.
  • For shrink wrap optimization with a personality routine, instructions that might cause exceptions (floating point exceptions and access violations) cannot be moved into the prologue.

It then repurposes the length field:

  • A single unwind group describes a (prologue_part1, prologue_part2, body, epilogue) tuple.
  • The prologue is conceptually split into two parts: the first part extends up to and including the instruction that decreases RSP; the second part extends to a point after the last preserved register is saved but before any preserved register is modified (this location is not unique, providing flexibility).
    • When unwinding in the prologue, the RSP register value can be inferred from the PC and the set of saved registers.
  • Since register restoration is idempotent (restoring preserved registers multiple times during unwinding causes no harm), there is no need to describe pop $reg sequences. The unwind group just need one bit to describe whether the 1-byte ret instruction is present.
  • The length field in the compact unwind group descriptor is repurposed to describe the prologue’s two parts.
  • By composing multiple unwind groups, potentially with zero-sized prologues or omitting ret instructions in epilogues, it can describe functions with shrink wrapping or tail duplication optimization.
  • Null frame group (with no prologue or epilogue) is the default and can be used to describe trampolines and PLT stubs.
2 Likes

I’m wondering whether it would be feasible to implement this inside .eh_frame using a new version number for CIEs? I.e., CIE v2 would be length/id=0/version=2/personality_ptr (or null); FDEs referring to v2 CIEs could be length/id=CIE-ref/func_start/func_size/LSDA ptr (if personality_ptr is not null)/64-bit unwind descriptor/(length/next unwind descriptor/… if one is insufficient).

Realistically, .eh_frame is not going away entirely and will still be needed for cases that cannot be adequately expressed. The infrastructure for .eh_frame (+.eh_frame_hdr) is already there in linkers, unwinders, and debuggers. I think extending .eh_frame instead of adding a new section could reduce implementation effort. (It would also reduce the size of object files, as not every COMDAT group would needs its own unwind section).

1 Like

I think you’d end up losing out on a significant amount of the size savings this way. I’ll have to dig around for the data again, but I’ve found for Meta’s Android apps that the actual unwinding encodings were less than 50% of eh_frame size; the rest was all the other overhead from the CIE and FDE structures. Even a function that requires no unwinding information at all (e.g. a leaf function with no stack usage) has 28 bytes of overhead (20 bytes for the FDE and 8 bytes for eh_frame_hdr), and an internal option to eliminate those reduced overall eh_frame size by 25%.

2 Likes

tl;dr: I think we can get to 16B/function and I’m fairly certain that we will need format changes.


I think we can get down to ~16 bytes per function in the common case by increasing the size of a .eh_frame_hdr entry to 16 bytes and omitting FDEs.

eh_frame_hdr entry: 4B address, 4B LSDA, 8B unwind descriptor.

The unwind descriptor from the OpenVMS proposal needs to be modified to add as follows: bits 46..0 unmodified; bit 47 (formerly epilogue indicator) gets expanded to >=7 bits “epilogue size”; some of the reserved bits (e.g. 4 bits) get repurposed into a “personality function ID”.

  • Epilogue size: we can roughly infer the size of the function from the next eh_frame_hdr entry, but due to padding between functions (and shrink wrapping), we can’t do so not exactly. I’d expect being able to encode 127 bytes of padding should cover most cases (leaving 127=no epilogue).

  • Personality function ID: most binaries have 1 or 2 (rarely a few more) different personality functions (including the absence, typically one per language). Therefore, add a table of personality function references as separate array to .eh_frame_hdr and use a few bits to index this table.

There can be multiple .eh_frame_hdr entries for a function, if one unwind descriptor is insufficient (this is needed for multiple epilogues). For entries with a compact unwinding descriptor in .eh_frame_hdr, the linker can drop the FDE. This would require minor additional effort in the linker implementation. As an additional benefit, unwinding requires no indirection.


The proposed (extended) compact unwinding format (link to mail here, the message in Discourse is truncated) will likely need some changes, in particular to handle the extremely common case of an epilogue in the middle of the function (I’d rather not change the generated code here to avoid performance losses). (Tail duplication is also costly and every tail would require a .eh_frame_hdr entry, we might want to do that more conservatively.) The prologue/epilogue description for RSP-based descriptors is insufficient to accurately describe the stack adjustments. It should describe exactly that it must be a push*-add (prologue) or sub-pop* sequence; the size of the pop instructions is derivable from the register encoding. The register permutation is unspecified. With the variable epilogue offset, the RET needs no longer to be specified as being part of the epilogue.

(I wanted to gather data on the applicability of the format to current binaries and started to write a script that encodes DWARF CFI into compact unwind descriptors and ran into these issues. The code is very WIP, so no data yet, but I already noted that many functions required more than one unwind descriptor with the current format.)

APX instructions PUSHP/POPP/PUSH2P/POP2P need to be supported through flags and the instruction sequence needs to be specified, otherwise it is impossible to have correct async unwind info.


That’s fascinating and rather unexpected – this means that >25% of your functions are leaf functions? Or did you also eliminated the frame info for functions that don’t need it for unwinding? For stack-less leaf functions, I see how frame info could be omitted, but in other cases, it would break async unwinding.

1 Like

Ah, I think the big difference is that we only require synchronous unwind information (for exception handling), so we only enable that (-funwind-tables -fno-asynchronous-unwind-tables). I can see the unwind information portion being much larger for asynchronous vs. synchronous unwind information.

Note that the linker optimizes compact unwind descriptors into a paged structure (as described in The Apple Compact Unwinding Format: Documented and Explained - Faultlore). With that, you usually end up with 4 bytes per function, plus the amortized overhead of the first-level pages (which should be pretty small across functions).

Anyone looking at different approaches for reducing the size of unwind might also want to look at Windows Arm64 unwind (ARM64 exception handling | Microsoft Learn). In the best case, it’s 8 bytes per function, and it supports asynchronous unwind. It achieves this by imposing expectations on the compiler’s codegen: you get the best results if your function prologue and epilogue follow a very specific format.

1 Like

@MaskRay has previously expressed admiration for pdata/xdata in Stack unwinding | MaskRay as well. I assume the main attraction of compact unwind here is that it’s already supported in libunwind, so it’s potentially easier to adopt.

Note that Apple’s arm64 ABI also imposes various constraints around the use of frame pointers and the order in which registers must be saved, which the compact unwind encoding takes advantage of. That’s also something we’d have to adapt for ELF (either by following the same rules in the compiler or by tweaking the unwind encoding to be more flexible).

Yes and both gcc and llvm use a different order of register saving/restoring on aarch64. I think only the ppc abis define order and location of the callee saved registers really. So a compact unwinding for Linux has to support both gcc and llvm and more since there is no defined way in the abi. Apple got away with it since they decided llvm is the abi even without fully documenting it.

I guess we should clarify the design goals. So far, I assumed we want:

  • Support for existing code as produced by GCC and LLVM, permitting only minor changes to prologue/epilogue sequences (e.g., restricting prologue instructions, register ordering, no interleaving of prologue/epilogue instructions with others, implies no use of callee-saved regs before end of prologue). Changes that might hurt performance (e.g., no shrink wrapping, single epilogue at end of function) or require substantial changes (different stack frame layout) are probably unacceptable?
  • Asynchronous unwinding for reliable stack traces (e.g., profiling, crashes).
  • Primary focus on programs where many functions have a LSDA (e.g., C++ with exceptions, Rust always seems to set a personality function).
  • Most code should be described only by compact descriptors, requiring a non-compact format (DWARF CFI or other) only rarely. (We might want to compare the tracing performance against frame-pointer and sframe tracing to see whether this is actually worth it.)

The key differences seem to be:

  • LSDA pointers are stored in a separate array. This heavily optimizes for the case of no LSDA and adds 8 bytes per function that has a LSDA.
  • Only synchronous unwind information.
  • At most 4 personality functions (seems fine).
  • Incompatible with GCC’s frame layout (x29 points to the bottom of the stack frame) (also pointed out by @pinskia).
  • Functions that need exception handlers need an xdata record.
  • While I’d expect an xdata-like to provide size benefits over DWARF CFI (primarily as the epilogue doesn’t need to be encoded), I’d intuitively expect most xdata records to be not particularly small (pdata=eh_frame_hdr size; xdata is 8B+4B exception handler+opcodes (uneducated guess: 12 bytes on average?); FDE is 17B+4B LSDA+opcodes (maybe 30 bytes on average for single epilogue for AArch64?). x86-64 FDEs tend to be smaller due to the red zone.

Do we have (or can we acquire) real-world data on the importance of LSDA pointers? That would be helpful information for further experiments (in addition to comments on the design goals).

2 Likes

First of all, it’s really fun to think about this in detail, so thanks for all the back and forth here :slight_smile:

For compact unwinding, Apple is able to cleverly co-design the platform ABI (mandating the use of frame pointers and a particular register saving order) and format to achieve great results, but I agree that it’s likely too restrictive for Linux in general. It could probably be made to work for more constrained environments (e.g. Clang is also the only officially supported compiler for Android), but we should aim for something more general. I like the general principle of having a more efficient encoding for standard formats but still supporting other encodings.

I agree with all of your design goals except for the emphasis on LSDA. Even for C++ with exceptions, you should only need a personality if your function catches exceptions or has an object with a non-trivial destructor live across a call which may throw. I’m barely familiar with Rust, but I believe they support panics either aborting or unwinding, and you shouldn’t need the personality if panics abort. Compiler Explorer shows Rust also using personalities only when needed, at least for this very basic example.

Meta’s Android applications are a mix of C, C++ (with exceptions enabled), and Rust (which I believe is built with -Cpanic=abort). I checked a few applications, and the numbers of FDEs with LSDA is between 16.9% and 19.8% (note that this is without the optimization to omit no-op FDEs that I mentioned earlier). Our server binaries are primarily C++ with exceptions enabled, and the number of FDEs with LSDA on a couple that I measured is between 37.3% and 39.7%. It’d be great to get more numbers from elsewhere, but I’d be surprised if they were significantly higher. I’m using the following to get the numbers, for reference:

llvm-dwarfdump --eh-frame binary_or_sos |
  awk '{ if ($0 ~ / FDE /) fde += 1; else if ($0 ~ /LSDA Address:/) lsda += 1; } END { print lsda, fde, lsda / fde * 100 }'

Looking at pdata/xdata in more detail, the main inefficiency I see is that each xdata record encodes the exception handler RVA (which I presume is the personality pointer), which will likely be mostly redundant. On the other hand, the exception handler data is stored inline, which saves an offset. Speaking of that data, I’m taking it as a given that we’re sticking to the __gxx_personality_v0 format. IIRC ARM EHABI defines its own format, but Clang uses the GCC format anyway. Microsoft also overhauled theirs a few years ago, but I haven’t examined it closely enough to see if there’s improvements that’d be applicable to the GCC format.

The other part that confused me was the function fragments discussion, which says:

For each separated secondary fragment that has its own prolog, it’s expected that no stack adjustment is done in its prolog. All stack space required by a secondary region must be pre-allocated by its parent region (or called host region). This preallocation keeps stack pointer manipulation strictly in the function’s original prolog.

A naïve reading would imply that shrink-wrapping must therefore still allocate a stack frame in all cases, but that’s not actually the case: Compiler Explorer (or Compiler Explorer for a more complex example). I guess the initial part of the function doesn’t need pdata at all, per another part of the spec (which incidentally matches my optimization to omit no-op FDEs):

The phrase “stack-manipulating” is significant: leaf functions that don’t require any local storage, and don’t need to save/restore non-volatile registers, don’t require a .pdata record. These records should be explicitly omitted to save space. An unwind from one of these functions can get the return address directly from lr to move up to the caller.

I still need to fully grok how the notion of fragments works (and why e.g. the more complex example needs two separate fragments for the $LN4 and $LN5 regions instead of a single fragment with two epilogs), but overall the scheme seems pretty complete.

xdata is 8B+4B exception handler+opcodes

It could be 4 bytes if you have a single epilog and set E = 1, right?

2 Likes

The 2018 proposal isn’t clear how to encode a function in -ffunction-sections builds. Using two unwind groups for a single function would be too expensive, or require non-trivial linke adjustment of the next descriptor.

Your suggestion to replace the 1-bit RET presence indicator with a byte count after the epilogue shall address the ambiguity.
One idea for the 16-byte descriptor:

uint32_t start_address;
// If we want to optimize for -fno-exceptions or rustc -cpanic=abort, then move this to separate storage, leading to 12-byte structure.
uint32_t lsda;

uint32_t prologue_offset : 8; // part1
uint32_t prologue_len : 8; // part2
uint32_t padding_after_epilogue : 8; // number of bytes to the next start_address
uint32_t personality_index : 3;
uint32_t is_not_function_start : 1;
uint32_t mode : 4;

// Extended from 24 bits to 32 bits
uint32_t mode_specific_encoding;

The Windows unwinding model supports retrieving saved registers within a partially executed prologue or epilogue.
However, this stronger asynchronous unwinding support isn’t needed for profiling purposes.

The optional page table (essentially a two-level lookup table) can extend start_address and lsda to support code segments larger than 2GiB as well as deduplicate repeated descriptors.

Regarding personality index, 2 bits is probably too restricted as only 3 entries are reserved for personality routines, with C++ and ObjC taking 2 of them. Since https://github.com/llvm/llvm-project/commit/e60b30d5e3878e7d91f8872ec4c4dca00d4a2dfc, LLVM uses DWARF fallback by default for non-canonical personality routines.

Regarding is_not_function_start: In .gcc_except_table, call site offsets and landing pad offsets are relative to the function start address.
When a function has multiple compact unwind descriptors, and the Itanium C++ ABI Level 1 unwinder finds a descriptor that is not the first, it needs to scan backwards to find the function start address.

Extended encoding bits should help with AArch64 pauth_key and X86 APX.

2 Likes

Even when building with exceptions, the highest percentage of functions I’ve seen internally with LSDA is 39%. It’d be helpful to gather more data, but given that number and also the large number of projects that build with -fno-exceptions, my inclination would be having shorter descriptors and storing LSDA on the side where needed. We can do so exclusively in the linker representation though (and see below for more on that).

The other thing to figure out would be if we need to support asynchronous exceptions or just asynchronous stack tracing, since we would potentially need to fully unwind prologs/epilogs for the former (like the Windows unwinding model). I can’t think of a case where you’d need to fully unwind a function in the middle of a prolog/epilog, but there must be a reason Windows supports that (some searching suggests that you can use SEH to catch stack overflows, though that seems pretty niche). @smithp35 specifically mentioned asynchronous exception support in Was any form of compact unwind information considered for AArch64? · Issue #344 · ARM-software/abi-aa · GitHub; maybe he had something in mind?

Alternatively, the original proposal requires code generation restrictions so that you don’t need to restore any registers inside a prolog and can restore all registers inside an epilog, and so that no potentially-faulting instructions can be placed in the prolog when performing shrink wrapping. Should we place the same restrictions? I can try to gather some data on how many places in our existing binaries violate those restrictions, and I believe @aengelke was doing the same.

The original proposal is centered around the compile-time representation of the proposal. I believe the transformed linker representation should also be defined up-front even if we want to keep it optional, because it offers some important benefits:

  • Deduplicating descriptors is important for size.
  • Binary searching the two-level structure should be much more efficient; minimizing page faults is important for performance, especially on mobile.
2 Likes

Thank you for collecting the data, this is valuable.

A quick estimate shows that storing the LSDA reference inline is only beneficial if >50% of the functions need one. (If we assume that an out-of-line table takes 8B/fn with LSDA (start+LSDA) and storing the LSDA ptr inline takes 4B/every fn.; this number will be higher in practice for functions with multiple descriptors.)

I collected some numbers from the distribution binaries and libraries installed on one of our Ubuntu and Fedora machines. In total, ~9% FDEs referred to a LSDA. Files with >50% (and >100 FDEs) are rare (primarily libz3 (18k/64%), libgrpc (6k/60%) and some Python packages). The Rust binaries I looked at (e.g. cargo, librustc_driver.so, fish) tend to average at ~30%.

I now agree that storing the LSDA reference in a separate table (similar to what Apple is doing) is beneficial.

I don’t think so? E=1 seems to imply one epilogue, but it still needs to encode the offset? It is unclear to me, when the compiler sets E=1, though.

I think that non-trivial linker logic is unavoidable for a good format (e.g. only the linker knows the start of the next descriptor and hence must adjust padding_after_epilogue, indexing personality functions, for addresses and constructing the search table).

W.r.t. -ffunction-sections: IMO, it’s easiest for the compiler to just dump the compact unwind information as a v2 FDE/CIE into the existing .eh_frame (see above). That avoids the costs of one extra section per comdat group and the logic for special-casing this is already in place everywhere. Ideally, linkers that have no support for compact unwinding will continue to handle the format just fine (although the result will be less efficient). A linker with compact unwinding support will transform this into a more efficient representation. The unwinder needs to handle both formats (compact search table + v2 FDE), but the latter should be very simple to implement.

If we store the LSDA references in a separate search table, that would not be needed.

Some people do unwind (e.g., throw an exception) from signal handles on Linux.

Besides, on x86, there isn’t much of a difference when rbp is not used: registers are saved/restored with push/pop, and the location of each of these instructions needs to be known to compute the CFA at every instruction. I’d say that shrink wrapping is primarily used to hoist/sink instructions before the prologue/after the epilogue and that restricting the prologue/epilogue instruction sequences is reasonable. I know that GCC sometimes mixes instructions into the prologue, which would be really hard to support. If they consider this as highly important for performance (I wouldn’t expect that it matters much, but haven’t measured), they can fall back to DWARF.

I’ll hopefully have time later this weekend to gather data on:

  • Number of functions where compact unwinding should be applicable.
  • Number of descriptors per function.
  • Number of unique descriptors for deduplication (I’d guess that due to different stack frame, prologue, and padding sizes only parts could be deduplicated).
3 Likes