[RFC] Improving compact x86-64 compact unwind descriptors

Here is our proposal to extend/enhance the x86-64 compact unwind
descriptors to fully describe the prologue/epilogue for asynchronous
unwinding. I believe there are missing/lacking CFI directives as well,
but I'll save that for another thread.

Asynchronous Compact Unwind Descriptors
Ron Brender, VMS Software, Inc.
Revised January 25, 2018

1 Introduction
This document proposes means to extend so-called compact unwind
descriptors to support fully
asynchronous exception handling. This will make extended compact unwind
descriptors an
alternative to DWARF CFI (call frame information) for achieving
asynchronous exception
support.

Compact unwind descriptors can and have been used in both 64- and 32-bit
environments.
However, this proposal addresses only 64-bit environments. While the
ideas presented here can
be readily adapted for use in a 32-bit environment, for simplicity this
document makes no
attempt to do so.

There are generally three kinds of information that together form the
heart of modern software
exception handling systems:

1. Information that is used to divide the remaining unwind information
into groups that are
specific to particular regions of memory (often, but not
necessarily, associated with a
single function) as well as provide a way to efficiently search for
and identify the
grouping that is associated with a particular address in memory.

2. Information that can be used to virtually or actually unwind from
the call frame of an
executing function to the call frame of its caller (at the point of
the call).

3. Identification of an associated personality routine that is invoked
by the general exception
handling mechanization to guide how processing should proceed for a
given function, as
well as additional “language specific data” needed for the
personality routine to do its
job. Note that the personality routine and its related data are
specified as an adjunct to
compiled code and totally opaque to the general mechanism (other
than the specified
interface).

Note in particular that C++ exception handling is built on top of a
personality routine and
language specific data area ABI that itself can be implemented using
either DWARE CFI or
extended compact unwind information as described here. The choice
between the two is
transparent to C++.

2 Compact Unwind Overview
This section provides a brief overview of key features of the LLVM
compact unwind design. It
does not attempt a comprehensive re-statement of all aspects of the
design except to the extent
necessary to motivate and understand later proposed changes and
enhancements.

2.1 Compact Unwind Group Description
A compact unwind group consist of five fields, as follows:

63 32 31 0 |
-------------------------------------------------------------------|
STARTING-ADDRESS |
LENGTH | COMPACT-UNWIND-DESCRIPTION |
PERSONALITY-FUNCTION-POINTER |
LANGUAGE-SPECIFIC-DATA-ADDRESS |
-------------------------------------------------------------------|

STARTING-ADDRESS (64-bits) is the lowest address of a region of memory
occupied by some
code, typically the entry point of a function.

LENGTH (32-bits) is the number of bytes included in this group,
typically including all and only
the code of a function.

COMPACT-UNWIND-DESCRIPTION (32-bits) is a description of the fully
formed frame of a
function and how to unwind it. This is described further following.

PERSONALITY-FUNCTION-POINTER (64-bit) is a pointer to the personality
routine.

LANGUAGE-SPECIFIC-DATA-ADDRESS (64-bits, sometimes abbreviated LSDA) is a
pointer to some data to be passed to the personality routine when it is
called.

A key observation is that the starting address plus length way of
describing a group means that
the set of groups for a compilation unit need not describe all of the
code in that unit. In
particular, it appears to be expected that no unwind information need be
generated for leaf
functions.

On the other hand, it is reasonable to expect that the groups that are
emitted are ordered by the
starting address. This means that a simple and fast binary search can be
used to map an address
to the group that applies to that address, if any.

It is useful to note that the run-time representation of unwind
information can vary from little
more than a simple concatenation of the compile-time information to a
substantial rewriting of
unwind information by the linker. The proposal favors simple
concatenation while maintaining
the same ordering of groups as their associated code.

2.2 Compact Unwind Frame Description
A compact unwind frame description describes a frame in sufficient
detail to be able to unwind
that frame to the frame of its caller.

31 28 27 24 23 0 |
-------------------------------------------------------------------|
FLAGS | MODE | |
-------------------------------------------------------------------|

At the top most level, there are four bits that are not of further
interest here. Interpretation of
these bits is neither used nor changed.

Also at the top-level is a 4-bit mode field. This is the tag of a
discriminated (tagged, variant)
union that selects the interpretation of the remaining 24 bits.

Of the 16 possible modes, only 5 are defined :

Code Meaning Description
0 Old “Old” is presumed to refer to some historical
design that is no longer of interest.
It is treated here as Reserved.
1 RBP-based The frame uses the RBP register as a frame pointer.
The size of the frame can
frame vary during execution.
2 RSP-based The frame uses RSP as the frame pointer. The size
of the frame is fixed (at
frame compilation time).
3 Large RSP- The frame uses RSP as the frame pointer, The size
of the frame is fixed (at
based frame compilation time); however, that size is too large
to express within this 32-bit
descriptor encoding.
4 DWARF The frame, for whatever reason, cannot be
adequately described using the
escape compact unwind frame description. The remaining
24-bits are an index into
what the DWARF standard calls the .debug_frame
section (__eh_frame in
LLVM).

2.2.1 RBP-based Frame (MODE=1)
For a RBP-based frame, the remaining 24 bits are encoded as follows:

23 16 | 15 | 14 0 |
-------------------------------------------------------------------|
OFFSET | 0 | REGS |
-------------------------------------------------------------------|

In a RBP-based frame the RBP register is pushed on the stack immediately
after the return
address, then RSP is moved to RBP. To unwind, RSP is restored with the
current RPB value,
then RBP is restored by popping off the stack, and the return is done by
popping the stack once
more into the instruction pointer.

All preserved registers are saved in a small range in the stack that
starts at RBP-8 to RBP-2040.
The offset/8 relative to RBP is encoded in the 8-bit OFFSET field. The
registers saved are
encoded in the 15-bit REGS field as five 3-bit entries.

2.2.2 RSP-Based Frame (MODE=2)
For a RSP-based frame, the remaining 24 bits are encoded as follows:

23 16 | 15 13 | 12 10 | 9 0 |
-------------------------------------------------------------------|
SIZE | | CNT | REG_PERM |
-------------------------------------------------------------------|

In a RSP-based frame the stack pointer serves directly as the frame
pointer and RBP is available
for use as a general register. Upon entry, the stack pointer is
decremented by 8*SIZE bytes (the
maximum stack allocation is thus 2040 bytes). To unwind, the stack size
is added to the stack
pointer, and completed by popping the stack once more into the
instruction pointer.

All preserved registers are saved on the stack immediately after the
return address. The number
of registers saved (up to 6) is encoded in the 3-bit CNT field. The
11-bit REG_PERM field
encodes which registers were saved and in what order.

2.2.3 Large RSP-Based Frame (MODE=3)
For a large RSP-based frame, the remaining 24 bits are encoded as follows:

23 16 | 15 13 | 12 10 | 9 0 |
-------------------------------------------------------------------|
SIZE | ADJ | CNT | REG_PERM |
-------------------------------------------------------------------|

This case is like the previous, except the stack size is too large to
encode in the compact unwind
encoding. Instead, the function must include a "subq $nnnnnnnn, RSP"
instruction in its
prologue to allocate the stack. The offset from the entry point of the
function to the nnnnnnnn
value in the function is given in the SIZE field.

Depending on the exact instructions used to save registers (PUSH versus
MOV), the nnnnnnnn
value in the instruction stream may not be quite the full stack size.
ADJ * 8 is the additional
adjustment needed to get the actual size.

2.2.4 DWARF Escape (MODE=4)
The frame, for whatever reason, cannot be adequately described using a
compact unwind frame
description. The remaining 24-bits are an index into what the DWARF
standard calls the
.debug_frame section (called __eh_frame in LLVM).

3 Asynchronous Changes and Enhancements
It is immediately obvious that omission of unwind information for leaf
functions (with any kind
of frame) precludes handling an exception that might occur during its
execution. It follows that
unwind information must cover all of the code of a module (with one
exception discussed
below). But if successive unwind groups are ordered (as previously
assumed) and also leave no
gaps, then the LENGTH field is redundant and can be omitted. The
beginning address of a
following group is always one byte past the end of the predecessor
group. There remains only the
question of how to identify the last group of a set.

It should also be clear that the unwind representation described in the
prior section is not
sufficient to unwind from an asynchronous exception that might occur in
either the prologue or
epilogue of a function. To see this consider what would happen if an
exception occurred at either
the entry point or the return instruction of either a RBP- or RSP-frame
function. To be able to
handle asynchronous exceptions at any point during function execution,
it is necessary to add
additional information to each unwind group.

These two considerations can be combined. The result is simply to
repurpose the LENGTH field
to encode prologue and epilogue information.

3.1 Extended MODEs
To preserve backward compatibility and to allow intermixing of
traditional and extended
compact unwind groups, new MODEs are defined as follows:

Code Meaning Description

Hi John & Ron, I read through the proposal and had a couple of quick observations.

1. The proposed encoding assumes that the epilogue instructions always come at the end of the function -- or rather, just before the next function. If there is a stack protector __stack_chk_fail sequence, or there is NOP padding between functions, then the epilogue cannot be expressed. The proposed encoding allows for instructions scheduled before the prologue, as long as they're guaranteed to never throw an exception or spill registers, but there's no similar affordance for after the epilogue.

2. In section 3.1, "A null frame (MODE = 8) is the simplest possible frame, with no allocated stack of either kind (hence no saved registers). A null frame might be considered just a special case of a RSP-based frame with a stack size of zero. But unlike other frames, its frame pointer is usually not 16-byte aligned." Did you meant stack pointer here? The x86_64 SysV abi requires 16-byte alignment of the stack pointer before a CALL instruction, but I don't know of any alignment requirements on the frame pointer.

3. In the same section, you propose an alternative "default null frame group" be defined,

"If the first attempt to lookup an unwind group for an exception address fails, then it is (tentatively) assumed to have occurred within a null frame function or in a part of a function that is adequately described by a null frame. The presumed return address is (virtually or actually) popped from the top of stack and looked up. This second attempted lookup must succeed, in which case processing continues normally. A failure is a fatal error."

I suspect I missed a part of the proposal covering this, but if entries are described by only start address, is there a problem of the previous non-null function's range extending & picking up these functions if they don't have entries?

4. For frameless functions -- small-stack RSP functions -- you're assuming that the compiler saves and restores spilled registers with MOV instructions instead of PUSH/POP instructions. I think this is the normal behavior of the compiler, something like

  subq $0x98, %rsp
  movq %rsi, 0x10(%rsp)
  movq %rdi, 0x18(%rsp)

or whatever, but the proposed encoding can only allow for one change in stack size. The current encoding requires this for large-stack RSP functions (we get the offset to the RSP instruction where we read the amount being decremented in the compact unwind encoding), but for small stack frameless functions the compiler can today express in compact unwind a combination of sub and push instructions as long as it's a fixed amount within the range that the small-stack RSP encoding can express.

John and Ron,

I developed the original compact unwind implementation for macOS 10.6 back in 2009. I tried to leave space in the design to support finer grain exception handling such as for asynchronous or for the shrink wrap optimization. The idea I had at the time was instead of having just one 32-bit compact unwind info per function, there could be an array of them each covering a different range. All but the first would have the UNWIND_IS_NOT_FUNCTION_START bit set.

On macOS, the linker takes the 32-byte “compact unwind group description” records and folds them down into a complete different (read only) runtime table which basically maps an offset in the DSO into a 32-bit compact info for that address.

It seems like from your proposal that your linker keeps the “compact unwind group description” records as is and just concatenates them, and the runtime would need to understand the prolog/epilog encoding. I would think that just widening the group record (as you suggest in 5.1) would be much simpler (like SHT_REL was widened to SHT_RELA).

-Nick

>Hi John & Ron, I read through the proposal and had a couple of quick
observations.

    Hi Jason. Thanks for the interesting observations and questions.

>1. The proposed encoding assumes that the epilogue instructions always
come at the end of the  function -- or rather, just before the next
function.  If there is a stack protector __stack_chk_fail sequence, or
there is NOP padding between functions, then the epilogue cannot be
expressed.  The proposed encoding allows for instructions scheduled
before the prologue, as long as they're guaranteed to never throw an
exception or spill registers, but there's no similar affordance for
after the epilogue.

    There are at least two alternatives for this situation:
    a.  The main group can be ended at the RET instruction and
        a new group started to cover the following bytes.
        Interestingly, both NOPs and a __stack_chk_fail call
        sequence are not really unwindable instructions.
        Perhaps a new NOUNWIND mode should be added for such
        cases; this would protect against an asynchronous
        exception attempting what is likely a dangerous stack
        walk.
    b.  Starting a new group is admittedly a rather large hammer
        that is perhaps OK if rare but not if it is common. If
        common, an alternative might be to widen the E (epilogue)
        field to allow more idiomatic alternatives. 4 bits might
        be enough: 8 codes for a RET followed by 0 to 7 NOPs +
        1 code for RET + __stack_chk_fail call + 1 code for
        just a __stack_chk_fail call + 1 code for no RET and
        nothing following. For anything else, fall back to a.
    Personally I rather like the combination of a. + b.

>2. In section 3.1, "A null frame (MODE = 8) is the simplest possible
frame, with no allocated stack of either kind (hence no saved
registers). A null frame might be considered just a special case of a
RSP-based frame with a stack size of zero. But unlike other frames,
its frame pointer is usually not 16-byte aligned." Did you meant stack
pointer here?  The x86_64 SysV abi requires 16-byte alignment of the
stack pointer before a CALL instruction, but I don't know of any
alignment requirements on the frame pointer.

    Yes, it would be clearer to say the stack pointer is not
    16-byte aligned.

    Aside: in my vocabulary, the frame pointer for a RSP-based
    or null frame function is the stack pointer. (And I find it
    an abuse of the English language to refer to a RSP-based
    frame as "frameless".) So what I wrote strictly is correct,
    but it does tend to confuse. Sorry...

3. In the same section, you propose an alternative "default null frame
group" be defined,

>"If the first attempt to lookup an unwind group for an exception
address fails, then it is (tentatively) assumed to have occurred
within a null frame function or in a part of a function that is
adequately described by a null frame. The presumed return address is
(virtually or actually) popped from the top of stack and looked up.
This second attempted lookup must succeed, in which case processing
continues normally. A failure is a fatal error."

I suspect I missed a part of the proposal covering this, but if
entries are described by only start address, is there a problem of the
previous non-null function's range extending & picking up these
functions if they don't have entries?

    For normal entries, I admit I sorta hand waved in Section 3.4
    over how to determine the end address of the last group. There
    are lots of alternatives. Most involve some kind of dummy
    extended parameter group (MODE =12) where the STARTING-ADDRESS
    is one more than the last address of the last "real" group.
    There might also be linker assistance of some kind. In any
    case I think it a problem that need not be solved upfront but
    can wait for broader issues to settle.

    The default unwind group has no starting address, of course.
    It is defined by what is left over after all other normal
    groups have been considered.

>4. For frameless functions -- small-stack RSP functions -- you're
assuming that the compiler saves and restores spilled registers with
MOV instructions instead of PUSH/POP instructions.  I think this is
the normal behavior of the compiler, something like

  subq    $0x98, %rsp
  movq    %rsi, 0x10(%rsp)
  movq    %rdi, 0x18(%rsp)

or whatever, but the proposed encoding can only allow for one change
in stack size.  The current encoding requires this for large-stack RSP
functions (we get the offset to the RSP instruction where we read the
amount being decremented in the compact unwind encoding), but for
small stack frameless functions the compiler can today express in
compact unwind a combination of sub and push instructions as long as
it's a fixed amount within the range that the small-stack RSP encoding
can express.

    You are quite right, with the understanding that "the range
    that the small-stack RSP encoding can express" is in fact the
    body of the function and excludes the prologue and epilogue.
    Our proposal, of course, is trying to describe the prologue and
    epilogue as well.

    My understanding of modern x86 processors is that there is
    no meaningful performance difference between using pushes/pops
    versus moves. That is, using only moves in prologues and
    epilogues is a negligible price to pay in our quest for
    a simple and compact unwind representation.
    But I do take the point that this should be mentioned in
    Section 3.5 regarding code generation implications.

Thanks again,
Ron


Hi Nick,

It is a pleasure to be in contact with the creator of the compact unwind approach!

I can see how an array of 32-bit unwind blocks could be used to describe each distinct point within a function (within a prolog in particular). But then you end up with six or seven or more such blocks for a large percentage of functions, don’t you? Seems like a lot of additional space for something that is usually simple, compact (sic) and idiomatic and dilutes the original benefit. We are seeking a summary description in a small fixed size to supplement the base unwind block.

Since you mention it, why have the UNWIND_IS_NOT_FUNCTION_START flag? Seems like you don’t need it for unwinding purposes. Especially in a function that has a code layout where the entry point is not the first/lowest addressed instruction of the function (which I have seen in some GEM produced code for Alpha). An aside, but just curious.

Not being a MAC guy the macOS two-level lookup scheme has always been rather a mystery. What little I know is gleaned from the compact_unwind_encoding.h file found around the net. But I have never seen the .c/.cxx/.cpp that goes with it to better understand how the mapping works. Always figured the was considered proprietary but maybe I just don’t know where/how to look. Is there public code and/or design note information to look at?

You are correct that this paper tends toward trying to create a scheme with little or no linker involvement. There are good reasons to ask more of the linker for better space economy, so this is still under discussion.

You are also correct that the unwinder needs to interpret the additional prolog/epilog information. But the rest of your comment has me confused: the extra 2-bytes is just a widening as you suggest, right? And there is no Section 5.1–did you mean 3.1?

Thanks again,

Ron

For what it's worth, there are two implementations of consuming linked compact unwind in lldb http://lldb.llvm.org . In tools/compact-unwind there is a standalone reader that I wrote in 2014 to make sure I understood the format from the compact_unwind_encoding.h header when I first read it. I should really go back and make sure that it is correct all the way through or remove it, I have a vague memory that it may not interpret something correctly in how it decodes the unwind state information.

I can more strongly recommend the reader in source/Symbol/CompactUnwindInfo.cpp -- lldb has been using this information for unwinds for a few years now. This is written as a part of lldb so it won't be quite as simple to understand, but it's all there. lldb has been using this for for a few years now. I don't read/use the LSDA or personality information yet.

Hi Nick,

It is a pleasure to be in contact with the creator of the compact unwind approach!

I can see how an array of 32-bit unwind blocks could be used to describe each distinct point within a function (within a prolog in particular). But then you end up with six or seven or more such blocks for a large percentage of functions, don’t you?

You would normally just need one region for the prolog and one for the epilog. In the compact unwind info you can have mode that describes which kind of prolog/epilog is used for all the common cases. Only if instructions are scheduled into the middle of a prolog would not need more ranges.

Seems like a lot of additional space for something that is usually simple, compact (sic) and idiomatic and dilutes the original benefit. We are seeking a summary description in a small fixed size to supplement the base unwind block.

Since you mention it, why have the UNWIND_IS_NOT_FUNCTION_START flag? Seems like you don’t need it for unwinding purposes.

The LSDA info for C++ encodes the landing pads (e.g. catch clause inside a function) as offsets from the start of the function. If each compact info range was independent, the runtime could not the function start to pass to the personality function, which the personality function adds to the offset and computes the landing pad. By marking the non-start compact infos this way, the runtime can walk backwards until to it finds the first range without that bit and know it is the actual start of the function.

Especially in a function that has a code layout where the entry point is not the first/lowest addressed instruction of the function (which I have seen in some GEM produced code for Alpha). An aside, but just curious.

Not being a MAC guy the macOS two-level lookup scheme has always been rather a mystery. What little I know is gleaned from the compact_unwind_encoding.h file found around the net. But I have never seen the .c/.cxx/.cpp that goes with it to better understand how the mapping works.

You can see the runtime unwinder that parses the compact unwind section in UnwindCursor<>::getInfoFromCompactEncodingSection() in
https://llvm.org/svn/llvm-project/libunwind/trunk/src/UnwindCursor.hpp

The goal is to make the section read-only (no load time fixups), small, and fast

Always figured the was considered proprietary but maybe I just don’t know where/how to look. Is there public code and/or design note information to look at?

You are correct that this paper tends toward trying to create a scheme with little or no linker involvement. There are good reasons to ask more of the linker for better space economy, so this is still under discussion.

If there is no linker optimization, does that mean at runtime the array of group entries in unsorted and you have to do a linear search?

You are also correct that the unwinder needs to interpret the additional prolog/epilog information. But the rest of your comment has me confused: the extra 2-bytes is just a widening as you suggest, right? And there is no Section 5.1–did you mean 3.1?

In the PDF I see section 5 “Other OpenVMS Refinements”. I read that as maybe changing the layout of the group record (widening).

-Nick