[RFC] Improving compact x86-64 compact unwind descriptors

I now wrote a somewhat working script that converts x86-64 DWARF CFI into compact unwind descriptors. It is not 100% accurate (DWARF CFI permits overappoximating saved registers), but it should be good enough for evaluation purposes. I built LLVM with GCC 15/Clang 21 (-O3) with and without frame pointer. There’re no LSDAs here, as we compile with -fno-exceptions.

Problems

Before getting into details, I’m putting the problems I noticed up front so that there’s a higher chance they get read. :slight_smile:

  • -fstack-clash-protection: There is more than one sub rsp. Linux distributions enable this hardening feature and stack frames >4kiB are not uncommon, so I think we need to support this.
  • DW_CFA_GNU_args_size: this x86-only opcode prevent leakage of stack memory from stack-passed arguments when an exception is caught. I see no way to encode this; ARM doesn’t have this. I don’t think we need to support this; if this is important, it can be added to the LSDA.
  • GCC really likes mixing instructions into prologues, this would need to stop.

Data

  • Number of functions that, right now, can use compact unwind descriptors: Clang-FP 99.99%, Clang-NoFP 97.7%, GCC-FP 78.3%, GCC-NoFP: 64.7%. Most common reasons for failures:
    • push-<sth else>-push sequences (GCC)
    • push rbp-<sth else>-mov rbp,rsp sequences (GCC)
    • <prologue>-<code>-push-push-call sequences (Clang, GCC)
    • These sequences are easily fixable.
  • Distribution of number of descriptors per function (total, 1, 2, 3, more):
    • libLLVM-clang-fp.so 82987 53714 26600 2166 507
    • libLLVM-clang-nofp.so 81041 51707 25835 2572 927
    • libLLVM-gcc-fp.so 66352 35511 25379 3974 1488
    • libLLVM-gcc-nofp.so 75530 57449 14976 2145 960
    • Essentially, one descriptor per duplicated tail. There are outliers: 222 functions (across all builds) would need 10 or more, the extreme case is llvm::X86GenSubtargetInfo::resolveSchedClass in GCC-FP with 361 tails.
    • Adjacent null descriptors could be combined, occasionally reducing the number of required descriptors to 0.
  • Number of different unwind descriptors (everything, just mode incl. fp prologue-size):
    • libLLVM-clang-fp.so 1773 50
    • libLLVM-clang-nofp.so 3671 653
    • libLLVM-gcc-fp.so 2806 222
    • libLLVM-gcc-nofp.so 4921 752
    • Primary reasons are different sizes from shrink wrapping, different prologue sizes in FP, and different stack frame sizes.
  • Callee-saved register ordering:
    • Overwhelmingly rbp,r15,r14,r13,r12,rbx (with possible omissions); GCC sometimes also uses r15,r14,r13,r12,rbp,rbx (with omissions). I didn’t compile with LTO/PGO/IPRA.

Analysis

Table format: Given that the number of different unwind descriptors including frame sizes is manageable, we could consider an Apple-style table with 23 bits address and 9 bits of descriptor ID. This would cover 8 MiB code per second-level page; if we add alignment assumptions (e.g., 16 byte per function), 128 MiB (won’t be flexible enough for multiple tails inside a function). For libLLVM-clang-nofp.so, we could end up with roughly 117099*4B (addrtable)+3671*8B (descriptors) = ~490 kiB. (Currently: 648 kiB eh_frame_hdr + 4177 kiB eh_frame – this would be a size reduction in the region of 8–9x at no significant loss of functionality.)

Descriptors: While implementing, I found some things to not be required. The prologue size is gone in favor of strictly specified sequences (which are needed for async tracing). There’s no large RSP mode in favor of more tightly specified CSR sets and orders.

Unfortunately, I don’t think we can reasonably compress a descriptor into 32 bit, but this should give enough flexibility for future extensions. If we wanted to, however, we can still reuse the Apple format with a little hack: new kind of second level page (with more bits for compact descriptors), the number of global opcodes is multiplied by two for format compatibility. I would hope that permits some code reuse in libunwind. (We could still put this into eh_frame_hdr with a version 2 prefix.)

struct CompactUnwindDescriptor {
    uint64_t reserved : 11; // Padding

    /// Offset of prologue start into function; -1 implies no prologue.
    /// The linker can fold NULL descriptors into non-(-1) prologue_start.
    ///
    /// 7 bits, >128 bytes is rare and can use a NULL descriptor.
    uint64_t prologue_start : 7;
    /// Number of bytes after the end of the register-restore sequence
    /// before the beginning of the next descriptor. These instructions
    /// are frame-less. If zero, there is no epilogue (otherwise, at
    /// least one instruction like ret or a tail call jmp would follow.)
    /// The linker can fold NULL descriptors into non-zero epilogue_end.
    /// The linker must adjust the value to account for any padding to
    /// the next descriptor. (The linker can get the exact function end
    /// from the address_range field of the FDE emitted by the compiler.)
    ///
    /// Example:
    ///   add rsp, 24
    ///   pop r12
    ///   pop rbx
    ///   jmp otherfn # 5 byte tail call
    ///   nopw        # 2 byte nop
    ///   <next function with new descriptor>
    /// => epilogue_end = 7.
    /// Note that there's nothing special about ret, no information is
    /// required about function returns.
    ///
    /// 6 bits, >64 bytes is rare and can use a NULL descriptor.
    uint64_t epilogue_end : 6;
    /// Index into personality function table. Non-zero value implies
    /// presence of an entry in the LSDA table. The linker must adjust
    /// the value; compiler set it to zero.
    uint64_t personality_fn : 4;
    /// Descriptor mode. Values (values other than 0 are arch-specific):
    /// - 0 = NULL.
    /// - 1 = DWARF escape (remaining 32 bits are FDE offset)
    /// - 2 = RBP-based frame (x86-64).
    /// - 3 = RSP-based frame (x86-64).
    uint64_t mode : 4;

    /// Remaining fields are architecture-specific.

    ///-------- Only for RBP-based frame.
    /// The prologue must begin with push rbp; mov rbp, rsp. Afterwards,
    /// other CSRs must be pushed in the specified order, but pushes can
    /// be interleaved with other instructions that don't modify CSRs.
    /// The epilogue must not clobber saved CSRs; the last instruction must
    /// be pop rbp.

    /// Size of the prologue, marks the point where all CSRs are saved.
    uint64_t prologue_size : 8;
    /// Saved registers. Details to be discussed. A simple format would be
    /// one bit in the sequence rbp,r15,r14,r13,r12,rbx, indicating whether
    /// it is saved (that'd require just 6 bits).
    uint64_t saved_regs : 24;

    ///-------- Only for RSP-based frame.
    /// Very rigid frame layout and prologue/epilogue sequences. Every
    /// single rsp adjument (incl. push/pop) must be identifiable.
    /// The prologue must look as follows:
    ///   push <reg1>
    ///   ...
    ///   push <regN>
    ///   push <anyreg> (only for 8B rsp adjustment)
    ///   sub rsp, <bimm> (fonly or <128B rsp adjustment)
    ///   sub rsp, <dwimm> (fonly or >=128B rsp adjustment)
    ///
    /// The epilogue must look as follows
    ///   <some instr to adjust rsp to the last saved CSR> (optional)
    ///   pop <regN>
    ///   ...
    ///   pop <reg1>

    /// Size of the stack frame * 16 (ABI requires 16B alignment).
    /// There is no need for a large frame mode, this currently covers
    /// 16 MiB, which should be enough. (Compilers can use the RBP mode
    /// or DWARF if they require larger stack frames.)
    uint64_t frame_size : 20;
    /// TBD: Specification of alternative push/pop sequence for APX.
    uint64_t is_apx : 1;
    /// TBD: Specification of instruction sequence for -fstack-clash-protection
    uint64_t with_scp : 1;
    /// TBD: I'm sure I forgot things we might need to handle.
    uint64_t reserved : 2;
    /// One bit in the sequence rbp,r15,r14,r13,r12,rbx, indicating whether
    /// it is saved (that'd require just 6 bits). I.e., the first saved reg
    /// of this list is at [CFA-16], the second at [CFA-24], etc.
    /// Maybe we need more options for IPRA, I'm unsure, so I added two
    /// unused bits for now.
    uint64_t saved_regs : 8;
};

Integration

The compiler would emit the compact descriptors into the existing eh_frame section. A supporting linker would encode a v2 eh_frame_hdr with the compact unwind table, dropping the unneeded FDEs/CIEs. An unwinder would need to support DWARF and compact unwinding (either via eh_frame_hdr or when compact descriptors are encoded in eh_frame FDEs).

Using a new augmentation character is probably more backwards compatible and it would permit emitting both compact and DWARF info in the compiler (not sure if there’s a use case for that?). I have no strong opinion here. For personality/LSDA, I’d stay roughly with the current way?

Agreed. There’re more low hanging fruits there (relocations, section headers).

On x86-64, there’s no difference, but there might be on other architectures. This would need investigation.

2 Likes