[RFC] Improving compact x86-64 compact unwind descriptors

Tatsuyuki Ishi is also interested at the topic.

(“More compact SFrames through deduplication”)


I’ve edited https://discourse.llvm.org/t/rfc-improving-compact-x86-64-compact-unwind-descriptors/47471/8 to include
my notes about the original compact unwinding and the extension. I hope this makes it easier for others to follow.

(I also keep updating “Compact unwind information” in my “Stack unwinding” block post as personal notes.)

I think reusing .eh_frame for the linking view format (compiler output, linker input) is a great choice.
We can either:

  • Add a new augmentation character for the new compact unwind descriptor; or
  • Bump the version and encode the descriptor in line.

Are you suggesting we reserve a dedicated personality field in CIE and an associated LSDA field in FDE?

In the Mach-O implementation, the __LD,__compact_unwind section (compiler output) complements __eh_frame.
On arm64-apple-* platforms, __eh_frame can be omitted from compiler output (MOFI->getSupportsCompactUnwindWithoutEHFrame()).
For our purposes, we don’t need to aggressively optimize for ELF relocatable file sizes.

  • Either the compiler or linker creates a null frame to provide the end address; or
  • The linker decreases the start of the next unwind descriptor to cover the gap. However, this would require adjusting call site and landing pad offsets within .gcc_except_table.

Therefore, I think reserving a field to encode the padding after the epilogue is worthwhile. This field serves a dual purpose: it encodes both the padding size and whether an epilogue (mirroring prologue instructions plus RET) is present.

Agreed. Let’s remove the lsda field. Then a function needing multiple compact unwind descriptors also need to split LSDA pieces, similar to the basic block functions feature.

Then are we aiming at a 12-byte format?!

// An optional page table (two-level lookup table) can lift the 32-bit address limitation.
uint32_t start_address;

uint32_t prologue_offset : 8;
uint32_t prologue_len : 8;
uint32_t padding_after_epilogue : 8; // number of bytes to the next start_address
uint32_t personality_index : 4;
uint32_t mode : 4;

// Extended from 24 bits to 32 bits
uint32_t mode_specific_encoding;

I’d also like to know more about this. This might be related to what restrictions we need on prologue/epilogue instruction sequences.

There might be fewer restrictions if we only support synchronous unwinding for C++ exceptions and asynchronous walking for profiling/backtrace(). Unwinding through a signal handler is probably not very demanding.

TODO notes about pthread_cancel

Thanks! You likely get the information from llvm-dwarfdump --eh-frame, but in case the following tips are useful:

Here is a lld hack to dump the FDE count. You can also parse

--- i/lld/ELF/SyntheticSections.cpp
+++ w/lld/ELF/SyntheticSections.cpp
@@ -524,2 +524,3 @@ void EhFrameSection::finalizeContents() {

+  size_t fdec = 0;
   size_t off = 0;
@@ -533,3 +534,5 @@ void EhFrameSection::finalizeContents() {
     }
+    fdec += rec->fdes.size();
   }
+  Msg(ctx) << "+++ fde count: " << fdec << " size: " << off;

Collect a reproduce rm bin/clang-21; LLD_REPRODUCE=/tmp/rep.tar ninja clang, facilitating writing a custom program for analysis.

clang -c -mllvm -debug-only=mc-dump -Wa,-L can be used to understand alignments and CFI advance_loc in .eh_frame

2 Likes

Fangrui, thanks for linking my post. A TL;DR is that over 99% of the CFI rows are duplicates, which lead to the hypothesis that chunks of CFI sequences (corresponding to prolog and epilog) can be deduplicated at the same rate. My proposal works similar to the opcode palette in Mach-O, but the palette is global and each opcode describes a fraction of the function (instead of the entire function), corresponding the prolog, body or epilog. And I’m proposing to use a much less restrictive opcode encoding (basically what SFrame has, and in the future we can put DWARF CFI there for callee-saved regs support too).

However, while this works well with async unwind tables I noticed that LSDA is much less compressible. Alexis’ suggestion of keeping LSDA separate from unwind table seems to make a lot of sense.

Finally with my proposal the size of sorted address table (32b / addr) becomes significant enough. The theoretical bound (c.f. Elias-Fano) is much lower at ~12b / addr. Something like Mach-O’s two level address table would help here.

2 Likes

I now wrote a somewhat working script that converts x86-64 DWARF CFI into compact unwind descriptors. It is not 100% accurate (DWARF CFI permits overappoximating saved registers), but it should be good enough for evaluation purposes. I built LLVM with GCC 15/Clang 21 (-O3) with and without frame pointer. There’re no LSDAs here, as we compile with -fno-exceptions.

Problems

Before getting into details, I’m putting the problems I noticed up front so that there’s a higher chance they get read. :slight_smile:

  • -fstack-clash-protection: There is more than one sub rsp. Linux distributions enable this hardening feature and stack frames >4kiB are not uncommon, so I think we need to support this.
  • DW_CFA_GNU_args_size: this x86-only opcode prevent leakage of stack memory from stack-passed arguments when an exception is caught. I see no way to encode this; ARM doesn’t have this. I don’t think we need to support this; if this is important, it can be added to the LSDA.
  • GCC really likes mixing instructions into prologues, this would need to stop.

Data

  • Number of functions that, right now, can use compact unwind descriptors: Clang-FP 99.99%, Clang-NoFP 97.7%, GCC-FP 78.3%, GCC-NoFP: 64.7%. Most common reasons for failures:
    • push-<sth else>-push sequences (GCC)
    • push rbp-<sth else>-mov rbp,rsp sequences (GCC)
    • <prologue>-<code>-push-push-call sequences (Clang, GCC)
    • These sequences are easily fixable.
  • Distribution of number of descriptors per function (total, 1, 2, 3, more):
    • libLLVM-clang-fp.so 82987 53714 26600 2166 507
    • libLLVM-clang-nofp.so 81041 51707 25835 2572 927
    • libLLVM-gcc-fp.so 66352 35511 25379 3974 1488
    • libLLVM-gcc-nofp.so 75530 57449 14976 2145 960
    • Essentially, one descriptor per duplicated tail. There are outliers: 222 functions (across all builds) would need 10 or more, the extreme case is llvm::X86GenSubtargetInfo::resolveSchedClass in GCC-FP with 361 tails.
    • Adjacent null descriptors could be combined, occasionally reducing the number of required descriptors to 0.
  • Number of different unwind descriptors (everything, just mode incl. fp prologue-size):
    • libLLVM-clang-fp.so 1773 50
    • libLLVM-clang-nofp.so 3671 653
    • libLLVM-gcc-fp.so 2806 222
    • libLLVM-gcc-nofp.so 4921 752
    • Primary reasons are different sizes from shrink wrapping, different prologue sizes in FP, and different stack frame sizes.
  • Callee-saved register ordering:
    • Overwhelmingly rbp,r15,r14,r13,r12,rbx (with possible omissions); GCC sometimes also uses r15,r14,r13,r12,rbp,rbx (with omissions). I didn’t compile with LTO/PGO/IPRA.

Analysis

Table format: Given that the number of different unwind descriptors including frame sizes is manageable, we could consider an Apple-style table with 23 bits address and 9 bits of descriptor ID. This would cover 8 MiB code per second-level page; if we add alignment assumptions (e.g., 16 byte per function), 128 MiB (won’t be flexible enough for multiple tails inside a function). For libLLVM-clang-nofp.so, we could end up with roughly 117099*4B (addrtable)+3671*8B (descriptors) = ~490 kiB. (Currently: 648 kiB eh_frame_hdr + 4177 kiB eh_frame – this would be a size reduction in the region of 8–9x at no significant loss of functionality.)

Descriptors: While implementing, I found some things to not be required. The prologue size is gone in favor of strictly specified sequences (which are needed for async tracing). There’s no large RSP mode in favor of more tightly specified CSR sets and orders.

Unfortunately, I don’t think we can reasonably compress a descriptor into 32 bit, but this should give enough flexibility for future extensions. If we wanted to, however, we can still reuse the Apple format with a little hack: new kind of second level page (with more bits for compact descriptors), the number of global opcodes is multiplied by two for format compatibility. I would hope that permits some code reuse in libunwind. (We could still put this into eh_frame_hdr with a version 2 prefix.)

struct CompactUnwindDescriptor {
    uint64_t reserved : 11; // Padding

    /// Offset of prologue start into function; -1 implies no prologue.
    /// The linker can fold NULL descriptors into non-(-1) prologue_start.
    ///
    /// 7 bits, >128 bytes is rare and can use a NULL descriptor.
    uint64_t prologue_start : 7;
    /// Number of bytes after the end of the register-restore sequence
    /// before the beginning of the next descriptor. These instructions
    /// are frame-less. If zero, there is no epilogue (otherwise, at
    /// least one instruction like ret or a tail call jmp would follow.)
    /// The linker can fold NULL descriptors into non-zero epilogue_end.
    /// The linker must adjust the value to account for any padding to
    /// the next descriptor. (The linker can get the exact function end
    /// from the address_range field of the FDE emitted by the compiler.)
    ///
    /// Example:
    ///   add rsp, 24
    ///   pop r12
    ///   pop rbx
    ///   jmp otherfn # 5 byte tail call
    ///   nopw        # 2 byte nop
    ///   <next function with new descriptor>
    /// => epilogue_end = 7.
    /// Note that there's nothing special about ret, no information is
    /// required about function returns.
    ///
    /// 6 bits, >64 bytes is rare and can use a NULL descriptor.
    uint64_t epilogue_end : 6;
    /// Index into personality function table. Non-zero value implies
    /// presence of an entry in the LSDA table. The linker must adjust
    /// the value; compiler set it to zero.
    uint64_t personality_fn : 4;
    /// Descriptor mode. Values (values other than 0 are arch-specific):
    /// - 0 = NULL.
    /// - 1 = DWARF escape (remaining 32 bits are FDE offset)
    /// - 2 = RBP-based frame (x86-64).
    /// - 3 = RSP-based frame (x86-64).
    uint64_t mode : 4;

    /// Remaining fields are architecture-specific.

    ///-------- Only for RBP-based frame.
    /// The prologue must begin with push rbp; mov rbp, rsp. Afterwards,
    /// other CSRs must be pushed in the specified order, but pushes can
    /// be interleaved with other instructions that don't modify CSRs.
    /// The epilogue must not clobber saved CSRs; the last instruction must
    /// be pop rbp.

    /// Size of the prologue, marks the point where all CSRs are saved.
    uint64_t prologue_size : 8;
    /// Saved registers. Details to be discussed. A simple format would be
    /// one bit in the sequence rbp,r15,r14,r13,r12,rbx, indicating whether
    /// it is saved (that'd require just 6 bits).
    uint64_t saved_regs : 24;

    ///-------- Only for RSP-based frame.
    /// Very rigid frame layout and prologue/epilogue sequences. Every
    /// single rsp adjument (incl. push/pop) must be identifiable.
    /// The prologue must look as follows:
    ///   push <reg1>
    ///   ...
    ///   push <regN>
    ///   push <anyreg> (only for 8B rsp adjustment)
    ///   sub rsp, <bimm> (fonly or <128B rsp adjustment)
    ///   sub rsp, <dwimm> (fonly or >=128B rsp adjustment)
    ///
    /// The epilogue must look as follows
    ///   <some instr to adjust rsp to the last saved CSR> (optional)
    ///   pop <regN>
    ///   ...
    ///   pop <reg1>

    /// Size of the stack frame * 16 (ABI requires 16B alignment).
    /// There is no need for a large frame mode, this currently covers
    /// 16 MiB, which should be enough. (Compilers can use the RBP mode
    /// or DWARF if they require larger stack frames.)
    uint64_t frame_size : 20;
    /// TBD: Specification of alternative push/pop sequence for APX.
    uint64_t is_apx : 1;
    /// TBD: Specification of instruction sequence for -fstack-clash-protection
    uint64_t with_scp : 1;
    /// TBD: I'm sure I forgot things we might need to handle.
    uint64_t reserved : 2;
    /// One bit in the sequence rbp,r15,r14,r13,r12,rbx, indicating whether
    /// it is saved (that'd require just 6 bits). I.e., the first saved reg
    /// of this list is at [CFA-16], the second at [CFA-24], etc.
    /// Maybe we need more options for IPRA, I'm unsure, so I added two
    /// unused bits for now.
    uint64_t saved_regs : 8;
};

Integration

The compiler would emit the compact descriptors into the existing eh_frame section. A supporting linker would encode a v2 eh_frame_hdr with the compact unwind table, dropping the unneeded FDEs/CIEs. An unwinder would need to support DWARF and compact unwinding (either via eh_frame_hdr or when compact descriptors are encoded in eh_frame FDEs).

Using a new augmentation character is probably more backwards compatible and it would permit emitting both compact and DWARF info in the compiler (not sure if there’s a use case for that?). I have no strong opinion here. For personality/LSDA, I’d stay roughly with the current way?

Agreed. There’re more low hanging fruits there (relocations, section headers).

On x86-64, there’s no difference, but there might be on other architectures. This would need investigation.

2 Likes

Not going to happen. This happens because of the after ra scheduler. There is no true schedule barrier for prologues and depending on one is in my mind a bad design to begin with.

IMHO, this would be a tradeoff between hardware efficiency and metadata overhead. On x86 and high-end ARM I would expect the hardware to be heavily optimized for push/pops and that scheduling yields minimal benefit, in which case it matters more to save space on unwind metadata.

A big takeaway is that Clang’s prologue and eplogue instructions are more regular.


GCC’s -fschedule-insns2 optimization may insert unrelated instructions between push rbp and mov rbp, rsp . (https://gcc.gnu.org/PR55667) A 2023 complaint on HN https://news.ycombinator.com/item?id=34803759 (where you asked “perf’s decision to write whole stacks directly to the disk and unwinding them as a post-process is a really bad design.” This is a good question. We should ask irogers :slight_smile: )

Agreed that GCC should do less rescheduling that makes unwind information hard. Standard prologue/epilogue instructions are more likely to receive hardware optimization.

This reminds me, llvm still does not implement the X86_64 abi correctly for argument passing. While what gcc is doing here is valid and is oked by the abi.

While considering the implementation of stack tracing, I concluded that two changes simplify the implementation for the common case: adjust prologue/epilogue points to describe the frequently accessed range with a completely set up stack frame.

struct CompactUnwindDescriptor {
    uint64_t reserved : 11; // Padding

    /// Offset of prologue end into function; 0 implies no prologue.
    /// The linker can fold NULL descriptors into non-zero prologue_start.
    /// The prologue size is defined by the mode. Values shorter than
    /// the actual prologue size are erroneous.
    uint64_t prologue_end : 7;
    /// Number of bytes after a mode-defined epilogue point before the
    /// beginning of the next descriptor. These instructions are epilogue
    /// or frame-less. If zero, there is no epilogue (otherwise, at
    /// least one instruction like ret or a tail call jmp would follow.)
    /// The linker can fold NULL descriptors into non-zero epilogue_start.
    /// The linker must adjust non-zero values to account for any padding to
    /// the next descriptor. (The linker can get the exact function end
    /// from the address_range field of the FDE emitted by the compiler.)
    ///
    /// For RSP-based frames, this points to the first pop instruction.
    /// The optional add rsp,XXX must immediately precede the point.
    /// For RBP-based frames, this points after the pop rbp instruction.
    /// Note: as the epilogue is a single instruction, NULL frames can be
    /// folded into an epilogue-less (epilogue_start=0) RBP descriptor.
    ///
    /// Example for RSP-based:
    ///   add rsp, 24
    ///   # epilogue_start points here.
    ///   pop r12     # 2 bytes
    ///   pop rbx     # 1 byte
    ///   jmp otherfn # 5 byte tail call
    ///   nopw        # 2 byte nop
    ///   <next function with new descriptor>
    /// => epilogue_start = 10.
    /// Note that there's nothing special about ret, no information is
    /// required about function returns.
    uint64_t epilogue_start : 6;
    /// ... (unmodified)
}

A stack tracing routine could look like this. The hot parts should be pretty efficient; note that the fixed location for when rbp is spilled also helps.

CompactUnwindDescriptor findCUD(uintptr_t pc, uintptr_t *start, uintptr_t *end);
struct TraceEntry { uintptr_t pc, rsp; };
void backtrace(uintptr_t pc, uintptr_t rsp, uintptr_t rbp, TraceEntry *trace) {
  *trace++ = {pc, rsp};
  uintptr_t cud_start, cud_end;
  bool restore_rbp;
  CompactUnwindDescriptor cud = findCUD(pc, &cud_start, &cud_end);
  if (cud_start + cud.prologue_end <= pc && pc < cud_end - cud.epilogue_start) {
    // Common case: we're not inside a prologue/epilogue/shrink-wrapped region.
    switch (cud.mode) {
    case 0: rsp += 8; restore_rbp = false; break;
    case 2: rsp = rbp + 16; restore_rbp = true; break;
    case 3: rsp += (cud.data >> 16) & -16u; restore_rbp = cud.data & 0x20; break;
    default: return; // DWARF/unsupported
    }
  } else [[unlikely]] {
    switch (cud.mode) {
    case 2:
      // XXX: this is a bit ugly, as we need to compute the exact prologue.
      // Fortunately, this is cold code.
    case 3: {
      uintptr_t prologue_start = cud_start + cud.prologue_end - (cud.data >> 24);
      if (pc >= cud_end - cud.epilogue_start) // After pop rbp
        rsp += 8, restore_rbp = false;
      else if (pc >= prologue_start + 4) // After mov rbp, rsp (not all CSRs saved)
        rsp = rbp + 16, restore_rbp = true;
      else // Either before or after push rbp, rbp is not yet modified
        rsp += pc > prologue_start ? 16 : 8, restore_rbp = false;
      break;
    }
    default: return; // DWARF/unsupported
    }
  }
  while (true) {
    pc = ((uintptr_t *)rsp)[-1];
    rbp = restore_rbp ? ((uintptr_t *)rsp)[-2] : rbp;
    *trace++ = {pc, rsp};
    cud = findCUD(pc, &cud_start, &cud_end);
    assert(cud_start + cud.prologue_end <= pc && pc < cud_end - cud.epilogue_start &&
           "call must be made from fully-setup stack frame");
    switch (cud.mode) {
    case 2: rsp = rbp + 16; restore_rbp = true; break;
    case 3: rsp += (cud.data >> 16) & -16u; restore_rbp = cud.data & 0x20; break;
    default: return; // DWARF/unsupported
    }
  }
}

Could you share data showing the performance benefits of doing so on somewhat recent (let’s say, <10 years) hardware? For RBP-based frames, we have spare bits and could support this (although I’d personally be reluctant to do that and consider it only if there’s a demonstrated significant benefit from supporting this). For RSP-based frames, there’s no real way to support this, but there’s always the DWARF fallback for such cases if GCC considers them to be particularly important.

1 Like

I don’t think I can give you much more than you already know. The original 32-bit Arm ABI only specified a compact format with support for synchronous C++ style exceptions only. There was come communication about lack of support for _Unwind_ForcedUnwind of which thread cancellation was given as motivation, as well as other non C++ languages (ADA and Java were mentioned back in 2005, but Haskell might be a more modern example). At the time, when embedded systems and code-size were the primary concern, it was thought that this was too niche and poorly specified a use case to put in the ABI. Lack of support for asynchronous exceptions did prove to be limiting on Arm 32-bit linux; one area of regret is missing the chance to use .ehframe when the hard float “eabihf” distro rolled out.

The main difference here from 32-bit Arm, is that the compact form was the only option, whereas here there is the chance to fall back to .ehframe.

I’d like to highlight a comment on the ABI thread from my colleague that works on GCC. Was any form of compact unwind information considered for AArch64? · Issue #344 · ARM-software/abi-aa · GitHub

Any format needs to handle shrinkwrapping and prolog scheduling where callee-saves are moved across the whole function, so that gives you non-call exceptions for free.

If that is the case, then it is likely that the design decision is whether you require strict separation of prologue, body and epilogue. I don’t have the expertise to back the statement up though.

My personal perspective, as someone that needs to support teams that need to build with both GCC and LLVM; I would trade-off smallest possible encoding for ease of adoption across GCC and LLVM. However I understand that others could come to a different conclusion.

I don’t know how practical it would be to have variants of a compact unwinding descriptor. For example the most compact encoding for code-generators that are willing to follow some restrictions, and a less compact encoding with fewer restrictions?

2 Likes

I really like the general direction here, just responding to a few specific bits.

The fragment length is part of the xdata record, and each unwind code corresponds to one instruction, so you can use that to determine the epilog offset in the E = 1 case instead of needing to encode it explicitly. I agree that it’s weirdly hard to get MSVC to generate that; Compiler Explorer is an example (though I’m not sure why only the main function body uses E = 1 and the catch handler funclet still uses E = 0).

I believe you’d still need to record the function start in some way, since that’s presumably how you’d look things up in the LSDA side table, unless that side table does a range search instead of an exact search. We should also make sure our mechanism works if catch or exception cleanup blocks are split out into a separate cold code region (assuming the LSDA is adjusted accordingly).

Making a palette of DWARF descriptors is an interesting idea. @clayborg had proposed something similar internally in the past. For one of the larger libraries in our Android apps (built with only synchronous unwind tables), IIRC deduplicating the rows would reduce their size by 100x, but the row data was only about 1/3rd of the total eh_frame size, hence my interest in reducing metadata overhead.

This is essentially what the Microsoft pdata/xdata format does, and the DWARF fallback being proposed here would serve the same purpose. One of my colleagues implemented an arm64 pass in Clang to make prolog/epilog sequences more homogeneous, which should also be beneficial for generating code that gets the most efficient unwind representation possible.

EDIT:

I like this idea in general, because we could support multiple formats this way. E.g. if you only care about synchronous unwinding the 32-bit descriptor is sufficient, so you could pick between those and the extended 12-byte descriptors (or even mix and match across pages). The kind field at the start of a second level page is a 4-byte integer, so we can easily encode multiple variants.

I’ve been doing some investigation into AArch64. A few observations:

  • Aarch64 Stack Frames Again is a good overview of GCC vs. Clang stack frame layouts, for reference. As we’ve discussed earlier, GCC puts the frame record (the (fp, lr) pair) at the top of the stack, and Clang puts it at the top of the general purpose register save area for non-Darwin targets.
  • Clang always saves registers in order. I haven’t gone through the entirety of the frame lowering code, but I believe this is a guaranteed consequence of how the register sets are ordered in AArch64CallingConvention.td. GCC’s order appears to be much less regular, though maybe there’s a pattern I’m missing.
  • As the comment in that file alludes to, Darwin puts the frame record at the bottom of the GPR save area instead of the top, so that the compact unwinder can always do fp + 16 to recover the stack pointer. Having it at the top of the GPR save area is preferred for SVE access, and we should still be able to count the number of callee-saved registers to determine the offset from fp to the previous stack pointer.
  • The one complication in Clang’s register saving logic is that when registers aren’t saved in pairs, it might place small stack objects in the middle of the general purpose or floating point register save area instead, in a way that’d be hard to reliably reverse-engineer for an unwinder (example of intermixing in the FP and GPR areas). Darwin avoids this by mandating that registers are saved in pairs, and we’d probably want a similar mode for compact unwind purposes. (The homogenous prolog config I mentioned earlier ensures register pairs, but it also has other potentially undesirable effects like multiple stack adjustments and breaking async unwind info, so we might want a different config.)
  • Both GCC and Clang shrink-wrap, including moving potentially faulting instructions before the stack frame is set up. I’ve seen some pretty extreme cases in our apps (that are built with Clang), e.g. the stack frame being set up more than 1000 instructions after the start of the function. Spot-checking those cases though, it seems like there’s still only ever one path in the function which sets up a stack frame (it just might appear pretty late), whereas I suspect GCC might be more aggressive.
  • GCC intermixes instructions into prologs aggressively, and in particular it can overwrite some callee-saved registers before all of them are saved (e.g. Compiler Explorer). I haven’t observed Clang doing this yet.
  • VLAs are complicated. Compiler Explorer is an example, where for full async tracing support, we’d need to know both the instruction which does the initial stack adjustment (which is always the first prolog instruction in all cases I’ve seen for both GCC and Clang) and the instruction which updates fp (which seems to always be at the end of the prolog for Clang but can be interspersed with other instructions for GCC). Any thoughts on how to encode that? For Clang you could probably reverse engineer the prolog size based on the number of registers saved (with an extra bit to record whether the stack pointer adjustment was a pre-index store or an explicit sub), but I’m not sure if that’s fully reliable in all cases.
3 Likes

For shrink-wrapping, you can split the unwind descriptors, as if it was multiple different functions. The Microsoft Arm64 documentation describes one way to do this. This can be cheap for simple case where there’s only one prologue: you don’t need a descriptor for operations before the prologue.

For instructions interspersed in the prologue, Microsoft Arm64 documentation has “nop” opcodes to indicate instructions which aren’t significant to the unwinder. Not sure what kind of design would make having such operations a problem… I mean, dealing with such operations obviously takes more space than assuming they don’t exist, but it doesn’t seem like a hard problem in general.

From the perspective of precise/async unwinding, everything in a function from the first instruction which adjust the stack and the instruction which sets fp has to be part of the “prologue”. If you can’t recover the sp from fp, you need to be able to precisely track it. Once fp is set up, you can ignore later stack adjustments.

If you haven’t yet, you probably want to consider prologues with -fstack-clash-protection enabled.


More generally, I think it makes sense to have a mode where we constrain prologues to reduce the size of unwind info. Don’t know exactly what compiler flags would trigger it. The runtime cost is probably not zero, but it should be small enough that the size savings is worth it, at least for some uses.

2 Likes

I have some ideas in mind to support 32-bit descriptors for sync-only unwinding. But I think a proper evaluation will need an implementation. (I do have some ideas on how to implement this in CodeGen and MC, but will need more time to flesh it out – and I won’t have really have time this week.)

Thanks, this is helpful. I think you discovered most of the relevant cases/problems. Shrink wrapping is not a problem with the recent design we have in this thread. Potentially faulting instructions are also fine.

There needs to be an accurate description for every modification of sp until fp is set. For Clang, this is easy in the case you described. For GCC, apart from the mixing of CSR-saving and CSR-using instructions, FP is set almost immediately, so only the CSR-saving instructions are problematic.

The code that generates the compact unwind info can always verify that the instructions generated match the expectations.

But this only works well in a variable-length format (like xdata). For a fixed-length format, this would either require a substantially larger encoding space or many more entries per function (in addition to a more complex encoding and therefore wasted space for the case where this is not needed). I think the goal should be to handle the common case (let’s say, 95% for Clang?) with the compact, fixed-length format and use the existing variable-length format (DWARF) for the more rare cases.

2 Likes

I think this makes sense, but it does preclude e.g. being a potential replacement for stack tracing scenarios which want to avoid any DWARF bytecode interpretation completely (which I believe is the main motivation behind SFrame). @MaskRay might have some opinions there.

I agree that we should aim to avoid variable-length formats.
The 95% coverage for Clang-generated code could reach 99% in specific build configurations.
If feasible, not-yet-implemented is better than outright unsupported.
Any gain could make the format more favorable to Linux perf maintainers.

That said, SFrame also has many representation limitations (note the numerous as_warn (_("no SFrame FDE emitted; ... warnings in binutils-gdb/gas).

Using gcc -fno-schedule-insns2 avoids interleaved instructions in the prologue. We should make distribution maintainers aware of this option and ideally convince GCC developers to disable prologue interleaving by default (possibly introducing a separate option for the other -fschedule-insns2 effects)


Aarch64 Stack Frames Again is a good overview of GCC vs. Clang stack frame layouts, for reference. As we’ve discussed earlier, GCC puts the frame record (the (fp, lr) pair) at the top of the stack, and Clang puts it at the top of the general purpose register save area for non-Darwin targets.

I haven’t thought this through completely (and unfamiliar with SVE), but GCC’s approach seems superior (with or without -fstack-protector).
The frame pointer is set much earlier, avoiding the need to track callee-saved registers.
Consider filing a feature request?

We may need to diverge from Darwin’s convention.

std::string merge( std::string a, std::string b, std::string c ) {
    std::string d = a + b;
    std::string e = a + d + b;
    return d + e;
}

gcc -O2 -fno-omit-frame-pointer:
        sub     sp, sp, #208
        stp     x29, x30, [sp, 128]
        add     x29, sp, 128

clang -O2 -fno-omit-frame-pointer:
        sub     sp, sp, #176
        stp     x29, x30, [sp, #96]
        stp     x26, x25, [sp, #112]
        stp     x24, x23, [sp, #128]
        stp     x22, x21, [sp, #144]
        stp     x20, x19, [sp, #160]
        add     x29, sp, #96

GCC’s order appears to be much less regular, though maybe there’s a pattern I’m missing.

More generally, I think it makes sense to have a mode where we constrain prologues to reduce the size of unwind info. Don’t know exactly what compiler flags would trigger it. The runtime cost is probably not zero, but it should be small enough that the size savings is worth it, at least for some uses.

Using -fno-schedule-insns2 can help avoid interleaved prologue instructions, simplifying unwind info design.

In the VLA example (-O3 -fomit-frame-pointer), neither GCC nor Clang encodes SP. This is a gap for DWARF as well.

I really appreciate your efforts on this project. Unfortunately my personal availability is limited than is used to be (one year ago). That said, I view this as a high priority for my free time and am certainly willing to offer assistance wherever it’s needed.

2 Likes

Again that is bad design.

People still don’t get shrinkwrapping… There is no such thing as prolog/epilog/body anymore. It’s all mixed and there is not even a fixed set of callee-saves for each call. So turning off scheduling isn’t going to help. In any case, any reordering (whether inside a prolog or not) cannot have any effect on unwinding.

The frame pointer is set much earlier, avoiding the need to track callee-saved registers.
Consider filing a feature request?

Don’t ever rely on the frame pointer either, whether it exists at all or where it happens to be stored. And I’m not sure how it helps avoiding “tracking of callee-saves” - in order to unwind, you need to know the offset of the specific callee-saves at each point you can unwind from.

As we’ve discussed earlier, GCC puts the frame record (the (fp, lr) pair) at the top of the stack,

That’s not correct either. Basically there is no fixed layout, we generate whatever works best for a particular function based on the stack size and other properties. Callee-saves are merged into the function so each call only saves the callee-saves that it requires, not the worst-case across the whole function.

1 Like

Do you mean that different calls can have different CSRs or that the same call can have different CSRs? If the latter, how is this represented in DWARF CFI right now?

… but it does have an effect on the size and complexity of the unwind info.

Well… we’re not trying to infer unwind info from assembly code. We’re trying to design a compact unwind info format that compilers can, optionally and if applicable, emit in addition to or instead of DWARF CFI. The compiler knows when a frame pointer is used and where it is stored and can encode this in the unwind info.

Sure, but based on observations of compiled code, some patterns seem to occur much more often than others. The goal would be to capture these common cases and provide file size and unwind/stack tracing performance benefits for these cases.

It would be helpful if GCC developers could provide information on:

  • Measured performance benefits on applications of mixing prologue/epilogue with other instructions on somewhat recent hardware (x86-64, AArch64).
    • If there’s a substantial (e.g., 5%) benefit of doing so, it would, IMHO, be worth to accommodate for this (and also to consider doing something similar in LLVM).
    • If there’s a very low (e.g., 0.2% or none at all) improvement, the reduced file size and faster unwinding/stack tracing might be a more beneficial trade-off for users.
    • Given that GCC is, to the best of my knowledge, the only back-end that does this, GCC is the most suitable candidate for obtaining such comparative data.
  • A collection of typically/often occurring stack frame setups that GCC emits (i.e., excluding edge/more rare cases).
1 Like

Yes and yes. Seperate shrink wrapping is name.

See https://gcc.gnu.org/wiki/cauldron2017?action=AttachFile&do=view&target=sws.pdf which has results from 2017 on powerpc. It helps SPEC FP more than SPEC Int. though it helped PHP up to 35%.
For aarch64 it was similar IIRC. x86_64 backend support patch was just included (http://gcc.gnu.org/r16-1551-g2c30f828e45078) this year even with more recent results:

povray was the biggest winner there.

I think that is NOT true. I think XLC and its powerpc and s390 backends both implement this for a long time now. Then again for PPC, the ABI is defined enough that backtrace is cheap and does not need special code to begin with.

Separate shrink wrapping in itself is unproblematic (each shrink-wrapped region would get its own unwind descriptor and different regions with different CSR sets are fine). The only difficulty is that GCC mixes unrelated instructions into the frame-setup and register-save sequences. I am specifically interested in the performance benefits of this scheduling, not of having multiple shrink wrapped regions (I know that this is beneficial and intend a compact unwind design to support this).

AFAIK DWARF CFI permits each code location to have a single row. If you say that the same call (i.e., the same code location) can have multiple different CSR sets – how is this encoded?

I don’t think I can try XLC, but anyway, I should’ve been more specific: GCC is, to my knowledge, the only back-end for x86-64/AArch64 which mixes non-prologue instructions into the prologue and/or uses CSRs before all CSRs have been saved.

From what I understand (I never worked on PPC codegen), the ABI also mandates a very rigid stack frame layout. We obviously can’t impose such requirements on e.g. x86-64/AArch64 now, which is why we have this discussion. Nonetheless, I’m not sure whether the PPC ABI permits async unwinding/tracing (I don’t think so, but I don’t know enough).