Thank you for collecting the data, this is valuable.
A quick estimate shows that storing the LSDA reference inline is only beneficial if >50% of the functions need one. (If we assume that an out-of-line table takes 8B/fn with LSDA (start+LSDA) and storing the LSDA ptr inline takes 4B/every fn.; this number will be higher in practice for functions with multiple descriptors.)
I collected some numbers from the distribution binaries and libraries installed on one of our Ubuntu and Fedora machines. In total, ~9% FDEs referred to a LSDA. Files with >50% (and >100 FDEs) are rare (primarily libz3 (18k/64%), libgrpc (6k/60%) and some Python packages). The Rust binaries I looked at (e.g. cargo, librustc_driver.so, fish) tend to average at ~30%.
I now agree that storing the LSDA reference in a separate table (similar to what Apple is doing) is beneficial.
I don’t think so? E=1 seems to imply one epilogue, but it still needs to encode the offset? It is unclear to me, when the compiler sets E=1, though.
I think that non-trivial linker logic is unavoidable for a good format (e.g. only the linker knows the start of the next descriptor and hence must adjust padding_after_epilogue, indexing personality functions, for addresses and constructing the search table).
W.r.t. -ffunction-sections: IMO, it’s easiest for the compiler to just dump the compact unwind information as a v2 FDE/CIE into the existing .eh_frame (see above). That avoids the costs of one extra section per comdat group and the logic for special-casing this is already in place everywhere. Ideally, linkers that have no support for compact unwinding will continue to handle the format just fine (although the result will be less efficient). A linker with compact unwinding support will transform this into a more efficient representation. The unwinder needs to handle both formats (compact search table + v2 FDE), but the latter should be very simple to implement.
If we store the LSDA references in a separate search table, that would not be needed.
Some people do unwind (e.g., throw an exception) from signal handles on Linux.
Besides, on x86, there isn’t much of a difference when rbp is not used: registers are saved/restored with push/pop, and the location of each of these instructions needs to be known to compute the CFA at every instruction. I’d say that shrink wrapping is primarily used to hoist/sink instructions before the prologue/after the epilogue and that restricting the prologue/epilogue instruction sequences is reasonable. I know that GCC sometimes mixes instructions into the prologue, which would be really hard to support. If they consider this as highly important for performance (I wouldn’t expect that it matters much, but haven’t measured), they can fall back to DWARF.
I’ll hopefully have time later this weekend to gather data on:
- Number of functions where compact unwinding should be applicable.
- Number of descriptors per function.
- Number of unique descriptors for deduplication (I’d guess that due to different stack frame, prologue, and padding sizes only parts could be deduplicated).