[RFC] Adding SFrame support to llvm

I’m not saying whether or not this feature counts as undue maintenance burden. I just aimed to point out the fallacy of the argument that “you’re not implementing / extending it so you don’t need to care about it”, because that isn’t true. The community at large still needs to view the feature as sufficiently valuable for the additional maintenance cost it will incur, you can’t just ignore that side of things entirely.

Currently, the lack of field data demonstrating the advantage over traditional unwinding approach remains a significant impediment to advancing this proposal.

While GNU Assembler and Linker have implemented some early .sframe generation and merge functionality, the entire scheme is at a very nascent stage and is not supported in the current mainline kernel (deferred unwind infrastructure is, however, a general feature), and not yet gained consensus as the promising direction.

Therefore, the argument that “GNU Toolchain has added support” is insufficient justification to upstream this feature, especially given its clear potential to become a maintenance liability without real-world data to back its value. Foundational design issues also need resolution before considering upstreaming. I’ve raised them in two binutils messages and here is a summary

https://sourceware.org/pipermail/binutils/2025-October/144904.html

  • Version mix-and-match problem requires a concrete solution
  • No other metadata format has ever demanded this much from linkers.
  • Why would supporting multiple concatenated elements within a single .sframe section be problematic?
  • .sframe is far from an established technology.
  • We need a solution that works within existing ELF conventions, not one that requires fundamental changes to them.
  • Let the format mature and prove its value before committing to complex linker behavior.
  • In the x86-64 sqlite3 amalgamation, .sframe is actually larger than .eh_frame

https://sourceware.org/pipermail/binutils/2025-October/144952.html

  • Alternative design that generates .sframe from .eh_frame, making assembler support unnecessary
  • fine-grained knowledge of the format may expose the linker to more frequent updates—a serious risk, given that the linker’s foundational role in the build process demands exceptional stability and robustness.
  • confirm if .sframe truly requires the SHF_ALLOC flag. perf supports .debug_frame (tools/perf/util/unwind-libunwind-local.c), which does not have the SHF_ALLOC flag.
  • existing Linux distribution post-processing tools can be modified to append the .sframe section to executable and shared object files.

Regarding the syscall handling and BPP, the Deferred stacktrace infrastructure has been accepted upstream. This is needed for SFrames, as the SFrame tables exist in user space. Profiling happens via an interrupt or NMI. In that context, user space tables are unsafe to read. The deferred infrastructure is a way to delay the user space stack reading until the task goes back to user space. In that context, the user space tables are safe to read.

Deferred stacktrace infrastructure has been accepted upstream.

Some perf contributors are still dubious: “A big issue with the kernel side of things is that deferring user stack traces to syscall boundaries (needed so you can page in the debug sections) means a BPF program can no longer just stack trace user code. This breaks stack trace deduplication a commonly used BPF primitive.”

Another major use of the feature is for better performance: omit-fp can be enabled with low profiling (with stack trace) overhead.

Many folks would love to see the data, but more recent analysis like The Return of the Frame Pointers show very low value for the omission of frame pointers.
Therefore, I have stated the following in my “Remarks on SFrame”: This benefit appears most significant on x86-64, which has limited general-purpose registers (without APX). Performance analyses show mixed results: some studies claim frame pointers degrade performance by less than 1%, while others suggest 1-2%. However, this argument overlooks a critical tradeoff—SFrame unwinding itself performs worse than frame pointer unwinding, potentially negating any performance gains from register availability.

Furthermore, we must consider whether the advantages are ultimately offset by a large memory footprint for .sframe.

“multiple people have already reported issues with lld on Ubuntu 25.10, which enabled sframe emission in GCC by default.”

The maintainer misunderstood what was wanted (testing, not deploying to users..) It enabled by a pre-release and then disabled in 15.2.0-4ubuntu2 : gcc-15 package : Ubuntu


Kernel ftrace, while significantly simpler than full stack unwinding, has historically led to a sprawling set of command-line options, many driven by specific kernel requirements.
It took many years of effort to establish a nearly ubiquitous solution in the Linux kernel—the adoption of -fpatchable-function-entry. However, the older, less elegant options cannot be easily retired.

-pg
-mfentry
-mnop-mcount
-mprofile-kernel powerpc
-mrecord-mcount
-mhotpatch=pre-halfwords,post-halfwords
-fpatchable-function-entry=N[,M]

Unlike these ftrace options, the unwind mechanism represents a much larger maintenance liability because it requires touching and coordinating numerous areas across the entire toolchain: Clang driver, code generation, the assembler, binary utilities, linker, debugger, optional libunwind.
Given this wide dependency footprint, it is imperative to remain highly cautious in its design.


I want to thank MaskRay for the excellent writeup on his blog (and here on Discourse). That was good data to collect in a single location.

Thanks for the kind words!

2 Likes

I generally agree with this stance. I think where I disagree is that I’m not convinced that sframe is “good”: it’s a very limited format, with a backwards-incompatible change on the horizon (v3), which in turn is unlikely to be the last breaking change.

I’d be more comfortable if I had the feeling that there’s active work on addressing the concerns and designing an extendable and forwards-compatible format. Instead, SFrame feels more like a MVP with already several (incompatible) ad-hoc fixes stacked on top (v2, v3) in less than 3 years, requesting special treatment from the tools involved (including suggestions to change ELF behavior instead of fixing the format).

I’m all for trying things out (off-by-default), but I’d expect we’d do only if there’s (1) reasonable consensus on the technical design, (2) a clear and reasonable-well-thought-out path forward, and (3) acceptable maintenance costs. I think there’s nothing stopping us to, e.g., implement CREL in LLVM+LLD (although I’m not in favor of EV_LIGHT, a subsections-via-symbols-like mechanism could achieve similar benefits in a more compatible way – but that’s off-topic here).

I’m personally missing 1, 2, and 3 for SFrames.

Formats need time to mature and stabilize, but there’s already the GNU implementation that should be sufficient for this purpose?

+1. MaskRay has made huge contribution to the LLD project. This is a blessing but also comes with risks. Without growing new LLD maintainers, it will be a natural tendency for the project to avert new changes as any single person is bandwidth limited no matter how productive they are.

This does not contradict to the fact that it is accepted upstream? 100% consensus does not exists in reality.

We care about BPF (off CPU) profiling a lot (more than any average user), so SFrame’s usability with it is a critical requirement, so it is carefully examined and we believe it is fine. One of the issue is that if a long latency system all is invoked at the end of a profiling period, we run the risk of dropping user stacks. The turns out to an moot issue because the chance of dropping stack is very low (thus does not affect statistical distribution of the profile). There is a solution to the problem too, but we believe it is not even needed. There are other scenarios such as sync traps (such as page faults), async ones such as IPI and timer interrupts, which all should be handled. I will stop here as they are quite off topic.

We have performance data from production (with millions machines). omit-fp is enabled for x86-64 Linux kernel and sees the expected improvement fleetwide – projecting to > 1% if applied in user space. ORC unwinder is part of the kernel (similar in performance to SFrame unwinder).

From fleetwide profile data, we also notice that the spill cost is higher on Arm than x86 despite the fact it has more general registers.

No one overlooks the tradeoff it they care about both profiling and performance with omit-fp. SFrame is the enabler for that (as illustrated in kernel with ORC unwinder).

Some memory overhead is expected with SFrame, but it is the tradeoff user can make themselves.

For Linux perf, the FP based unwinding can be a hit or miss if the sample is in pro or epilogue. With -momit-leaf-frame-pointer on, the stacks may be missing or incomplete depending on where the sample lands in the leaf function.

Of course, that is what format version is all about, but just because the format is evolving is not basis for blocking the feature. Things learned from field with the early version is better than the ‘perfect’ future version that is designed on paper. Implement and iterate is the right way to go. I will give you a few examples. The original SampleFDO implementation in LLVM was pre-mature without inline context information, but without that first step, we would not evolve it into the most powerful autoFDO framework today with support for inline contextual profile, full CSSPGO, pseudo-probe support, flow-sensitive AFDO etc. The format for AFDO has also evolved many times in the past adding support to extended format, known symbol list etc. Early upstreaming also enabled collaborations from multiple industry partners, creating a healthy environment (instead of being bottlenecked on one maintainer).

Another example is instrumentation PGO. If you trace back the history, the PGO today has very different format that the original version with the support of program summary, value profiling (icalls, mem-ops), name compression, begin-end section, platform-dependent support, vtable profiling support, memory profiling etc added along the way. A IR based PGO was also developed later. Without the original version (even though it was not usable for us at all) and its foundation, we won’t reach the current state.

Not really – there is constraint for users on what toolchain they can use.

David

1 Like

It’s not just the prologue and epilogue: even with -mno-omit-leaf-frame-pointer, setting up the frame can be delayed for quite a bit. For example, the happy path in glibc’s malloc doesn’t set up a frame even if built with the options. This happens even before the GCC 16 change to enable separate shrink wrapping. A lot of workloads spend quite some time in the glibc string functions implemented in hand-written assembler, and we do not have variants with frame pointers of those.

I think the explanation here is that profiling needs vary: People pushing for frame pointers either are interested what happens further down the stack (so that they can get meaningful flamegraphs), or it’s not really profiling for them, but getting stack traces from the kernel for behavioral analysis of userspace (as part of some intrusion detection system). In contrast, those who mainly work on system libraries and low-level components need accurate stack traces at the top, and frame pointers can give misleading results there.

2 Likes

Taking a step back:

We have a significant technical disagreement, and don’t seem to be making progress. Here is my best summary of the two positions and a proposed way forward.

[This is my best understanding, and it may be wrong. Please correct as you see fit.]:

LLD should not support sframe version 2 and should wait to add any support (Maskray)

The format isn’t mature, is complex to generate, and has several limitations and bugs. The benefits of the new format are unclear at best. As presently defined, linkers must be aware of the details of the format in ways that add undue complexity, slow things down, and are not linker or ELF friendly. Implementing various versions as it matures adds instability to one of the foundational tools of a distribution.

This extra complexity and instability is bad for linkers in general and LLD in particular. There isn’t enough experience with the format yet. SFrame Version 3 is currently being defined, and we should wait and see if it is adequate, and only then support it. V3 may or may not fix the issues, but at the very least we should wait so that we don’t have to support multiple versions or to review code for both versions. Even worse, supporting two versions will add just that much more complexity and review burden. Adding support is a burden on reviewers and damages the health of LLD’s source code.

LLD should support sframe version 2 (Sterling-Augustine)

The gnu toolchain supports sframes version 2 today, and interoperability is extremely useful. Ubuntu 25.10, among other distributions, has already enabled sframes by default, and users have reported bugs in LLD. Other distributions would like to enable it by default, but cannot without linker support (whether LLD or some other). The linux kernel recently added a feature that depends on it for hot-patching, and expects it to replace the ORC unwinder. Without this support, advanced kernel development will be limited to gnu-binutils only. That other projects have adopted it is sufficient evidence of benefit. Also we have done additional analysis and believe it will be useful for our internal unwinding situation.

We shouldn’t wait for V3. It is not yet finalized, and won’t propagate to distributions for quite some time after that, perhaps a year. It is unlikely that most–or even some–of the concerns will be addressed in V3 in ways that would be acceptable. (See continuing upstream discussion here.). Further, it would be good to have a well-tested, and well understood implementation of V2 in place before V3. In fact, one of the proposals includes automatically upgrading V2 object files to V3, so at least reading V2 may be required.

The compiler is also a fundamental tool of all distributions, and they traditionally accept many new features that are gaining traction. Clang itself is very open to new features and tools, and something with the same amount of traction in the compiler world would be accepted. Not every experimental feature will work out, but preventing them from going in makes the project stagnant.

Although there is some additional complexity to be added, it is not especially significant, and is comparable to the other unwind formats supported by LLD.

I am cognizant of the additional review burden this will require, and would be happy to spread the burden among any delegates designated by the project. Reducing the extra review burden on the LLD maintainers is a worthy goal. I am also happy to take any bugs it may introduce.

Going Forward

I don’t think we are making progress on the issues between ourselves, so I propose we follow the LLVM Project guidelines for escalating a decision. It starts by taking the disagreement to the Area Team (https://github.com/llvm/llvm-www/blob/HEAD/proposals/LP0004-project-governance.md#project-council) for a given sub-project. But as far as I know, there is no LLD Area Team, in which case it goes directly to the Project Council (https://github.com/llvm/llvm-www/blob/HEAD/proposals/LP0004-project-governance.md#project-council)

Maskray (or anyone else): Do you think this would be a good way forward?

Maskray (or anyone else): Do you think this would be a good way forward?

IMO LLD has fallen through the cracks of the policy: I would imagine the Project Council fallback exists for cases where an area is not large enough nor actively developed enough to warrant an Area Team, but LLD doesn’t appear to me (as an LLD outsider) to fall in those categories.

Thought experiment: if an LLD Area Team were established, who are likely to serve? Would an escalation to the hypothetical LLD Area Team lead to a different decision than an escalation to the Project Council?

rnk: Maintenance burden is an important consideration, but how can we grow new LLD maintainers if the project has such a conservative stance that new contributors can’t add features to the linker? The way open source normally works is someone proposes a feature, stakes out some space in the repository, implements it behind a flag, iterates on it, refactors it, integrates it into the system, and as they iterate on that, they take on maintenance responsibilities. The bet on inclusivity may not pay off in all cases, and I’m sure we can point to our favorite poorly-thought-out and unmaintained feature, but I firmly believe that this is how you build sustainable open source projects.
davidxl: +1. MaskRay has made huge contribution to the LLD project. This is a blessing but also comes with risks. Without growing new LLD maintainers, it will be a natural tendency for the project to avert new changes as any single person is bandwidth limited no matter how productive they are.

I appreciate the recognition, but let me be clear: my concerns here are strictly technical, not about gatekeeping or aversion to new features.
I’ve accepted and integrated numerous features into LLD over the years.
This particular proposal has unresolved foundational design issues that need addressing before upstreaming, regardless of who is reviewing it.

We have performance data from production (with millions machines). omit-fp is enabled for x86-64 Linux kernel and sees the expected improvement fleetwide – projecting to > 1% if applied in user space. ORC unwinder is part of the kernel (similar in performance to SFrame unwinder).

I would very much like to see this data: “Currently, the lack of field data demonstrating the advantage over traditional unwinding approach remains a significant impediment to advancing this proposal.”
This is really unusual compared with other performance upstream work I’ve seen from Google.

Specifically:

  • Is this measurement from kernel space (which already uses ORC)? Are you comparing FP, ORC, and SFrame?
  • Kernel vs user-space unwinding is very different. How do you translate the x86-64 kernel saving to userspace?
    The ORC-related thread Making sure you're not a bot! mentioned
    “enabling framepointer introduced overhead of around the 5-10% mark.”
    The number is much larger than the userspace overhead people have been observing.
  • What is the performance overhead and memory footprint impact of SFrame sections?

Of course, that is what format version is all about, but just because the format is evolving is not basis for blocking the feature. Things learned from field with the early version is better than the ‘perfect’ future version that is designed on paper. Implement and iterate is the right way to go.

I respectfully disagree that “implement and iterate” applies universally. The PGO examples you cite are fundamentally different:

  • PGO format changes don’t impact linker stability or require coordinating changes across the entire toolchain
  • PGO demonstrated clear value from day one; it evolved features but wasn’t fundamentally unproven
  • Most critically: PGO didn’t have unresolved design issues at the time of initial upstreaming

SFrame has specific, concrete design problems that I’ve raised in my binutils messages. “Implement and iterate” is not a substitute for resolving these issues first. These are not hypothetical concerns—they are real problems that will affect every user if we upstream prematurely.

These aren’t minor implementation details to iterate on later—they’re foundational design questions that affect whether this approach is sound.

This does not contradict to the fact that it is accepted upstream? 100% consensus does not exists in reality.

Acceptance upstream doesn’t mean the concerns are invalid or resolved.
The code is not yet in current mainline kernel. There is no SHF_GNU_SFRAME or PT_GNU_SFRAME.
In addition, Having code, which might mean replacing the kernel-specific ORC with SFrame, is also different from having userspace adoption.

We care about BPF (off CPU) profiling a lot (more than any average user), so SFrame’s usability with it is a critical requirement, so it is carefully examined and we believe it is fine.

Thanks for the context. I wish it is fine as well.

but there’s already the GNU implementation that should be sufficient for this purpose?

Not really – there is constraint for users on what toolchain they can use.

Toolchain choice is a valid consideration (understanding that Google is locked to LLVM), but it doesn’t override the need to resolve design issues before upstreaming.
Users are better served by a delayed but correct implementation than a premature one that introduces maintenance burden and potential incompatibilities.

I spent a long time on the Google production toolchain team, so I understand well the strong cultural aversion to local patches and downstream changes.
However, this attitude—while valuable for maintaining stability and bisection ease—may not be the right fit when upstreaming experimental and highly exploratory work.
Experimental features benefit from maturation and validation before they become everyone’s problem to maintain.
The aversion to local changes shouldn’t pressure us into premature upstreaming.


To be clear: I’m not opposed to SFrame as a concept. I’m opposed to upstreaming an implementation with unresolved foundational issues and insufficient field validation. If the design questions are addressed and real-world data (not projections) demonstrates value, that changes the calculus considerably.
The burden here is to demonstrate that this feature is ready, not to argue that readiness shouldn’t be required.


LLD should support sframe version 2 (Sterling-Augustine)

The gnu toolchain supports sframes version 2 today, and interoperability is extremely useful. Ubuntu 25.10, among other distributions, has already enabled sframes by default, and users have reported bugs in LLD. Other distributions would like to enable it by default, but cannot without linker support (whether LLD or some other). The linux kernel recently added a feature that depends on it for hot-patching, and expects it to replace the ORC unwinder. Without this support, advanced kernel development will be limited to gnu-binutils only. That other projects have adopted it is sufficient evidence of benefit. Also we have done additional analysis and believe it will be useful for our internal unwinding situation.

As I pointed out earlier, Ubuntu 25.10 hasn’t enabled SFrame by default.
The maintainer misunderstood what was wanted—testing, not deploying to users. A pre-release version enabled it by accident, but the next pre-release (iiuc 15.2.0-4ubuntu2) disabled it.

While there are discussions about using SFrame for aarch64 livepatch, I haven’t found any upstreamed code implementing the proposal.
The kernel unwinding process is orthogonal to userspace unwinding. Technically, objtool could be modified to generate .sframe from .eh_frame instead of relying on the assembler/linker.

1 Like

Maskray:

I think we are going in circles here, which is why I propose we escalate to the project council. Do you agree that this is a good way forward?

This does not contradict to the observation that the bandwidth of a single person can be limited?

Sorry which part is unusual? The data I provided is from production – the optimization is already deployed, saving machine resources and power.

I am afraid we are dragged into a rabbit hole now. Since you are interested, I can give a little more details. The performance is measured using application’s productivity metrics. It is ORC + omitfp compared against FP + no-omitfp.

See above about what matters for performance.

it is very common for a benchmark setup (same for user space).

Performance overhead is low (depends on profiling frequency which is generally low). Memory overhead is very low 0.2%, but again it is user’s trade-off (i.e. RAM is over-provisioned, and not a bottleneck).

The improved stack trace quality can also help improve performance indirectly (via AFDO etc).

We are in a rabbit hole again – the PGO example is used to demonstrate how Rince and iterate approach helps. PGO also cares about format backward compatiblity and versioning control.

Clearly not for autoFDO.

This was clearly not the case (see my original reply). There were also many unknown knowns at the time (especially for AFDO).

People provided you with data and information, but you keep on challenging them or nick picking details.

(A side note: you claim that linker stability is more important than anything else in the toolchain. How to improved the situation (including code structure, testing coverage etc) that LLD is more robust and resistant to destablition? This can reduce the risks for adding new linker features in the future).

First and foremost, I think we should try to use the processes we’ve created to resolve disagreement. And, if folks have any critical feedback about those processes, it is welcome. We have a panel at the dev meeting next week on the lessons from the first year of LLVM area teams.


Area teams can be established by finding 3 potential members and asking the project council to create a new area team (doc link):

Process for New Area Teams

Any project area that has at least three members interested in forming an area team can request the project council form one. The project council will then consider the needs of the project and determine whether to form a new team or not.

When the project council forms a new area team, the project council will nominate members for the team to serve until the next elections.

The project council has the ability to overrule area teams with a 2/3 majority vote (doc link), but the ideal outcome is that folks reach a mutually agreeable compromise:

… If agreement cannot be reached, the area team may act as the final decision maker. In that capacity decisions of an area team are considered final, but can be overruled by a 2/3 majority vote of the project council or the area team itself revisiting the issue. If an area team cannot reach consensus, it may request the project council to resolve the disagreement.

It’s true these are the internal incentives at Google and they are not a good reason to upstream bad designs that won’t be adopted by the community.

However, would it change your view if there were additional stakeholders? Yes, Google can go maintain a soft fork / branch of LLD, but would that really be the best outcome for the community? Presumably there are other LLD users out there who would like to experiment with SFrame (Meta), and usually the most effective way for us to collaborate is to add features upstream under experimental feature flags until things stabilize.

Is there some way we can fence off this complexity, and commit to remove and replace it when the v3 format comes along? Any promise should be discounted, of course, since removing features is harder than adding them, but we’ve done things like this before in clang (-fexperimental-*). Can we add documentation disclaiming support and stability for sframes in some way?

The removal of features is generally much more difficult than adding them. Before accepting new features, we should provide more public data and address technical issues.
Many contributors, after submitting new features, don’t have the time to maintain them. Maintainers end up spending a lot of time handling the post-submission maintenance work. Perhaps we can clarify the technical challenges in advance to reduce maintenance issues later on.

3 Likes

I think an experimental flag that’s required to use sframe support, with a big warning sign that this can be removed at any time, could be a great way forward.

One could even make the warning part of the flag, e.g. --experimental-sframe-may-break-at-any-time, but ultimately we should IMO be treating users as consenting adults, who can handle the consequences of opting into something experimental.

Regarding the technical review questions: Asking for concrete data on performance overhead, memory footprint, and how kernel measurements translate to userspace is a standard part of the review process. These foundational questions help ensure the proposal is well-supported by empirical evidence.

Regarding the performance data discussion: The kernel-space measurements with ORC are valuable, but there’s a critical gap in our understanding. The kernel environment (ORC+omitfp vs FP+no-omitfp) differs fundamentally from userspace, where this RFC proposes SFrame adoption. Direct userspace performance measurements would strengthen the case by avoiding the need to extrapolate kernel savings to a different environment.

On LLD robustness: We should improve testing and maintain high standards for what we upstream.

AutoFDO uses LBR, which has a limited depth (32 on Skylake). Do you use FP for other stack trace requests—such as backtrace() and non-SamplePGO profiling?


I believe there may be some misalignment between expectations and what SFrame realistically offers. A few observations:

  • linux-perf is waiting for V3
  • Some developers explored enabling arm64 livepatch with SFrame, but Song Liu has since implemented a frame-pointer-based alternative
  • No Linux distribution has adeopted SFrame.
  • I am not the only one questioning the object file format design. As more linker-aware folks become aware of this format, similar concerns are being raised: GNU Tools Cauldron SFrame talk notes https://groups.google.com/g/generic-abi/c/3ZMVJDF79g8
  • If SFrame is exclusively a kernel-space feature, it could be implemented entirely within objtool – similar to how objtool --link --orc generates ORC info for vmlinux.o. This approach would eliminate the need for any modifications to assemblers and linkers, while allowing SFrame to evolve in any incompatible way.

While SFrame may still have potential to replace the ORC unwinder in the kernel (ORC being a simple format that is considerably larger than .eh_frame), its viability as a stack walking mechanism for userspace programs remains an open question.

From https://maskray.me/blog/2025-10-26-stack-walking-space-and-time-trade-offs

% ~/Dev/bloaty/out/release/bloaty /tmp/out/custom-sframe/bin/clang
    FILE SIZE        VM SIZE
 --------------  --------------
  63.9%  88.0Mi  73.9%  88.0Mi    .text
  11.1%  15.2Mi   0.0%       0    .strtab
   7.2%  9.96Mi   8.4%  9.96Mi    .rodata
   6.4%  8.87Mi   7.5%  8.87Mi    .sframe
   5.1%  7.07Mi   5.9%  7.07Mi    .eh_frame
   2.9%  3.96Mi   0.0%       0    .symtab
   1.4%  1.98Mi   1.7%  1.98Mi    .data.rel.ro
   0.9%  1.23Mi   1.0%  1.23Mi    [LOAD #4 [R]]
   0.7%   999Ki   0.8%   999Ki    .eh_frame_hdr
   0.0%       0   0.5%   614Ki    .bss
   0.2%   294Ki   0.2%   294Ki    .data
   0.0%  23.1Ki   0.0%  23.1Ki    .rela.dyn
   0.0%  8.99Ki   0.0%  8.99Ki    .dynstr
   0.0%  8.77Ki   0.0%  8.77Ki    .dynsym
   0.0%  7.24Ki   0.0%  7.24Ki    .rela.plt
   0.0%  6.73Ki   0.0%       0    [Unmapped]
   0.0%  6.29Ki   0.0%  3.84Ki    [21 Others]
   0.0%  4.84Ki   0.0%  4.84Ki    .plt
   0.0%  3.36Ki   0.0%  3.30Ki    .init_array
   0.0%  2.50Ki   0.0%  2.50Ki    .hash
   0.0%  2.44Ki   0.0%  2.44Ki    .got.plt
 100.0%   137Mi 100.0%   119Mi    TOTAL
% ~/Dev/unwind-info-size-analyzer/eh_size.rb /tmp/out/custom-sframe/bin/clang
clang: sframe=9303875 eh_frame=7408976 eh_frame_hdr=1023004 eh=8431980 sframe/eh_frame=1.2558 sframe/eh=1.1034

The results show that .sframe (8.87 MiB) is approximately 10% larger than the combined size of .eh_frame and .eh_frame_hdr (7.07 + 0.99 = 8.06 MiB).
Since .eh_frame cannot be eliminated (doing so would lead to loss of restoring callee-saved registers, LSDA, and personality information), this size overhead raises significant concerns about the practical viability of this approach.

It’s worth noting that there are existing, battle-tested implementations of a compact unwind format in LLVM, lld/MachO, and libunwind that work with C++ exception handling (in production since 2015 or earlier).
macOS was an early adopter, and OpenVMS appears to use a variant (“VSI OpenVMS Calling Standard”, and an earlier [RFC] Improving compact x86-64 compact unwind descriptors ).
The Apple Compact Unwinding Format: Documented and Explained - Faultlore documents how this works.

This approach allows frames that cannot be described compactly to fall back to DWARF unwinding, which means most DWARF CFI entries can be removed while still maintaining full functionality.
In contrast, today’s SFrame implementation in GNU Assembler would emit many warnings when building llvm-project.

As a concrete example, in a clang executable on macOS (objdump --arch=x86_64 -h), the __text section is 0x4a55470 bytes, while the __unwind_info section is very small at just 0x79060 bytes and __eh_frame at only 0x58 bytes, demonstrating the efficiency of this approach, even if it is only for synchronous.

Regarding AArch64: It would be valuable to gather Arm’s perspective on compact unwind for ELF (Was any form of compact unwind information considered for AArch64? · Issue #344 · ARM-software/abi-aa · GitHub). I’ll ask them. In the meantime, the Mach-O implementation provides a proven baseline for this architecture.

6 Likes

I will go ahead and put this issue on the Project Council’s agenda.

In case there is misunderstanding. The optimization is enabled for kernel, the performance metric measured is from user space (qps/cpu). In addition, our past experience with kernel performance (e.g. PGO) indicates that extrapolation from kernel performance improvement to overall app level improvement works reasonably well (according to kernel/user space cycle distribution).

What is more important is that the data is from production jobs which is more reliable than data collected in benchmarking environment. We can’t have production data for user space omitfp+Sframe yet for obvious reasons, but I believe the data we have is strong enough.

As mentioned in https://sourceware.org/pipermail/binutils/2025-October/145028.html:
V3’s original feature list was an exhaustive collection of all review comments, shortcomings seen so far with V2 (collected over a span of about a yr). Since then, we have taken time to evaluate each one and even discard some requests (make SFrame FRE accesses aligned), precisely because the return for additional complexity was not worth the gains. We understand the value of keeping the complexity low.

Will the next version bump need a similar varied feature set ? I don’t think so.

I didn’t find any mention of recording callee saved register information in Making sure you're not a bot!. Is that planned at all? IMO it would be a shame to go through all the trouble of supporting a new unwind information format if it’s not able to supplant eh_frame. I know the kernel obviously doesn’t care about exceptions, and I believe Google builds without them as well, but Meta uses exceptions frequently, for example. There’s lots of prior art for formats that efficiently encode exception unwinding information (e.g. Apple’s compact unwind, ARM EHABI’s exidx/extab, and Microsoft’s pdata/x data), and not supporting that in a new unwind info format seems like a huge missed opportunity.

2 Likes

SFrame is stack trace format, not a stack unwind format.

Summary from https://sourceware.org/pipermail/binutils/2025-October/145027.html:
SFrame requires SHF_ALLOC. This ensures that SFrame is available in the program memory. Non-SHF_ALLOC SFrame will hurt tracers, and run time stack tracing usecases, because the cost of mmap is not amortized over the runtime of the application. Cost is one only aspect; in some critical scenarios at the time when backtrace is needed, bringing in a section is just not viable.

Sure, perf supports .debug_frame which is not SHF_ALLOC. But there is a difference between status quo and whether the status quo is the best we can do.

It is just not feasible to use a post-processing tool for the fleet of a variety of userspace packages in a distro.