[RFC] Please consider enabling -fstack-clash-protection / probe-stack by default

[I originally posted this as Github issue [clang] Please consider enabling -fstack-clash-protection / probe-stack by default · Issue #184428 · llvm/llvm-project · GitHub , but was asked there to post it as an RFC instead. I am also not sure if this belongs in the frontend or the backend category - the frontend is where the flag for this is currently exposed to the user, but the backend is where it is implemented…]

Proposed change

Please consider enabling -fstack-clash-protection (called probe-stack in the IR) by default for supported targets, so that unbounded recursion on a non-main thread in combination with a large stackframe (containing uninitialized buffers) at the bottom of the recursion does not turn into an exploitable security bug.

Reasoning

In my opinion, stack overflows are security issues that can only be mitigated by the compiler, because there is no good way for a programmer to protect against them other than by manually keeping track of recursion depth and available stack memory.

Also, if a programmer wants to guard against exploitable stack overflows in their own recursive code, it is not necessarily enough for them to build their own code with -fstack-clash-protection, because the function at the bottom of the stack that moves the stack pointer across the guard page might be part of a library dependency.

MSVC / Apple defaults

MSVC documentation says:

By default, the compiler generates code that initiates a stack probe when a function requires more than one page of stack space.

Apple’s version of clang also emits __chkstk_darwin calls for functions with large stack buffers.

Complications / Downsides

probe-stack is currently only supported by some backends (X86, PowerPC, SystemZ, AArch64, RISCV, I think); other backends would remain unprotected.

Luckily stack probing doesn’t require any runtime helpers, so this should not depend on having a recent libc version or such.

This will create extra code, and runtime cost, in functions with large stack frames; however, in my experience, codebases typically only have a tiny number of such functions.

The worst case in terms of runtime overhead would probably be functions that have on-stack buffers that are many pages big, and either don’t use these buffers at all, or only use a small part of such buffers. (However, that overhead is likely dwarfed by the overhead of automatic stack buffer initialization, if that is enabled and triggers on the buffer.)

Motivating example

For a motivating example, see https://project-zero.issues.chromium.org/issues/465827985, an Android issue where the combination of the following factors makes it possible to escalate privileges from shell context to the more privileged system_server context (though my proof of concept only manages to do this at a low success rate):

  • Android compiles code without -fstack-clash-protection.

  • Android has an IPC mechanism that supports synchronous calls with synchronous callbacks, and it is possible to infinitely nest such IPC calls to cause unbounded recursion by design.

  • Non-main thread stacks on Android are placed in the same virtual address region as heap allocations and shared memory mappings.

  • Non-main thread stacks on Android that can run Java code effectively have a 8 KiB or 16 KiB guard region at the bottom, depending on how you count.

  • Another function that can be called over IPC, including from a nested IPC call context, contains a 128 KiB stack buffer, and performs a function call to another non-leaf function before initializing this buffer.

This makes it possible for a saved instruction pointer value to be spilled into, and loaded back from, a shared memory segment located below the guard page.

1 Like

In terms of generated code:

  1. You mention impact on performance and code size. Do you have numbers? Bringing Stack Clash Protection to Clang / X86 — the Open Source Way - The LLVM Project Blog previously mentioned that Firefox did not see any major regressions, but a single application is not necessarily representative. I also don’t see any code size numbers there.

In terms of ecosystem compatibility:

  1. What does gcc do here?

For the motivating example, I’m also not sure why it wouldn’t be sufficient to just compile Android with -fstack-clash-protection instead of trying to make it the default in clang? You can probably find such motivating examples for any security hardening feature, but that doesn’t mean we enable any of them by default.

1 Like

That type of default is normally the choice of the platform, like Android or a specific Linux Distro.

There is some precedent for configure time defaults that something like a Linux Distro could opt-in to. For example CLANG_DEFAULT_PIE_ON_LINUX llvm-project/clang/CMakeLists.txt at main · llvm/llvm-project · GitHub

I guess that could be added for the platforms that wanted it, but I’d be wary of changing the default for everyone that could potentially support it. For example it may not be appropriate for bare-metal targets.

1 Like

Talks about one distro defaults and this option for gcc.

I have to double check but I think it is enabled with -fhardening too.

1 Like

You mention impact on performance and code size. Do you have numbers?

I don’t have performance numbers; but for example, on a Debian Trixie machine, I see 112 functions in glibc that perform stack pointer decrements >=0x1000 bytes or with a register operand:

$ objdump -d /usr/lib/x86_64-linux-gnu/libc.so.6 | egrep 'sub *[^ $].*,%rsp|sub *\$0x.{4,},%rsp' | wc -l
112

Grepping through Chrome for Linux at version 145.0.7632.159, I see 43 stack pointer decrements by non-constant values:

$ objdump -d /opt/google/chrome/chrome > /tmp/chrome.disas
$ grep 'sub *[^ $].*,%rsp' /tmp/chrome.disas | wc -l
43

and 797 stack pointer decrements by values >=0x1000:

$ egrep 'sub *\$0x.{4,},%rsp' /tmp/chrome.disas | wc -l
797

What does gcc do here?

My understanding is that GCC, like clang, supports -fstack-clash-protection but doesn’t enable it by default.

For the motivating example, I’m also not sure why it wouldn’t be sufficient to just compile Android with -fstack-clash-protection instead of trying to make it the default in clang?

Yes, that will fix it for Android, but not for the rest of the Linux ecosystem.

You can probably find such motivating examples for any security hardening feature, but that doesn’t mean we enable any of them by default.

In my view, this is not just a security hardening feature.

Normal security hardening features catch things that are clearly bugs created by a programmer - stuff like calling sprintf() on a buffer too small to store the result. In my view, stack overflows are different because you can have exploitable security bugs without a clear programmer mistake in a single spot - I don’t think it is reasonable to expect programmers to mentally keep track of how much stack memory they have to work with at any given point, across library boundaries, so well that the system’s security can depend on it.

Ah, nice. I just learned that Debian has it enabled too: Making sure you're not a bot!

Though somehow that doesn’t seem to apply to Debian’s glibc build: I Challenge Thee

These numbers are great, but it’s still unclear how they translate into performance/code size changes. I don’t think this change is sufficiently motivated by looking for some assembly patterns in one application and one particular distro’s libc.

Does the entirety of the rest of the Linux ecosystem want this to be fixed? There are plenty of users that would not be willing to trade off a 0.x% performance regression for slightly increased security. In cases like HPC where often the data and the code can both be trusted, users want maximum performance at the expense of security and disable features like PIE for that reason.

I don’t see why everyone doesn’t get what they want by this being opt-in. Large projects that care about this already can and do opt-in to this behavior.

1 Like

I guess that’s a difference in perspective - I see this as being closer to a correctness property than a hardening feature like PIE. It sounds like folks here disagree on that.

I think I’m used to a different way of setting defaults, where the defaults try to ensure secure and correct system behavior as long as there are no bugs, and opt-in hardening features are for ensuring security in the face of security bugs.

But if the idea is that every distribution that uses clang, and every direct user of upstream clang, is always expected to come up with their own set of security configuration flags, and nobody uses the defaults, then I guess that doesn’t matter.

To confirm, yes -fstack-clash-protection is enabled by -fhardenedin GCC if the target supports the slp flag.

For distros and platforms that want this today, clang’s configuration file mechanism already provides a clean path without requiring any upstream changes or a custom clang build. So nobody is blocked on this RFC to ship the protection they want.

The remaining question is whether upstream should flip the default for hosted targets. I believe the objection is strong enough that this won’t happen.

I’d rather not add a CMake configure-time option for this (à la CLANG_DEFAULT_PIE_ON_LINUX). This is much much more niche than PIE or not. Every such option adds maintenance surface and build matrix complexity, and clang’s config file mechanism already solves this problem more cleanly.

I wonder if implicitly enabling -fstack-clash-protection would break the Linux kernel builds? :thinking:

Perhaps other projects similarly might break; and not at compile time either I would imagine.

2 Likes

How would this break Linux kernel builds? My understanding is that, at
least for Linux targets, it only causes extra memory accesses to
addresses that are below the current stack pointer, but are part of
the stack frame that is being set up for the current function; is that
wrong?