[I originally posted this as Github issue [clang] Please consider enabling -fstack-clash-protection / probe-stack by default · Issue #184428 · llvm/llvm-project · GitHub , but was asked there to post it as an RFC instead. I am also not sure if this belongs in the frontend or the backend category - the frontend is where the flag for this is currently exposed to the user, but the backend is where it is implemented…]
Proposed change
Please consider enabling -fstack-clash-protection (called probe-stack in the IR) by default for supported targets, so that unbounded recursion on a non-main thread in combination with a large stackframe (containing uninitialized buffers) at the bottom of the recursion does not turn into an exploitable security bug.
Reasoning
In my opinion, stack overflows are security issues that can only be mitigated by the compiler, because there is no good way for a programmer to protect against them other than by manually keeping track of recursion depth and available stack memory.
Also, if a programmer wants to guard against exploitable stack overflows in their own recursive code, it is not necessarily enough for them to build their own code with -fstack-clash-protection, because the function at the bottom of the stack that moves the stack pointer across the guard page might be part of a library dependency.
MSVC / Apple defaults
MSVC documentation says:
By default, the compiler generates code that initiates a stack probe when a function requires more than one page of stack space.
Apple’s version of clang also emits __chkstk_darwin calls for functions with large stack buffers.
Complications / Downsides
probe-stack is currently only supported by some backends (X86, PowerPC, SystemZ, AArch64, RISCV, I think); other backends would remain unprotected.
Luckily stack probing doesn’t require any runtime helpers, so this should not depend on having a recent libc version or such.
This will create extra code, and runtime cost, in functions with large stack frames; however, in my experience, codebases typically only have a tiny number of such functions.
The worst case in terms of runtime overhead would probably be functions that have on-stack buffers that are many pages big, and either don’t use these buffers at all, or only use a small part of such buffers. (However, that overhead is likely dwarfed by the overhead of automatic stack buffer initialization, if that is enabled and triggers on the buffer.)
Motivating example
For a motivating example, see https://project-zero.issues.chromium.org/issues/465827985, an Android issue where the combination of the following factors makes it possible to escalate privileges from shell context to the more privileged system_server context (though my proof of concept only manages to do this at a low success rate):
-
Android compiles code without
-fstack-clash-protection. -
Android has an IPC mechanism that supports synchronous calls with synchronous callbacks, and it is possible to infinitely nest such IPC calls to cause unbounded recursion by design.
-
Non-main thread stacks on Android are placed in the same virtual address region as heap allocations and shared memory mappings.
-
Non-main thread stacks on Android that can run Java code effectively have a 8 KiB or 16 KiB guard region at the bottom, depending on how you count.
-
Another function that can be called over IPC, including from a nested IPC call context, contains a 128 KiB stack buffer, and performs a function call to another non-leaf function before initializing this buffer.
This makes it possible for a saved instruction pointer value to be spilled into, and loaded back from, a shared memory segment located below the guard page.