Adding a stack probe function attribute

Hi, I want to add a stack probe function attribute which would insert stack probes on all platforms, not just Windows. This will be useful for Rust since it must guarantee that the stack can’t overflow, which it currently abuses the segmented stack support for. I’m not sure which kind of attribute is appropriate here. It must be added to the caller when inlined and clients of LLVM should be able to tell if code generation supports it. I would like some tips on how to implement this.

Here’s an implementation of this: http://reviews.llvm.org/D4717
There’s no way to tell if stack probes actually are being generated though…

Giving a bit of background and motivation would be good here. What are you trying to accomplish and why?

Philip

The point of this is to cheaply detect all stack overflows using a guard page. For a guard page to actually detect all stack overflows, we need to ensure that the code touches each page of the stack in the right order, otherwise it could skip the guard page and write outside the stack. That is very bad for languages such as Rust which provides memory safety, so it currently does an explicit comparison against the end of the stack for each function, which is again bad for performance. This would correspond to GCC’s -fstack-check (if that worked).

Thanks for the explanation. I’m used to hearing the term “stack banging” used for this mechanism, but I understand your objective.

I believe having a general mechanism here would be valuable, but only if the implementation doesn’t make assumptions about runtime environment. For example, let’s say my runtime uses a three page guard region and considers anything in that region to be a stack expansion. This should work with your attribute.

There’s also different ways of implementing this. Depending on your runtime, you might want to a) call a function, b) emit some special loads. There’s also numerous optimizations which apply for the later implementation choice.

Worth noting is that stack banging only when the stack size is larger than a page size is NOT sufficient unless you can prove that every smaller frame actually reads or writes to the frame before calling a subroutine.

As an example, consider the following recursive function:
int test(int i) { char buff[50]; if( i == 0 ) return 0; else return test(i-1); }

The arguments will be passed in registers. The buffer won’t be initialized (assuming it’s not compiled away), and you’ll push a stack frame without touching the stack memory.

It would be nice if your attribute could also represent an explicit conditional check implementation as well.

Also, are you expecting the runtime to be able to throw an exception at the site of the check? If so, there’s a bunch of other issues which need handled.

Straw man ideas:

  • Start with an string attribute, work out the semantics and implementation, then propose a “real attribute” once we’ve settled on a workable implementation.
  • Pick a more generic name. Possibly “StackOverflowGuard”?
  • Use two parameters. First, “minimum guard region size” (non-negative integer number of bytes). Second, “test mechanism” (enum (FuncCall, Load, Store, ConditionalCheck)). For the conditional check version, you’d need a way to specify a failure handler. For the func call version, you need a way to set the routine.

Now, I realize this is well beyond what you originally wanted to implement. If you wanted to make the minimum change you could to support forcing the enable of the existing stack probe mechanism, we can discuss specifically that. I’d lean away from a general attribute for that purpose, but am open to being convinced otherwise. :slight_smile:

Philip

Thanks for the explanation. I'm used to hearing the term "stack banging"
used for this mechanism, but I understand your objective.

I believe having a general mechanism here would be valuable, but only if
the implementation doesn't make assumptions about runtime environment. For
example, let's say my runtime uses a three page guard region and considers
anything in that region to be a stack expansion. This should work with
your attribute.

There's also different ways of implementing this. Depending on your
runtime, you might want to a) call a function, b) emit some special loads.
There's also numerous optimizations which apply for the later
implementation choice.

Worth noting is that stack banging only when the stack size is larger than
a page size is NOT sufficient unless you can *prove* that every smaller
frame actually reads or writes to the frame before calling a subroutine.

As an example, consider the following recursive function:
int test(int i) { char buff[50]; if( i == 0 ) return 0; else return
test(i-1); }

The arguments will be passed in registers. The buffer won't be
initialized (assuming it's not compiled away), and you'll push a stack
frame without touching the stack memory.

I don't think that will happen unless there's a separate call stack. Either
the call instruction itself will write to the stack or it will be tail
recursive call and the stack will be reused.

It would be nice if your attribute could also represent an explicit
conditional check implementation as well.

The "split-stack" attribute can be used for that, and that's what Rust
currently does.

Also, are you expecting the runtime to be able to throw an exception at
the site of the check? If so, there's a bunch of other issues which need
handled.

Straw man ideas:
- Start with an string attribute, work out the semantics and
implementation, then propose a "real attribute" once we've settled on a
workable implementation.

I seemed to have read that string attributes were for platform specific
things, would that be incorrect?

- Pick a more generic name. Possibly "StackOverflowGuard"?
- Use two parameters. First, "minimum guard region size" (non-negative
integer number of bytes). Second, "test mechanism" (enum (FuncCall, Load,
Store, ConditionalCheck)). For the conditional check version, you'd need a
way to specify a failure handler. For the func call version, you need a
way to set the routine.

Now, I realize this is well beyond what you originally wanted to
implement. If you wanted to make the minimum change you could to support
forcing the enable of the existing stack probe mechanism, we can discuss
specifically that. I'd lean away from a general attribute for that
purpose, but am open to being convinced otherwise. :slight_smile:

I think this should be limited to only lightweight stack probing, which
either emits probe instructions or calls a function (provided by
libgcc/compiler-rt) which does exactly the same. One extension I'm open for
is to always force a probe which ensures the stack overflow happens in a
well defined point in the prologue. That could be used for languages which
can recover from stack overflows.

Er, yeah. Please ignore my momentary stupidity. You’re entirely correct. Interesting. I did not know this. Using it for that purpose honestly feels like a bit of a hack. I’d rather factor the stack bounds checking and the handling separately. Looking around, it looks like we also have an Erlang specific implementation which does essentially the same thing. Yuck. Not saying we need to actually do so, just that I’d prefer that design in an ideal world. :slight_smile: Generally, they’re either a) platform specific, and b) prototyping something. t.m.k. there is nothing that says a string attribute can’t be used long term, but generally, they don’t seem to be. Given what you said about split-stack, I’m okay with this. The general cleanup is a worthwhile, but separate task. This would need to be carefully specified. Going back to your original proposal. I’ve come around to accepting the idea of adding a stack-probe attribute, but I’ll request you improve the documentation some. In particular, mention the split-stack case, and mention that the called function is platform specific. I still think a better factored design is possibly, but you didn’t create the mess we have, so I can’t really expect you to fix it. :slight_smile: I’ll add a couple of more detailed code comments on the review thread as well.

Would the __probestack functions be a suitable addition to compiler-rt? Does it already have __chkstk or is that provided by something else on Windows? I noticed that libgcc implemented them in cygwin.S.

It seems reasonable to put them in compiler-rt/lib/builtins.

I don't think anyone is currently using compiler-rt's builtins on Windows,
so it doesn't have __chkstk. Typically that is provided by the Cygwin,
MinGW, or MSVC C runtime.

I updated http://reviews.llvm.org/D4717 and also wrote an __probestack implementation: https://github.com/Zoxc/compiler-rt/compare/llvm-mirror:master…stprobe

Which instruction would be the preferable one to probe with? I used OR since that’s what GCC/libgcc does, but I don’t see why that would be better than a write.

Since David Majnemer doesn't seem overly eager to merge my patches,
let's see if we can't figure things out here.

I noticed a string function attribute appeared in LangRef.rst, would
that be the correct place to document this?

For reference:
http://reviews.llvm.org/D9653
http://reviews.llvm.org/D9654
http://reviews.llvm.org/D9858

Yeah, the function attributes section of LangRef is a reasonable place to put stuff like this:
http://llvm.org/docs/LangRef.html#function-attributes

I think we should add this. I also know that LLILAC needs something like this as well. I propose the following:

  • Add a string attribute called “stack-probe-symbol”=“foo”.
  • The presence of this attribute indicates that stack probes should be emitted, even on non-Windows OSs.
  • (future work) For LLILAC, if this attribute is present but the string is empty, this can be a signal that the check must be emitted inline, either as a sequence of stores or a loop.

This also addresses David’s concern with the hardcoded __probestack symbol name.

For what it’s worth, I need stack probes as well. My language usecase has two relevant properties:

  1. The language guarantees total memory safety/sandboxing. Malicious programs are unable to negatively affect the rest of the system beyond consuming resources.

  2. The user of the language is allowed significant expressive power, letting him manipulate the stack.

Stack probes prevent users of a language from skipping over the guard page; it’s the last memory hole to be plugged once the rest of the system is secure. I think in general, any language that has these two properties will require either stack probes or a much more complicated system (like geordi, segmented stacks, or obsessive loading of allocated memory). A great number of languages fall into this situation.

Yeah, the function attributes section of LangRef is a reasonable place to
put stuff like this:
http://llvm.org/docs/LangRef.html#function-attributes

I'll see if I can't sneak something in there.

I think we should add this. I also know that LLILAC needs something like
this as well. I propose the following:
- Add a string attribute called "stack-probe-symbol"="foo".
- The presence of this attribute indicates that stack probes should be
emitted, even on non-Windows OSs.
- (future work) For LLILAC, if this attribute is present but the string is
empty, this can be a signal that the check must be emitted inline, either as
a sequence of stores or a loop.

This also addresses David's concern with the hardcoded __probestack symbol
name.

First of all, LLVM should be free to choose how it does stack probes,
it could call ___chkstk_ms, ___chkstk_ms, __chkstk, _alloca, _chkstk,
__probestack or any other stack probe function it knows about, it
could unroll and inline it for smaller allocation amounts, it could
inline the function entirely or it could do nothing, for platforms
which does stack overflow checks in hardware.

I don't see why hardcoding __probestack is different from every other
hardcoded thing in LLVM. Furthermore since calls to it can be elided
it is not useful for clients to specify their own function, so they
would just point it to whatever the platform stack probing function
would be (replicating the ugly logic in
X86FrameLowering::emitStackProbeCall). If LLVM in the future always
inlined the call, the stack probe function would never be called and
the attribute argument is useless.

The difference between __probestack and __chkstk etc is that we are happy
to call into existing interfaces that are somehow guaranteed by the
environment. Sometimes we do invent our own in compiler-rt for obscure
cases like i128 division, but it's rare. After years of adapting to fit
pre-existing interfaces, we are naturally very cautious to define our own.
Since not everyone uses compiler-rt, I worry about a situation where people
fight over the definition of __probestack, or where users want to override
__probestack to call into their runtime, rather than dealing with signals.

> Yeah, the function attributes section of LangRef is a reasonable place
> to
> put stuff like this:
> http://llvm.org/docs/LangRef.html#function-attributes
I'll see if I can't sneak something in there.

>
> I think we should add this. I also know that LLILAC needs something like
> this as well. I propose the following:
> - Add a string attribute called "stack-probe-symbol"="foo".
> - The presence of this attribute indicates that stack probes should be
> emitted, even on non-Windows OSs.
> - (future work) For LLILAC, if this attribute is present but the string
> is
> empty, this can be a signal that the check must be emitted inline,
> either as
> a sequence of stores or a loop.
>
> This also addresses David's concern with the hardcoded __probestack
> symbol
> name.
First of all, LLVM should be free to choose how it does stack probes,
it could call ___chkstk_ms, ___chkstk_ms, __chkstk, _alloca, _chkstk,
__probestack or any other stack probe function it knows about, it
could unroll and inline it for smaller allocation amounts, it could
inline the function entirely or it could do nothing, for platforms
which does stack overflow checks in hardware.

I don't see why hardcoding __probestack is different from every other
hardcoded thing in LLVM. Furthermore since calls to it can be elided
it is not useful for clients to specify their own function, so they
would just point it to whatever the platform stack probing function
would be (replicating the ugly logic in
X86FrameLowering::emitStackProbeCall). If LLVM in the future always
inlined the call, the stack probe function would never be called and
the attribute argument is useless.

The difference between __probestack and __chkstk etc is that we are happy to
call into existing interfaces that are somehow guaranteed by the
environment. Sometimes we do invent our own in compiler-rt for obscure cases
like i128 division, but it's rare. After years of adapting to fit
pre-existing interfaces, we are naturally very cautious to define our own.

The code does need to go somewhere though.

Since not everyone uses compiler-rt, I worry about a situation where people
fight over the definition of __probestack

Wouldn't this be resolved by defining what __probestack does?

, or where users want to override
__probestack to call into their runtime, rather than dealing with signals.

As I said before, calls to __probestack are not guaranteed to be
emitted, so clients can't rely on it doing anything other than probing
the stack. Also clients must always deal with guard page faults. Those
will usually happen outside of __probestack, since functions with
large stack frames are rare.

Rust is going ahead and removing it's abuse of segmented stack support
for stack overflow checking
(https://github.com/rust-lang/rust/pull/27338). This leaves it without
foolproof stack overflow checking. So it would be nice if this could
land soonish.

I suggest the following:

Add a "probe-stack" attribute (which can become a proper attribute
named probestack later), leaving the implementation of stack probing
up to LLVM.

Implement stack probing for x86 on non-Windows platforms with a call
to __probestack, with __probestack being added to compiler-rt.

Implement stack probing for ARM using the same code as Windows only
with __chkstk renamed to __probestack.

Other backends should use __probestack as the name for the stack
probing function if needed.

If there turns out to be an issue with the stack probe function name,
add a "stack-probe-function" attribute later (which goes along with
"stack-probe-size") allowing users to use their own name for the stack
probe function.

I started to implement inlining of the stack probe function based on
Microsoft's inlined stack probes in
https://github.com/Microsoft/llvm/tree/MS.

Do we know why the stack pointer cannot be updated in a loop (which
results in ideal code)? I noticed that was commented in Microsoft's
code.
I suspect this is due to debug or unwinding information, since it is
allowed on Windows x86-32.

I ran into two issues while implementing this.

The epilog was inserted into the wrong basic block. This is because
the basic blocks to insert epilogs into is calculated before
emitPrologue is called.
I fixed this by adding the following code after the emitPrologue call
in PEI::insertPrologEpilogCode:

  RestoreBlocks.clear();
  calculateSets(Fn);

This doesn't seem like a very nice solution. It's also unclear to me
how Microsoft's branch handles this.

The other issue is with code generation. All the tests passed but the
one were segmented stacks are used and stack probes are required.
I don't see what is actually wrong here, and would like some help.
Here is the output:

# After Prologue/Epilogue Insertion & Frame Finalization
# Machine code for function test_large: Post SSA
Frame Objects:
  fi#0: size=40000, align=4, at location [SP-40000]

BB#4:
        %R11<def> = LEA64r %RSP, 1, %noreg, -40040, %noreg
        CMP64rm %R11, %noreg, 1, %noreg, 40, %GS, %EFLAGS<imp-def>
        JA_1 <BB#0>, %EFLAGS<imp-use>
    Successors according to CFG: BB#3 BB#0

BB#3:
    Predecessors according to CFG: BB#4
        %R10<def> = MOV64ri 40040
        %R11<def> = MOV64ri 32
        CALL64pcrel32 <es:__morestack>, %RSP<imp-use>
        MORESTACK_RET
    Successors according to CFG: BB#0

BB#0: derived from LLVM BB %0
    Predecessors according to CFG: BB#3 BB#4
        %EAX<def> = MOV32ri 40040; flags: FrameSetup
        %RDX<def> = MOV64rr %RAX; flags: FrameSetup
        %RCX<def> = MOV64rr %RSP; flags: FrameSetup
    Successors according to CFG: BB#1

BB#1: derived from LLVM BB %0
    Predecessors according to CFG: BB#0 BB#1
        OR64mi8 %RCX, 1, %noreg, 0, %noreg, 0, %EFLAGS<imp-def>;
flags: FrameSetup
        %RCX<def,tied1> = SUB64ri32 %RCX<tied0>, 4096,
%EFLAGS<imp-def>; flags: FrameSetup
        %RDX<def,tied1> = SUB64ri32 %RDX<tied0>, 4096,
%EFLAGS<imp-def>; flags: FrameSetup
        JAE_1 <BB#1>, %EFLAGS<imp-use>; flags: FrameSetup
    Successors according to CFG: BB#2 BB#1

BB#2: derived from LLVM BB %0
    Predecessors according to CFG: BB#1
        %RSP<def,tied1> = SUB64rr %RSP<tied0>, %RAX, %EFLAGS<imp-def>;
flags: FrameSetup
        SEH_StackAlloc 40040; flags: FrameSetup
        SEH_EndPrologue; flags: FrameSetup
        %RCX<def> = LEA64r %RSP, 1, %noreg, 40, %noreg
        %EDX<def> = MOV32r0 %EFLAGS<imp-def,dead>
        CALL64pcrel32 <ga:@dummy_use>, <regmask>, %RSP<imp-use>,
%RCX<imp-use>, %EDX<imp-use,kill>, %RSP<imp-def>
        SEH_Epilogue
        %RSP<def,tied1> = ADD64ri32 %RSP<tied0>, 40040, %EFLAGS<imp-def,dead>
        RETQ

# End machine code for function test_large.

*** Bad machine code: Using an undefined physical register ***
- function: test_large
- basic block: BB#1 (0x40b1d70)
- instruction: OR64mi8- operand 0: %RCX

*** Bad machine code: Using an undefined physical register ***
- function: test_large
- basic block: BB#1 (0x40b1d70)
- instruction: %RCX<def,tied1> = SUB64ri32- operand 1: %RCX<tied0>

*** Bad machine code: Using an undefined physical register ***
- function: test_large
- basic block: BB#1 (0x40b1d70)
- instruction: %RDX<def,tied1> = SUB64ri32- operand 1: %RDX<tied0>

*** Bad machine code: Using an undefined physical register ***
- function: test_large
- basic block: BB#2 (0x40b1e20)
- instruction: %RSP<def,tied1> = SUB64rr- operand 2: %RAX
LLVM ERROR: Found 4 machine code errors.

The stack pointer can't ever safely point beyond the valid range. I don't 100% know the exact reason but suspect the OS may verify this on context save/restore, which can happen unpredictably. This can be tricky to guarantee if you're modifying the stack pointer in flight without some extra math here and there, at which point you might as well use another register and defer the update like we do.

I'm also not sure what you mean by updating the stack pointer being "allowed" for x86-32 -- if you look at the implementation of __chkstk for x86-32, which we provide in sources with VS, it does not modify ESP in the loop -- it uses EAX to probe the stack.

[from] c:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\crt\src\intel\chkstk.asm

        mov eax, esp ; current TOS
        and eax, not ( _PAGESIZE_ - 1) ; Round down to current page boundary

cs10:
        cmp ecx, eax ; Is new TOS
        jb short cs20 ; in probed page?
        mov eax, ecx ; yes.
        pop ecx
        xchg esp, eax ; update esp
        mov eax, dword ptr [eax] ; get return address
        mov dword ptr [esp], eax ; and put it at new TOS
        ret

; Find next lower page and probe
cs20:
        sub eax, _PAGESIZE_ ; decrease by PAGESIZE
        test dword ptr [eax],eax ; probe page.
        jmp short cs10

Another common pattern which we haven't optimized for yet is to probe and zero the newly grown stack region.

We have the same issue with epilogs that you saw. We've worked around it for now by generating epilogs first, but this seems a bit hacky. I'd like to see us fix this by delaying the inline expansion of a stack probe until after prolog/epilog generation. That also removes the need for us to update the prolog block in ::emitProlog, which is a bit clunky. But we haven't gotten around to this yet.

I've submited my patch to inline stack probes here:
http://reviews.llvm.org/D12483

The stack pointer can't ever safely point beyond the valid range. I don't 100% know the exact reason but suspect the OS may verify this on context save/restore, which can happen unpredictably. This can be tricky to guarantee if you're modifying the stack pointer in flight without some extra math here and there, at which point you might as well use another register and defer the update like we do.

I'm also not sure what you mean by updating the stack pointer being "allowed" for x86-32 -- if you look at the implementation of __chkstk for x86-32, which we provide in sources with VS, it does not modify ESP in the loop -- it uses EAX to probe the stack.

I was thinking of the fact that __chkstk modified SP on x86-32, but
not on x86-64.

We have the same issue with epilogs that you saw. We've worked around it for now by generating epilogs first, but this seems a bit hacky. I'd like to see us fix this by delaying the inline expansion of a stack probe until after prolog/epilog generation. That also removes the need for us to update the prolog block in ::emitProlog, which is a bit clunky. But we haven't gotten around to this yet.

Hm.. I did try just generating epilogs first, but I ran into test failures.