[RFC] Using Intel MPX to harden SafeStack

Hi,

I previously posted about using 32-bit X86 segmentation to harden SafeStack: http://lists.llvm.org/pipermail/llvm-dev/2016-May/100346.html That involves lowering the limits of the DS and ES segments that are used for ordinary data accesses while leaving the limit for SS, the stack segment, set to its maximum value. The safe stacks were clustered above the limits of DS and ES. Thus, by directing individual memory operands to either DS/ES or SS, stray pointer writes that could otherwise corrupt the safe stack would be blocked by the segmentation checks. My proposed compiler modifications inspect memory operands to determine whether the compiler (or more specifically, the SafeStack pass) intends that they be allowed to access the safe stack. It then inserts segment override prefixes and related instructions as necessary.

I submitted patches today to implement an analogous idea in 64-bit mode using Intel MPX. MPX can be used both to enforce fine-grained per-object bounds and coarse-grained bounds. My patches use it for the latter purpose, so they make no use of the table-related instructions in MPX. The runtime library [1] simply initializes one bounds register, BND0, to have an upper bound that is set below all safe stacks and above all ordinary data. A pre-isel patch instruments stores that are not authorized to access the safe stack by preceding each such instruction with a BNDCU instruction. That checks whether the following store accesses memory that is entirely below the upper bound in BND0 [2]. Loads are not instrumented, since the purpose of the checks is only to help prevent corruption of the safe stacks. Authorized safe stack accesses are not instrumented, since the SafeStack pass is responsible for verifying that such accesses do not corrupt the safe stack. The default handler is used when a bound check fails, which results in the program being terminated on the systems where I have performed tests.

To reduce the performance and size overhead from instrumenting the code, both the pre-isel patch and a pre-emit patch elide various checks [2, 3]. The pre-isel patch uses techniques derived from the BoundsChecking pass to statically verify that some stores are safe so that the checks for those stores can be elided. The pre-emit patch compares the bound checks in each basic block and combines those that are redundant. The contents of BND0 are static, so a successful check of a higher address implies that any check of a lower address will also succeed. Thus, if a check of a higher address precedes a check of a lower address in a basic block, the latter check can be erased. On the other hand, if a check of a lower address precedes a check of a higher address in a basic block, then the latter check can still be erased, but it is also necessary to use the higher address in the remaining check. However, my pass is only able to statically compare certain addresses, which limits the checks that can be combined. For example, if two addresses use the same base and index registers and scale along with a simple displacement, then my pass may be able to compare them. However, if either the base or the index register is redefined by an instruction between the two checks, then my pass is currently unable to compare the two addresses. Incidentally, the pre-emit pass uses the getAddressFromInstr routine, which needs to be patched to properly handle certain global variable addresses [7]. The pre-emit pass also erases checks for addresses that do not specify a base or index register as well as those that specify a RIP-relative offset with no index register. I think that the source code would need to be quite malformed to corrupt safe stacks using such address types. Additional optimizations may be possible in the future, such as lifting checks out of loops or otherwise performing inter-basic block analysis to identify additional redundant checks.

The pre-emit pass also erases bound checks for accesses relative to a non-default segment, such as thread-local accesses relative to FS. Linear addresses for thread-local accesses are computed with a non-zero segment base address, so it would be necessary to check thread-local effective addresses against a bounds register with an upper bound that is adjusted down to account for that rather than the bounds register that is used for checking other accesses. However, negative offsets are sometimes used for thread-local accesses, which are treated as very large unsigned effective addresses. Checking them would require them to first be added to the base of the thread-local storage segment.

Developers can use the -mseparate-stack-seg flag to enable instrumentation of functions that have the SafeStack attribute [4, 6]. That flag also causes the runtime library to be linked [5].

Due to BND0 being treated as per-thread state, the runtime library picks an initial BND0 upper bound when the program starts that is arbitrarily set to be 256MiB below the base of the initial (safe) stack. If and when that 256MiB space becomes overfilled by safe stacks, the program will crash due to a failing CHECK_GE statement in the runtime library. Without this check, an adversary may be able to modify a variable in the runtime library recording the address of the most-recently allocated safe stack to cause safe stacks to be allocated in vulnerable locations. An alternative approach to avoid that limitation could be to store that variable above the bound checked by the instrumented code. This could help to prevent adversaries from forcing safe stacks to be allocated at vulnerable locations while still allowing the program to keep running even when its safe stacks protrude below the bound. Of course, the protruding portions of the safe stacks would be vulnerable. Another alternative could be to treat the MPX bounds registers as per-program state rather than per-thread state. BND0 could then be adjusted downwards as necessary when new safe stacks are allocated.

The runtime library currently ignores all attribute settings passed to pthread_create. It allocates a safe stack itself at a high address. Furthermore, the runtime library currently does not support expansion of the safe stacks, nor does it free the safe stacks that it allocates.

If the BND0 upper bound happens to fall below some ordinary data, then attempted accesses to that data by instrumented stores will violate the associated bound checks.

The runtime library checks for MPX support when it initializes, and it falls back to the default (ASLR-based) safe stack protections if MPX is unavailable. However, (inactive) bound check instructions in the program may still impose size and performance overheads.

There could conceivably be a situation in which an instrumented program passes a function pointer for an instrumented callback to an uninstrumented library. If that library allocates an object on the (safe) stack and passes its pointer to the callback, then a bound check violation could result. This is due to an assumption in the pass that instruments the code. It assumes that all pointer arguments except those with the byval or readnone attributes point to the unsafe stack. I think this is a valid assumption when all stack frames correspond to instrumented functions. However, it is not a valid assumption in the scenario described above. Such bound check violations can be avoided by not instrumenting such callbacks as well as any functions to which they pass pointers to any allocations on the safe stack.

Comments appreciated.

Thanks,
Michael

[1] [safestack] Add runtime support for MPX-based hardening: https://reviews.llvm.org/D29657
[2] [X86] Add X86SafeStackBoundsChecking pass: ⚙ D29649 [X86] Add X86SafeStackBoundsChecking pass
[3] [X86] Add X86SafeStackBoundsCheckingCombiner pass: https://reviews.llvm.org/D29652
[4] [X86] Add -mseparate-stack-seg: https://reviews.llvm.org/D17092
[5] [X86] Link safestacksepseg runtime: https://reviews.llvm.org/D29655
[6] [X86] Add separate-stack-seg feature: https://reviews.llvm.org/D29646
[7] [x86] Fix getAddressFromInstr: https://reviews.llvm.org/D27169

(explicitly CC-ing more folks, just in case)

Hi,

I previously posted about using 32-bit X86 segmentation to harden
SafeStack: http://lists.llvm.org/pipermail/llvm-dev/2016-May/100346.html
That involves lowering the limits of the DS and ES segments that are used
for ordinary data accesses while leaving the limit for SS, the stack
segment, set to its maximum value. The safe stacks were clustered above
the limits of DS and ES. Thus, by directing individual memory operands to
either DS/ES or SS, stray pointer writes that could otherwise corrupt the
safe stack would be blocked by the segmentation checks. My proposed
compiler modifications inspect memory operands to determine whether the
compiler (or more specifically, the SafeStack pass) intends that they be
allowed to access the safe stack. It then inserts segment override
prefixes and related instructions as necessary.

I submitted patches today to implement an analogous idea in 64-bit mode
using Intel MPX. MPX can be used both to enforce fine-grained per-object
bounds and coarse-grained bounds. My patches use it for the latter
purpose, so they make no use of the table-related instructions in MPX.

That's a relief :slight_smile:

The runtime library [1] simply initializes one bounds register, BND0, to
have an upper bound that is set below all safe stacks and above all
ordinary data.

So you enforce that safe stacks and other data are not intermixed, as you
explain below.
What are the downsides? Performance? Compatibility?

A pre-isel patch instruments stores that are not authorized to access the
safe stack by preceding each such instruction with a BNDCU instruction.

My understanding is that BNDCU is the cheapest possible instruction, just
like XOR or ADD,
so the overhead should be relatively small.
Still my guesstimate would be >= 5% since stores are very numerous.
And such overhead will be on top of whatever overhead SafeStack has.
Do you have any measurements to share?

That checks whether the following store accesses memory that is entirely
below the upper bound in BND0 [2]. Loads are not instrumented, since the
purpose of the checks is only to help prevent corruption of the safe
stacks. Authorized safe stack accesses are not instrumented, since the
SafeStack pass is responsible for verifying that such accesses do not
corrupt the safe stack. The default handler is used when a bound check
fails, which results in the program being terminated on the systems where I
have performed tests.

To reduce the performance and size overhead from instrumenting the code,
both the pre-isel patch and a pre-emit patch elide various checks [2, 3].
The pre-isel patch uses techniques derived from the BoundsChecking pass to
statically verify that some stores are safe so that the checks for those
stores can be elided. The pre-emit patch compares the bound checks in each
basic block and combines those that are redundant. The contents of BND0
are static, so a successful check of a higher address implies that any
check of a lower address will also succeed. Thus, if a check of a higher
address precedes a check of a lower address in a basic block, the latter
check can be erased. On the other hand, if a check of a lower address
precedes a check of a higher address in a basic block, then the latter
check can still be erased, but it is also necessary to use the higher
address in the remaining check. However, my pass is only able to
statically compare certain addresses, which limits the checks that can be
combined. For example, if two addresses use the same base and index
registers and scale along with a simple displacement, then my pass may be
able to compare them. However, if either the base or the index register is
redefined by an instruction between the two checks, then my pass is
currently unable to compare the two addresses.

The usual question in such situation: how do we verify that the
optimizations are not too optimistic?
If we remove a check that is not in fact redundant, we will never know,
until clever folks use it for an exploit (and maybe not even then).

Incidentally, the pre-emit pass uses the getAddressFromInstr routine,
which needs to be patched to properly handle certain global variable
addresses [7]. The pre-emit pass also erases checks for addresses that do
not specify a base or index register as well as those that specify a
RIP-relative offset with no index register. I think that the source code
would need to be quite malformed to corrupt safe stacks using such address
types. Additional optimizations may be possible in the future, such as
lifting checks out of loops or otherwise performing inter-basic block
analysis to identify additional redundant checks.

The pre-emit pass also erases bound checks for accesses relative to a
non-default segment, such as thread-local accesses relative to FS. Linear
addresses for thread-local accesses are computed with a non-zero segment
base address, so it would be necessary to check thread-local effective
addresses against a bounds register with an upper bound that is adjusted
down to account for that rather than the bounds register that is used for
checking other accesses. However, negative offsets are sometimes used for
thread-local accesses, which are treated as very large unsigned effective
addresses. Checking them would require them to first be added to the base
of the thread-local storage segment.

Developers can use the -mseparate-stack-seg flag to enable instrumentation
of functions that have the SafeStack attribute [4, 6]. That flag also
causes the runtime library to be linked [5].

Due to BND0 being treated as per-thread state, the runtime library picks
an initial BND0 upper bound when the program starts that is arbitrarily set
to be 256MiB below the base of the initial (safe) stack. If and when that
256MiB space becomes overfilled by safe stacks, the program will crash due
to a failing CHECK_GE statement in the runtime library. Without this
check, an adversary may be able to modify a variable in the runtime library
recording the address of the most-recently allocated safe stack to cause
safe stacks to be allocated in vulnerable locations. An alternative
approach to avoid that limitation could be to store that variable above the
bound checked by the instrumented code. This could help to prevent
adversaries from forcing safe stacks to be allocated at vulnerable
locations while still allowing the program to keep running even when its
safe stacks protrude below the bound. Of course, the protruding portions
of the safe stacks would be vulnerable. Another alternative could be to
treat the MPX bounds registers as per-program state rather than per-thread
state. BND0 could then be adjusted downwards as necessary when new safe
stacks are allocated.

The runtime library currently ignores all attribute settings passed to
pthread_create. It allocates a safe stack itself at a high address.
Furthermore, the runtime library currently does not support expansion of
the safe stacks, nor does it free the safe stacks that it allocates.

If the BND0 upper bound happens to fall below some ordinary data, then
attempted accesses to that data by instrumented stores will violate the
associated bound checks.

The runtime library checks for MPX support when it initializes, and it
falls back to the default (ASLR-based) safe stack protections if MPX is
unavailable. However, (inactive) bound check instructions in the program
may still impose size and performance overheads.

There could conceivably be a situation in which an instrumented program
passes a function pointer for an instrumented callback to an uninstrumented
library. If that library allocates an object on the (safe) stack and
passes its pointer to the callback, then a bound check violation could
result. This is due to an assumption in the pass that instruments the
code. It assumes that all pointer arguments except those with the byval or
readnone attributes point to the unsafe stack. I think this is a valid
assumption when all stack frames correspond to instrumented functions.
However, it is not a valid assumption in the scenario described above.
Such bound check violations can be avoided by not instrumenting such
callbacks as well as any functions to which they pass pointers to any
allocations on the safe stack.

Comments appreciated.

Thanks,
Michael

[1] [safestack] Add runtime support for MPX-based hardening:
https://reviews.llvm.org/D29657
[2] [X86] Add X86SafeStackBoundsChecking pass: https://reviews.llvm.org/
D29649
[3] [X86] Add X86SafeStackBoundsCheckingCombiner pass:
https://reviews.llvm.org/D29652
[4] [X86] Add -mseparate-stack-seg: https://reviews.llvm.org/D17092
[5] [X86] Link safestacksepseg runtime: https://reviews.llvm.org/D29655
[6] [X86] Add separate-stack-seg feature: https://reviews.llvm.org/D29646
[7] [x86] Fix getAddressFromInstr: https://reviews.llvm.org/D27169

_______________________________________________
LLVM Developers mailing list
llvm-dev@lists.llvm.org
llvm-dev Info Page

Thanks!
--kcc

I think the main downside is that only a limited number of threads can be created before the safe stacks would protrude below the bound. Extending the proposed runtime library to deallocate safe stacks when they are no longer needed may help with this. The safe stacks are also prevented from expanding, since they are allocated contiguously at high addresses.

I’m working on getting approval to release some benchmark results.

The pre-emit pass is able to verify that some checks are redundant by inspecting the operands used to specify an address. For example, consider the following test for the pre-emit pass:

0: %rax = MOVSX64rr32 killed %edi
1: INLINEASM $“bndcu $0, %bnd0”, 8, 196654, _, 8, %rax, @x + 4, _
; CHECK: INLINEASM $“bndcu $0, %bnd0”, 8, 196654, _, 8, %rax, @x + 8, _
2: MOV32mi _, 8, %rax, @x, _, 0
3: INLINEASM $“bndcu $0, %bnd0”, 8, 196654, _, 8, %rax, @x + 8, _
; CHECK-NOT: INLINEASM $“bndcu $0, %bnd0”, 8, 196654, _, 8, %rax, @x + 8, _
4: MOV32mi _, 8, killed %rax, @x + 4, _, 0

The pass verifies that the only difference between the memory operands in instructions 1 and 3 is that they use a different offset from the global variable, so they can be combined. The pass also tracks register definitions, so it would know not to combine the checks in this example if there had been an instruction that redefined %rax between instructions 1 and 3.

On the other hand, some of the optimizations described in the next couple of paragraphs may be optimistic, so I especially welcome feedback on them:

Thanks,
Michael

Here are estimated SPECint_base2006 component runtimes for some relevant test configurations:

Runtime in seconds:

FTR: https://peerj.com/preprints/2863/ seems to be related/similar to this patch

the correct links is of course http://dl.acm.org/citation.cfm?id=2991089&CFID=732054959&CFTOKEN=52558062