TL;DR
We lack good tools to verify correct code generation for security hardening
features. Most security hardening features are tested by just a small number of
regression tests. Just running large amounts of code compiled with security
hardening features doesn’t test them well: it checks if the program still
produces the expected output for a given input, but it does not check if it
makes it harder to maliciously exploit the binary.
This RFC proposes building a static binary analyzer that can scan binaries to
verify that a given hardening feature has been applied correctly across the
whole binary. I’ve built a prototype on top of BOLT and propose to improve it
enough to be able to upstream it. I will need help to do so succesfully.
The rising importance of security hardening in toolchains
Over the past 10-15 years, a lot of security hardening features were added to
compilers which alter the sequences of generated assembly code. The first such
hardening feature was probably
stack canaries
(enabled by -fstack-protector
). Since then, many other security hardening
features were introduced and implemented.
A small sample are the ones that are
recommended by OpenSSF for most C and C++ builds:
-U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=3 -fstrict-flex-arrays=3
-fstack-clash-protection
-fstack-protector-strong
-mbranch-protection=standard
(enabling pac-ret and bti)-fcf-protection=full
(enabling shstk and ibt)
The world at large relies ever more on these and other security hardening
features in compilers.
Testing the implementation of security hardening features
Our testing and verification of these features is however somewhat limited. For
most of these features, testing is limited to a few regression or unit tests,
which check if the expected assembly sequences are generated for a handful of
test cases. Therefore, it should be no surprise that gaps are found from time to
time in the implementation of security hardening features.
Most of these security hardening features change the binary code generated in a
way such that a specific kind of attack becomes harder to create. For example,
stack canaries (-fstack-protector
) aim to make stack buffer overflow attacks
which overwrite a return address on the stack harder. They do that by placing a
“canary” variable between local variables and the return address on the stack.
If an attacker overflows a buffer or array variable on the stack enough such
that it also overwrites the return address on the stack, there’s a high
likelihood it will be detected by observing a changed value in the canary.
The attacks that hardening methods aim to make harder mostly operate at the
binary level. Therefore, to analyze or measure the effectiveness of a hardening
method, the analysis has to happen at the binary level.
It seems like there isn’t a tool available that can analyze whether a binary
indeed has the properties one can expect after enabling a specific codegen
security hardening method.
Instead, we currently use regression tests to point-check correct hardened code
generation on a small number of tiny code snippets. Sometimes, implementers of
new hardening methods perform a custom, one-off, analysis, for example as
described in
this blog.
In this RFC, I propose a new BOLT-based tool that can verify that binaries have
the properties implied by specific security hardening features.
What benefits would such a binary analysis tool give?
With such a tool, we could:
-
Check correct implementation of hardening features in the compiler.
-
Check correct application of hardening features across an entire
distribution. -
For some mitigations, there are few specific contexts where they cannot be
applied. Often this is only known to a handful of implementers working in
this area. A scanner enables exhaustively searching through a binary
distribution and enumerating where there are gaps, which in turn helps with
properly documenting known gaps. -
Add the scanner to compiler CI loops, such as LLVM post-commit CI to detect
if anyone accidentally regressed hardening. -
Integrate this scanner in a compiler fuzzing setup to check that hardening
keeps on being applied correctly when non-default compiler options are
enabled in different combinations. -
Add the scanner to a distribution build process to verify that there are no
regressions in applying distro-wide security mitigations between releases.
Why a binary analyzer based on BOLT?
I chose to implement this using BOLT as it seemed to be the one option that
combines the following features.
- BOLT is used to achieve significant performance improvements in production
workloads. That suggests it will keep on being maintained reasonably well for a
long time to come. - BOLT reads binaries and creates control flow graphs for them.
- BOLT works on the MCInst representation of instructions, i.e. maintains an
exact 1-1 relationship between the IR it works on and the instructions as
present in the binary. - BOLT uses a framework that is familiar to compiler writers and the
implementers of security mitigations in LLVM: the same developer can
implement both a mitigation and the associated scanner without switching to a
completely different framework. - BOLT can handle most binaries, irrespective of which code generator or
compiler that produced them.
Of course, a BOLT-based scanner can also be used to check the output of other
compilers, not just llvm-based ones.
Prototype
To explore if it would be possible to build such a tool, I started building a
prototype, to verify the correctness of code generation under 2 hardening
schemes: pac-ret and stack clash protection. The idea behind building such a
prototype is to evaluate if it is possible to create a useful binary analyzer
with low enough false positive and false negative rate.
For now, I’ve called this tool llvm-bolt-gadget-scanner
, based on the idea
that an attacker typically builds an exploit by combining “gadgets”, i.e. pieces
of binary code with interesting properties from an exploitability-point-of-view.
The “gadgets” the llvm-bolt-gadget-scanner looks for are places in binaries that
are not protected as expected after applying a security hardening.
Example 1: scanner for pac-ret
The threat model used in the pac-ret hardening scheme is that an attacker has
used a memory vulnerability that enables them to overwrite memory locations. The
hardening scheme makes it harder for such an attacker to build a ROP
attack[1][2]
by not storing return addresses to memory as is. Instead, it uses Arm pointer
authentication to sign return addresses before storing them to memory, and
authenticating return address values after reading them back from memory.
Example 2: scanner for stack-clash hardening
In a stack-clash attack, the attacker makes use of a large stack pointer change
to “jump over” stack guard pages and write to non-stack memory regions such as
the heap.
The security hardening feature -fstack-clash-protection
changes code
generation, so that every stack pointer change is never more than 1 page in
size, and there is at least one access to every page. It then assumes that the
OS always maintains at least 1 guard page beyond the extent of the current
stack. As such, all growth of the stack will be visible to the OS, and it’s not
possible to jump beyond the end of the stack, and into heap memory without the
OS detecting it.
Experimental results so far
I implemented prototypes for the above 2 examples (scanning for gaps in pac-ret
and stack-clash hardening). The main test corpus I used was the set of libraries
under /usr/lib64
on a Fedora 39 AArch64-linux distribution with about 3000
packages installed. The total test set contains about 2000 .a
and .so
libraries, totalling about 260 million instructions and about 2 million
functions in total.
The Fedora 39 distribution enables both pac-ret and stack-clash hardening by
default distro-wide.
Note that BOLT can better analyze binaries when relocations are retained in
libraries. In this experiment, I scanned libraries as shipped by Fedora, i.e.
without the relocations present.
Some insights on generic BOLT AArch64 handling
- For 23% of functions, BOLT could not reconstruct the CFG. Maybe this is
because in this experiment, the relocation info was not present (the default)
in the binaries? - I had to work around 2 crashes, one on not handling a specific CFI opcode;
one on not handling a particular jump table pattern.
Experience implementing a scanner for pac-ret hardening
The prototype scans for the following property being true: for every return
instruction, the register containing the return address (this is typically
x30
):
- is either not written to at all in the function before returning, or
- the last write to it was by an authenticating instruction (such as
AUTIASP
).
The analysis uses the BOLT dataflow framework.
-
The binary analysis runs more than fast enough from a practical point of view:
it takes about 10 minutes when using a single core to scan all 2000 libraries. -
Implementing this analysis does not need too much complexity: about 700 lines
of code for pac-ret gadget scanning. -
To reduce false positive rate, BOLT needs to recognize which functions are
“no-return”. It turns out that compilers do not seem to apply pac-ret
hardening on return instructions for which it sees that control flow can never
reach those; e.g. after a no-return function is called. -
Overall, across the about 2000 libraries, there are 2.5 million return
instructions. Of those, 46 thousand are reported as not being properly
protected by pac-ret hardening. After looking through those reports, it seems
like the major reasons for those are:- Some libraries are written in languages for which the compiler do not (yet?)
implement pac-ret hardening, such as Rust, Haskell, Go, … - There are a few C/C++ libraries for which the build system doesn’t propagate
the distro default build flags, so pac-ret hardening doesn’t get enabled in
the build. - There are a few reports on code written in assembly where the implementers
chose to not implement pac-ret hardening.
- Some libraries are written in languages for which the compiler do not (yet?)
-
There are still a few false positives. It seems those are caused by BOLT not
yet recognizing thatbrk
instructions stop the regular control flow,
potentially making paths “no-return”.
Based on the above experience, the output of the pac-ret scanner seems useful
and actionable, including actions such as:
- Prioritize which toolchains, for which language, to implement pac-ret support
in, based on data collected. - Fix the build system for packages not respecting distro-wide defaults.
- Document accepted gaps in hardening, so that knowledge becomes more accessible.
Experience implementing a scanner for stack clash hardening
The analysis uses the BOLT dataflow framework. It checks, for every time the
stack grows, that:
- It grows by no more than 1 page.
- There was at least one access to the top-most page of the stack before it
grows again.
The hardest part in implementing this include:
-
Recognizing whether the stack grows by an amount no more than 1 page. Examples
include:-
The stack pointer changing instruction uses a register rather than an
immediate to change the stack pointer. But the register can be deduced to
have a value of less than or equal to the page size:and x1, x1, 65535 sub sp, sp, x1
mov x12, #40000 sub sp, sp, x12
-
Recognizing code that aligns the stack pointer:
sub x9, sp, #0x1d0 and sp, x9, #0xffffffffffffff80
-
-
Recognizing stack accesses. Examples include:
-
The stack pointer gets copied to another register and that register is used
to access the stack:mov x0, sp str x0, [x29, #8]
-
-
There are over 1000 AArch64 MCInst instructions that access memory.
Understanding at which offset and which size they do so requires a change in
tablegen.
When testing the prototype implementation, I found that:
- The analysis speed is similar to what is seen for pac-ret scanning, more than
fast enough. - The analysis crashes on a few libraries because of using too much memory. It
requires a relatively small change to the parallel processing framework in
BOLT to enable removing the biggest memory leak. - Not all AArch64 LD/ST instructions are fully supported yet, leading to a few
assertion failures on a small number of libraries.
The current version of the prototype reports 39 stack-clash gadgets. Presumably most remaining ones are false positives.
(This last paragraph was edited after originally posting as there was a mistake in the original post on how many stack-clash gadgets were reported by the prototype)
The LLVM stack-clash hardening implementation which was implemented recently
introduces yet another stack manipulation pattern that is not yet recognized by
the prototype implementation.
To check whether the prototype-quality implementation can find true positives, I
built the LLVM test-suite using gcc with stack-clash hardening disabled. It
detects 101 stack clash gadgets. With stack-clash hardening enabled, it detects
1 gadget, which seems to be a false positive of a similar style as the 3 false
positives seen on /usr/lib64
fedora 39 scans.
In summary, the prototype implementation for stack-clash gadget scanning shows
that a production-quality implementation seems very highly likely possible and
getting a near-zero false positive rate seems achievable.
Conclusions from building and experimenting with a prototype
I believe that the experiment of building a prototype showed that:
-
A useful binary scanner can be built.
-
The specification of most security hardening schemes should be more rigorous,
so users (and binary scanners) can know more precisely what kind of hardening
they get precisely. -
It’s time to share the idea more widely and consider upstreaming this to BOLT.
Which I’m doing with this RFC. -
We should consider starting to use this scanner in both compiler continuous
integration loops and distro-building continuous integration loops.
The prototype implementation is published at GitHub - kbeyls/llvm-project at bolt-gadget-scanner-prototype.
A webui showing the commits on that branch is available here.
Next steps
This RFC is already pretty long, so I’ll try to keep it short here.
I think the above shows that it is worthwhile to start implementing a good-quality binary analysis tool integrated into upstream BOLT.
I am looking for consensus on doing so, so if you have an opinion, please do
speak up.
If consensus is reached, I’ll also be looking for help with review and
implementation of this tool, so if you’d be interested to help with that, then
please do let me know.
This work will also be presented at EuroLLVM as the keynote on April 10th; and
there will be a round table too.