[RFC] BOLT-based binary analysis tool to verify correctness of security hardening

kbeyls · April 4, 2024, 10:00am

TL;DR

We lack good tools to verify correct code generation for security hardening
features. Most security hardening features are tested by just a small number of
regression tests. Just running large amounts of code compiled with security
hardening features doesn’t test them well: it checks if the program still
produces the expected output for a given input, but it does not check if it
makes it harder to maliciously exploit the binary.

This RFC proposes building a static binary analyzer that can scan binaries to
verify that a given hardening feature has been applied correctly across the
whole binary. I’ve built a prototype on top of BOLT and propose to improve it
enough to be able to upstream it. I will need help to do so succesfully.

The rising importance of security hardening in toolchains

Over the past 10-15 years, a lot of security hardening features were added to
compilers which alter the sequences of generated assembly code. The first such
hardening feature was probably
stack canaries
(enabled by -fstack-protector). Since then, many other security hardening
features were introduced and implemented.

A small sample are the ones that are
recommended by OpenSSF for most C and C++ builds:

-U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=3 -fstrict-flex-arrays=3
-fstack-clash-protection
-fstack-protector-strong
-mbranch-protection=standard (enabling pac-ret and bti)
-fcf-protection=full (enabling shstk and ibt)

The world at large relies ever more on these and other security hardening
features in compilers.

Testing the implementation of security hardening features

Our testing and verification of these features is however somewhat limited. For
most of these features, testing is limited to a few regression or unit tests,
which check if the expected assembly sequences are generated for a handful of
test cases. Therefore, it should be no surprise that gaps are found from time to
time in the implementation of security hardening features.

Most of these security hardening features change the binary code generated in a
way such that a specific kind of attack becomes harder to create. For example,
stack canaries (-fstack-protector) aim to make stack buffer overflow attacks
which overwrite a return address on the stack harder. They do that by placing a
“canary” variable between local variables and the return address on the stack.
If an attacker overflows a buffer or array variable on the stack enough such
that it also overwrites the return address on the stack, there’s a high
likelihood it will be detected by observing a changed value in the canary.

The attacks that hardening methods aim to make harder mostly operate at the
binary level. Therefore, to analyze or measure the effectiveness of a hardening
method, the analysis has to happen at the binary level.

It seems like there isn’t a tool available that can analyze whether a binary
indeed has the properties one can expect after enabling a specific codegen
security hardening method.

Instead, we currently use regression tests to point-check correct hardened code
generation on a small number of tiny code snippets. Sometimes, implementers of
new hardening methods perform a custom, one-off, analysis, for example as
described in
this blog.

In this RFC, I propose a new BOLT-based tool that can verify that binaries have
the properties implied by specific security hardening features.

What benefits would such a binary analysis tool give?

With such a tool, we could:

Check correct implementation of hardening features in the compiler.
Check correct application of hardening features across an entire
distribution.
For some mitigations, there are few specific contexts where they cannot be
applied. Often this is only known to a handful of implementers working in
this area. A scanner enables exhaustively searching through a binary
distribution and enumerating where there are gaps, which in turn helps with
properly documenting known gaps.
Add the scanner to compiler CI loops, such as LLVM post-commit CI to detect
if anyone accidentally regressed hardening.
Integrate this scanner in a compiler fuzzing setup to check that hardening
keeps on being applied correctly when non-default compiler options are
enabled in different combinations.
Add the scanner to a distribution build process to verify that there are no
regressions in applying distro-wide security mitigations between releases.

Why a binary analyzer based on BOLT?

I chose to implement this using BOLT as it seemed to be the one option that
combines the following features.

BOLT is used to achieve significant performance improvements in production
workloads. That suggests it will keep on being maintained reasonably well for a
long time to come.
BOLT reads binaries and creates control flow graphs for them.
BOLT works on the MCInst representation of instructions, i.e. maintains an
exact 1-1 relationship between the IR it works on and the instructions as
present in the binary.
BOLT uses a framework that is familiar to compiler writers and the
implementers of security mitigations in LLVM: the same developer can
implement both a mitigation and the associated scanner without switching to a
completely different framework.
BOLT can handle most binaries, irrespective of which code generator or
compiler that produced them.

Of course, a BOLT-based scanner can also be used to check the output of other
compilers, not just llvm-based ones.

Prototype

To explore if it would be possible to build such a tool, I started building a
prototype, to verify the correctness of code generation under 2 hardening
schemes: pac-ret and stack clash protection. The idea behind building such a
prototype is to evaluate if it is possible to create a useful binary analyzer
with low enough false positive and false negative rate.

For now, I’ve called this tool llvm-bolt-gadget-scanner, based on the idea
that an attacker typically builds an exploit by combining “gadgets”, i.e. pieces
of binary code with interesting properties from an exploitability-point-of-view.
The “gadgets” the llvm-bolt-gadget-scanner looks for are places in binaries that
are not protected as expected after applying a security hardening.

Example 1: scanner for pac-ret

The threat model used in the pac-ret hardening scheme is that an attacker has
used a memory vulnerability that enables them to overwrite memory locations. The
hardening scheme makes it harder for such an attacker to build a ROP
attack[1][2]
by not storing return addresses to memory as is. Instead, it uses Arm pointer
authentication to sign return addresses before storing them to memory, and
authenticating return address values after reading them back from memory.

Example 2: scanner for stack-clash hardening

In a stack-clash attack, the attacker makes use of a large stack pointer change
to “jump over” stack guard pages and write to non-stack memory regions such as
the heap.

The security hardening feature -fstack-clash-protection changes code
generation, so that every stack pointer change is never more than 1 page in
size, and there is at least one access to every page. It then assumes that the
OS always maintains at least 1 guard page beyond the extent of the current
stack. As such, all growth of the stack will be visible to the OS, and it’s not
possible to jump beyond the end of the stack, and into heap memory without the
OS detecting it.

Experimental results so far

I implemented prototypes for the above 2 examples (scanning for gaps in pac-ret
and stack-clash hardening). The main test corpus I used was the set of libraries
under /usr/lib64 on a Fedora 39 AArch64-linux distribution with about 3000
packages installed. The total test set contains about 2000 .a and .so
libraries, totalling about 260 million instructions and about 2 million
functions in total.

The Fedora 39 distribution enables both pac-ret and stack-clash hardening by
default distro-wide.

Note that BOLT can better analyze binaries when relocations are retained in
libraries. In this experiment, I scanned libraries as shipped by Fedora, i.e.
without the relocations present.

Some insights on generic BOLT AArch64 handling

For 23% of functions, BOLT could not reconstruct the CFG. Maybe this is
because in this experiment, the relocation info was not present (the default)
in the binaries?
I had to work around 2 crashes, one on not handling a specific CFI opcode;
one on not handling a particular jump table pattern.

Experience implementing a scanner for pac-ret hardening

The prototype scans for the following property being true: for every return
instruction, the register containing the return address (this is typically
x30):

is either not written to at all in the function before returning, or
the last write to it was by an authenticating instruction (such as AUTIASP).

The analysis uses the BOLT dataflow framework.

The binary analysis runs more than fast enough from a practical point of view:
it takes about 10 minutes when using a single core to scan all 2000 libraries.
Implementing this analysis does not need too much complexity: about 700 lines
of code for pac-ret gadget scanning.
To reduce false positive rate, BOLT needs to recognize which functions are
“no-return”. It turns out that compilers do not seem to apply pac-ret
hardening on return instructions for which it sees that control flow can never
reach those; e.g. after a no-return function is called.
Overall, across the about 2000 libraries, there are 2.5 million return
instructions. Of those, 46 thousand are reported as not being properly
protected by pac-ret hardening. After looking through those reports, it seems
like the major reasons for those are:
- Some libraries are written in languages for which the compiler do not (yet?)
  implement pac-ret hardening, such as Rust, Haskell, Go, …
- There are a few C/C++ libraries for which the build system doesn’t propagate
  the distro default build flags, so pac-ret hardening doesn’t get enabled in
  the build.
- There are a few reports on code written in assembly where the implementers
  chose to not implement pac-ret hardening.
There are still a few false positives. It seems those are caused by BOLT not
yet recognizing that brk instructions stop the regular control flow,
potentially making paths “no-return”.

Based on the above experience, the output of the pac-ret scanner seems useful
and actionable, including actions such as:

Prioritize which toolchains, for which language, to implement pac-ret support
in, based on data collected.
Fix the build system for packages not respecting distro-wide defaults.
Document accepted gaps in hardening, so that knowledge becomes more accessible.

Experience implementing a scanner for stack clash hardening

The analysis uses the BOLT dataflow framework. It checks, for every time the
stack grows, that:

It grows by no more than 1 page.
There was at least one access to the top-most page of the stack before it
grows again.

The hardest part in implementing this include:

Recognizing whether the stack grows by an amount no more than 1 page. Examples
include:
- The stack pointer changing instruction uses a register rather than an
  immediate to change the stack pointer. But the register can be deduced to
  have a value of less than or equal to the page size:
```
and x1, x1, 65535
sub sp, sp, x1
```
```
mov x12, #40000
sub sp, sp, x12
```
- Recognizing code that aligns the stack pointer:
```
sub x9, sp, #0x1d0
and sp, x9, #0xffffffffffffff80
```
Recognizing stack accesses. Examples include:
- The stack pointer gets copied to another register and that register is used
  to access the stack:
```
mov x0, sp
str x0, [x29, #8]
```
There are over 1000 AArch64 MCInst instructions that access memory.
Understanding at which offset and which size they do so requires a change in
tablegen.

When testing the prototype implementation, I found that:

The analysis speed is similar to what is seen for pac-ret scanning, more than
fast enough.
The analysis crashes on a few libraries because of using too much memory. It
requires a relatively small change to the parallel processing framework in
BOLT to enable removing the biggest memory leak.
Not all AArch64 LD/ST instructions are fully supported yet, leading to a few
assertion failures on a small number of libraries.

The current version of the prototype reports 39 stack-clash gadgets. Presumably most remaining ones are false positives.
(This last paragraph was edited after originally posting as there was a mistake in the original post on how many stack-clash gadgets were reported by the prototype)

The LLVM stack-clash hardening implementation which was implemented recently
introduces yet another stack manipulation pattern that is not yet recognized by
the prototype implementation.

To check whether the prototype-quality implementation can find true positives, I
built the LLVM test-suite using gcc with stack-clash hardening disabled. It
detects 101 stack clash gadgets. With stack-clash hardening enabled, it detects
1 gadget, which seems to be a false positive of a similar style as the 3 false
positives seen on /usr/lib64 fedora 39 scans.

In summary, the prototype implementation for stack-clash gadget scanning shows
that a production-quality implementation seems very highly likely possible and
getting a near-zero false positive rate seems achievable.

Conclusions from building and experimenting with a prototype

I believe that the experiment of building a prototype showed that:

A useful binary scanner can be built.
The specification of most security hardening schemes should be more rigorous,
so users (and binary scanners) can know more precisely what kind of hardening
they get precisely.
It’s time to share the idea more widely and consider upstreaming this to BOLT.
Which I’m doing with this RFC.
We should consider starting to use this scanner in both compiler continuous
integration loops and distro-building continuous integration loops.

The prototype implementation is published at GitHub - kbeyls/llvm-project at bolt-gadget-scanner-prototype.
A webui showing the commits on that branch is available here.

Next steps

This RFC is already pretty long, so I’ll try to keep it short here.

I think the above shows that it is worthwhile to start implementing a good-quality binary analysis tool integrated into upstream BOLT.

I am looking for consensus on doing so, so if you have an opinion, please do
speak up.

If consensus is reached, I’ll also be looking for help with review and
implementation of this tool, so if you’d be interested to help with that, then
please do let me know.

This work will also be presented at EuroLLVM as the keynote on April 10th; and
there will be a round table too.

rafaelauler · April 4, 2024, 11:51pm

Thanks Kristof for such a nice write-up! I think this is a promising direction and I would like to add two minor comments:

For 23% of functions, BOLT could not reconstruct the CFG. Maybe this is
because in this experiment, the relocation info was not present (the default)
in the binaries?

Maybe those are functions with jump tables? In AArch64 it is much harder to reconstruct CFG with indirect jumps.

I had to work around 2 crashes, one on not handling a specific CFI opcode ;
one on not handling a particular jump table pattern.

This doesn’t look hard to implement, I guess we need to keep track of the RA state in BOLT’s CFI rewriter.

Reshabh · April 6, 2024, 8:18am

Hi Kristof, thank you for writing this up! For the past two years, I’ve been using LLVM to implement various mitigations and we were using BAP for ensuring the correctness of the mitigation. I can say that this is a great proposal, I’ll be looking forward to more things coming from this. I’ll also try to use your prototype to implement a scanner/checker for a aarch64 mitigation I’m working on right now and bap seems to be lacking support for aarch64 mach-o binaries. In the future we are thinking of considering Bolt for writing the mitigations as well just to handle inline assembly.

tstellar · April 6, 2024, 2:37pm

Are you able to share some of your results of your scan on Fedora 39? I’m interested in which packages from Fedora 39 had incorrect security hardening,

kbeyls · April 9, 2024, 12:25pm

The reporting format is currently very “prototype”-y, so it requires some manual effort to summarize it nicely.
I’m expecting I’ll have more time to do more of that after EuroLLVM.

I wonder if there are specific kinds of results you’re looking for? If so, I can prioritize looking at those.

FWIW, I believe at least the following 3 packages (the ones with the most non-pac-ret-protected return instructions) are caused by parts of it written in Rust, and the Rust compiler not yet supporting pac-ret:

/usr/lib64/librsvg-2.so.2.48.0: 5797 pac-ret gadgets. 6226 rets, 640418 instrs, 4258 CFG functions, 646 non-CFG functions
/usr/lib64/librpm_sequoia.so.1: 4818 pac-ret gadgets. 5234 rets, 440354 instrs, 3405 CFG functions, 507 non-CFG functions
/usr/lib64/libmozjs-115.so.0.0.0: 3443 pac-ret gadgets. 21214 rets, 2812397 instrs, 16669 CFG functions, 883 non-CFG functions

I know that in the past I also saw at least one package that didn’t seem to propagate distro-wide compiler flags such as -mbranch-protection=standard correctly. But last time I saw that, was probably half a year ago when I was doing early experiments on another VM, presumbly using fedora 38 and probably a slightly different set of packages installed. I’ll dig a bit deeper to see if I can find back which package that was after EuroLLVM.

In my current experimental setup, I see that in the other libraries on which non-pac-ret-protected returns are reported, it’s only a minority of that total number of returns that are reported. For example:

/usr/lib64/libwebkit2gtk-4.1.so.0.12.4: 3001 pac-ret gadgets. 149266 rets, 15514152 instrs, 154155 CFG functions, 6340 non-CFG functions

I’m currently guessing that most of these remaining reports may be false positives due to the scanner not yet recognizing that a brk instruction stops control flow, see the description in the original message above.

I’m assuming that for digging more into those results, it’s going to be best to first implement recognizing that brk instructions stop regular control flow before manually sifting through the results.

I hope this answer is at least somewhat useful to you?

tstellar · April 9, 2024, 1:35pm

This is the the kind of result I was looking for. If I had a list of packages that failed the scanner, I could pretty easily go through and figure out which ones failed due to missing flags. I think I’ll experiment using this tool with our build system and see if I can get some full distro results.

kbeyls · April 11, 2024, 1:28pm

That would be interesting and helpful!

I’ll be writing up how I invoked the tool myself to produce the results I’ve shared above in the next couple of days. There is quite a long command line to invoke the tool to produce the results I got. I also have an ad-hoc python post-processing script to analyze and summarize the output to make it easier to interpret the results.
I’m planning to publicize those as an extra commit to https://github.com/kbeyls/llvm-project/tree/bolt-gadget-scanner-prototype.

EuroLLVM currently keeps me busy, but I should be able to do the above in the next couple of days.

kbeyls · April 11, 2024, 1:31pm

A few people had asked a copy of the slides I used for the first keynote presentation at EuroLLVM, which was on the topic of this RFC. Please find them here:
Beyls_EuroLLVM2024_security_hardening_keynote.pdf (3.0 MB)

kbeyls · April 12, 2024, 10:39am

Hi @tstellar ,

I’ve just added that extra commit with a few shell scripts and python script to help run llvm-bolt-gadget-scanner across a set of binaries in a directory (e.g. /usr/lib64).
I’ve added a readme as part of that commit, see [GadgetScanner] document how to run llvm-bolt-gadget-scanner across a… · kbeyls/llvm-project@19e3796 · GitHub, which hopefully explains how to use these scripts in enough detail to be able to reproduce the results I’ve been sharing earlier.

Please do let me know if anything remains too unclear/too poorly documented.

maksfb · May 14, 2024, 10:11pm

Hi Kristof,

As you already know, I’m very much in favor of this proposal! Scanning a binary using BOLT framework can reveal a lot more than a simple objdump -d disassembly and the security hardening verification is an excellent application of BOLT’s capabilities. With analysis pass framework, we can build a family of smart disassemblers and scanners.

I’m going to find time to review the patches and help you with integration into the source tree.

Maksim

kbeyls · November 20, 2024, 2:04pm

Just a quick update to let everyone know that I created a first PR that introduces a new BOLT-based tool that can run as a binary analyzer PR115330. This first PR aims to only provide an empty framework to implement various binary analyses in.

Topic		Replies	Views
[RFC] BOLT: A Framework for Binary Analysis, Transformation, and Optimization LLVM Dev List Archives	9	465	November 24, 2020
Summary of "BOLT as a binary analysis tool" round table at EuroLLVM LLVM Project eurollvm	0	298	April 18, 2024
Introducing the binary-level coverage analysis tool bcov LLVM Dev List Archives	7	119	July 8, 2020
Clang Analysis of several open source projects. Clang Frontend	24	125	May 13, 2011
Static analysis tool development Clang Frontend	21	92	January 21, 2009