[RFC] Simple control-flow integrity

Hi everyone,

I've been working on control-flow integrity (CFI) techniques over
LLVM, and I'd like to get feedback on these techniques and their
potential usefulness as a part of LLVM. I'd like to submit some
patches for this; I've implemented a version of it, and I've applied
it to large, real-world programs like Chromium to see how well it
holds up in practice.

TL;DR: my CFI pass builds jump tables and adds a fast check to
indirect calls; values that fail the check are passed to a function
defined at compile time. I have added special analysis and
source-level annotations to help deal with the problems of external
function pointers.

Details:

My current implementation works as a pass over a single module
consisting of all the code that the compiler has at LTO time. At the
IR level, this pass:

    1. creates a power-of-two sized InlineAsm jump table (or multiple
jump tables) filled with jump instructions to each address-taken
function.

    2. replaces each such address-taken function with a pointer to the
corresponding location in the appropriate table. Note that these will
be valid function pointers for the purposes of external code.

    3. adds a fast check for pointer safety at each indirect call site:

         a. It forces the pointer into the appropriate table (based on
type information), and checks to see if the pointer changed. Pointers
that were already in the right table will not change, and all other
pointers will. We rewrite the pointer either by masking and adding to
a base pointer, if we can guarantee sufficient table alignment,
otherwise by subtracting from a base, then masking, then adding back
to the base.

         b. If the pointer fails the check, it's passed to a CFI
failure function defined at compile time to handle it. By default, we
define a function written in IR; this function prints out the name of
the function in which the CFI violation happens.

The biggest challenge for such an implementation is functions that are
neither declared nor defined at LTO time. These functions are false
positives for the CFI check. They can occur in at least 3 ways:

    - JIT code, like in the v8 javascript engine, can allocate and
call functions that were not defined at compile time. These functions
are not even external: they just didn’t exist at LTO time.

    - External functions can return pointers to external functions
that were not exposed at LTO time. The canonical example in this class
is dlsym, which is used extensively by many projects. Other commonly
used cases are signal/sigaction (returns the old signal handler),
XSetErrorHandler from X, and std::set_new_handler from the Standard
C++ library. But this happens with any dynamically-linked library that
has a method that returns function pointers.

    - Internal code that takes function pointer arguments can be
passed to external code and have external function pointers passed to
it as arguments. This pattern is used extensively by graphics
libraries, e.g., gtk.

I have some techniques that help handle these false positives:

    - Since CFI violations are passed to an arbitrary function, the
policy for these violations can be set at compile time. For example,
you could run the rewritten code for a while to build up a set of
known false positives, then switch to a CFI failure function that
stopped when it saw something not allowed by the policy. This is
similar to the approach taken by, e.g., AppArmor.

    - my current CFI pass looks for special annotations added to the
source code: these are of the form
__attribute__((annotate("cfi-maybe-external"))) and
__attribute__((annotate("cfi-no-rewrite")))

         - cfi-maybe-external can be applied to pointers and variables
(llvm.ptr.annotation and llvm.var.annotation) and means that this
value sometimes stores external function pointers.

         - cfi-no-rewrite is applied to functions and means that there
are indirect calls in this function that can happen with external
function pointers. The current implementation skips rewriting for
these functions, but it could instead be used to prepopulate a list of
known potential false positives.

    - I have a separate analysis pass called ExternalFunctionAnalysis
that does a fairly naive interprocedural dataflow analysis starting
from cfi-maybe-external annotations and from all places where it can
find external function pointers coming in to the module:

          - if an external function pointer flows into a store that
doesn't flow from an annotated location, then the pass prints a
warning

          - all indirect call sites that flow from annotated
pointers/variables are not rewritten (but this could be used instead
to prepopulate a whitelist of known false positives instead).

As I mentioned, I've used my current implementation to build a version
of Chromium protected with this form of CFI; in the process, I added
sufficient annotations to the Chromium code base to catch all false
positives (or at least: I haven't seen any in my testing so far). I've
also tried it out with other, less immense, projects, like the SPEC
CPU2006 benchmark suite.

Please let me know what you think.

Thanks,

Tom

Hi Tom,

The PNaCl team is very interested in your work. We see 3 applications
for our purpose:
- Augment the safety of our trusted code base, which includes OS
shims as well as NaCl validators.
- Apply extra CFI to the PNaCl translator. The PNaCl translator
already runs inside a NaCl SFI sandbox (which include CFI), but this
sandboxing just ensure that the code doesn't escape the sandbox, it
doesn't guarantee that the translator is doing the right thing. Adding
your CFI on top is nice because it captures some semantic information
from the original LLVM source, and makes it that much harder to
exploit a bug in LLVM to generate malicious code (which still needs to
pass NaCl validation, so it's an extra defense-in-depth layer).
- Optionally apply your CFI to user applications which opt-in.

Some of the concerns you express won't affect our platform since we
link everything statically and avoid passing raw pointer when
possible.

A few questions:
- Have you tried running your CFI on LLVM itself? Did you need to add
any annotations?
- What is the performance and size hit on different applications?

JF

One thing to note is that this does not protect return instructions the way traditional control-flow integrity does. One consequence of this is that applying this transformation imposes only a very modest run-time overhead to the protected application at the cost of not defending against, for example, buffer overflows on the stack which modify the saved instruction pointer.

Steve

Hi Tom,

A few questions:
- Have you tried running your CFI on LLVM itself? Did you need to add
any annotations?

Actually, I haven't. That will depend on me being able to compile LLVM
under LTO. I'll give it a try.

- What is the performance and size hit on different applications?

The overhead varies a bit, but perf's generally been in the small # of
percent over a version compiled with LLVM LTO. For example, a version
of Chromium M31 had about a 4% perf overhead running the dromaeo.com
benchmark. The size hit mostly depends on the number of functions and
call sites; e.g., in x86-64, each function entry in the table takes up
8 bytes, and each rewritten indirect call instruction takes up 35
extra bytes for the pointer rewriting and the branch and call in the
case of a violation.

    1. creates a power-of-two sized InlineAsm jump table (or multiple
jump tables) filled with jump instructions to each address-taken
function.

Why inline asm? There's probably a better way to do this via lowering
your jump table in the backend etc.

-eric

Tom, this sounds awesome. I’m imagining a wonderful world of CFI hardened browsers.

I'd have to look more at what he's doing, but wouldn't a simple switch
statement in IR suffice? Efficiency would be up to the various
lowering mechanisms, but it wouldn't require inline asm.

-eric

Another option might be to create an array of function pointers in the LLVM IR, i.e generate code that looks like:

void (*jumptable[])() = {
  &a,
  &b
};

void f(int index) {
  *(jumptable[index])();
}

*nod* That's the sort of thing I was thinking about too.

-eric

Why not using a bloom filter for valid target addresses instead?

Joerg

>>
>>
>>
>> IIRC this came up before, and I don't think we expose anything like a
jump
>> table at the IR level. As an IR-to-IR transform, I think asm is the
only
>> way to do it.
>
> I'd have to look more at what he's doing, but wouldn't a simple switch
> statement in IR suffice? Efficiency would be up to the various
> lowering mechanisms, but it wouldn't require inline asm.
>
> -eric
Another option might be to create an array of function pointers in the
LLVM IR, i.e generate code that looks like:

void (*jumptable[])() = {
  &a,
  &b
};

void f(int index) {
  *(jumptable[index])();
}

This isn't ABI compatible. Now function pointers point to data (or are
switch table indices) and not code.

I can imagine abusing indirectbr across separate functions, but rolling
your own jump table in asm seems better.

>>
>>
>>
>> IIRC this came up before, and I don't think we expose anything like a
>> jump
>> table at the IR level. As an IR-to-IR transform, I think asm is the
>> only
>> way to do it.
>
> I'd have to look more at what he's doing, but wouldn't a simple switch
> statement in IR suffice? Efficiency would be up to the various
> lowering mechanisms, but it wouldn't require inline asm.

I'm not sure I follow how this would work. Could you expand on this, please?

>
> -eric
Another option might be to create an array of function pointers in the
LLVM IR, i.e generate code that looks like:

void (*jumptable[])() = {
  &a,
  &b
};

void f(int index) {
  *(jumptable[index])();
}

This isn't ABI compatible. Now function pointers point to data (or are
switch table indices) and not code.

My first attempt at a solution did it just like this, with data
instead of pointers, but that opens several large cans of worms. As
mentioned, the ABI is a major problem, and to maintain compatibility,
you have to provide a complete escape analysis for all address-taken
functions, since they have to be transformed before they are passed to
external code. And, e.g., the IR translation of vtable calls actually
use a more complicated representation of function pointers, IIRC.

My version based on data instead of pointers worked well for simple
code and toy examples but it became extremely complicated and fragile
once I tried to apply it to real code. I finally realized that any
transformation that depends for its soundness on a complete solution
to an alias analysis problem is a non-starter for large, complicated
programs; conservative solutions weaken the protections provided by
the transformation.

This is why I switched to using inline asm; I add a declaration of a
fake external function for each entry in the table and use it as the
function pointer. Since my inline asm defines these symbols, the
linker is satisfied. There doesn't seem to be a way to create a true
jump table in LLVM, so this is the best we can do.

I can imagine abusing indirectbr across separate functions, but rolling your
own jump table in asm seems better.

I tried using indirect branches for a bit, but I didn't see how to
make that work. Aren't they restricted to branching to labels inside
the current function? From the LLVM Language Reference entry on the
semantics of indirectbr:

Control transfers to the block specified in the address argument. All
possible destination blocks must be listed in the label list,
otherwise this instruction has undefined behavior. This implies that
jumps to labels defined in other functions have undefined behavior as
well.

> 3. adds a fast check for pointer safety at each indirect call site:

Why not using a bloom filter for valid target addresses instead?

Can a bloom filter be as fast as a simple bounds check? I'm thinking lea
base, sub, cmp, jl, and cold call.

Exactly. The jump table turns into a very small amount of code; note
that a normal bounds check has to check both bounds (so two subs and
cmps). With the base and mask, and in an asm pseudo-code, it does:

sub base, addr
and mask, addr
add base, addr
cmp addr, orig
je normal_call
<load info for warning call>
call warning
normal_call:
call orig

And if you can get sufficient power-of-two alignment for the table,
you can do even better, since then the base is a prefix of all valid
addrs in its table. Unfortunately, Linux only gives you alignment up
to 2^12 under PIE/ASLR, even though the ELF binary claims to have
larger alignment.

and mask addr
add base addr
cmp addr, orig
<etc...same>

My code supports this as an option, but the default is the subtraction version.

Depends. The potential issue with the "jump table" approach is code
size -- as written, at least 64bit for every potential target. If you
want to include virtual functions in that list it will grow very large.
A decently working bloom filter would need in the order of 1 or 2 bits
per potential target, making the chance of fitting into cache quite a
bit larger. A basic hash function as candidate would be "and size" for
the first filter and "mul with constant; and size" for the second,
followed by a bit test for each. On modern CPUs the mul is quite cheap,
so the trade off is more or less one memory access vs two.

Joerg

>>
>>
>>
>> IIRC this came up before, and I don't think we expose anything like a
>> jump
>> table at the IR level. As an IR-to-IR transform, I think asm is the
>> only
>> way to do it.
>
> I'd have to look more at what he's doing, but wouldn't a simple switch
> statement in IR suffice? Efficiency would be up to the various
> lowering mechanisms, but it wouldn't require inline asm.

I'm not sure I follow how this would work. Could you expand on this, please?

I think you've already rebutted it below, so how about another idea? :slight_smile:

What about creating a pseudo-plt in the back end that will create this
jump table for you at object emission time?

Throwing out ideas in an attempt to avoid passes creating inline
assembly - especially since we're looking at an IR level pass.

-eric

I'm definitely interested in removing the inline asm bits. I'm not
sure what you mean by a pseudo-plt, though; do you mean hooking into
the code that generates the Procedure Linkage Table? I really don't
know much about the LLVM back end, so I'd have to learn how that all
works in LLVM first.

Either that or using similar functionality. It's definitely more
complicated on the start than using the inline assembly.

-eric

I've been digging around in the back end for a bit now, trying to figure
out the best way to implement the jump-instruction tables; there's already
support for jump tables there, but it's a different kind of jump table than
the one I want.

One direction that looks promising to me is to add an intrinsic, something
like:

declare void @llvm.jump.instr.table.entry(i8*, i8*)

The first argument would be the original function (the one that gets jumped
to in the table), and the second argument is the _JT function declared but
not defined by the pass.

Then I can add a custom lowering in each supported architecture that would
turn this into a labeled jump instruction something like

  .globl func_JT
func_JT:
  jmp func@PLT

The CFI pass would add a special function that would consist only of these
intrinsics, one for each jump statement needed by the table, and padded to
a power of two using another special intrinsic (something like
llvm.undefined.instr) that would lower to an undefined instruction (like
ud2 in x86).

I'd appreciate feedback anyone might have about this proposal.

Thanks,

Tom

Creating these intrinsic calls could potentially introduce pessimizations,
as the midend would consider the functions to be used. Though maybe this
doesn't matter if the CFI pass is run late.

It also seems sort of distasteful to have a function whose sole purpose is
to hold metadata about other functions.

An alternative proposal: introduce a new function attribute named, say,
'jumptable', which would cause the backend to emit a jump table entry and
redirect function references other than calls through the jump table. (Direct
function calls could call the original function directly, as an optimization.)

The CFI pass could then consist of marking every address-taken function with
'jumptable' and introducing the pointer safety checks at call sites.

Thanks,

>
> > >
> > > I'm definitely interested in removing the inline asm bits. I'm not
> > > sure what you mean by a pseudo-plt, though; do you mean hooking into
> > > the code that generates the Procedure Linkage Table? I really don't
> > > know much about the LLVM back end, so I'd have to learn how that all
> > > works in LLVM first.
> >
> > Either that or using similar functionality. It's definitely more
> > complicated on the start than using the inline assembly.
> >
> > -eric
> >
>
> I've been digging around in the back end for a bit now, trying to figure
> out the best way to implement the jump-instruction tables; there's already
> support for jump tables there, but it's a different kind of jump table than
> the one I want.
>
> One direction that looks promising to me is to add an intrinsic, something
> like:
>
> declare void @llvm.jump.instr.table.entry(i8*, i8*)
>
> The first argument would be the original function (the one that gets jumped
> to in the table), and the second argument is the _JT function declared but
> not defined by the pass.
>
> Then I can add a custom lowering in each supported architecture that would
> turn this into a labeled jump instruction something like
>
> .globl func_JT
> func_JT:
> jmp func@PLT
>
>
> The CFI pass would add a special function that would consist only of these
> intrinsics, one for each jump statement needed by the table, and padded to
> a power of two using another special intrinsic (something like
> llvm.undefined.instr) that would lower to an undefined instruction (like
> ud2 in x86).
>
> I'd appreciate feedback anyone might have about this proposal.

Creating these intrinsic calls could potentially introduce pessimizations,
as the midend would consider the functions to be used. Though maybe this
doesn't matter if the CFI pass is run late.

It also seems sort of distasteful to have a function whose sole purpose is
to hold metadata about other functions.

The way I've implemented it (see the patch I sent to llvm-commits
yesterday), it's not just metadata: the intrinsic lowers to the
jumptable entry code given above. The CFI pass then generates a
function for each jump table; the function consists solely of these
intrinsic calls.

An alternative proposal: introduce a new function attribute named, say,
'jumptable', which would cause the backend to emit a jump table entry and
redirect function references other than calls through the jump table. (Direct
function calls could call the original function directly, as an optimization.)

The CFI pass could then consist of marking every address-taken function with
'jumptable' and introducing the pointer safety checks at call sites.

That's an interesting suggestion. I'll look into it.

However, adding a new function attribute would require changing the
bitcode format, right? I thought that was to be avoided in general.
I'm not sure it makes sense to change the bitcode format to support
CFI.