IR extension proposal: bitset constants

Hi all,

I would like to propose a mechanism that allows IR modules to co-operatively
build a pointer set corresponding to addresses within a given set of
globals. The specific use case I have in mind is to provide a mechanism
for a C++ program to efficiently verify (at each call site) that a vtable
pointer is in the set of valid vtable pointers for the class or its derived
classes. One way of doing this is for a toolchain component to build, for each
class, a bit set that maps to the memory region allocated for the vtables,
such that each 1 bit in the bit set maps to a valid vtable for that class,
and lay out the vtables next to each other, to minimize the total size of
the bit sets. Call sites would perform a range check followed by a load of
the appropriate bit from the bit set.

To give a concrete example, suppose we have the following classes:

struct A { virtual void f(); };
struct B : A { virtual void f(), g(); };
struct C : A { virtual void f(), h(); };

with the following vtables:

_ZTV1A = { &A::rtti, &A::f };
_ZTV1B = { &B::rtti, &B::f, &B::g };
_ZTV1C = { &C::rtti, &C::f, &C::h };

The set of valid vtables for static class A is {&_ZTV1A[1], &_ZTV1B[1],
&_ZTV1C[1]}, for B is {&_ZTV1B[1]} and for C is {&_ZTV1C[1]}. The toolchain
would lay out _ZTV1A, _ZTV1B and _ZTV1C consecutively, and generate bit sets
like this:

A = {0,1,0,1,0,0,1,0}
B = {0,0,0,1,0,0,0,0}
C = {0,0,0,0,0,0,1,0}

A call site (say the static type of the call is A) will check whether the
object's vtable falls in the range [_ZTV1A, _ZTV1C], and is sizeof(void
*)-aligned. It would then load the (vtable-_ZTV1A)/sizeof(void *)'th entry
in A's bitset, which will be 1 if the vtable is valid for A.

We are now faced with a number of implementation questions: how do we represent
these (potentially partial) bit sets in LLVM, and how do we arrange to lay
out the globals and build the complete bit sets? Remember that classes B
and C may be in different translation units, so we do not necessarily have
a complete picture of A's bit set at A's compile time.

What I propose is that the parts of the bit sets be represented as arrays
of pointers, which will be resolved through some mechanism (either at
LTO time, or at regular (non-LTO) link time) to bit sets corresponding to
those addresses. This mechanism would also be responsible for laying out
the referenced globals (i.e. the vtables) as close together as possible. We
introduce a new flag on globals, the so-called "bitset" flag, which causes
the global to be transformed using this mechanism. Such a global cannot be
accessed in the normal way. It may only be accessed using an intrinsic that
tests whether a given pointer is in the set.

To return to our concrete example, a translation unit defining the vtable
for A may also define the following global:

@_ZBS1A = appending bitset constant [1 x i8*] [gep(@_ZTV1A, 0, 1)]

A translation unit defining B would define:

@_ZBS1A = appending bitset constant [1 x i8*] [gep(@_ZTV1B, 0, 1)]
@_ZBS1B = appending bitset constant [1 x i8*] [gep(@_ZTV1B, 0, 1)]

And C:

@_ZBS1A = appending bitset constant [1 x i8*] [gep(@_ZTV1C, 0, 1)]
@_ZBS1C = appending bitset constant [1 x i8*] [gep(@_ZTV1C, 0, 1)]

A later mechanism would combine the bitset globals, lay out the referenced
globals and build the bit sets. In the LTO case, the IR linker can already
do the combination step if the globals are given appending linkage. A
later compiler pass (which we will call the "bitset lowering pass") would lay out
the globals (by combining them into a single large global, and RAUWing the
original globals with aliases that gep into the large global) and build the
bit sets. In the non-LTO case, we could also invent an object-file-specific
mechanism for the backend to tell the linker to combine and build the bit sets,
for example by placing the bitset globals in a specifically named section,
and place the referenced globals in individual sections so that the linker
can rearrange them.

As previously mentioned, an intrinsic would be used to test entries in the
bit set at call sites. For example, the relevant IR for the following C++ code:

A *some_object = ...;
some_object->f();

might look like this:

%vtable = load i8** %some_object
%p = call i1 @llvm.bitset.test(i8* %vtable, i8* @_ZBS1A) readnone
br i1 %p, label %continue, label %trap

In the LTO case, such intrinsic calls would be lowered by the bitset lowering pass
into the appropriate range/bit set check code, which might look like this:
(pseudocode for 64-bit platform)

if %vtable & 7 != 0 || %vtable < global_start_addr(_ZBS1A) || %vtable > global_end_addr(_ZBS1A) {
  %p = 0
} else {
  %p = bitset(_ZBS1A)[(%vtable - global_start_addr(_ZBS1A)) >> 3]
}

In the non-LTO case, we would generate similar code, but with appropriate
relocations so that the required constants (global start/end address, bitset
address, mask, shift amount) would receive appropriate linker-determined
values.

The first step I propose to make in this direction is a patch that adds
support for the bitset flag on globals, and introduces the bitset lowering pass. A
later step would involve building backend support, and linker support in lld.

Comments appreciated.

Thanks,

Is there any way to accomplish this that doesn’t require changes in every part of the toolchain?

What you describe here sounds really efficient, but for a change this pervasive throughout the toolchain, I would at least like to see an attempt done that does not require changing the toolchain. Then we can circle back to changing the toolchain if that proves inadequate and with concrete experience informing what really is going to require a toolchain change (Maybe you’ve already done this? Please share.).

– Sean Silva

I would start from using module-level metadata.
An IR extension might be a good idea once we show that

  • the proposed indirect call protection mechanism is efficient and useful, and
  • there are other use cases for the IR extension.

–kcc

Hi Sean,

I agree that we should first try to accomplish this without changing every
part of the toolchain. This is why I proposed to first introduce only the IR
support and bitset lowering pass. This would provide a working implementation
that is confined to the compiler, but does require LTO.

In the long run, however, I would like to drop the dependency on LTO, in
order to permit fast incremental builds.

Once the initial approach has proven to be effective, we can think about
dropping the LTO dependency by implementing linker support. I only included
the linker implementation details to show that the design can work without
requiring LTO.

Peter

Introducing a new module-level metadata format is just another kind of IR
extension, with the disadvantage of being untyped. We should be cutting down
on these types of things, not introducing a new one [1].

I feel that a global is the right design for this, as it represents data
that is conceptually in the "object file".

Peter

[1] https://groups.google.com/forum/#!topic/llvm-dev/u4hZUWAXQuY

Introducing a new module-level metadata format is just another kind of IR
extension, with the disadvantage of being untyped. We should be cutting
down
on these types of things, not introducing a new one [1].

Hard to argue with that.
module-level metadata is a simple (though not the only) way to make a
prototype and prove that the approach works.

So, bitset would be a property that means : globals with the same name will append on a string of bits in the order in which they appear, and it’s the job of the front end to make sure the correct order is followed in every unit, so that linking does the right job in every corner case?

Could that be used for efficient global boolean flags, like LLVM’s options? Even if you don’t need the range check, you get the bit mask check for free.

I’m trying to find other uses to help you in the case to make this into a first class citizen, not metadata.

Cheers,
Renato

So, bitset would be a property that means : globals with the same name will
append on a string of bits in the order in which they appear, and it's the
job of the front end to make sure the correct order is followed in every
unit, so that linking does the right job in every corner case?

Not quite. Each entry in a bitset global (after deduplication) represents
a 1 in a bitset in a particular region of the object file. The entries may
appear in any order, and it is the job of LTO (or the linker) to build the
actual bitset.

Could that be used for efficient global boolean flags, like LLVM's options?
Even if you don't need the range check, you get the bit mask check for
free.

I don't understand exactly what you mean.

Peter

So, bitset would be a property that means : globals with the same name
will append on a string of bits in the order in which they appear, and it's
the job of the front end to make sure the correct order is followed in
every unit, so that linking does the right job in every corner case?

Could that be used for efficient global boolean flags, like LLVM's
options? Even if you don't need the range check, you get the bit mask check
for free.

Maybe during LTO... in general they would need to have distinct addresses.

Actually, Peter, would it be possible to prototype this with an appending
i8 array that we already have in the IR? Some space overhead compared to
appending bits, but could allow for a quick prototype.

-- Sean Silva

This would work, and you could make the packaging during your lowering pass, no?

Cheers,
Renato

I don't think it could be made to work in all cases. Consider this class
hierarchy, with each class defined in a separate translation unit:

class A { virtual void f(); };
class B { virtual void g(); };
class C { virtual void h(); };
class D : A, B { virtual void f(), g(); };
class E : B, C { virtual void g(), h(); };

vtables:

A = { &A::rtti, &A::f };
B = { &B::rtti, &B::g };
C = { &C::rtti, &C::h };
D = { &D::rtti, &D::f, &D::rtti, &D::g };
E = { &E::rtti, &E::g, &E::rtti, &E::h };

bitsets (assuming layout A,B,C,D,E):

A = { 0,1,0,0,0,0,0,1,0,0,0,0,0,0 }
B = { 0,0,0,1,0,0,0,0,0,1,0,1,0,0 }
C = { 0,0,0,0,0,1,0,0,0,0,0,1,0,1 }
D = { 0,0,0,0,0,0,0,1,0,0,0,0,0,0 }
E = { 0,0,0,0,0,0,0,0,0,0,0,1,0,0 }

Note that because of the existence of class C, we must leave a gap in
the bitset for A between A and D. Because none of the translation units
in C's hierarchy are necessarily aware of A, they cannot know to leave a
gap. Attempting to fix this in the bitset lowering pass would just introduce
more complexity, IMO.

So I don't think it is possible (or simple, at the least) without references
to other globals in the thing we use to build the bitset.

I've been working on a patch that implements the bitset attribute and bitset
lowering pass. I'll see if I can send it out later today.

Thanks,

http://reviews.llvm.org/D7288

Thanks,

Hi Peter,

Please forgive if this is an obvious question, but how reasonable is it for this approach to work when not all translation units are available to be rebuilt with the new information?

I'm very interested in what you're proposing here.

Jim

This technique works best if we can lay out the vtables in a specific order,
but this is not necessarily a requirement. I can imagine an extension of the
design where if the pointer is out of range we can fall back to a slow path
that compares the vtable pointer against each element of a list of valid
addresses that belong to external code.

The hard requirements (I believe) are that we have headers for all external
classes that may interact with our code, and that the external objects export
the needed vtable symbols. With those, we should be able to use Clang to
build the list of valid addresses. I believe this should work with some
implementations of the standard library, for example.

If the external library does not have all headers available, or (say) if
it exposes classes defined in an anonymous namespace, we may not be able
to identify all the valid addresses. To handle those cases we may need to
develop some kind of blacklisting mechanism.

Thanks,

Hi Peter,

I don’t think that adding an IR construct for this is the way to go. You’re making IR complicated for everyone to serve a very narrow use case (one that admittedly I don’t really understand). The fact that this affects semantics and will only work with LTO and not native linkers is also really weird to me. Is there other precedent for that? The only cases I know that affect LTO add information that is safe to drop (e.g. TBAA etc).

Also, your patch is incomplete. Presumably you have to scatter checks for “if (!GV->isBitSet())” throughout the optimizer, codegen and other things that touch globals.

-Chris

I have a bad feeling about this... :slight_smile:

--renato

Hi Peter,

Please forgive if this is an obvious question, but how reasonable is it for this approach to work when not all translation units are available to be rebuilt with the new information?

I'm very interested in what you're proposing here.

This technique works best if we can lay out the vtables in a specific order,
but this is not necessarily a requirement. I can imagine an extension of the
design where if the pointer is out of range we can fall back to a slow path
that compares the vtable pointer against each element of a list of valid
addresses that belong to external code.

The hard requirements (I believe) are that we have headers for all external
classes that may interact with our code, and that the external objects export
the needed vtable symbols. With those, we should be able to use Clang to
build the list of valid addresses. I believe this should work with some
implementations of the standard library, for example.

Oof. That’s a deal breaker for any of the uses I was hoping for. Nuts. :frowning:

Hi Chris,

I wanted to start by giving an explanation of what I am trying to achieve
and how I am trying to achieve it.

I am working towards introducing into LLVM a security mechanism, Forward
Control Flow Integrity (CFI), that is designed to mitigate against
vulnerabilities that allow attacks based on corrupting vtable or function
pointers in memory in order to subvert a program's control flow.

More details are in a paper that I co-authored that was presented at USENIX
Security last year [1]. As mentioned in the paper, attackers are increasingly
relying on such techniques to subvert control flow. This is why I feel that it
is particularly important that compilers contain practical defenses against
such attacks.

One particular variant of the defense I am proposing, vtable verification,
was implemented in GCC and described in section 3 of the paper, however it
comes with a significant performance overhead, more than 8% on certain Chrome
browser-based benchmarks and up to around 20% on SPEC 2006 benchmarks (see
Figure 2). This is likely due to the fact that it searches lists of vtables
to determine if a given vtable is valid. This is a direct consequence of
its avoidance of techniques that depend on changing how the program is linked.

The implementation I am proposing to contribute to LLVM focuses on
performance. Building the needed data structures at link time allows us
to reduce the cost of checking the validity of a vtable pointer to a few
machine instructions and a single load from memory.

I don’t think that adding an IR construct for this is the way to go. You’re making IR complicated for everyone to serve a very narrow use case (one that admittedly I don’t really understand).
Also, your patch is incomplete. Presumably you have to scatter checks for “if (!GV->isBitSet())” throughout the optimizer, codegen and other things that touch globals.

I'll address these points simultaneously because I would like to explain
why extensive support for the new construct is not required, and why the
maintenance burden is not as large as it might seem (and why my patch does
not need to make such extensive changes to the optimizers).

The first point I'd like to make is that from the point of view of optimizer
passes other than bitset lowering, the llvm.bitset.test intrinsic has opaque
semantics with regard to the content of the globals it references, so they
cannot legally modify the contents of the bitset global.

The second is that any use of a bitset global other than as an argument to
the llvm.bitset.test intrinsic has undefined semantics. (This is something
that can be documented better.) This means that any optimizer pass that
looks through global initializers does not require any changes, as any
transformation it may perform on IR that treats such globals as regular
globals (for example by taking its address or loading from it) is semantics
preserving, as the semantics of such IR would have been undefined anyway.

With these points in mind, I'm reasonably confident that very little code
needs to care about the new flag.

(I should also point out that I know that the patch most likely works without
any other optimizer changes, because I have a work-in-progress patch that
implements the Clang side of this, and have successfully applied it to a
large C++ codebase and found real bugs in that codebase.)

Regarding codegen, I haven't implemented support in codegen for bitsets yet,
the intrinsic is completely handled by the pass. I can't imagine the changes
being very intrusive though. We can easily add a check that no bitset stuff
makes it through to codegen for the moment.

The fact that this affects semantics and will only work with LTO and not native linkers is also really weird to me.

I agree, which is why I plan to add support to lld and perhaps other linkers,
but we do have to start somewhere. Adding the functionality to the compiler
only seems like a reasonable first step, even if we depend on LTO to begin
with.

Is there other precedent for that? The only cases I know that affect LTO add information that is safe to drop (e.g. TBAA etc).

There is the jumptable attribute, which has been used to implement a variant
of CFI for indirect function calls (see section 4 of the USENIX paper), and
that only works effectively with an LTO'd module. (We might end up adding
native linker support for that or something similar as well.)

Thanks,

Trying to summarize all opinions expressed here: Peter is proposing an
initial implementation that would only work with LTO. Folks seem put off by
this implementation affecting IR without having proven itself, and having
shortcomings (as Jim pointed out). Kostya proposed going through metadata
(and Chris kind of did too by mentioning tbaa), but Peter points out that
this will make the implementation trickier.

It sounds like going through metadata for the LTO-only proof-of-concept
would be preferable, even if more complex. Peter, how more complex would
that be?

As an interested user of this I'm happy with LTO, and I can help with the
reviews of the code.

I was just thinking about how the metadata-based design for this might work.

We could have a named metadata global containing the list of bitset entries.
Each entry would be a pair consisting of an identifier for the bitset (in
this case, the mangled name of the class) and a pointer into a global
(in this case, a valid vtable pointer).

For example, this class hierarchy:

class A { virtual void f(); };
class B : A { virtual void f(); };
class C : A { virtual void f(); };

would have these bitsets:

!llvm.bitsets = !{!0, !1, !2, !3, !4}

!0 = !{!"1A", i8* getelementptr inbounds ([3 x i8*]* @_ZTV1A, i32 0, i32 2)}
!1 = !{!"1A", i8* getelementptr inbounds ([3 x i8*]* @_ZTV1B, i32 0, i32 2)}
!2 = !{!"1A", i8* getelementptr inbounds ([3 x i8*]* @_ZTV1C, i32 0, i32 2)}
!3 = !{!"1B", i8* getelementptr inbounds ([3 x i8*]* @_ZTV1B, i32 0, i32 2)}
!4 = !{!"1C", i8* getelementptr inbounds ([3 x i8*]* @_ZTV1C, i32 0, i32 2)}

The LLVM linker can already merge the contents of named metadata globals.

The intrinsic for reading from bitsets would have this definition:

declare i1 @llvm.bitset.test(i8*, metadata)

and would be called like this:

  %x = call i1 @llvm.bitset.test(i8* %p, metadata !{!"1A"})

The disadvantage of this design is that metadata strings always have global
scope, so identically named classes in anonymous namespaces in different
translation units would receive the same bitset. (We can't just use the
vtable globals as bitset identifiers, because they may be erased by globaldce.
There are probably various things we could do to force the globals to stick
around, if we wanted to go that way.)

Does that seem more reasonable to people?

Thanks,