In this RFC I introduce “structure protection”, which is a family of lightweight techniques for mitigating against use-after-free bugs by storing data in unused bits of a struct. It is a variant of the “lock and key” approach for memory bug mitigation which has been implemented in various forms such as HWASan and ARMv8.5 Memory Tagging Extension (MTE). The basic idea is similar: we want to store data in the pointer that acts as a key, together with data in memory that acts as a lock. With MTE, we have 4 bits of “lock” per 128 bits of data, and a matching 4-bit “key” in the top bits of the pointer. On memory access, the hardware checks that the lock matches (is equal to) the key. HWASan is similar, but it has an 8-bit lock, an 8-bit key and the checks are implemented in software using shadow memory. The general way that “lock and key” based mitigations protect against UAF and type confusion is that the key acts as a sort of object identity, so if the identity changes (e.g. via deallocation and reallocation), a different key is used and subsequent accesses using the old key will lead to a program crash.
With structure protection, the lock is not stored in a separate memory region but rather in the struct’s memory itself. The first structure protection technique that will be introduced as part of this work is known as pointer field protection (PFP). With pointer field protection, we observe that pointer bits above the maximum width of a virtual address are effectively unused, so we can use them to store the lock. Each pointer field in a struct with PFP has its own lock. Here is an example of a value that we may store to a struct’s pointer field, assuming a VA width of 48:
|63 .. 48|47 .. 0|
| lock | pointer |
Note that this is only the in-memory representation of the pointer field. When we load from the pointer field, the value that the program sees will end up looking like a conventional pointer:
|63 .. 48|47 .. 0|
|0000000000| pointer |
This is achieved by teaching the compiler to add code to remove the lock at load time, and insert the lock at store time. Let’s call the functions to insert and remove the lock InsertLock and RemoveLock respectively. Each struct, or even each field, may have its own pair of InsertLock and RemoveLock functions. To start with, let’s say that they take a single argument: InsertLock takes the pointer that we want to store and RemoveLock takes the pointer value that we loaded. Each implementation of these functions must have the following properties:
- RemoveLock(InsertLock(ptr)) = ptr
- Using a pointer derived from RemoveLock must cause a fault with a high probability if there is a mismatch between the InsertLock function and the RemoveLock function (implying a type confusion).
Let’s suppose that we use the functions:
- InsertLock(ptr) = ptr XOR (hash(T, F) << 48)
- RemoveLock(ptr) = ptr XOR (hash(T, F) << 48)
where hash(T, F) is a 16-bit hash computed using the name T of the struct containing the pointer field and the name F of the field. (These are just examples of functions that could be used. The functions used in the actual implementation will be introduced later.) Assuming that the hardware checks all 16 upper bits of the pointer, we can see that accesses with an incorrect type (type confusion) will fail with a false negative probability of 1 in 65536. Since the RemoveLock function needs to be called every time we load a pointer from a struct that is subject to PFP, a “check” is effectively placed after every load of a pointer from a struct subject to PFP. (RemoveLock is technically just the first part of the check. The second part is implemented by the hardware when it checks that the pointer is canonical when it is used for a load or store.)
So far I have only presented a basic version of the idea. On some architectures we can also use architecture-specific features to increase the difficulty of creating a false negative.
The first architecture-specific feature that we can use exists on multiple architectures, and it allows certain pointer bits to be ignored during loads and stores. AArch64 calls it Top Byte Ignore, Intel calls it Linear Address Masking, AMD calls it Upper Address Ignore and RISC-V calls it Pointer Masking. With each of these features, we can store the key in the masked bits of the pointer to the struct, so that normal loads and stores are unaffected by the presence of the key. The key acts as an identity for the struct, so it allows detection of use-after-free bugs that do not involve a type confusion (this is also how HWASan and MTE work). So, assuming AArch64 TBI (bits 63:56 ignored), now a pointer looks like this in memory:
|63 .. 56|55 .. 48|47 .. 0|
| key | lock | pointer |
and this in a register:
|63 .. 56|55 .. 48|47 .. 0|
| key |0000000000| pointer |
A heap allocator must generate a random key for each allocation and insert it into the top bits of the pointer. It is not necessary for the allocator to do anything else with the key, such as storing it to shadow memory. Note that, as with the basic version of the idea, pointers returned by an allocator (or by any other function) do not include the lock as part of the pointer that they return, because the pointer is returned in a register, and the lock is only known when the pointer is stored to memory.
Our InsertLock and RemoveLock functions must now take an additional argument, namely the address of the struct that was used to load the pointer, and we must ensure that RemoveLock(InsertLock(ptr, structptr), structptr) = ptr. Here are example implementations:
- InsertLock(ptr, structptr) = ptr XOR ((structptr >> 56) << 48)
- RemoveLock(ptr, structptr) = ptr XOR ((structptr >> 56) << 48)
With these functions, we effectively create a linked list of keys in memory. Assume a struct defined like so:
struct S {
S *ptr;
};
A linked list of structs of type S will look like this:
|63 .. 56|55 .. 48|47 .. 0|
| key1 | lock | pointer1 |
|
+------------------------------+
V
|63 .. 56|55 .. 48|47 .. 0|
| key2 | key1 | pointer2 |
|
+------------------------------+
V
|63 .. 56|55 .. 48|47 .. 0|
| key3 | key2 | pointer3 |
The second feature that we can use is the AArch64-specific Pointer Authentication feature. With this feature, we can also mix a global secret (the pointer authentication key) and the pointer value itself into the lock, use as many pointer bits as there are available and take advantage of a shorter instruction sequence at each pointer load/store as compared to doing the computations in software. Now our functions look approximately like this, again assuming a 48-bit VA range:
- InsertLock(ptr, structptr) = ptr[63:56] : PAHash(ptr, structptr, GlobalKey) : ptr[47:0]
- RemoveLock(ptr, structptr) = ptr[63:56] : (if ptr[55:48] == PAHash(ptr, structptr, GlobalKey) then 0 else 1) : ptr[47:0]
or these functions if the allocator does not support pointer tagging:
- InsertLock(ptr) = ptr[63:56] : PAHash(ptr, hash(T, F), GlobalKey) : ptr[47:0]
- RemoveLock(ptr) = ptr[63:56] : (if ptr[55:48] == PAHash(ptr, hash(T, F), GlobalKey) then 0 else 1) : ptr[47:0]
and our in-memory pointer looks like this:
|63 .. 56|55 .. 48|47 .. 0|
| key | PAHash | pointer |
Effectiveness
The following factors impact the effectiveness of pointer field protection:
- The number of ignored address bits (e.g. 8 on AArch64) – this bounds the effectiveness of the UAF defense.
- The number of non-canonical address bits – this bounds the effectiveness of forged pointer detection.
- The fraction of exploit chains prevented by an inability to exploit a UAF by means of a pointer field.
The first two factors indicate a low probability of a false negative: 1 in 256 or less (e.g. with a smaller VA size or with an exploit chain requiring access to multiple freed pointers). We plan to internally conduct a full evaluation of the last factor once the initial version of PFP is upstreamed.
Here is a breakdown of the threat model in various cases:
- With the generic implementation and an allocator without support for pointer tagging, UAFs involving a type confusion between two fields of pointer type are caught deterministically, assuming no type hash collisions (probability 1 in 65536 of a collision) and in the absence of an active adversary. This is because, for example, if a structure with a pointer field is reallocated with a structure with an attacker-controlled non-pointer field at the same address, pointer forging is possible.
- With the generic implementation and an allocator with support for pointer tagging, all UAFs are detected with high probability, but the tradeoff is that it relies on secrecy of the locks, so it assumes the attacker doesn’t have a read primitive.
- With the pointer authentication based implementation, a pointer cannot be forged except by chance because of the global secret, and a userspace read primitive cannot be used to disclose the global secret because it is stored only in CPU system registers and/or kernel memory.
Practicalities regarding the choice of InsertLock and RemoveLock functions
Going back to the original implementations of InsertLock and RemoveLock, which are needed for the generic implementation of PFP:
- InsertLock(ptr) = ptr XOR (hash(T, F) << 48)
- RemoveLock(ptr) = ptr XOR (hash(T, F) << 48)
These functions are suboptimal because they require relatively long instruction sequences and possibly an additional register to materialize the shifted hash constant and then apply the XOR. Instead, we can take advantage of the relatively common architectural support for an add or subtract instruction that takes an 8-bit immediate. We can combine it with a rotate instruction to move the required-to-be-zero high bits of the pointer into the low bits of the stored value. This gives us the following alternative implementations of the functions, which each costs 2 short instructions with constants in immediate operands on common architectures; these functions are the ones that PFP actually uses in the generic implementation:
- InsertLock(ptr) = ROL(ptr, 16) + hash(T, F)
- RemoveLock(ptr) = ROR(ptr - hash(T, F), 16)
The rotation count of 16 (64-48) was chosen because 48 is a relatively common size for the VA range and it does not conflict with 6-bit LAM on x86_64. The maximum possible size of the VA range on x86_64 was recently increased to 57 bits; detection is still likely if an application uses more than 48 bits but the likelihood decreases as the application uses more VA space.
It is possible to mix bits of the original pointer into the lock in such a way that pointer corruptions would be detected, but this comes at the cost of more instructions. We plan to implement an alternative mode of PFP that mixes the original pointer, after investigating the possible instruction sequences.
On standard layout types
PFP may not be applied to fields of structs that are standard-layout (this includes structs that may be shared with C). This is for two main reasons:
- It facilitates compatibility with precompiled C libraries and other tools that may assume that structs have a standard representation. (PFP implies a change to the C++ ABI, but it does not need to imply a change to the C ABI.)
- Even if we cannot apply PFP to all standard layout types, one option that we considered was to allow PFP just to non-trivially destructible standard layout types (informally, “non-POD types”). However, even given this restriction, the rules of the language make it very difficult to distinguish a pointer to a field of pointer type from any other pointer to a pointer, or to know which fields must have PFP disabled when encountering particular constructs. To give one example of the difficulties, the code snippet below would need to disable PFP for the field B::c but f is allowed to be compiled without a complete type for A or B.
struct B {
int *c;
~B();
};
struct A {
B b;
};
int **f(A *a) {
return (int **)a;
}
Because the current implementations of various standard library types, including std::unique_ptr, are standard layout, their pointers would not be protected by PFP under this rule. For this reason, we also propose to make various common standard library types non-standard layout when PFP is enabled. Although this is a C++ ABI change, it should be acceptable because PFP already changes the C++ ABI.
This could be done by, for example, declaring a class with two identical base classes and having the standard library types inherit from it, taking advantage of the C++ rule in [class.prop] that a standard-layout class “has at most one base class subobject of any given type”, such as the following:
class __force_nonstandard_layout_base1 {};
class __force_nonstandard_layout_base2 : __force_nonstandard_layout_base1 {};
class __force_nonstandard_layout : __force_nonstandard_layout_base1, __force_nonstandard_layout_base2 {};
For the time being, we propose to add the above base class to certain libc++ types (only if PFP is enabled). However, we also intend to introduce an attribute that has the same effect. This would not only be more concise and self documenting, especially for user code, but would also allow C structs to opt into PFP.
On trivially destructible and trivially copyable types
If the data structure is trivially destructible, implying that it may be trivially copyable, we cannot use a variant of InsertLock and RemoveLock that takes a structptr argument because the struct pointer may change outside of our control as a result of the structure being copied via memcpy.
I referred to trivially destructible (formally: “has a trivial, non-deleted destructor”) structs above. One might think that this could just read “trivially copyable”. However, to my surprise, I discovered that this does not work. This is because, by the rules of the language, it is possible for a struct A to be not trivially copyable, while another struct B which has A as a base or member is trivially copyable. This means that memcpying B, as allowed by the language, will invalidate A’s members.
Let’s recap the rule for whether a class is trivially copyable:
A trivially copyable class is a class:
(6.1)— where each copy constructor, move constructor, copy assignment operator, and move assignment
operator (15.8, 16.5.3) is either deleted or trivial,
(6.2)— that has at least one non-deleted copy constructor, move constructor, copy assignment operator, or
move assignment operator, and
(6.3)— that has a trivial, non-deleted destructor (15.4).
(This is from C++17. C++20 changed the wording to account for concepts but I don’t think the change was material for this issue.)
Suppose that A has a trivial CC, MC, CAO, and a non-trivial MAO. The non-trivial MAO causes it to be non-trivially-copyable. Then B, which has A as a member, introduces a deleted MAO. This hides A’s non-trivial MAO and causes B to be trivially copyable. Here is an example case extracted from the protobuf test suite (assuming the std::tuple from libc++):
#include <tuple>
class AnythingMatcher {
};
class PairMatcher {
public:
PairMatcher(int) : first_matcher_(0) {}
private:
const char *const first_matcher_;
const AnythingMatcher second_matcher_;
};
PairMatcher s(0); // not trivially-copyable
std::tuple<PairMatcher> st(s); // trivially copyable!
To make this work, we need a property X where “not X” for a base class/field type implies both “not X” and “not trivially copyable” for derived classes and classes with a field of that type. “Trivially destructible” was identified as a suitable property X because of clause 6.3 above combined with the fact that a class cannot have a trivial destructor if its bases or fields have non-trivial destructors.
Example
The following example program is used to show how loads and stores are compiled:
struct Cls {
virtual ~Cls();
long *ptr;
};
long *load(Cls *c) {
return c->ptr;
}
void store(Cls *c, long *l) {
c->ptr = l;
}
When targeting x86_64, the functions are compiled like this:
0000000000000000 <_Z4loadP3Cls>:
0: 48 8b 47 08 mov 0x8(%rdi),%rax
4: 48 83 c0 45 add $0x45,%rax
// hash(T) = -0x45
8: 48 c1 c0 30 rol $0x30,%rax
c: c3 ret
0000000000000000 <_Z5storeP3ClsPl>:
0: 48 c1 c6 10 rol $0x10,%rsi
4: 48 83 c6 bb add $0xffffffffffffffbb,%rsi
// -0xffffffffffffffbb = -hash(T) = 0x45
8: 48 89 77 08 mov %rsi,0x8(%rdi)
c: c3 ret
When targeting AArch64, they are compiled like this:
0000000000000000 <_Z4loadP3Cls>:
0: f9400408 ldr x8, [x0, #8]
4: dac11808 autda x8, x0
8: aa0803e0 mov x0, x8
c: d65f03c0 ret
0000000000000000 <_Z5storeP3ClsPl>:
0: dac10801 pacda x1, x0
4: f9000401 str x1, [x0, #8]
8: d65f03c0 ret
Structure protection vs HWASan and MTE
Here is a comparison of structure protection against HWASan and MTE:
- Structure protection is only intended as a UAF mitigation. It can also mitigate against buffer overflows, but this is context-dependent.
- Unlike HWASan and MTE, it stores the “lock” in existing bits of storage that are already being fetched, so there is no additional pressure on the memory system. HWASan and MTE’s use of separate storage for locks has certain costs:
- With HWASan, the shadow memory region is unassociated with the main memory at the memory system level; it has its own set of cache lines and TLB entries. This increases pressure on the cache and TLB and, together with explicit checks, typically makes HWASan unsuitable for production deployment due to high overhead.
- With MTE, the tags are typically associated at the hardware level with the cache line and possibly the physical memory itself, so the TLB costs and most of the cache costs are avoided, but the system still needs to fetch tags to fill cache lines and (depending on the implementation) may need to reserve 3% of all memory for tag storage.
- It scales linearly with the number of checks that are required, in terms of both runtime and memory overhead. With HWASan and MTE, the allocator must retag every memory region on deallocation and sometimes on allocation, and with MTE every access to a tagged region is usually checked (checks can be disabled with specific instructions, but these have their own costs), so enabling checks has a fixed cost even if we just want to check a few fields per struct. On the other hand, with structure protection, the per-allocation cost is very low (we just need to generate a random number per allocation), and we can use the compiler to only insert checks which provide high ROI.
- Compared to MTE, it reduces the likelihood of a false negative. MTE has 4 key bits, while structure protection has 8, and has even more on AArch64 via mixing of a global secret. This implies a false negative probability of 1 in 256 or less.
- On AArch64 it utilizes pointer authentication, which is currently available in more microarchitectures than MTE.
- Structure protection only works for certain C++ struct types (i.e. non-standard layout types) as a consequence of avoiding modifications to the C ABI. HWASan and MTE are compatible with all object types.
- Structure protection has a basic mode of operation which is architecture-independent.
Frequently Anticipated Questions
How do you handle code that takes the address of a struct field?
We must make sure that loads/stores via that pointer are equivalent to loads/stores via the struct field. The basic strategy for handling this situation is to disable PFP for fields whose address is taken (more precisely: fields whose address leaks outside of the translation unit). Fortunately, it is relatively uncommon for fields to have their address escape (in one large internal Google binary, there were 66875 PFP eligible fields of which 728 fields, or around 1.1% of all eligible fields, had escaping addresses). Less fortunately, determining whether a field’s address is taken requires whole-program information. We don’t need LTO to handle this, though: we can observe that all that we need to do to disable PFP for a field is replace certain instructions with NOPs, and this is fairly easy for a linker to do. More details on how this will work are provided in the RFC for deactivation symbols.
Similarly to code that takes the address of a field, we must also consider code that uses offsetof. Code that uses offsetof to compute a field offset can be understood to mean that a pointer to the field will subsequently be formed by adding the offset to a struct address, so the presence of an offsetof call for a specific field anywhere in the program must disable PFP for that field. This is implemented by tracking all evaluations of offsetof in the ASTContext. Although offsetof is technically only defined for standard-layout types (which means that technically no special handling is necessary for offsetof, because standard layout types are not subject to PFP), we found that enough code uses offsetof on non-standard-layout types to make it necessary to implement this.
The same applies to code that forms a data member pointer, which is also permitted on non-standard-layout types. If a data member pointer is evaluated during code generation, that triggers PFP disablement for the field.
You’re adding latency to every pointer field load, won’t this slow everything down?
This was also our concern at the start of the project. According to public information, the latency of the AUT instruction on Apple M1 is 6-7 cycles, and the latency on Neoverse-N2 is 5 cycles. To begin with, I relied on the following intuition: memory loads are, on average, relatively high latency operations anyway, so adding a few more cycles is unlikely to make a significant difference. But the only way to be sure was to actually implement it and measure the performance difference. Using the prototype linked below, I collected the following result: QPS (queries per second) of a large realistic internal Google benchmark decreases by around 1.5-2% with PFP enabled, depending on the architecture and microarchitecture. These numbers may be considered acceptable by some users, but for users with more exacting performance requirements, we intend to implement techniques for reducing the overhead, by using profile data to disable PFP for frequently accessed fields (see also this thread).
We are also considering another “structure protection” technique, which is to use struct padding to store the “lock” for structs with padding bytes. In principle, this should move the check out of the critical path of loads, but its effect on overall performance will need to be studied.
How do you pass structs by value?
PFP is only intended to protect the heap, not the stack or values in registers. Therefore, pointer fields are decoded into regular pointers as part of the calling convention for structs passed and returned by value.
How are prebuilt/uninstrumented libraries handled? (e.g. libc, libraries written in non-C languages, etc.)
As long as the library itself exposes a C interface and does not depend on the C++ standard library, such libraries will work without problems. This is because, for interoperability among other reasons, structure protection only affects the C++ ABI and not the C ABI.
How it works
When Clang generates code that takes the address of a field that requires PFP, instead of using a normal GEP instruction, it emits a call to the llvm.protected.field.ptr intrinsic. For example, to load from a pointer field named field_name at offset 8 from a non-trivially-copyable struct named StructName, it may emit an intrinsic call and a load like this:
%1 = call ptr @llvm.protected.field.ptr(ptr %0, i64 8, metadata !"StructName.field_name", i1 true)
%2 = load ptr, ptr %1
The first argument is the struct pointer, the second is the field offset in bytes, the third is the field identifier (used for disabling PFP for address-taken fields) and the fourth is true if the struct is not trivially copyable. An instrumentation pass replaces the intrinsic with a regular GEP, and replaces load/store users with the required pointer manipulations following/preceding the load/store. If a user other than a load or store is encountered, it is considered to be escaping the address of the field and will trigger disabling PFP for the field.
Clang is also modified to disable certain optimizations on non-trivially-copyable structs with PFP fields, such as those inserting a call to memcpy.
To support globally disabling PFP for a particular field, instructions that sign or authenticate a pointer to be stored in a field are relocated using a deactivation symbol. Translation units that escape the address of a PFP field or contain an evaluation of offsetof or a member pointer for a PFP field will define a deactivation symbol for the field.
To support globals with PFP fields, Clang is extended to emit ptrauth constant expressions, which were originally introduced for the PAuth ABI, for pointer field members subject to PFP. The ptrauth constant expression is extended in the following ways:
- The address diversity form is extended to also add the discriminator to the address if it is given. This allows a negative number to be provided to subtract the field offset.
- The constant expression is extended to contain a reference to the deactivation symbol for the field if any.
When generating an initializer for a ptrauth constant as a global initializer, the compiler generates code for an IFUNC resolver which returns the value to be stored as the initializer for the field, and initializes the field using the resolver. This approach was chosen over using or extending the R_AARCH64_AUTH_* relocations added for the PAuth ABI for two main reasons:
- It is more flexible. For example, it easily allows the initializer to support deactivation symbols by using the deactivation symbol to relocate the instruction in the IFUNC resolver that signs the pointer. By comparison, a new relocation type would take two symbol operands (the referent and the deactivation symbol), which is unprecedented in ELF and would likely require fundamental extensions to the relocation format just for this niche requirement. It also enables support for using Emulated PAC to relocate ptrauth relocations (see link below). Furthermore, as we extend to other architectures lacking pointer authentication, we avoid needing to hardcode details of our pointer encoding into various dynamic loaders.
- It avoids a dependency on a new version of the dynamic loader that supports the PAuth ABI relocation types or any new relocation types/formats that would be introduced as part of (1).
The current version of LLD does not properly support this usage of IFUNC relocations (it introduces a PLT thunk for IFUNC references even when not strictly necessary); an LLD patch will be proposed to address this.
To optimize accesses to PFP fields, the SROA pass is taught to recognize the new intrinsic and to treat intrinsic calls passing the same arguments as references to the same field.
The full prototype may be found here and the fork of TCMalloc with pointer tagging support may be found here. The current prototype can do the following:
- Build LLVM/Clang/MLIR and pass most tests
- Build various foundational Google open source projects including ABSL/Protobuf/gRPC and pass their tests
- Build most of the subset of Google’s internal codebase that is HWASan clean (i.e. the subset that is compatible with pointer tagging and does not contain UAF bugs detectable by running tests) and pass most of its tests
Most remaining issues are considered likely to be undefined behavior in the code itself, such as code that uses memcpy to copy a non-trivially-copyable struct, or code that uses memset(0) over a struct to set its pointer fields to NULL.
Because deactivation symbols and global initializers are currently only implemented for Arm, the current prototype only works on other architectures for trivial programs.
To support running PFP programs on AArch64 machines lacking pointer authentication support, we also introduce Emulated PAC.
Next steps
I propose to upstream the prototype, after adding additional LLVM lit tests to cover the newly added code and addressing the remaining TODOs. To begin with, PFP will be enabled using an experimental Clang driver flag: -fexperimental-pointer-field-protection.
Special thanks
Dmitry Vyukov proposed the general concept of structure protection using unused bits as well as the idea of using struct padding, and provided some feedback. Qinkun Bao implemented the Clang attribute and wrote some of the lit tests.
This work couldn’t have been created, or at least would have been much more difficult, without the ability to use pointer authentication in LLVM and to use a debugger to record and replay programs on AArch64 with pointer authentication. So the author would also like to thank Apple, the Asahi Linux team and the rr debugger team for their amazing work.