RFC: Structure protection, a family of UAF mitigation techniques

In this RFC I introduce “structure protection”, which is a family of lightweight techniques for mitigating against use-after-free bugs by storing data in unused bits of a struct. It is a variant of the “lock and key” approach for memory bug mitigation which has been implemented in various forms such as HWASan and ARMv8.5 Memory Tagging Extension (MTE). The basic idea is similar: we want to store data in the pointer that acts as a key, together with data in memory that acts as a lock. With MTE, we have 4 bits of “lock” per 128 bits of data, and a matching 4-bit “key” in the top bits of the pointer. On memory access, the hardware checks that the lock matches (is equal to) the key. HWASan is similar, but it has an 8-bit lock, an 8-bit key and the checks are implemented in software using shadow memory. The general way that “lock and key” based mitigations protect against UAF and type confusion is that the key acts as a sort of object identity, so if the identity changes (e.g. via deallocation and reallocation), a different key is used and subsequent accesses using the old key will lead to a program crash.

With structure protection, the lock is not stored in a separate memory region but rather in the struct’s memory itself. The first structure protection technique that will be introduced as part of this work is known as pointer field protection (PFP). With pointer field protection, we observe that pointer bits above the maximum width of a virtual address are effectively unused, so we can use them to store the lock. Each pointer field in a struct with PFP has its own lock. Here is an example of a value that we may store to a struct’s pointer field, assuming a VA width of 48:

|63  ..  48|47     ..      0|
|   lock   |     pointer    |

Note that this is only the in-memory representation of the pointer field. When we load from the pointer field, the value that the program sees will end up looking like a conventional pointer:

|63  ..  48|47     ..      0|
|0000000000|     pointer    |

This is achieved by teaching the compiler to add code to remove the lock at load time, and insert the lock at store time. Let’s call the functions to insert and remove the lock InsertLock and RemoveLock respectively. Each struct, or even each field, may have its own pair of InsertLock and RemoveLock functions. To start with, let’s say that they take a single argument: InsertLock takes the pointer that we want to store and RemoveLock takes the pointer value that we loaded. Each implementation of these functions must have the following properties:

  • RemoveLock(InsertLock(ptr)) = ptr
  • Using a pointer derived from RemoveLock must cause a fault with a high probability if there is a mismatch between the InsertLock function and the RemoveLock function (implying a type confusion).

Let’s suppose that we use the functions:

  • InsertLock(ptr) = ptr XOR (hash(T, F) << 48)
  • RemoveLock(ptr) = ptr XOR (hash(T, F) << 48)

where hash(T, F) is a 16-bit hash computed using the name T of the struct containing the pointer field and the name F of the field. (These are just examples of functions that could be used. The functions used in the actual implementation will be introduced later.) Assuming that the hardware checks all 16 upper bits of the pointer, we can see that accesses with an incorrect type (type confusion) will fail with a false negative probability of 1 in 65536. Since the RemoveLock function needs to be called every time we load a pointer from a struct that is subject to PFP, a “check” is effectively placed after every load of a pointer from a struct subject to PFP. (RemoveLock is technically just the first part of the check. The second part is implemented by the hardware when it checks that the pointer is canonical when it is used for a load or store.)

So far I have only presented a basic version of the idea. On some architectures we can also use architecture-specific features to increase the difficulty of creating a false negative.

The first architecture-specific feature that we can use exists on multiple architectures, and it allows certain pointer bits to be ignored during loads and stores. AArch64 calls it Top Byte Ignore, Intel calls it Linear Address Masking, AMD calls it Upper Address Ignore and RISC-V calls it Pointer Masking. With each of these features, we can store the key in the masked bits of the pointer to the struct, so that normal loads and stores are unaffected by the presence of the key. The key acts as an identity for the struct, so it allows detection of use-after-free bugs that do not involve a type confusion (this is also how HWASan and MTE work). So, assuming AArch64 TBI (bits 63:56 ignored), now a pointer looks like this in memory:

|63  ..  56|55  ..  48|47     ..      0|
|   key    |   lock   |     pointer    |

and this in a register:

|63  ..  56|55  ..  48|47     ..      0|
|   key    |0000000000|     pointer    |

A heap allocator must generate a random key for each allocation and insert it into the top bits of the pointer. It is not necessary for the allocator to do anything else with the key, such as storing it to shadow memory. Note that, as with the basic version of the idea, pointers returned by an allocator (or by any other function) do not include the lock as part of the pointer that they return, because the pointer is returned in a register, and the lock is only known when the pointer is stored to memory.

Our InsertLock and RemoveLock functions must now take an additional argument, namely the address of the struct that was used to load the pointer, and we must ensure that RemoveLock(InsertLock(ptr, structptr), structptr) = ptr. Here are example implementations:

  • InsertLock(ptr, structptr) = ptr XOR ((structptr >> 56) << 48)
  • RemoveLock(ptr, structptr) = ptr XOR ((structptr >> 56) << 48)

With these functions, we effectively create a linked list of keys in memory. Assume a struct defined like so:

struct S {
  S *ptr;
};

A linked list of structs of type S will look like this:

|63  ..  56|55  ..  48|47     ..      0|
|   key1   |   lock   |     pointer1   |
                               |
+------------------------------+
V
|63  ..  56|55  ..  48|47     ..      0|
|   key2   |    key1  |     pointer2   |
                               |
+------------------------------+
V
|63  ..  56|55  ..  48|47     ..      0|
|   key3   |   key2   |     pointer3   |

The second feature that we can use is the AArch64-specific Pointer Authentication feature. With this feature, we can also mix a global secret (the pointer authentication key) and the pointer value itself into the lock, use as many pointer bits as there are available and take advantage of a shorter instruction sequence at each pointer load/store as compared to doing the computations in software. Now our functions look approximately like this, again assuming a 48-bit VA range:

  • InsertLock(ptr, structptr) = ptr[63:56] : PAHash(ptr, structptr, GlobalKey) : ptr[47:0]
  • RemoveLock(ptr, structptr) = ptr[63:56] : (if ptr[55:48] == PAHash(ptr, structptr, GlobalKey) then 0 else 1) : ptr[47:0]

or these functions if the allocator does not support pointer tagging:

  • InsertLock(ptr) = ptr[63:56] : PAHash(ptr, hash(T, F), GlobalKey) : ptr[47:0]
  • RemoveLock(ptr) = ptr[63:56] : (if ptr[55:48] == PAHash(ptr, hash(T, F), GlobalKey) then 0 else 1) : ptr[47:0]

and our in-memory pointer looks like this:

|63  ..  56|55  ..  48|47     ..      0|
|   key    |  PAHash  |     pointer    |

Effectiveness

The following factors impact the effectiveness of pointer field protection:

  • The number of ignored address bits (e.g. 8 on AArch64) – this bounds the effectiveness of the UAF defense.
  • The number of non-canonical address bits – this bounds the effectiveness of forged pointer detection.
  • The fraction of exploit chains prevented by an inability to exploit a UAF by means of a pointer field.

The first two factors indicate a low probability of a false negative: 1 in 256 or less (e.g. with a smaller VA size or with an exploit chain requiring access to multiple freed pointers). We plan to internally conduct a full evaluation of the last factor once the initial version of PFP is upstreamed.

Here is a breakdown of the threat model in various cases:

  • With the generic implementation and an allocator without support for pointer tagging, UAFs involving a type confusion between two fields of pointer type are caught deterministically, assuming no type hash collisions (probability 1 in 65536 of a collision) and in the absence of an active adversary. This is because, for example, if a structure with a pointer field is reallocated with a structure with an attacker-controlled non-pointer field at the same address, pointer forging is possible.
  • With the generic implementation and an allocator with support for pointer tagging, all UAFs are detected with high probability, but the tradeoff is that it relies on secrecy of the locks, so it assumes the attacker doesn’t have a read primitive.
  • With the pointer authentication based implementation, a pointer cannot be forged except by chance because of the global secret, and a userspace read primitive cannot be used to disclose the global secret because it is stored only in CPU system registers and/or kernel memory.

Practicalities regarding the choice of InsertLock and RemoveLock functions

Going back to the original implementations of InsertLock and RemoveLock, which are needed for the generic implementation of PFP:

  • InsertLock(ptr) = ptr XOR (hash(T, F) << 48)
  • RemoveLock(ptr) = ptr XOR (hash(T, F) << 48)

These functions are suboptimal because they require relatively long instruction sequences and possibly an additional register to materialize the shifted hash constant and then apply the XOR. Instead, we can take advantage of the relatively common architectural support for an add or subtract instruction that takes an 8-bit immediate. We can combine it with a rotate instruction to move the required-to-be-zero high bits of the pointer into the low bits of the stored value. This gives us the following alternative implementations of the functions, which each costs 2 short instructions with constants in immediate operands on common architectures; these functions are the ones that PFP actually uses in the generic implementation:

  • InsertLock(ptr) = ROL(ptr, 16) + hash(T, F)
  • RemoveLock(ptr) = ROR(ptr - hash(T, F), 16)

The rotation count of 16 (64-48) was chosen because 48 is a relatively common size for the VA range and it does not conflict with 6-bit LAM on x86_64. The maximum possible size of the VA range on x86_64 was recently increased to 57 bits; detection is still likely if an application uses more than 48 bits but the likelihood decreases as the application uses more VA space.

It is possible to mix bits of the original pointer into the lock in such a way that pointer corruptions would be detected, but this comes at the cost of more instructions. We plan to implement an alternative mode of PFP that mixes the original pointer, after investigating the possible instruction sequences.

On standard layout types

PFP may not be applied to fields of structs that are standard-layout (this includes structs that may be shared with C). This is for two main reasons:

  1. It facilitates compatibility with precompiled C libraries and other tools that may assume that structs have a standard representation. (PFP implies a change to the C++ ABI, but it does not need to imply a change to the C ABI.)
  2. Even if we cannot apply PFP to all standard layout types, one option that we considered was to allow PFP just to non-trivially destructible standard layout types (informally, “non-POD types”). However, even given this restriction, the rules of the language make it very difficult to distinguish a pointer to a field of pointer type from any other pointer to a pointer, or to know which fields must have PFP disabled when encountering particular constructs. To give one example of the difficulties, the code snippet below would need to disable PFP for the field B::c but f is allowed to be compiled without a complete type for A or B.
struct B {
  int *c;
  ~B();
};

struct A {
  B b;
};

int **f(A *a) {
  return (int **)a;
}

Because the current implementations of various standard library types, including std::unique_ptr, are standard layout, their pointers would not be protected by PFP under this rule. For this reason, we also propose to make various common standard library types non-standard layout when PFP is enabled. Although this is a C++ ABI change, it should be acceptable because PFP already changes the C++ ABI.

This could be done by, for example, declaring a class with two identical base classes and having the standard library types inherit from it, taking advantage of the C++ rule in [class.prop] that a standard-layout class “has at most one base class subobject of any given type”, such as the following:

class __force_nonstandard_layout_base1 {};
class __force_nonstandard_layout_base2 : __force_nonstandard_layout_base1 {};
class __force_nonstandard_layout : __force_nonstandard_layout_base1, __force_nonstandard_layout_base2 {};

For the time being, we propose to add the above base class to certain libc++ types (only if PFP is enabled). However, we also intend to introduce an attribute that has the same effect. This would not only be more concise and self documenting, especially for user code, but would also allow C structs to opt into PFP.

On trivially destructible and trivially copyable types

If the data structure is trivially destructible, implying that it may be trivially copyable, we cannot use a variant of InsertLock and RemoveLock that takes a structptr argument because the struct pointer may change outside of our control as a result of the structure being copied via memcpy.

I referred to trivially destructible (formally: “has a trivial, non-deleted destructor”) structs above. One might think that this could just read “trivially copyable”. However, to my surprise, I discovered that this does not work. This is because, by the rules of the language, it is possible for a struct A to be not trivially copyable, while another struct B which has A as a base or member is trivially copyable. This means that memcpying B, as allowed by the language, will invalidate A’s members.

Let’s recap the rule for whether a class is trivially copyable:

A trivially copyable class is a class:
(6.1)— where each copy constructor, move constructor, copy assignment operator, and move assignment
operator (15.8, 16.5.3) is either deleted or trivial,
(6.2)— that has at least one non-deleted copy constructor, move constructor, copy assignment operator, or
move assignment operator, and
(6.3)— that has a trivial, non-deleted destructor (15.4).

(This is from C++17. C++20 changed the wording to account for concepts but I don’t think the change was material for this issue.)

Suppose that A has a trivial CC, MC, CAO, and a non-trivial MAO. The non-trivial MAO causes it to be non-trivially-copyable. Then B, which has A as a member, introduces a deleted MAO. This hides A’s non-trivial MAO and causes B to be trivially copyable. Here is an example case extracted from the protobuf test suite (assuming the std::tuple from libc++):

#include <tuple>

class AnythingMatcher {
};

class PairMatcher {
 public:
  PairMatcher(int) : first_matcher_(0) {}

 private:
  const char *const first_matcher_;
  const AnythingMatcher second_matcher_;
};

PairMatcher s(0); // not trivially-copyable
std::tuple<PairMatcher> st(s); // trivially copyable!

To make this work, we need a property X where “not X” for a base class/field type implies both “not X” and “not trivially copyable” for derived classes and classes with a field of that type. “Trivially destructible” was identified as a suitable property X because of clause 6.3 above combined with the fact that a class cannot have a trivial destructor if its bases or fields have non-trivial destructors.

Example

The following example program is used to show how loads and stores are compiled:

struct Cls {
  virtual ~Cls();
  long *ptr;
};

long *load(Cls *c) {
  return c->ptr;
}

void store(Cls *c, long *l) {
  c->ptr = l;
}

When targeting x86_64, the functions are compiled like this:

0000000000000000 <_Z4loadP3Cls>:
   0:	48 8b 47 08          	mov    0x8(%rdi),%rax
   4:	48 83 c0 45          	add    $0x45,%rax
                              // hash(T) = -0x45
   8:	48 c1 c0 30          	rol    $0x30,%rax
   c:	c3                   	ret

0000000000000000 <_Z5storeP3ClsPl>:
   0:	48 c1 c6 10          	rol    $0x10,%rsi
   4:	48 83 c6 bb          	add    $0xffffffffffffffbb,%rsi
                              // -0xffffffffffffffbb = -hash(T) = 0x45
   8:	48 89 77 08          	mov    %rsi,0x8(%rdi)
   c:	c3                   	ret

When targeting AArch64, they are compiled like this:

0000000000000000 <_Z4loadP3Cls>:
   0:	f9400408 	ldr	x8, [x0, #8]
   4:	dac11808 	autda	x8, x0
   8:	aa0803e0 	mov	x0, x8
   c:	d65f03c0 	ret

0000000000000000 <_Z5storeP3ClsPl>:
   0:	dac10801 	pacda	x1, x0
   4:	f9000401 	str	x1, [x0, #8]
   8:	d65f03c0 	ret

Structure protection vs HWASan and MTE

Here is a comparison of structure protection against HWASan and MTE:

  • Structure protection is only intended as a UAF mitigation. It can also mitigate against buffer overflows, but this is context-dependent.
  • Unlike HWASan and MTE, it stores the “lock” in existing bits of storage that are already being fetched, so there is no additional pressure on the memory system. HWASan and MTE’s use of separate storage for locks has certain costs:
    • With HWASan, the shadow memory region is unassociated with the main memory at the memory system level; it has its own set of cache lines and TLB entries. This increases pressure on the cache and TLB and, together with explicit checks, typically makes HWASan unsuitable for production deployment due to high overhead.
    • With MTE, the tags are typically associated at the hardware level with the cache line and possibly the physical memory itself, so the TLB costs and most of the cache costs are avoided, but the system still needs to fetch tags to fill cache lines and (depending on the implementation) may need to reserve 3% of all memory for tag storage.
  • It scales linearly with the number of checks that are required, in terms of both runtime and memory overhead. With HWASan and MTE, the allocator must retag every memory region on deallocation and sometimes on allocation, and with MTE every access to a tagged region is usually checked (checks can be disabled with specific instructions, but these have their own costs), so enabling checks has a fixed cost even if we just want to check a few fields per struct. On the other hand, with structure protection, the per-allocation cost is very low (we just need to generate a random number per allocation), and we can use the compiler to only insert checks which provide high ROI.
  • Compared to MTE, it reduces the likelihood of a false negative. MTE has 4 key bits, while structure protection has 8, and has even more on AArch64 via mixing of a global secret. This implies a false negative probability of 1 in 256 or less.
  • On AArch64 it utilizes pointer authentication, which is currently available in more microarchitectures than MTE.
  • Structure protection only works for certain C++ struct types (i.e. non-standard layout types) as a consequence of avoiding modifications to the C ABI. HWASan and MTE are compatible with all object types.
  • Structure protection has a basic mode of operation which is architecture-independent.

Frequently Anticipated Questions

How do you handle code that takes the address of a struct field?

We must make sure that loads/stores via that pointer are equivalent to loads/stores via the struct field. The basic strategy for handling this situation is to disable PFP for fields whose address is taken (more precisely: fields whose address leaks outside of the translation unit). Fortunately, it is relatively uncommon for fields to have their address escape (in one large internal Google binary, there were 66875 PFP eligible fields of which 728 fields, or around 1.1% of all eligible fields, had escaping addresses). Less fortunately, determining whether a field’s address is taken requires whole-program information. We don’t need LTO to handle this, though: we can observe that all that we need to do to disable PFP for a field is replace certain instructions with NOPs, and this is fairly easy for a linker to do. More details on how this will work are provided in the RFC for deactivation symbols.

Similarly to code that takes the address of a field, we must also consider code that uses offsetof. Code that uses offsetof to compute a field offset can be understood to mean that a pointer to the field will subsequently be formed by adding the offset to a struct address, so the presence of an offsetof call for a specific field anywhere in the program must disable PFP for that field. This is implemented by tracking all evaluations of offsetof in the ASTContext. Although offsetof is technically only defined for standard-layout types (which means that technically no special handling is necessary for offsetof, because standard layout types are not subject to PFP), we found that enough code uses offsetof on non-standard-layout types to make it necessary to implement this.

The same applies to code that forms a data member pointer, which is also permitted on non-standard-layout types. If a data member pointer is evaluated during code generation, that triggers PFP disablement for the field.

You’re adding latency to every pointer field load, won’t this slow everything down?

This was also our concern at the start of the project. According to public information, the latency of the AUT instruction on Apple M1 is 6-7 cycles, and the latency on Neoverse-N2 is 5 cycles. To begin with, I relied on the following intuition: memory loads are, on average, relatively high latency operations anyway, so adding a few more cycles is unlikely to make a significant difference. But the only way to be sure was to actually implement it and measure the performance difference. Using the prototype linked below, I collected the following result: QPS (queries per second) of a large realistic internal Google benchmark decreases by around 1.5-2% with PFP enabled, depending on the architecture and microarchitecture. These numbers may be considered acceptable by some users, but for users with more exacting performance requirements, we intend to implement techniques for reducing the overhead, by using profile data to disable PFP for frequently accessed fields (see also this thread).

We are also considering another “structure protection” technique, which is to use struct padding to store the “lock” for structs with padding bytes. In principle, this should move the check out of the critical path of loads, but its effect on overall performance will need to be studied.

How do you pass structs by value?

PFP is only intended to protect the heap, not the stack or values in registers. Therefore, pointer fields are decoded into regular pointers as part of the calling convention for structs passed and returned by value.

How are prebuilt/uninstrumented libraries handled? (e.g. libc, libraries written in non-C languages, etc.)

As long as the library itself exposes a C interface and does not depend on the C++ standard library, such libraries will work without problems. This is because, for interoperability among other reasons, structure protection only affects the C++ ABI and not the C ABI.

How it works

When Clang generates code that takes the address of a field that requires PFP, instead of using a normal GEP instruction, it emits a call to the llvm.protected.field.ptr intrinsic. For example, to load from a pointer field named field_name at offset 8 from a non-trivially-copyable struct named StructName, it may emit an intrinsic call and a load like this:

%1 = call ptr @llvm.protected.field.ptr(ptr %0, i64 8, metadata !"StructName.field_name", i1 true)
%2 = load ptr, ptr %1

The first argument is the struct pointer, the second is the field offset in bytes, the third is the field identifier (used for disabling PFP for address-taken fields) and the fourth is true if the struct is not trivially copyable. An instrumentation pass replaces the intrinsic with a regular GEP, and replaces load/store users with the required pointer manipulations following/preceding the load/store. If a user other than a load or store is encountered, it is considered to be escaping the address of the field and will trigger disabling PFP for the field.

Clang is also modified to disable certain optimizations on non-trivially-copyable structs with PFP fields, such as those inserting a call to memcpy.

To support globally disabling PFP for a particular field, instructions that sign or authenticate a pointer to be stored in a field are relocated using a deactivation symbol. Translation units that escape the address of a PFP field or contain an evaluation of offsetof or a member pointer for a PFP field will define a deactivation symbol for the field.

To support globals with PFP fields, Clang is extended to emit ptrauth constant expressions, which were originally introduced for the PAuth ABI, for pointer field members subject to PFP. The ptrauth constant expression is extended in the following ways:

  1. The address diversity form is extended to also add the discriminator to the address if it is given. This allows a negative number to be provided to subtract the field offset.
  2. The constant expression is extended to contain a reference to the deactivation symbol for the field if any.

When generating an initializer for a ptrauth constant as a global initializer, the compiler generates code for an IFUNC resolver which returns the value to be stored as the initializer for the field, and initializes the field using the resolver. This approach was chosen over using or extending the R_AARCH64_AUTH_* relocations added for the PAuth ABI for two main reasons:

  1. It is more flexible. For example, it easily allows the initializer to support deactivation symbols by using the deactivation symbol to relocate the instruction in the IFUNC resolver that signs the pointer. By comparison, a new relocation type would take two symbol operands (the referent and the deactivation symbol), which is unprecedented in ELF and would likely require fundamental extensions to the relocation format just for this niche requirement. It also enables support for using Emulated PAC to relocate ptrauth relocations (see link below). Furthermore, as we extend to other architectures lacking pointer authentication, we avoid needing to hardcode details of our pointer encoding into various dynamic loaders.
  2. It avoids a dependency on a new version of the dynamic loader that supports the PAuth ABI relocation types or any new relocation types/formats that would be introduced as part of (1).

The current version of LLD does not properly support this usage of IFUNC relocations (it introduces a PLT thunk for IFUNC references even when not strictly necessary); an LLD patch will be proposed to address this.

To optimize accesses to PFP fields, the SROA pass is taught to recognize the new intrinsic and to treat intrinsic calls passing the same arguments as references to the same field.

The full prototype may be found here and the fork of TCMalloc with pointer tagging support may be found here. The current prototype can do the following:

  • Build LLVM/Clang/MLIR and pass most tests
  • Build various foundational Google open source projects including ABSL/Protobuf/gRPC and pass their tests
  • Build most of the subset of Google’s internal codebase that is HWASan clean (i.e. the subset that is compatible with pointer tagging and does not contain UAF bugs detectable by running tests) and pass most of its tests

Most remaining issues are considered likely to be undefined behavior in the code itself, such as code that uses memcpy to copy a non-trivially-copyable struct, or code that uses memset(0) over a struct to set its pointer fields to NULL.

Because deactivation symbols and global initializers are currently only implemented for Arm, the current prototype only works on other architectures for trivial programs.

To support running PFP programs on AArch64 machines lacking pointer authentication support, we also introduce Emulated PAC.

Next steps

I propose to upstream the prototype, after adding additional LLVM lit tests to cover the newly added code and addressing the remaining TODOs. To begin with, PFP will be enabled using an experimental Clang driver flag: -fexperimental-pointer-field-protection.

Special thanks

Dmitry Vyukov proposed the general concept of structure protection using unused bits as well as the idea of using struct padding, and provided some feedback. Qinkun Bao implemented the Clang attribute and wrote some of the lit tests.

This work couldn’t have been created, or at least would have been much more difficult, without the ability to use pointer authentication in LLVM and to use a debugger to record and replay programs on AArch64 with pointer authentication. So the author would also like to thank Apple, the Asahi Linux team and the rr debugger team for their amazing work.

1 Like

A specific question on the use of IFUNCs. Not sure if this is best done here (for more visibility) or in the LLD PR ELF: Do not rewrite IFUNC relocations to point to PLT if no GOT/PLT needed. by pcc · Pull Request #133531 · llvm/llvm-project · GitHub

Is the reason to use IFUNCs because they are run before static initialization? Just thinking if something like the existing static initialization via .init_array.

On the way that the mechanism works. Presumably the ifunc resolver places the value for the static initializer in the .got entry created (with the R_*_IRELATIVE that runs the resolver function). The static initialization code loads the value from the .got entry. As the result of the resolver is not a function no PLT entry is required as the initializing value is only used once.

The alternative is that R_*_IRELATIVE is applied to the field location directly? I can have a memory of glibc throwing out R_ARM_IRELATIVE unless it was in .got (rejected in .got.plt). If you’re planning to use this could be worth seeing if there’s any dynamic linker restrictions on some targets.

Out of interest what if the field to be initialized is itself a pointer to an ifunc?

I’ve still got to go through this in detail, will likely have some more questions on this or the deactivation symbols. I’m particularly interested in how this will interact with the PAuthABI. On first reading I don’t think they are fundamentally incompatible, but there may be some interactions to do with code-pointers in structs that will need some thought.

I guess that if the address of any of global struct is exported into the dynamic symbol table (address taken from outside the executable/shared-object) then Structure protection would need to be disabled for that struct.

I do not know why it implies a change to the C++ ABI. Did you say that earlier in the proposal and I just missed it, or does this refer to the example immediately below?

Is the reason to use IFUNCs because they are run before static initialization? Just thinking if something like the existing static initialization via .init_array.

Yes, we need this initialization to happen during relocation processing to avoid order of initialization issues. I considered initializing via .init_array with e.g. priority 0 but if we did that, that would force many data structures into .data (instead of .data.rel.ro). IFUNCs are applied while RELRO is writable so we wouldn’t have that problem.

On the way that the mechanism works. Presumably the ifunc resolver places the value for the static initializer in the .got entry created (with the R_*_IRELATIVE that runs the resolver function). The static initialization code loads the value from the .got entry. As the result of the resolver is not a function no PLT entry is required as the initializing value is only used once.

The alternative is that R_*_IRELATIVE is applied to the field location directly? I can have a memory of glibc throwing out R_ARM_IRELATIVE unless it was in .got (rejected in .got.plt). If you’re planning to use this could be worth seeing if there’s any dynamic linker restrictions on some targets.

There would be an IRELATIVE applied to the field location directly, and the relocations end up in .rela.dyn. I’ve been doing all my development so far on glibc based systems and haven’t encountered any restrictions like that. Bionic doesn’t have any restrictions like this as far as I’m aware though admittedly I haven’t tried using this on Android yet (and musl doesn’t support ifuncs at all), so I think our bases are covered.

Out of interest what if the field to be initialized is itself a pointer to an ifunc?

Ouch, I didn’t consider that case. I suppose it could work by having the compiler-generated resolver generate a call to the original resolver… but then the compiler would need to know which symbols are ifuncs. Maybe one possibility would be to introduce a new relocation type to be used in the compiler-generated resolver that would either be relocated as BL target ; NOP to call the resolver (if target is an ifunc) or ADRP x0, target ; ADD x0, x0, :lo12:target to take its address (otherwise). Then the calling code would be written to expect either possibility (e.g. would save/restore x30).

I’m particularly interested in how this will interact with the PAuthABI. On first reading I don’t think they are fundamentally incompatible, but there may be some interactions to do with code-pointers in structs that will need some thought.

I think we can mostly make it work by only enabling PFP on fields of data pointer type when both features are enabled.

One subtlety is around how data pointers containing a reinterpret_casted function pointer are represented. With PAuth ABI these pointers are currently signed using IA with discriminator 0 and would not be able to be stored to a PFP field. I’m not sure of a good way to handle this case in general other than to use the no_field_protection attribute to opt out data pointer fields that will store casted function pointers, or to change the ABI so that function pointers casted to data pointers are not signed at all. (The pointers would still be signed in memory as long as they are stored to a PFP field.)

I guess that if the address of any of global struct is exported into the dynamic symbol table (address taken from outside the executable/shared-object) then Structure protection would need to be disabled for that struct.

That’s not implemented and it would be an incomplete solution to the problem of compatibility between multiple DSOs because the address could be obtained in other ways such as by calling a function in the other DSO. At least for the time being, multiple DSOs are outside of the intended usage model for PFP, so the behavior of cross-DSO field access is undefined. I posted ideas in the deactivation symbol RFC about how we can make multiple DSOs work in general.

The C++ ABI change is due to the change to the representation of pointer fields.

“Specifies that a type is a standard layout type. Standard layout types are useful for communicating with code written in other programming languages.”

So you are preserving the C ABI by not changing the representations of pointers that would be shared with C.

You are applying it to pointers in structures that are only used in C++, which does change the C++ ABI.

My confusion was that I thought you were changing the pointer representation in all cases, which would change the C and C++ ABIs, but you are not doing this.

So “does not need to” means that we could apply this to those structures but then it would change the C ABI and there are costs associated with doing that.

Have you considered how debugging will work with this feature? I think the outlook is quite positive.

For Pointer Authentication Top Byte Ignore on Linux, LLDB will ask the kernel for the mask registers and unconditionally remove all those bits every time you call a function via a pointer or you look up a symbol.

This is LLDB on a Linux machine without Pointer Authentication:

(lldb) process status -v
<...>
Addressable code address mask: 0xff00000000000000
Addressable data address mask: 0xff00000000000000
Number of bits used in addressing (code): 56

IIRC LLDB assumes Top Byte Ignore for all AArch64 Linux, then the kernel would give us the Pointer Authentication bits if there were any, there aren’t any here.

On hardware with Pointer Authentication:

(lldb) process status -v
Addressable code address mask: 0xff7f000000000000
Addressable data address mask: 0xff7f000000000000
Number of bits used in addressing (code): 49

That one nibble is 0x7 because the top bit is the sign bit I think. In userspace it will be zero anyway. Masking should replace all non-address bits with the value of that sign bit.

There is a setting you can change to emulate this:

(lldb) settings set target.process.virtual-addressable-bits 48
(lldb) process status -v
<...>
Addressable code address mask: 0xffff000000000000
Addressable data address mask: 0xffff000000000000
Number of bits used in addressing (code): 48

So I tried a small example that I think looks like the proposed scheme:

#include <stdint.h>

int example() { return 99; }

typedef int (*ExampleFnPtr)();

typedef struct {
  ExampleFnPtr fn_ptr;
} FnPtrStruct;

ExampleFnPtr mangle_ptr(ExampleFnPtr fn_ptr) {
  intptr_t int_ptr = (intptr_t)fn_ptr;
  int_ptr |= (intptr_t)0x1234 << 48;
  return (ExampleFnPtr)int_ptr;
}

int main()
{
  FnPtrStruct fps;
  fps.fn_ptr = mangle_ptr(example);
  return fps.fn_ptr();
}

On Top Byte Ignore only hardware, we can’t call the function or lookup the symbol, as you’d expect:

(lldb) p fps.fn_ptr
(ExampleFnPtr) 0x1234aaaaaaaaa714
(lldb) p fps.fn_ptr()

error: Expression execution was interrupted: signal SIGSEGV: address not mapped to object (fault address=0x34aaaaaaaaa714).

Then I change it to 48 virtual address bits:

(lldb) settings set target.process.virtual-addressable-bits 48
(lldb) process status -v
<...>
Addressable code address mask: 0xffff000000000000
Addressable data address mask: 0xffff000000000000
Number of bits used in addressing (code): 48
(lldb) p fps.fn_ptr
(ExampleFnPtr) 0x1234aaaaaaaaa714 (actual=0x0000aaaaaaaaa714 test.o`example at test.c:3:24)
(lldb) p fps.fn_ptr()

error: Expression execution was interrupted: signal SIGSEGV: address not mapped to object (fault address=0x34aaaaaaaaa714).

We are removing the non-address bits from the symbol when doing lookups but not doing the same when we try to call it. The call only removes the top byte.

I think this is a bug, because we should be able to call a function via a Signed Pointer as well. I will investigate.

Assuming that gets fixed, I think you could use this setting to debug the software only mode using LLDB. For AArch64 at least, other architectures might already use this mechanism but I’ve never tested it. We could use a test case like the one above to find out, given that it doesn’t require hardware features.

More complex than I thought - lldb cannot call a function via a pointer that includes non-address bits (that aren't part of Top Byte Ignore) · Issue #134247 · llvm/llvm-project · GitHub, but the rest of the basic lldb commands do work.

Before I start if we have something like

typedef void Fptr(void);

struct S1 {
  Fptr* fp1;
  Fptr* fp2;
};

struct S2 {
  Fptr* fp1;
};

extern void fn();
extern Fptr *init_fn();

// Initialized via .init_array calling _GLOBAL__sub_I_fp.cpp
// If init_fn() could be via PLT
S1 s1 = { &fn, init_fn() };
// Initialized via relocation. 
S2 s2 = { &fn };
// Initialized via relocation.
Fptr *global_ptr = &fn;

In a structure protection world, I’m assuming that the ifunc initialization would only be used for S2. We would want to avoid a call to the PLT from within an ifunc resolver as that can crash if the dynamic loader hasn’t processed the PLT relocation yet.

For the case where fn in the example is an ifunc. IIUC, your proposed change to LLD would mean that instead of the structure field being redirected to the iPLT entry for fn the ifunc resolver writes directly into the place. At the moment this would happen for global variables like global_ptr in my example, as well as fields like S2::fp1 from s2.

In the ifunc_resolver for s2’s S2::fp1 field, I thought that a possible alternative to your suggestion could be for the static linker to indirect via an iPLT entry if the target fn1 in my example is an ifunc. The address of the iPLT entry is known at static link time so if that is “locked” and returned by the initializer then it should be independent of whatever the ifunc resolver for fn returns. However this would not compare equal to a non structure field global_ptr as that no longer indirects via a PLT entry (assuming your LLD change). I’m guessing this could be resolved by using a different relocation directive for the structure field initializer, for example R_AARCH64_ADDRINIT_ABS64 which if targeting an ifunc does not generate an iPLT entry, leaving the regular R_AARCH64_ABS64 relocations to behave as it does today with GNU ld and LLD. The advantage of indirecting via an iPLT is that we don’t need to call the iFunc resolver via a BL, potentially running it twice (or more) if there are other uses of the ifunc.

Hope that makes some kind of sense. I find it difficult to keep all the corner cases for ifuncs in my head.

Another possibility, assuming we can successfully detect when this happens at static link time, is to just give an “unsupported error” message and abort. My guess is that this case is very rare.

I’ve read through the whole proposal now, I’ve mentioned it to a few people internally as well.

Some of the feedback, was along the lines of “That sounds interesting, it goes into implementation detail really quickly, and I’ll need to come back later when I’ve got more time (assuming that ever happens).”

Perhaps it would be worth a separate TL;DR post with a summary of the high-level feature, with its limitations and comparison to existing UAF techniques. While not strictly necessary for llvm developers, it could be useful for gaining support for adding the feature as I expect you’ll need reviewers from clang, llvm, lld and libc++.

On the topic of gaining support for adding the feature. It would be good to know who this feature is aimed at, and what deployment could look like. For example:

  • Is this intended to be used in production as a mitigation, or as a debug tool to find UAF that’s easier to use than MTE/HWASAN?
  • How could this be deployed as an option for a clang user to use in their programs on a Linux Distro? It sounds like a structure protection aware libc++ might be needed?

Overall I like the idea of the feature. Some parts of the ifunc changes make me a bit nervous as ifuncs are almost always trouble, however that is a relatively small part of the implementation detail.

1 Like

Yes, that’s all correct.

Yes, we were thinking about this. I think that what is needed is a new DWARF attribute on the field that marks it as a PFP field. If the debugger sees one of these attributes it would need to mask the signature when reading from the field. On AArch64 we could use the “Addressable data address mask” (minus the TBI bits) and I imagine that something similar would need to be added for other architectures. In your example you showed what happens when calling a function pointer but I think that ideally we would mask the signature when loading the pointer (matching the compiler behavior) so that the function call (or the dereference, or whatever else is done with the pointer) wouldn’t see it in the first place.

Where things get more complicated is writing to fields as this would need to take account of whether the struct is trivially destructible and whether there was a deactivation symbol in order to compute the correct signature. To begin with I think we can forbid writing to these pointers from the debugger.

Correct (with the proviso that S2 in your example is standard layout, so something would need to be added to make it non-standard layout), s1 is initialized dynamically (via .init_array) and that will continue to happen.

Correct.

Correct.

Because the instruction relocation referring to fn would exist in the compiler generated ifunc resolver for S2::fp1, fn would be replaced with an iPLT (even in the global_ptr initializer) because HAS_DIRECT_RELOC would be set by the instruction relocation. That happens with or without my change. Therefore, function pointer equality is still preserved. So I think this would work as long as “normal” (user-written) ifunc resolvers only ever return function pointers. Maybe this is fine for now but I would be slightly concerned about this restriction because we have had use cases for returning non-function-pointers from resolvers, such as __hwasan_shadow in the HWASan runtime.

This would be an alternative to my LLD change.

But I would not worry about the cost of calling the resolver multiple times. I would in fact argue that that change is likely an optimization in the general case because it trades off a tiny amount of startup time to call a resolver multiple times against the cost of indirecting every ifunc call via an iPLT even when not necessary. (Ifuncs are used if the function is expected to be called frequently, so I would expect the number of calls to be much larger than the number of startups.)

I’d rather not create a new relocation type for this because that would need to be repeated for every architecture. I suppose it would be fine with me to introduce a -z flag that eliminates iPLT indirection wherever possible and then the Clang driver could pass this when -fexperimental-pointer-field-protection is passed at link time.

It is rare for sure, and this case didn’t come up in our internal testing so far, but it is legal in C++ to initialize a field with (say) the address of a function like memset which is likely to be an ifunc, and features like FMV exist which automatically introduce ifuncs, so I think it needs to be supported somehow.

Thank you for reading and sharing the proposal and the feedback so far.

Maybe I’ll try to add a TL;DR section to the start of my initial post since that’s where people will look first. There are some things like this in the “Effectiveness” and “Structure protection vs HWASan and MTE” sections but this is perhaps burying the lede somewhat and it would be good to have this information highlighted.

This is intended as a low overhead production mitigation.

Maybe we could allow linking simple programs statically by shipping a separate libc++.a in the install directory under a separate path which would get added to -L by passing the right flag at link time. This would be useful for the sanitizers as well.

Getting this to work on arbitrary Linux distribution packages would be much more difficult, because the usage model assumes static linking, and Linux distributions are usually dynamically linked. With the recent rise of immutable distributions maybe there’s something that can be done by extending the idea of LLD partitions to “statically link” all binaries in the distribution together without the overhead of duplicate libraries, which would allow for not only this but also many other kinds of ABI breakages, as well as avoiding dynamic linking overhead, but that’s a whole separate topic.

There will be changes in C++26 due to trivially relocatable. I’m not sure whether that will impact your solution.

This subject has come up in libc++ before, but we never implemented a solution. IIRC sanitizers were indeed one of the motivations.

Looking through the paper, my initial read is that I think my proposal is compatible with trivial relocatability, but I will think about it more to be certain. The paper requires that trivial relocations must be done by calling std::trivially_relocate, and we can have the compiler generate an implementation of that function that authenticates and re-signs any pointer fields belonging to non-trivially-destructible objects. This is basically the same as would be required for vtable pointers under the PAuth ABI as mentioned in the FAQ section 15.1.

I was discussing my proposal with @waffles_the_dog and we came across the following case, which resembles the case that prompted the implementation-defined carveout for polymorphic union members to support PAuth ABI:

struct Foo trivially_relocatable_if_eligible {
  Foo(const Foo &);
  public:
    size_t p;
  private:
    void *q;
};

struct Bar trivially_relocatable_if_eligible {
  Bar(const Bar &);
  public:
    void *p;
  private:
    size_t q;
};

struct MyType trivially_relocatable_if_eligible {
  virtual ~MyType();
  union U {
    Foo f;
    Bar b;
  } u;
};

void f(MyType *t1, MyType *t2) {
  std::trivially_relocate(t1, t1+1, t2);
}

Foo and Bar are non-trivially-copyable (which means that PFP will address-discriminate their pointers), and under P2786R13 MyType is required to be trivially relocatable because neither Foo and Bar are polymorphic, but std::trivially_relocate does not have enough information to know how to trivially relocate the object, because it does not know which member is active and cannot authenticate and re-sign the correct pointer.

Oliver kindly offered to bring this issue up at the Sofia WG21 meeting so that we can hopefully relax the carveout to accommodate PFP.

Oh one thing for people to consider is that reference types trigger the object offset problem. When I was thinking about this originally I fixated on the explicit &someField example and was trying to think of ways an author could annotate them. It wasn’t until I was talking to @pcc that I actually thought about reference parameters and realized that they’re very common, syntactically invisible, and obviously are just syntactic sugar around &field.

So while I initially thought (hoped?) we could add some kind of annotation on specific use cases that’s not actually practically possible.