[RFC] Proposal for optimizations to transparently make data structures PIC-friendly

Hi all,

I’d like to propose an LLVM optimization for transparently changing the layout of globals to make them more PIC-friendly. The intention is to reduce relro memory by applying similar techniques done with relative vtables but for user-defined data structures. What I’d like are peoples’ opinions/thoughts on whether or not something like this is worthwhile investing in.

Motivation

.data.rel.ro is a data section that normally contributes to a lot of relro usage. Much of this section contains user-defined globals. It’s not uncommon for binaries to have large data structures like string tables, vtable-like objects, or generic lookup tables to other globals. It would be nice to apply the same technique for vtables to these globals. Much of the remainder of relro usage in Chromium and other Fuchsia binaries are from .data.rel.ro and much of this section is these globals.

Background

relro refers to “relocatable read-only”. This effectively refers to memory that is readonly after the dynamic linker resolves relocations. If your code is position-independent (PIC), then any “constant” globals that contain pointers to other globals will have relocations that the dynamic linker needs to resolve. This is because your PIC binary can be loaded almost anywhere into memory, so the linker doesn’t know where the addresses of these other globals are until all relevant binaries have been loaded, after which the dynamic linker can change protections on pages holding these binaries to readonly. An issue here is that these pages cannot be shared because they’re initially loaded on writable pages in order to patch in the relocations. The relro memory overhead comes from multiple processes loading the same binary onto writable pages even though .data.rel.ro globals are effectively readonly at runtime.

VTables are a good example of this. VTables are data structures that are readonly at runtime and only contain pointers to other globals (mainly virtual functions). The relative vtables algorithm involves replacing each 64-bit function pointer with a 32-bit offset between the vtable and its virtual function. If the vtable and virtual function are in the same binary, then this offset is always statically known because they travel together in the same binary regardless of where it’s loaded. When loading a virtual function from a vtable, rather than loading the 64-bit function pointer, we instead load the 32-bit offset, and add it back to the vtable address to get the virtual function address (let’s call this a “relative load”). Filling the vtable with statically known offsets moves the vtable to .rodata which can be loaded onto a pure readonly page.

Note that this changes the ABI. Because the vtable layout changes, any binary that wants to use a relative vtable must also be built with the relative vtables ABI. If you have the luxury of building your own “system binaries” (like libc++ or libc++abi), then this might not be an issue. Additionally, if your code doesn’t make any assumptions about the C++ ABI, then enabling this will be nearly transparent from a user perspective because vtables are compiler-emitted. That is, C++ developers don’t need to care about their layout as long as what’s guaranteed by the C++ standard “just works”.

Another example are lookup tables: they are generated by the compiler, and the table and other relevant symbols tend to be within the same binary.

Proposals

I would like to extend this idea of changing the data layout to user-defined constant globals to reduce .data.rel.ro. That is, replace pointers to other globals with offsets. To get a static offset, both symbols must be non-interposable and be defined in the same binary.

(1) A non-intrusive compiler IR transformation

Preconditions

The idea behind this is that users could use an opt-in compiler transform that transparently makes candidate globals PIC-friendly without needing to change any source code, but replacing global elements from pointers to offsets introduces ABI changes. In order to prevent breaking user code, an optimization that replaces a global’s layout must be conservative such that it doesn’t break user code or violate C++ language rules. In order to do this, the optimization would check for candidate globals before changing its layout. The conditions for a global to be PIC-friendly would be:

  1. We must be able to “see” all usages of this global. That is, it cannot be runtime or link-time interposable, and we know all references to it in code or other globals. Effectively, this means the global must have static linkage. If LTO is enabled, then the linkage can be global but its visibility must be hidden.
  2. All pointer components in the global must be to dso-local non-preemptible globals. This ensures that we know all pointees are dso-local symbols.
  3. All references to this global must be done through a series of GEPs followed by a load. Each of these GEPs can be replaced with the correct relative load. This also ensures that these globals don’t escape the current compilation unit via “rets” and we don’t instrument accesses done via bitcasts to i8 ptrs where pointer arithmetic on char/uint8_t ptrs and loads/stores are done manually.

Transformation

Once we’ve asserted a global meets these criteria, we can change its layout and usages via the following algorithm. Note that these steps assume pointers are 64-bit and a small code model (where symbols are assumed to be within 32-bit offsets of each other), but this can be extended to non-64-bit architectures:

  1. For each 64-bit pointer slot: if it is an integral constant, create a private unnamed_addr dso_local global whose value is the original constant. In ELF, this results in an anonymous mergeable static symbol that just contains a single word. Replace that pointer slot with a reference to this new global. (In practice, NULL is the only relevant constant here. We can call this null_proxy).
  2. Create one anonymous dso-local symbol that contains a single relocatable word: its own load address. Let’s call this load_addr_proxy. On ELF, this could effectively just be the load address of the binary itself (__ehdr_start), but I don’t know how to represent this in IR.
  3. For each 64-bit pointer slot, replace it with two 32-bit offsets:
    1. If the original slot was an integral constant (like NULL), the first 32-bit offset will be the offset from the proxy created for it (null_proxy) to the current location (.). The second 32-bit offset will be zero.
    2. Else, the original slot was a reference to some dso-local global (plus an optional addend). The first 32-bit offset will be the offset from the symbol created in step 2 (load_addr_proxy) to the current location to the global (.). The second 32-bit offset will be the offset from the original dso-local symbol we reference (plus any addends) to load_addr_proxy. On ELF, both are statically computable since all offsets mentioned are between locations in the same binary.
  4. When loading from this new PIC-friendly global, do the following:
    1. Load both of the two 32-bit offsets (as opposed to normally loading the 64-bit pointer directly, or loading a single 32-bit word in the “relative load”).
    2. Add the first 32-bit offset to the original pointer. This yields the address of the proxy (which is either null_proxy or load_addr_proxy in this example).
    3. Load the 64-bit value from this proxy. This yields either the original integral constant or the address of load_addr_proxy.
    4. Add the second 32-bit value back onto this loaded 64-bit value. This will be the final address we want. (If the original value was some constant like NULL, then the value we loaded from the previous step is the original constant, and we added zero to it in this step. If the original value was some offset into a global requiring a dynamic relocation, like &kGlobalArr[20], then the value we loaded from the previous step is the load address proxy, and we just added to it the offset between &kGlobalArr[20] and the load address proxy, resulting in &kGlobalArr[20]).

This would make something like:

const char * const symbols[] = {
  "Howdy",
  NULL,
  "Hallo",
};
void external(const char *const);
void loader(int i) {
  external(symbols[i]);
}

For x86_64 ELF, this would be lowered to

_Z6loaderi:
        movslq  %edi, %rax
        leaq    _ZL7symbols(%rip), %rcx
        movq    (%rcx,%rax,8), %rdi
        jmp     _Z8externalPKc@PLT

        .section        .data.rel.ro,"aw",@progbits
_ZL7symbols:
        .quad   .L.str
        .quad   0
        .quad   .L.str.1

        .section        .rodata.str1.1,"aMS",@progbits,1
.L.str:
        .asciz  "Howdy"
.L.str.1:
        .asciz  "Hallo"

But after the transformation, it will instead be lowered to:

        .section        .rodata.str1.1,"aMS",@progbits,1
.L.str:
        .asciz  "Howdy"
.L.str.1:
        .asciz  "Hallo"

// Step 1 - Make the integral proxies
.L.null_proxy:
        .quad   0

// Step 2 - Make the load address proxy
        .section        .data.rel.ro,"aw",@progbits
.L.load_addr_proxy:
        .quad .L.load_addr_proxy

// Step 3 - Replace the components of the global
        .section        .rodata,"a",@progbits
_ZL7symbols:
        // 3-1 - Replace ptr component with the two relative offsets
        .long   .L.load_addr_proxy - .
        .long   .L.str - .L.load_addr_proxy
        // 3-2 - Replace the integral ptr component with one offset and a zero
        .long   .L.null_proxy - .
        .long   0
        .long   .L.load_addr_proxy - .
        .long   .L.str.1 - .L.load_addr_proxy

// Step 4 - Replace the original load
        .section        .text
_Z6loaderi:                             # @_Z6loaderi
        leaq    _ZL7symbols(%rip,%rdi,8), %rcx      // rcx = offset into an array of now 32-bit offset pairs, rdi = index (int i)
        movslq  (%rcx), %rdx                        // Load the first 32-bit offset
        movslq  4(%rcx), %rdi                       // Load the second 32-bit offset
        addq    (%rcx,%rdx), %rdi                   // *(original ptr + first offset) + second offset
        jmp     _Z8externalPKc@PLT

This effectively moves all of the original table into rodata at the cost of one dynamic relocation for .L.load_addr_proxy per linkage unit. This requires three loads instead of one from a regular or relative load: one for the first offset, one for the second offset, and one for the actual relative load. The second load might not be as cycle-heavy though since both offsets are right next to each other and will likely be on the same cache line. An alternative method for reducing the number of loads is to instead do a 64-bit load, then truncate this value to get the first 32-bit offset and do a signed right-shift to get the second 32-bit offset. This replaces the second load with a truncate and a shift.

This should also ensure functions like offsetof, sizeof, and alignof return the same expected values.

(2) Language-level support

The above optimization has a few restrictions, namely (without LTO) it will not capture hidden globals, and it dismisses globals that can escape to other compilation/linkage units. This is where language-level support can help. If we have some way in the frontend to indicate across all source files that a particular global can be made PIC-friendly, then we can always control the codegen for accessing that global (ie. ensure accesses to it are all relative loads). In the frontend, we can emit the PIC-friendly layout for this data structure and ensure a correct relative load is always used. The only requirements would just be that the globals must be non-interposable and be defined in the same binary.

What I would like to implement is some language-level support like an attribute. For this example, we can call it __attribute__((relocatable)) and it would have these properties:

  • relocatable can either be added to a global declaration that is either a pointer pointing to another dso-local global, or an aggregate data structure like an array or struct that contains pointers to dso-local globals.
    • dso-local here means all globals involved have either hidden visibility or are static.
  • A pointer within a relocatable global indicates any pointer in it will contain the offset between that location in the global and the address it actually points to. Loading this pointer will involve a relative load which is dereferencing this address to get the offset, then adding it back to the original address.
  • relocatable will operate similarly to the type qualifiers const and volatile: it is not tied to a specific type, but rather some instance of data.
  • relocatable must only be used on const globals.
    • This is because we do not want to “store” an address in the global that is not dso-local. We would not be able to guarantee any offset between symbols in two separate binaries fit within our offset size.
  • Casting that removes or adds relocatable is allowed, but accessing a casted global that does not match the original relocatableness of the global’s original declaration is undefined behavior.
    • More specifically, casting relocatable out from a global originally declared with relocatable then accessing it results in UB, but if it’s casted back to relocatable, accesses into it are OK since the compiler will know to use relative loads.

From a user perspective, if someone wants to make their global PIC-friendly, they would only need to add this attribute/keyword to their global declaration and the entire structure would be PIC-friendly. This operates similarly to “relative pointers” in the Jai programming language.

Thoughts? These are high-level descriptions of what I would like to implement but wanted to get peoples’ thoughts before investing time into this and getting measurements. I skipped out on a couple of details/corner cases for the sake of length, but I can go over them in the comments.

How hard would it be to modify the scheme to compress pointers to 32 bits? If we’re assuming everything is in the same DSO anyway, we might as well take advantage of that to reduce binary size further.

On AArch64, “small” code model goes up to 4GB, so you can’t use negative offsets there, I think.

I believe representing pointers with 32 bits can be done already with ILP32 which is supported for x86 and arm and is supported by a bunch of platforms already, so modifying this I think would be mostly orthogonal and wouldn’t impact the idea too much. I believe reducing the pointer size could reduce relro usage by a factor of that reduction, but we’d still have the issue of pointers resulting in dynamic relocations (just of a smaller size).

I believe representing pointers with 32 bits can be done already with ILP32

Not what I was trying to get at.

Given the restrictions “We must be able to “see” all usages of this global” and " All references to this global must be done through a series of GEPs followed by a load" we can rewrite the layout of the global however we want; we don’t need to preserve the size/alignment/etc. of the original global because we’re going to rewrite all the uses anyway. Say you have:

static const char* const s[] = { "A", "B", "A", "A" }`
void external(const char *const);
void loader(int i) {
  external(symbols[i]);
}

You can rewrite to (in pseudo-C):

static const uint32_t sym = { "A"-__ehdr_start, "B"-__ehdr_start, "A"-__ehdr_start, "B"-__ehdr_start };
void external(const char *const);
void loader(int i) {
  external(__ehdr_start + sym[i]);
}

So the global is half the size (but we haven’t actually changed the size of a pointer).

Or actually, for that particular case, there are other possible rewrites available; for example:

static const char sym = { 0, 1, 0, 0 };
void external(const char *const);
void loader(int i) {
  external(sym[i] ? "A" : "B");
}

In this case, you’ve shrunk the pointers from 8 bytes to 1 byte.

Oh, I see. Thanks for the clarification. By compress pointers to 32 bits, you mean more their “underlying representation”. I’ve thought about this before and yes I believe they can be compressed further but only if a few extra conditions are met.

In your pseudo-C example, we can cut the size of sym in half because we know that every component in it can be reduced to a 32-bit static relocation. If this static relocation is simply the offset between two dso-local symbols, then the entire object can be 32-bit offsets. Where we might run into trouble is if one of the original string values was instead NULL (or some integer casted to a char ptr). So

static const char* const s[] = { "A", "B", NULL, "A" }`

Here, we can’t uniformly have the layout be offsets because NULL - __ehdr_start results in a 64-bit dynamic reloc. So we could technically compress the size further only if we know for a fact all of the pointer components are in fact pointers to other dso-local non-interposable symbols. This is one of the reasons we could do this with vtables since every virtual function points to some non-null symbol.

This is actually an extension to the transformation I would want to do incrementally: if we additionally know all pointer components point to a dso-local non-interposable global, then we can use 32 offset instead of the pair I proposed above, and loading would be just one relative load (just like with vtables). I don’t know if there are any language features though in either C or IR that quite represent this. nonnull or dereferencable might be the closest candidates. So if instead sym were an array of nonnull dereferencable pointers, then we know we can store one offsets 32 bit offset and do only one relative load.

In practice, I don’t know what distribution of constant globals there are that have only references to other globals, since it’s not uncommon for a string table to maybe have a rogue NULL in it. The schema I described above was meant to handle the more “general” case I saw where const globals could have pointers to other globals, but have some “integral pointers” sprinkled in.

If the concern is specifically null, you could use a sentinel. Something like:

static const uint32_t sym = { "A"-__ehdr_start, "B"-__ehdr_start, 0, "B"-__ehdr_start };
void external(const char *const);
void loader(int i) {
  external(sym[i] ? __ehdr_start + sym[i] : 0);
}

If a global can contain an arbitrary mixture of global pointers and integers, and some of the loads have variable offsets which mean you can’t prove whether some loads from the global are loading a global pointer or an integer, then I guess something like your proposed scheme makes sense.

I don’t know if there are any language features though in either C or IR that quite represent this.

Not sure I follow this. If we’re doing this as an optimization, we can see all the globals/loads; we don’t need any additional attributes. If doing it as a language feature, we always have to assume the worst case.

I’ve considered the sentinel idea. One thing I wanted to avoid though is explicit branches. Logically it’s possible to handle the NULL case without branching via clever bitmasking and shifting, but when I was experimenting with different ways of doing this they all added significantly more instructions. The extra layer of indirection I proposed I think simplifies any text bloat a lot.

I don’t know if there are any language features though in either C or IR that quite represent this.

Not sure I follow this. If we’re doing this as an optimization, we can see all the globals/loads; we don’t need any additional attributes. If doing it as a language feature, we always have to assume the worst case.

Oh yeah, this was meant more for mitigating the extra layer of indirection for language support. As an optimization where we would see all uses and confirm all loads for a global can be via one relative loads, then yes the attributes are unnecessary. As a language feature where we may not be able to see all uses, this idea was more for fine-tuning codegen. That is, if I have something like:

__attribute__((visibility("hidden"), relocatable)) const char* const s[4];
void external(const char *const);
void loader(int i) {
  external(symbols[i]);
}

Then the proposed scheme would be needed to handle cases where s could have constants like NULL, but if we had another indicator that said s only contained dynamic relocs (like __attribute__((visibility("hidden"), relocatable, only_contains_ptrs_to_globals)) const char* const s[4]), then the compiler would know it could safely just do one relative load:

void loader(int i) {
  external(__ehdr_start + sym[i]);
}

Since we’d know sym is in the form static const uint32_t sym = { "A"-__ehdr_start, "B"-__ehdr_start, "A"-__ehdr_start, "B"-__ehdr_start };