Hi all,
I’d like to propose an LLVM optimization for transparently changing the layout of globals to make them more PIC-friendly. The intention is to reduce relro memory by applying similar techniques done with relative vtables but for user-defined data structures. What I’d like are peoples’ opinions/thoughts on whether or not something like this is worthwhile investing in.
Motivation
.data.rel.ro is a data section that normally contributes to a lot of relro usage. Much of this section contains user-defined globals. It’s not uncommon for binaries to have large data structures like string tables, vtable-like objects, or generic lookup tables to other globals. It would be nice to apply the same technique for vtables to these globals. Much of the remainder of relro usage in Chromium and other Fuchsia binaries are from .data.rel.ro and much of this section is these globals.
Background
relro refers to “relocatable read-only”. This effectively refers to memory that is readonly after the dynamic linker resolves relocations. If your code is position-independent (PIC), then any “constant” globals that contain pointers to other globals will have relocations that the dynamic linker needs to resolve. This is because your PIC binary can be loaded almost anywhere into memory, so the linker doesn’t know where the addresses of these other globals are until all relevant binaries have been loaded, after which the dynamic linker can change protections on pages holding these binaries to readonly. An issue here is that these pages cannot be shared because they’re initially loaded on writable pages in order to patch in the relocations. The relro memory overhead comes from multiple processes loading the same binary onto writable pages even though .data.rel.ro globals are effectively readonly at runtime.
VTables are a good example of this. VTables are data structures that are readonly at runtime and only contain pointers to other globals (mainly virtual functions). The relative vtables algorithm involves replacing each 64-bit function pointer with a 32-bit offset between the vtable and its virtual function. If the vtable and virtual function are in the same binary, then this offset is always statically known because they travel together in the same binary regardless of where it’s loaded. When loading a virtual function from a vtable, rather than loading the 64-bit function pointer, we instead load the 32-bit offset, and add it back to the vtable address to get the virtual function address (let’s call this a “relative load”). Filling the vtable with statically known offsets moves the vtable to .rodata which can be loaded onto a pure readonly page.
Note that this changes the ABI. Because the vtable layout changes, any binary that wants to use a relative vtable must also be built with the relative vtables ABI. If you have the luxury of building your own “system binaries” (like libc++ or libc++abi), then this might not be an issue. Additionally, if your code doesn’t make any assumptions about the C++ ABI, then enabling this will be nearly transparent from a user perspective because vtables are compiler-emitted. That is, C++ developers don’t need to care about their layout as long as what’s guaranteed by the C++ standard “just works”.
Another example are lookup tables: they are generated by the compiler, and the table and other relevant symbols tend to be within the same binary.
Proposals
I would like to extend this idea of changing the data layout to user-defined constant globals to reduce .data.rel.ro. That is, replace pointers to other globals with offsets. To get a static offset, both symbols must be non-interposable and be defined in the same binary.
(1) A non-intrusive compiler IR transformation
Preconditions
The idea behind this is that users could use an opt-in compiler transform that transparently makes candidate globals PIC-friendly without needing to change any source code, but replacing global elements from pointers to offsets introduces ABI changes. In order to prevent breaking user code, an optimization that replaces a global’s layout must be conservative such that it doesn’t break user code or violate C++ language rules. In order to do this, the optimization would check for candidate globals before changing its layout. The conditions for a global to be PIC-friendly would be:
- We must be able to “see” all usages of this global. That is, it cannot be runtime or link-time interposable, and we know all references to it in code or other globals. Effectively, this means the global must have static linkage. If LTO is enabled, then the linkage can be global but its visibility must be hidden.
- All pointer components in the global must be to dso-local non-preemptible globals. This ensures that we know all pointees are dso-local symbols.
-
All references to this global must be done through a series of GEPs followed by a load. Each of these GEPs can be replaced with the correct relative load. This also ensures that these globals don’t escape the current compilation unit via “rets” and we don’t instrument accesses done via bitcasts to i8 ptrs where pointer arithmetic on
char
/uint8_t
ptrs and loads/stores are done manually.
Transformation
Once we’ve asserted a global meets these criteria, we can change its layout and usages via the following algorithm. Note that these steps assume pointers are 64-bit and a small code model (where symbols are assumed to be within 32-bit offsets of each other), but this can be extended to non-64-bit architectures:
- For each 64-bit pointer slot: if it is an integral constant, create a
private unnamed_addr dso_local
global whose value is the original constant. In ELF, this results in an anonymous mergeable static symbol that just contains a single word. Replace that pointer slot with a reference to this new global. (In practice,NULL
is the only relevant constant here. We can call thisnull_proxy
). - Create one anonymous dso-local symbol that contains a single relocatable word: its own load address. Let’s call this
load_addr_proxy
. On ELF, this could effectively just be the load address of the binary itself (__ehdr_start
), but I don’t know how to represent this in IR. - For each 64-bit pointer slot, replace it with two 32-bit offsets:
- If the original slot was an integral constant (like
NULL
), the first 32-bit offset will be the offset from the proxy created for it (null_proxy
) to the current location (.
). The second 32-bit offset will be zero. - Else, the original slot was a reference to some dso-local global (plus an optional addend). The first 32-bit offset will be the offset from the symbol created in step 2 (
load_addr_proxy
) to the current location to the global (.
). The second 32-bit offset will be the offset from the original dso-local symbol we reference (plus any addends) toload_addr_proxy
. On ELF, both are statically computable since all offsets mentioned are between locations in the same binary.
- If the original slot was an integral constant (like
- When loading from this new PIC-friendly global, do the following:
- Load both of the two 32-bit offsets (as opposed to normally loading the 64-bit pointer directly, or loading a single 32-bit word in the “relative load”).
- Add the first 32-bit offset to the original pointer. This yields the address of the proxy (which is either
null_proxy
orload_addr_proxy
in this example). - Load the 64-bit value from this proxy. This yields either the original integral constant or the address of
load_addr_proxy
. - Add the second 32-bit value back onto this loaded 64-bit value. This will be the final address we want. (If the original value was some constant like
NULL
, then the value we loaded from the previous step is the original constant, and we added zero to it in this step. If the original value was some offset into a global requiring a dynamic relocation, like&kGlobalArr[20]
, then the value we loaded from the previous step is the load address proxy, and we just added to it the offset between&kGlobalArr[20]
and the load address proxy, resulting in&kGlobalArr[20]
).
This would make something like:
const char * const symbols[] = {
"Howdy",
NULL,
"Hallo",
};
void external(const char *const);
void loader(int i) {
external(symbols[i]);
}
For x86_64 ELF, this would be lowered to
_Z6loaderi:
movslq %edi, %rax
leaq _ZL7symbols(%rip), %rcx
movq (%rcx,%rax,8), %rdi
jmp _Z8externalPKc@PLT
.section .data.rel.ro,"aw",@progbits
_ZL7symbols:
.quad .L.str
.quad 0
.quad .L.str.1
.section .rodata.str1.1,"aMS",@progbits,1
.L.str:
.asciz "Howdy"
.L.str.1:
.asciz "Hallo"
But after the transformation, it will instead be lowered to:
.section .rodata.str1.1,"aMS",@progbits,1
.L.str:
.asciz "Howdy"
.L.str.1:
.asciz "Hallo"
// Step 1 - Make the integral proxies
.L.null_proxy:
.quad 0
// Step 2 - Make the load address proxy
.section .data.rel.ro,"aw",@progbits
.L.load_addr_proxy:
.quad .L.load_addr_proxy
// Step 3 - Replace the components of the global
.section .rodata,"a",@progbits
_ZL7symbols:
// 3-1 - Replace ptr component with the two relative offsets
.long .L.load_addr_proxy - .
.long .L.str - .L.load_addr_proxy
// 3-2 - Replace the integral ptr component with one offset and a zero
.long .L.null_proxy - .
.long 0
.long .L.load_addr_proxy - .
.long .L.str.1 - .L.load_addr_proxy
// Step 4 - Replace the original load
.section .text
_Z6loaderi: # @_Z6loaderi
leaq _ZL7symbols(%rip,%rdi,8), %rcx // rcx = offset into an array of now 32-bit offset pairs, rdi = index (int i)
movslq (%rcx), %rdx // Load the first 32-bit offset
movslq 4(%rcx), %rdi // Load the second 32-bit offset
addq (%rcx,%rdx), %rdi // *(original ptr + first offset) + second offset
jmp _Z8externalPKc@PLT
This effectively moves all of the original table into rodata at the cost of one dynamic relocation for .L.load_addr_proxy
per linkage unit. This requires three loads instead of one from a regular or relative load: one for the first offset, one for the second offset, and one for the actual relative load. The second load might not be as cycle-heavy though since both offsets are right next to each other and will likely be on the same cache line. An alternative method for reducing the number of loads is to instead do a 64-bit load, then truncate this value to get the first 32-bit offset and do a signed right-shift to get the second 32-bit offset. This replaces the second load with a truncate and a shift.
This should also ensure functions like offsetof
, sizeof
, and alignof
return the same expected values.
(2) Language-level support
The above optimization has a few restrictions, namely (without LTO) it will not capture hidden globals, and it dismisses globals that can escape to other compilation/linkage units. This is where language-level support can help. If we have some way in the frontend to indicate across all source files that a particular global can be made PIC-friendly, then we can always control the codegen for accessing that global (ie. ensure accesses to it are all relative loads). In the frontend, we can emit the PIC-friendly layout for this data structure and ensure a correct relative load is always used. The only requirements would just be that the globals must be non-interposable and be defined in the same binary.
What I would like to implement is some language-level support like an attribute. For this example, we can call it __attribute__((relocatable))
and it would have these properties:
-
relocatable
can either be added to a global declaration that is either a pointer pointing to another dso-local global, or an aggregate data structure like an array or struct that contains pointers to dso-local globals.- dso-local here means all globals involved have either hidden visibility or are static.
- A pointer within a
relocatable
global indicates any pointer in it will contain the offset between that location in the global and the address it actually points to. Loading this pointer will involve a relative load which is dereferencing this address to get the offset, then adding it back to the original address. -
relocatable
will operate similarly to the type qualifiersconst
andvolatile
: it is not tied to a specific type, but rather some instance of data. -
relocatable
must only be used onconst
globals.- This is because we do not want to “store” an address in the global that is not dso-local. We would not be able to guarantee any offset between symbols in two separate binaries fit within our offset size.
- Casting that removes or adds
relocatable
is allowed, but accessing a casted global that does not match the originalrelocatable
ness of the global’s original declaration is undefined behavior.- More specifically, casting
relocatable
out from a global originally declared with relocatable then accessing it results in UB, but if it’s casted back torelocatable
, accesses into it are OK since the compiler will know to use relative loads.
- More specifically, casting
From a user perspective, if someone wants to make their global PIC-friendly, they would only need to add this attribute/keyword to their global declaration and the entire structure would be PIC-friendly. This operates similarly to “relative pointers” in the Jai programming language.
Thoughts? These are high-level descriptions of what I would like to implement but wanted to get peoples’ thoughts before investing time into this and getting measurements. I skipped out on a couple of details/corner cases for the sake of length, but I can go over them in the comments.