A memory model for LLVM IR supporting limited type punning
Introduction
This post introduces a memory model for LLVM IR. Compared to previous approaches, the aim of the model is to support limited type punning between pointers and integers and allow memcpy to be represented in LLVM without introducing a byte type. The aim is not for this model to be adopted as-is as it is intentionally simplified for clarity, but instead to demonstrate that a weaker model that doesn’t need such a byte type is possible.
Background
Miscompilations
I’ll provide a brief description of the issue. Currently LLVM IR treats values as both having provenance from an allocation and as an integer with no such information. This results in miscompilations.
For example,
if (x == y) {
return y;
}
could reasonably be transformed to
if (x == y) {
return x;
}
Integer optimizations would like to take advantage of this kind of transformations. However it does not hold if y is a pointer as the comparison is partial for pointers as they contain additional provenance data unavailable at runtime. Disabling such optimizations would fix the issue, but we’d like to keep doing them for performance reasons.
Another way to solve this inconsistency is to ensure that integers have no provenance. That lets us keep the integer optimizations and we only have to disable them for pointer types. This however results in new problems. memcpy written in C becomes undefined because the i8 used for char can not be used to copy pointer values. memcpy can also not be replaced with regular integer loads and stores. Such approaches to solve this inconsistency introduce a new byte type to allow for these behaviors.
For more information I’d recommend the talk Byte types, or how to get rid of i8 abuse for chars in LLVM IR and Ralf Jung’s blog post series (part 1, part 2 and part 3).
Type punning
Type punning is the act of storing one type in memory and loading it as another type. This model focuses on type punning pointers as integers. This is an example of such:
void *ptr = malloc(1);
uintptr_t raw_value = *(uintptr_t*)&ptr;
Here ptr stores a pointer, but we load it as an integer. The casts only serves to change the type of the pointer, but it does not itself cast the pointer to an integer.
This is in contrast to the following example where we do an explicit pointer to integer cast:
void *ptr = malloc(1);
uintptr_t address = (uintptr_t)ptr;
The address here is not equivalent to the raw_value in the earlier example. Consider a backend which encrypts pointers. raw_value would still be encrypted, but address would be decrypted by the explicit pointer to integer cast.
This model considers type punning, but does not specify casts between pointers and integers. memcpy uses type punning so we do not need to consider such casts to allow a lowering of it.
Memory and values
An allocation consists of an address and a vector of bytes which is the storage of the allocation. alive indicates if the allocation can still be accessed.
struct Allocation {
bool alive;
size_t address;
vector<Byte> storage;
};
We’ll call a permission to access some memory a capability. Pointer values in memory will carry such capabilities. This is the same concept as pointers having a provenance value, but I’ll borrow the capability terminology from the capability-based security field here.
In this model there’s only one form of capability, the permission to access an allocation. This makes our definition simple:
struct Capability {
shared_ptr<Allocation> allocation;
};
We’ll say that a capability is equal to another capability if the allocation object they point to is the same.
A bit has an optional value, which we’ll call the integer value. If the bit has no integer value we’ll say the bit has poison.
A bit also has an optional capability.
struct Bit {
optional<bool> value;
optional<Capability> capability;
};
LLVM values consists of an array of Bits. If any of the bits of a value have poison we say that the value has poison. If any of the bits of a value have a capability, we say that the value has a capability. If all of the bits have an integer value, we say that the value has an integer value, which can be formed by concatenating the integer values of the bits.
A byte is an array of 8 bits.
typedef Bit Byte[8];
Semantics
The poison literal gives a value whose bits all have poison, but no capabilities.
Literal numbers have all integer values, but no capabilities.
alloca creates a new allocation with the alive field set to true. A capability to it is returned in the capability bits of the result. The integer bits are non-deterministic and the same as the address of the allocation. The address of the allocation returned does not overlap with other allocations. It is undefined behavior if the address space cannot fit the new allocation. The last byte of the address space will not be included in an allocation’s range. When the function that called alloca returns, the alive field of the allocation will be set to false.
Memory accesses require that the pointer used have all capability bits and all integer bits. All the capabilities of the bits must be equal, which mean that they all refer to the same allocation object. The alive field of that allocation object must be true. The base of the allocation is subtracted from the integer bits of the pointer as if they were both unbounded integers and the result is the index into the allocation’s storage field. The access must not go outside the bounds of the storage field.
Both load and store does not alter any bits, but simply moves them. The size of the accesses must be aligned to a byte. It is undefined behavior for these instruction not to follow the rules for memory access.
Access to memory requires you to have the right capability bits. These bits are only created by alloca. This is what allows alias analysis to reason that memory operations whose pointer is without any dataflow dependency on the result of an alloca does not alias the allocation created by that alloca.
Conditional branch instruction like br and switch are undefined if their input has a capability or has poison. This being undefined for capabilities is quite important as that information is not available when lowered to real hardware, so we cannot make a control flow decision on it. We also cannot ignore the capability, we couldn’t do the transformation we saw in the background section in that case.
ptrtoint and inttoptr is not present in this model as it focuses on type punning and not casts between pointers and integers. There are however multiple known models for such casts that it can be extended with.
Integer operations including icmp return poison if there are any capability bits or if any bits have poison. Comparing pointers is intentionally unspecified as it can interact with the choice of model for casts between pointers and integers.
freeze transforms any bits that have poison non-deterministically to integer bits, but leaves capability bits alone.
getelementptr does not modify capability bits. It returns poison if any of its inputs have poison. The validation of inbounds pointers has an additional rule that the capabilities of the pointer’s bits must be equal. Otherwise the validation works as it currently does, with the allocation the capabilities refer to being the “allocated object”.
Implications
With these semantics, memcpy can be represented by using plain i8 types. Lowering memcpy to wider types such as i64 will no longer spread poison. Such memcpy-like patterns can also be split and combined.
Programs which use type punning to store a pointer in an integer will have well defined behavior. An example of such in C is:
int *a = malloc(8);
uintptr_t *ptr_to_a = (uintptr_t*)&a;
uintptr_t ptr_to_a_as_int = *ptr_to_a;
*ptr_to_a = ptr_to_a_as_int;
*a = 1;
This assumes that strict aliasing is also disabled.
Modifying a pointer in integer form by integer operations does however not work:
int *a = malloc(8);
uintptr_t *ptr_to_a = (uintptr_t*)&a;
uintptr_t ptr_to_a_as_int = *ptr_to_a;
*ptr_to_a = ptr_to_a_as_int + 1; // Results in poison
*a = 1; // Undefined behavior
Compared to an approach which does not allow pointer bits in integers, this introduces new obligations for integer transformations. For example x to x + 0 is not a legal transformation. This is because x could have a capability which means x + 0 would be poison where previously it was just x. Additionally it is not allowed because if x is partially poison, x + 0 becomes fully poison. Integer operations can still be transformed given existing integer operations, so it’s unclear if it is problematic in practice.
Capabilities are non-monotonic, which means that transformations that introduce capabilities where there previously was none may introduce undefined behavior.
If it turns out that these new obligations indeed are problematic we can introduce a !nocap attribute on load instructions which cause them to return poison if it would otherwise have any capability bits. This is something languages like Rust and C with strict aliasing (excluding char) could do, but you do have to give up the ability to store pointers in integers and implementing memcpy with regular integer types in the front-end language. This attribute is similar to the approach taken with !noundef and freeze.
Conclusion
We defined poison to be bitwise to prevent it from spreading in memory operations. We allowed integers to hold pointers if no integer operations are used on them. These are the key novel ideas of this model. I hope this shows that we can avoid introducing a byte type, give defined behavior to more programs compared to an approach where integers cannot hold pointers, while still keeping our optimizations.