A memory model for LLVM IR supporting limited type punning
Introduction
This post introduces a memory model for LLVM IR. Compared to previous approaches, the aim of the model is to support limited type punning between pointers and integers and allow memcpy
to be represented in LLVM without introducing a byte
type. The aim is not for this model to be adopted as-is as it is intentionally simplified for clarity, but instead to demonstrate that a weaker model that doesn’t need such a byte
type is possible.
Background
Miscompilations
I’ll provide a brief description of the issue. Currently LLVM IR treats values as both having provenance from an allocation and as an integer with no such information. This results in miscompilations.
For example,
if (x == y) {
return y;
}
could reasonably be transformed to
if (x == y) {
return x;
}
Integer optimizations would like to take advantage of this kind of transformations. However it does not hold if y
is a pointer as the comparison is partial for pointers as they contain additional provenance data unavailable at runtime. Disabling such optimizations would fix the issue, but we’d like to keep doing them for performance reasons.
Another way to solve this inconsistency is to ensure that integers have no provenance. That lets us keep the integer optimizations and we only have to disable them for pointer types. This however results in new problems. memcpy
written in C becomes undefined because the i8
used for char
can not be used to copy pointer values. memcpy
can also not be replaced with regular integer loads and stores. Such approaches to solve this inconsistency introduce a new byte
type to allow for these behaviors.
For more information I’d recommend the talk Byte types, or how to get rid of i8 abuse for chars in LLVM IR and Ralf Jung’s blog post series (part 1, part 2 and part 3).
Type punning
Type punning is the act of storing one type in memory and loading it as another type. This model focuses on type punning pointers as integers. This is an example of such:
void *ptr = malloc(1);
uintptr_t raw_value = *(uintptr_t*)&ptr;
Here ptr
stores a pointer, but we load it as an integer. The casts only serves to change the type of the pointer, but it does not itself cast the pointer to an integer.
This is in contrast to the following example where we do an explicit pointer to integer cast:
void *ptr = malloc(1);
uintptr_t address = (uintptr_t)ptr;
The address
here is not equivalent to the raw_value
in the earlier example. Consider a backend which encrypts pointers. raw_value
would still be encrypted, but address
would be decrypted by the explicit pointer to integer cast.
This model considers type punning, but does not specify casts between pointers and integers. memcpy
uses type punning so we do not need to consider such casts to allow a lowering of it.
Memory and values
An allocation consists of an address and a vector of bytes which is the storage of the allocation. alive
indicates if the allocation can still be accessed.
struct Allocation {
bool alive;
size_t address;
vector<Byte> storage;
};
We’ll call a permission to access some memory a capability. Pointer values in memory will carry such capabilities. This is the same concept as pointers having a provenance value, but I’ll borrow the capability terminology from the capability-based security field here.
In this model there’s only one form of capability, the permission to access an allocation. This makes our definition simple:
struct Capability {
shared_ptr<Allocation> allocation;
};
We’ll say that a capability is equal to another capability if the allocation
object they point to is the same.
A bit has an optional value
, which we’ll call the integer value. If the bit has no integer value we’ll say the bit has poison.
A bit also has an optional capability.
struct Bit {
optional<bool> value;
optional<Capability> capability;
};
LLVM values consists of an array of Bit
s. If any of the bits of a value have poison we say that the value has poison. If any of the bits of a value have a capability, we say that the value has a capability. If all of the bits have an integer value, we say that the value has an integer value, which can be formed by concatenating the integer values of the bits.
A byte is an array of 8 bits.
typedef Bit Byte[8];
Semantics
The poison
literal gives a value whose bits all have poison, but no capabilities.
Literal numbers have all integer values, but no capabilities.
alloca
creates a new allocation with the alive
field set to true
. A capability to it is returned in the capability bits of the result. The integer bits are non-deterministic and the same as the address of the allocation. The address of the allocation returned does not overlap with other allocations. It is undefined behavior if the address space cannot fit the new allocation. The last byte of the address space will not be included in an allocation’s range. When the function that called alloca
returns, the alive
field of the allocation will be set to false
.
Memory accesses require that the pointer used have all capability bits and all integer bits. All the capabilities of the bits must be equal, which mean that they all refer to the same allocation object. The alive
field of that allocation object must be true
. The base of the allocation is subtracted from the integer bits of the pointer as if they were both unbounded integers and the result is the index into the allocation’s storage
field. The access must not go outside the bounds of the storage
field.
Both load
and store
does not alter any bits, but simply moves them. The size of the accesses must be aligned to a byte. It is undefined behavior for these instruction not to follow the rules for memory access.
Access to memory requires you to have the right capability bits. These bits are only created by alloca
. This is what allows alias analysis to reason that memory operations whose pointer is without any dataflow dependency on the result of an alloca
does not alias the allocation created by that alloca
.
Conditional branch instruction like br
and switch
are undefined if their input has a capability or has poison. This being undefined for capabilities is quite important as that information is not available when lowered to real hardware, so we cannot make a control flow decision on it. We also cannot ignore the capability, we couldn’t do the transformation we saw in the background section in that case.
ptrtoint
and inttoptr
is not present in this model as it focuses on type punning and not casts between pointers and integers. There are however multiple known models for such casts that it can be extended with.
Integer operations including icmp
return poison
if there are any capability bits or if any bits have poison. Comparing pointers is intentionally unspecified as it can interact with the choice of model for casts between pointers and integers.
freeze
transforms any bits that have poison non-deterministically to integer bits, but leaves capability bits alone.
getelementptr
does not modify capability bits. It returns poison
if any of its inputs have poison. The validation of inbounds
pointers has an additional rule that the capabilities of the pointer’s bits must be equal. Otherwise the validation works as it currently does, with the allocation the capabilities refer to being the “allocated object”.
Implications
With these semantics, memcpy
can be represented by using plain i8
types. Lowering memcpy
to wider types such as i64
will no longer spread poison. Such memcpy
-like patterns can also be split and combined.
Programs which use type punning to store a pointer in an integer will have well defined behavior. An example of such in C is:
int *a = malloc(8);
uintptr_t *ptr_to_a = (uintptr_t*)&a;
uintptr_t ptr_to_a_as_int = *ptr_to_a;
*ptr_to_a = ptr_to_a_as_int;
*a = 1;
This assumes that strict aliasing is also disabled.
Modifying a pointer in integer form by integer operations does however not work:
int *a = malloc(8);
uintptr_t *ptr_to_a = (uintptr_t*)&a;
uintptr_t ptr_to_a_as_int = *ptr_to_a;
*ptr_to_a = ptr_to_a_as_int + 1; // Results in poison
*a = 1; // Undefined behavior
Compared to an approach which does not allow pointer bits in integers, this introduces new obligations for integer transformations. For example x
to x + 0
is not a legal transformation. This is because x
could have a capability which means x + 0
would be poison
where previously it was just x
. Additionally it is not allowed because if x
is partially poison, x + 0
becomes fully poison
. Integer operations can still be transformed given existing integer operations, so it’s unclear if it is problematic in practice.
Capabilities are non-monotonic, which means that transformations that introduce capabilities where there previously was none may introduce undefined behavior.
If it turns out that these new obligations indeed are problematic we can introduce a !nocap
attribute on load
instructions which cause them to return poison
if it would otherwise have any capability bits. This is something languages like Rust and C with strict aliasing (excluding char
) could do, but you do have to give up the ability to store pointers in integers and implementing memcpy
with regular integer types in the front-end language. This attribute is similar to the approach taken with !noundef
and freeze
.
Conclusion
We defined poison to be bitwise to prevent it from spreading in memory operations. We allowed integers to hold pointers if no integer operations are used on them. These are the key novel ideas of this model. I hope this shows that we can avoid introducing a byte
type, give defined behavior to more programs compared to an approach where integers cannot hold pointers, while still keeping our optimizations.