A memory model for LLVM IR supporting limited type punning

A memory model for LLVM IR supporting limited type punning

Introduction

This post introduces a memory model for LLVM IR. Compared to previous approaches, the aim of the model is to support limited type punning between pointers and integers and allow memcpy to be represented in LLVM without introducing a byte type. The aim is not for this model to be adopted as-is as it is intentionally simplified for clarity, but instead to demonstrate that a weaker model that doesn’t need such a byte type is possible.

Background

Miscompilations

I’ll provide a brief description of the issue. Currently LLVM IR treats values as both having provenance from an allocation and as an integer with no such information. This results in miscompilations.

For example,

if (x == y) {
   return y;
}

could reasonably be transformed to

if (x == y) {
   return x;
}

Integer optimizations would like to take advantage of this kind of transformations. However it does not hold if y is a pointer as the comparison is partial for pointers as they contain additional provenance data unavailable at runtime. Disabling such optimizations would fix the issue, but we’d like to keep doing them for performance reasons.

Another way to solve this inconsistency is to ensure that integers have no provenance. That lets us keep the integer optimizations and we only have to disable them for pointer types. This however results in new problems. memcpy written in C becomes undefined because the i8 used for char can not be used to copy pointer values. memcpy can also not be replaced with regular integer loads and stores. Such approaches to solve this inconsistency introduce a new byte type to allow for these behaviors.

For more information I’d recommend the talk Byte types, or how to get rid of i8 abuse for chars in LLVM IR and Ralf Jung’s blog post series (part 1, part 2 and part 3).

Type punning

Type punning is the act of storing one type in memory and loading it as another type. This model focuses on type punning pointers as integers. This is an example of such:

void *ptr = malloc(1);
uintptr_t raw_value = *(uintptr_t*)&ptr;

Here ptr stores a pointer, but we load it as an integer. The casts only serves to change the type of the pointer, but it does not itself cast the pointer to an integer.

This is in contrast to the following example where we do an explicit pointer to integer cast:

void *ptr = malloc(1);
uintptr_t address = (uintptr_t)ptr;

The address here is not equivalent to the raw_value in the earlier example. Consider a backend which encrypts pointers. raw_value would still be encrypted, but address would be decrypted by the explicit pointer to integer cast.

This model considers type punning, but does not specify casts between pointers and integers. memcpy uses type punning so we do not need to consider such casts to allow a lowering of it.

Memory and values

An allocation consists of an address and a vector of bytes which is the storage of the allocation. alive indicates if the allocation can still be accessed.

struct Allocation {
   bool alive;
   size_t address;
   vector<Byte> storage;
};

We’ll call a permission to access some memory a capability. Pointer values in memory will carry such capabilities. This is the same concept as pointers having a provenance value, but I’ll borrow the capability terminology from the capability-based security field here.

In this model there’s only one form of capability, the permission to access an allocation. This makes our definition simple:

struct Capability {
   shared_ptr<Allocation> allocation;
};

We’ll say that a capability is equal to another capability if the allocation object they point to is the same.

A bit has an optional value, which we’ll call the integer value. If the bit has no integer value we’ll say the bit has poison.

A bit also has an optional capability.

struct Bit {
    optional<bool> value;
    optional<Capability> capability;
};

LLVM values consists of an array of Bits. If any of the bits of a value have poison we say that the value has poison. If any of the bits of a value have a capability, we say that the value has a capability. If all of the bits have an integer value, we say that the value has an integer value, which can be formed by concatenating the integer values of the bits.

A byte is an array of 8 bits.

typedef Bit Byte[8];

Semantics

The poison literal gives a value whose bits all have poison, but no capabilities.

Literal numbers have all integer values, but no capabilities.

alloca creates a new allocation with the alive field set to true. A capability to it is returned in the capability bits of the result. The integer bits are non-deterministic and the same as the address of the allocation. The address of the allocation returned does not overlap with other allocations. It is undefined behavior if the address space cannot fit the new allocation. The last byte of the address space will not be included in an allocation’s range. When the function that called alloca returns, the alive field of the allocation will be set to false.

Memory accesses require that the pointer used have all capability bits and all integer bits. All the capabilities of the bits must be equal, which mean that they all refer to the same allocation object. The alive field of that allocation object must be true. The base of the allocation is subtracted from the integer bits of the pointer as if they were both unbounded integers and the result is the index into the allocation’s storage field. The access must not go outside the bounds of the storage field.

Both load and store does not alter any bits, but simply moves them. The size of the accesses must be aligned to a byte. It is undefined behavior for these instruction not to follow the rules for memory access.

Access to memory requires you to have the right capability bits. These bits are only created by alloca. This is what allows alias analysis to reason that memory operations whose pointer is without any dataflow dependency on the result of an alloca does not alias the allocation created by that alloca.

Conditional branch instruction like br and switch are undefined if their input has a capability or has poison. This being undefined for capabilities is quite important as that information is not available when lowered to real hardware, so we cannot make a control flow decision on it. We also cannot ignore the capability, we couldn’t do the transformation we saw in the background section in that case.

ptrtoint and inttoptr is not present in this model as it focuses on type punning and not casts between pointers and integers. There are however multiple known models for such casts that it can be extended with.

Integer operations including icmp return poison if there are any capability bits or if any bits have poison. Comparing pointers is intentionally unspecified as it can interact with the choice of model for casts between pointers and integers.

freeze transforms any bits that have poison non-deterministically to integer bits, but leaves capability bits alone.

getelementptr does not modify capability bits. It returns poison if any of its inputs have poison. The validation of inbounds pointers has an additional rule that the capabilities of the pointer’s bits must be equal. Otherwise the validation works as it currently does, with the allocation the capabilities refer to being the “allocated object”.

Implications

With these semantics, memcpy can be represented by using plain i8 types. Lowering memcpy to wider types such as i64 will no longer spread poison. Such memcpy-like patterns can also be split and combined.

Programs which use type punning to store a pointer in an integer will have well defined behavior. An example of such in C is:

   int *a = malloc(8);
   uintptr_t *ptr_to_a = (uintptr_t*)&a;
   uintptr_t ptr_to_a_as_int = *ptr_to_a;
   *ptr_to_a = ptr_to_a_as_int;
   *a = 1;

This assumes that strict aliasing is also disabled.

Modifying a pointer in integer form by integer operations does however not work:

   int *a = malloc(8);
   uintptr_t *ptr_to_a = (uintptr_t*)&a;
   uintptr_t ptr_to_a_as_int = *ptr_to_a;
   *ptr_to_a = ptr_to_a_as_int + 1; // Results in poison
   *a = 1; // Undefined behavior

Compared to an approach which does not allow pointer bits in integers, this introduces new obligations for integer transformations. For example x to x + 0 is not a legal transformation. This is because x could have a capability which means x + 0 would be poison where previously it was just x. Additionally it is not allowed because if x is partially poison, x + 0 becomes fully poison. Integer operations can still be transformed given existing integer operations, so it’s unclear if it is problematic in practice.

Capabilities are non-monotonic, which means that transformations that introduce capabilities where there previously was none may introduce undefined behavior.

If it turns out that these new obligations indeed are problematic we can introduce a !nocap attribute on load instructions which cause them to return poison if it would otherwise have any capability bits. This is something languages like Rust and C with strict aliasing (excluding char) could do, but you do have to give up the ability to store pointers in integers and implementing memcpy with regular integer types in the front-end language. This attribute is similar to the approach taken with !noundef and freeze.

Conclusion

We defined poison to be bitwise to prevent it from spreading in memory operations. We allowed integers to hold pointers if no integer operations are used on them. These are the key novel ideas of this model. I hope this shows that we can avoid introducing a byte type, give defined behavior to more programs compared to an approach where integers cannot hold pointers, while still keeping our optimizations.

1 Like

I would prefer you didn’t, it risks confusion with the capabilities in CHERI/Arm’s Morello where this model will never be a thing, once broken up pointers are invalid and you must re-derive from some other still-valid pointer. Is there a reason not to just call this the provenance and avoid such confusion?

If you break uintptr_t arithmetic then good luck compiling real-world code.

If I’m following correctly, this is basically turning integer types into bytes in disguise. The primary difference between this and the byte proposal is that converting a byte to an integer is an implicit operation, instead of an explicit one. We can implement essentially same semantics for C code on top of an explicit byte type if we want: we just translate all load/store/etc. operations on C integers to byte types, then explicitly cast them to LLVM IR integer types when we need to perform an integer operation.

I’m a bit wary of the implicit conversions; it’s tricky for optimizations to reason about them. But maybe it’s worth avoiding the code churn involved in using explicit byte types.

I think in practice programmers do sometimes need integer arithmetic on pointers, for example to round them up to alignment. Is there a way we can preserve the capabilities of their bits? What would we lose, optimization-wise?

Note that this proposal only covers what happens if you type-pun pointer values in memory (convert an char** to long*, then mess with the bits of the long). It doesn’t cover explicitly casting from char*longchar*, which is what you’d normally think of as “integer arithmetic on pointers”.

Broadly speaking, there are two kinds of models of pointer provenance that allow integer arithmetic on pointers. http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2577.pdf calls them “PVI” and “PVNI”. PVI says that integers have provenance; PVNI says they don’t, but explicit integer-to-pointer conversions give the resulting pointer provenance. This proposal implies a PVNI model.

I’d argue the opposite point. I think understanding that LLVM IR and C’s abstract machines have a capability system and how that system relates to CHERI’s capability system is important to understanding how CHERI works.

That does seem to be equivalent. It would probably have some compile-time impact so I don’t think it’s as practical as just lowering char to the byte type.

It kind of seems like integer optimizations can treat values as integers if they see existing integer operations. That seems like a precondition for useful optimizations anyway, so I wonder if any existing ones break under this model. I also cannot think of miscompilations as a result of integer optimizations ignoring these new obligations.

There’s 2 known ways to support that.

The best one optimization-wise would be Rust’s new experimental strict provenance API. The addr method on pointers would remove all capabilities allowing it to be used as an integer. You’d use with_addr to explicitly restore the capabilities. As a bonus you gain CHERI compatibility. It’s possible C and C++ will pick these up eventually too, but those language don’t move as fast.

You can also use traditionally casts between pointers and integers. You’d need a model which erases capabilities on pointer to integer casts. These don’t optimize as well since you have to consider pointers cast to integers as escaping in to the “integer domain”. This is what the existing ptrtoint and inttoptr is intended to support.

There are certain optimizations which straddle the gap between non-integer and integer. For example x < 0 ? a : 0(x >> 31) & a is illegal under your proposal.

I figured control-flow had to be involved somehow. At least that example has to show that a is an integer already to rule out the case where a is poison.

I’m trying to fully understand your proposal. Here are a few checks:

(1) Does your proposal imply that the source program of A miscompilation bug with unsigned char · Issue #36817 · llvm/llvm-project · GitHub must raise UB? If it is true, the C++ standard must be changed first, I guess

(2) Is select cond, x + 1, x into x + zext(cond) legal? There are a few InstCombine transformations introducing arithmetic operations.

(3) In some instances, LLVM can introduce a branch on the comparison of integers based on pointers. Is loop versioning supported in this model?

for (i = 0; i < n; ++i)
  p[i] = q[i];
=>
if ((intptr_t)(q+n) <= (intptr_t)p ||
    (intptr_t)(p+n) <= (intptr_t)q) {
  // [p,p+n) and [q,q+n) are disjoint
  for (i = 0; i < n; i += 4) p[i..i+3] = q[i..i+3];
} else {
  for (i = 0; i < n; ++i) p[i] = q[i];
}

I guess that depends on how to lower things. Say the cast from unsigned char to int lowers to zext, then the result is poison and the program has UB. If the cast is considered to be some form of weird cast between a pointer and integer and suitably lowered, then it could be fine. The issue does lack some justification of why the C++ standard allows integer operations on pointer values however.

That is not legal in the case where cond = false and x has a capability. I do wonder now if defining x + 0, 0 + x as x would cause further problems.

This seems to just be regular pointer to integer casts, which the model doesn’t define, but should work with a suitable extension.

Hi,

In your example of:
For example,

if (x == y) {
   return y;
}

could reasonably be transformed to

if (x == y) {
   return x;
}

Just to add an alternative viewpoint. In the pointer case.
As a C programmer that understands provenance, one might wish to state that one actually wants “return y” in all cases.
In which case your proposed optimizations prevent that level of control from the programmer.
My idea would therefore be to change the C/C++ standards such that the programmer is given more control over pointer providence. A step in this direction would be for the C programmer to be able to express pointer arithmetic without resorting to “uintptr_t”.
Example, let the C programmer able to express “pointer_align_up” and “pointer_align_down” that result in LLVM IR that does not do any conversion from pointer to int and back again. Thus allowing the compiler to make correct optimizations.
I think it might make compiler optimization cleaner if pointer arithmetic was given its own set to LLVM IR instructions, separate from the integer ones.

The memcpy aspect where a pointer is saved in RAM, and then maybe memcpy somewhere else, and then loaded as a pointer. Preserving provenance across that is more complex and I am not sure how one could express that in C/C++
The C standard probably states that provenance is not preserved in this case, but if it can be preserved, possibly extra optimizations are available to the compiler to make.
There will of course be cases where provenance cannot be preserved, but increasing the number of edge cases where it can, would improve optimizations.

As a C programmer, there are always several methods to achieve a goal, but if one can choose a method that preserves provenance, one can then know that the compiler is more likely to optimize it better.

We upstreamed this to Clang as __builtin_align_up/__builtin_align_down/__builtin_is_aligned. The only problem upstream is that these are lowered to use ptrtoint, though it won’t use an inttoptr for the first two, it’ll do a GEP to add the difference between the unaligned and aligned addresses to the original pointer.

Thank you. That adds a few more calculations, but by using “GEP to add the difference” one is achieving the aim of preserving provenance and controlling it from C.
There are some edge cases where it might not work on some targets, but this is something I can use, so thank you.

Yes, that is also how I would frame that proposal. (Though to be precise it is not actually a conversion, the representation is the same. It’s just that all bit sequences can be stored in an i64, but only some subset of those is actually supported by arithmetic and logical operations.) I am not sure what the LLVM community’s thinking on a byte type currently is (I recall originally it was received with a lot of skepticism), so this might be an alternative if a byte type is considered “too much”.

A model like the one I suggested here, for example. :wink:

That’s exactly how Rust’s new with_addr operation works. However, unfortunately this sometimes does not optimize as well in LLVM as the original code, even though it should give the compiler more information (see here).