Does LLVM support allocations that grow or shrink?

Alternative titles: Does LLVM do optimizations that are incompatible with mmap? Can we document that particular uses of mmap are fine with LLVM?

One reasonable use of mmap is to implement something that is equivalent to an allocation that grows and shrinks – first reserve a chunk of address space (but don’t map any pages there), and then use more mmap calls to allocate and free pages inside that range as memory demand comes and goes. I am assuming here that pages only get (un)mapped at the end – i.e., there’s some contiguous set of pages starting at the base address that was returned by the initial reservation, and the only thing that changes is the end of this region of mapped pages. (It is certainly interesting to discuss generalizations of this where pages can also be unmapped at the beginning or in the middle, but let’s not get ahead of ourselves.) To my knowledge, C programmers will generally assume that this is something one can do, and we’d like to officially support it in Rust as well.

The trouble is, it is unclear whether LLVM is actually compatible with this kind of pattern. LLVM obvously doesn’t know anything about mmap, but LLVM does know about the concept of an “allocation” or “allocated object” – and if we want to freely access the currently mapped range of our growing-and-shrinking allocation, we need to be careful not to run afoul of assumptions LLVM is making here.

(Can we avoid this region being one large allocated object that changes its size? I don’t think so. Allocated objects have to be backed by memory that’s actually mapped – LLVM assumes we can load from an allocated object any time and that will never trap. We could consider each mmap that actually maps pages to create its own allocated object, but that would mean we cannot do pointer arithmetic across the boundaries of these blocks, which is not an acceptable restriction.)

I was told before by @nlopes that in fact LLVM does not support this pattern. If that is indeed the current consensus then I would ask what it would take to make LLVM support this pattern – as I said above, this is something we’d really like to support in Rust. Currently we have to tell people that they can only use mmap in very restrictive ways, which is unfortunate, but we’re apparently limited by LLVM.

There are two places where I see potential issues.

getelementptr inbounds

This operation requires the pointer to remain in-bounds of some allocated object. If allocated objects can change their size, then does this mean they have to be in-bound of the current size of the object? That means moving getelementptr inbounds down across a shrinking operation would be incorrect, as the offset may not be inbounds any more after shrinking. Maybe it has to be inbounds of the largest size the object ever had? Then moving getelementptr inbounds down will always be possible, but moving it up across a growing operation would not work.

Or maybe the correct way to view this is something entirely different, like – allocated objects correspond to reservations in the address space that may or may not actually be backed by pages right now, and the fact that an address is inbounds of a live allocated object does not imply that the address is actually mapped?

This would be elegantly solved by using a “nowrap” operation for pointer arithmetic instead of an “inbounds” operation, but that’s a larger change that so far didn’t get much traction. Still, it would very easily answer almost all subtle questions about getelementptr inbounds, giving it clear semantics while only requiring minimal changes to existing optimizations (if I recall correctly).

dereferenceability

Consider the following pseudocode:

%val1 = load i32, ptr %ptr // %ptr is dereferenceable for 4 bytes here
%_ = call ...
%val2 = load i8, ptr %ptr
if (...) {
  %val3 = load i32, ptr %ptr
  ...
}

Can val3 be hoisted out of the if?

  • If allocations can never shrink, then the answer is yes – the fact that after the call, %ptr remains dereferenceable for one byte implies that the allocation has not been freed so it is still dereferenceable for 4 bytes.
  • However, if the call shrinks the allocation, then maybe after the call, %ptr points to 1 byte of still-existing memory, followed by a page boundary and 3 bytes on a page that no longer exists, so hoisting val3 would introduce a page fault.

Does LLVM currently do any sort of reasoning like this, where dereferenceability is carried across function calls based on “a big access happened before the call and a small access after the call, therefore the big access must also still be legal after the call”? If yes, then that seems fundamentally incompatible with the idea of shrinking an allocation, and therefore with unmapping no-longer-needed pages at the end of such a growing-and-shrinking allocation.

If LLVM does not currently do such reasoning, can we get this documented in the LangRef so that frontends can rely on it?

1 Like

Even though this is almost verbatim to what’s on the doc, it’s a misconception
and perhaps the doc should be updated).

The gep operation does not know if the object is allocated, freed or even valid. It does not need to know, as it only performs the address calculation (offset from a base pointer), not the actual load.

See The Often Misunderstood GEP Instruction — LLVM 19.0.0git documentation

Not quite. There is no past or current object sizes, as gep operates on compile-time types. If you access two different objects from the same pointer location, you’ll need two different geps operating on two different struct types (or different offsets on the same opaque pointer).

If you only use pointers, there is no way to calculate inbounds. If you use structs or arrays of structs, it’s only meaningful to calculate inbounds of a “potential” object of that type inside an array that is expected to be still valid.

To “know” the object size or if it’s still allocated, you need runtime information or a very clever compile time semantics. None of which are available to LLVM IR (as it’s too low level).

All of that is controlled by the runtime. LLVM can only trust you’re doing “the right thing”.

This has been discussed in the context of MLIR, but only because we control semantics at that level.

To do this at LLVM level you’d need to create operations / intrinsics / address spaces that are known to have certain properties (ex. bounds, liveness, containership), which is a big diversion from the current design so unlikely to work upstream.

This looks like something similar to pointer provenance and capability control. Take a look at CHERI papers.

@davidchisnall

As a proof-of-concept, I suggest you try some special intrinsics that a pass could infer semantics using some global analysis and only rely on gep for the address calculation (as it was intended). Long term, you should also take a look at MLIR as a way to keep that semantics as a dialect and then lower to LLVM at the end.

I think it was also noted that “growing” allocations (i.e. treating “grown”/“shrunk” object via mmap as the same allocation and not a new one) are not compatible with arches, like CHERI: Clarifying GEP semantics - #22 by jrtc27

Link to a previous discussion on this same subject:

I’ll also link an old miscompile bug report, caused by “dereferenceable” attribute’s incorrect semantics w.r.t. C++ object model:

The document you reference is very clear about this and IMO supports my interpretation of GEPi semantics:

" With the inbounds keyword, the result value of the GEP is poison if the address is outside the actual underlying allocated object and not the address one-past-the-end."

This means the result of GEP depends on the size of the actual underlying allocated object. It doesn’t depend on whether the object has been freed already, but it does depend on there being such an object and what its size is.

How else could this sentence be interpreted?
(Cc @nikic, with whom I had past discussions about GEPi and we seemed to agree on how it is specified.)

GEP is a runtime operation. I am asking about its runtime semantics. I am not asking about the compile-time inference that can be deduce from that – those are a secondary, derived property. The primary behavior of GEP is entirely defined in terms of its runtime effect on the runtime state of the LLVM Abstract Machine. (This means things like poison are real distinct values during the execution, i.e. this is different from the runtime behavior in the final hardware where some distinctions have been erased. To ensure there is a coherent view of what an LLVM program does, it is crucial that there exists a precise way to specify its dynamic semantics, including all the poison and UB cases, that all optimizations are consistent with.)

A different way to put this is to consider an example:
(I hope I got the syntax for just adding a byte offset to a pointer with GEPi right.)

%ptr = call my_alloc ... // calls `mmap` to map a single page of memory
%ptr_inner1 = getelementptr inbounds i8, ptr %ptr, i64 8000 // returns poison because 8000 is OOB?
call my_grow ... // calls `mmap` to map a second page of memory after the first
%ptr_inner2 = getelementptr inbounds i8, ptr %ptr, i64 8000 // now this returns a usable pointer as the allocation is big enough?
%val = load i8, ptr %ptr_inner2
call my_shrink ... // calls `munmap` to unmap the second page
%ptr_inner3 = getelementptr inbounds i8, ptr %ptr, i64 8000 // now this returns poison again?

Of course the compiler can’t know what mmap and munmap do, but the compiler can make assumptions like “getelementptr can always be moved up/down across other calls”, which are relevant for programs where the compiler has partial information. So the question is whether a program like the above has well-defined semantics under the LLVM LangRef – which requires considering the dynamic behavior of GEP.

If the compiler sees the load from ptr_inner2 and then concludes that offset 8000 was already inbounds before my_grow then that seems like it could lead to problems.

The problem is that LLVM doesn’t specify sufficiently precisely what “the right thing” is, that’s why I am here asking about it. :slight_smile:

Yes, this is indeed related to pointer provenance – as far as I understand, GEPi uses pointer provenance to determine the allocation that the pointer has to be in-bounds of. That is, at least, the usual interpretation I have seen. (However, if GEPis is intended to ignore but preserve provenance, and use the concrete integer address to determine which object the pointer must be in-bounds of, then that’s fine as well and doesn’t really change anything about my question.)

CHERI is largely unrelated.

Yes, there are a bunch of idioms that are non-trivial to port to CHERI. That doesn’t mean they should not be supported on other targets, though.

For CHERI, what I would suggest is that the initial reservation mmap create a pointer with capability to access the entire memory range, and then the later mmaps that add and remove backing pages don’t have to do anything with capabilities. This means programs can segfault despite using a pointer within its capability, but they can’t use that pointer to do any rogue accesses as the entire range has been reserved, so the security properties CHERI aims for should be preserved.

oops looks like I forgot about my own prior question here.^^ Sorry for that.

The discussion did get derailed a bit at some points, but it seemed like you and @efriedma-quic both agreed that it would make sense to make LLVM support such code, i.e. to say that code that uses munmap to shrink the dereferenceable range of an allocation is well-defined (and presumably the same goes for code that uses mmap to grow the dereferenceable range of an allocation).

What is not clear to me is how to best go about this. Where would that even be documented?

FWIW, this proposal by @nikic would also make it possible for Rust to entirely bypass the getelementptr-related question by entirely avoiding the use of inbounds, so dereferenceability is the only question that remains:

You’re still mixing compile time with run time. The interpretation is correct for compile time, not for run time.

By that I mean: at compile time, there is no way to know if the memory has been allocated, freed, moved. The compiler only knows the shape of the object you’re trying to dereference and can tell you if the offset you’re asking is in-bounds or out-of-bounds of some theoretical object of that type.

It may be possible to infer at compile time if a pointer has enough allocated space if the allocation is immediately followed by use, but not if the allocated pointer is passed to functions from multiple entry points, or if reallocations occur in between, etc.

Even whole program analysis may be insufficient if you’re creating a library that will be used by arbitrary programs.

It can’t. Not without “knowing” the semantics of some run time library like mmap or my_version_of_mmap that is totally out of the question.

You can only know these things at compile time if the language permits you via strong memory semantics (like Verona), but LLVM IR has to cope with unsafe languages like C and therefore cannot make assumptions about the memory model.

“LLVM IR is a compiler IR”. It’s not a representation of some particular language semantics, or library implementation, runtime behaviour. It has to be the minimum denominator across all languages, and when one of them is C, there isn’t much you can do.

As an example, when LLVM tries to be smart about runtime libraries, it messes up: Orc Jit on windows cannot find __divti3

If it can be inferred at compile time, sure. Not all pointer provenances can.

You’re particularly asking about run time provenance (mmap, remap), and that’s something GEP cannot do, because it’s a compile time semantics.

This has to be done on top of LLVM IR. You must add semantics around IR constructs, run time calls that verify state, or use hardware controlled scoping (like CHERI), or define language controlled scoping (like Verona), and lower that to IR that “will do the right thing” based on your language/API/hardware semantics.

That’s why I suggested MLIR. It’s easier to wrap semantics around LLVM IR as an MLIR dialect, than it is creating new LLVM intrinsics and teaching LLVM passes to obey them.

No, I’m not. I am well aware of the distinction of what happens at runtime vs what part of that the compiler can deduce at compile time. You seem to be assuming that the inbounds restriction of GEP, and the questions around poison, are purely compile-time questions. That is not the case. LLVM IR has a runtime semantics that fully determines all questions of poison, Undefined Behavior, things like that. A semantics like that is the basis for tools like Alive and for formal (or even sufficiently precise but informal) analysis of whether an optimization pass is correct.

It is often the case that these semantics are not talked about a lot in compiler development, which leads to massive issues such as miscompilations and serious spec holes. I want to talk about those semantics to avoid exactly those kinds of issues proactively, instead of waiting until they occur in production code and then merely reacting to them.

Yes, I know all that and it is entirely irrelevant for this question. I am asking what the specified behavior of GEPi is for an omniscient compiler that fully knows everything, or equivalently for an interpreter that runs LLVM IR while fully tracking all state that is relevant to determine whether an operation produces poison or whether a program has UB, or equivalently for the Abstract Machine that specified the behavior of LLVM IR programs, or equivalently for a tool like Alive that can automatically prove certain optimizations to be correct. All these use-cases need a fully defined semantics for each IR operation, and none of them ever involves the question of “does the compiler know something”.

Compiler analyses are but imperfect approximations of these dynamic semantics, whose correctness is judged by whether they correctly (i.e., conservatively) approximate these semantics.

To give a concrete example: the result of an add %x, %y might not be known by the compiler, but this operation always has a very specific well-defined result (and sometimes it is poison), and what the result is depends in no way on what the compiler “knows”. Sometimes the compiler knows the result is a particular constant, sometimes the compiler knows the result is even, sometimes the compiler knows nothing. None of this is ever mentioned in the specification of what add does, as it is not part of the specification. The specification only says: this produces the sum of the two inputs, or poison if one input (or both) is poison. Whether what the compiler “knows” that a value is poison or even or whatever is not relevant for specifying the behavior of add. getelementptr inbounds is fundamentally the same, the spec is just more complicated.

The only project that can define the semantics of IR constructs is the LLVM project itself. The correctness of LLVM optimization passes depends on what these semantics are, so it is the responsibility of the LLVM project to fully describe what these semantics are. This also helps the LLVM project itself because it is the basis for being able to judge whether an optimization is correct (and it is used for very helpful tools such as Alive).

LLVM IR is generally fairly good at this, thanks to the LangRef, but there are some gaps. This thread is about one of them.

I’ll start by answering the second question about dereferenceability, because it is easier: For your IR sample, we can only assume that the location is still dereferenceable after the call is nofree (in any of the permutations it can be, e.g. via one of the attributes or by not having access to the object in the first place).

I believe that originally nofree was designed with just whole-allocation freeing in mind, but I think it could be extended to also cover partial deallocation from the end without any significant impact. I don’t see a problem with accepting a LangRef change to that effect.

Of course, if the dereferenceability is derived from dereferenceable rather than a load, then this is not the case, because dereferenceability currently applies to the whole function. That’s a different discussion.

It’s worth noting that we may very little use of nofree right now (I think there might be just one transform for loop load PRE?) so this is mostly a theoretical thing at this point.


The question about getelementptr inbounds is trickier. From a practical perspective, as long as LLVM is not told about the size of the allocation, things will be fine. For the actual operational semantics, I don’t have an immediate answer to this.

I think to the most part, GEP inbounds would be fine with just being in bounds of a hypothetical allocated object that could exist at that location. This puts limits on the size of the inbounds range (half the address space) and excludes the null pointer location.

However, we also make use of GEP inbounds in alias analysis, to determine that a GEP inbounds with offset N cannot alias an identified object with size smaller N. This is where we can get into trouble with growing allocations. If we initially have an allocation of size N, a GEP inbounds with offset M > N and the allocation later grows to size K > M, how does that work out if we want GEPs to be freely hoistable and sinkable? I don’t really know.

That’s great, thanks!

Any idea where in the LangRef this would best go? This is not really the property of any particular operation or attribute, instead it is about a universal property governing the entire Abstract Machine.

Yes of course, if there is such an attribute then the allocation must not be shrunken below that size while the attribute applies (i.e., while the function whose parameter carries the attribute runs).

That sounds like we could say that allocated objects have two sizes: a maximal size and a current size. The maximal size is subject to LLVM’s usual constraints (not larger than half the address space, not wrapping around the address space, not containing the null pointer). The maximal size is fixed when the object is created and never changed. The current size can change, but never beyond the actual size.

For the mmap use case, the maximal size would be set when reserving the region in the address space. That’s when we consider the allocation to be created, but its initial actual size is 0. It then grows and shrinks over time.

What is not clear to me is whether allocated objects need to be disjoint according to their maximal size, or whether two allocated objects’ “maximal region” can overlap as long as their “current region” does not overlap.

One way to make this work could be to say that “an identified object with size smaller N” really means “an object that will definitely not grow beyond size N”, i.e. the maximal size mentioned in the previous paragraph is known to be N (or less). For instance, stack allocations via alloca cannot grow, so if the small object with size no more than N is an alloca, then the aliasing reasoning remains valid: no getelementptr inbounds of more than N can ever be valid inside this alloca. For malloc we’d have to decide whether growing this later is allowed or not (and if we say it is allowed, then alias analysis has to be changed to no longer take into account size information from malloc). For mmap we can safely say that it can grow because LLVM already does not make use of the size information here.

If we initially have an allocation of size N, a GEP inbounds with offset M > N and the allocation later grows to size K > M, how does that work out if we want GEPs to be freely hoistable and sinkable? I don’t really know.

The allocation gets created with initial actual size N and maximal size K. So the GEP with offset M is always allowed.