Clarifying GEP semantics

Hello, I’m trying to figure out what’s legal and what’s not legal to do with GEP instruction in llvm and have some confusion that you can hopefully help me resolve.

Reading The Often Misunderstood GEP Instruction — LLVM 17.0.0git documentation:

Without the inbounds keyword, there are no restrictions on computing out-of-bounds addresses. Obviously, performing a load or a store requires an address of allocated and sufficiently aligned memory. But the GEP itself is only concerned with computing addresses.

seems to imply that non-inbounds gep could be used to produce out-of-bounds address and it’s perfectly legal to use that address to load/store as long as it points to sufficiently aligned “allocated” memory.

However, reading different section on the same page The Often Misunderstood GEP Instruction — LLVM 17.0.0git documentation:

It’s invalid to take a GEP from one object, address into a different separately allocated object, and dereference it.

which seemingly contradicts the statement above (unless there is an implicit “inbounds” everywhere).

Also:

Can I compute the distance between two objects, and add that value to one address to compute the other address?
As with arithmetic on null, you can use GEP to compute an address that way, but you can’t use that pointer to actually access the object if you do, unless the object is managed outside of LLVM.

What is meant by “object is managed outside of LLVM”? Does it mean that if some “managed outside” condition is meant, then it’s legal to load/store from this pointer inside llvm? Or was the intent here to only allow producing such address in the IR visible to llvm (so managed in this case should be read as “accessed”)?

Taking into the account all this, is the following transformation currently considered legal?

inttoptr ((ptrtoint %p) + 42)
->
gep %p, 42 ; no inbounds

what if this address is immediately used by load or store?

non-inbound gep can produce OOB pointers. But those can’t be dereferenced.

This is not legal in general because pointers and integers have different rules for refinement.
This is a very long story; stay way from int2ptr unless you are willing to spend a few hours reading about it. (I think this is a decent starting point: https://web.ist.utl.pt/nuno.lopes/pres/pointers-eurollvm18.pdf)

1 Like

Without the inbounds keyword, there are no restrictions on computing out-of-bounds addresses. Obviously, performing a load or a store requires an address of allocated and sufficiently aligned memory. But the GEP itself is only concerned with computing addresses.

seems to imply that non-inbounds gep could be used to produce out-of-bounds address and it’s perfectly legal to use that address to load/store as long as it points to sufficiently aligned “allocated” memory.

The statement a “non-inbounds gep could be used to produce out-of-bounds address” is true. The second statement, “it’s perfectly legal to use that address to load/store as long as it points to sufficiently aligned “allocated” memory” is false.

If you have a gep that produces an out-of-bounds address followed by a load (or any other memory operation) of that result, undefined behavior has occurred no matter what. The question is which instruction causes that undefined behavior. If the inbounds keyword is not present, then the undefined behavior occurs in the load: the gep was a legal instruction that produced a well-defined value, but the resulting pointer still fails the provenance check of the load. If the inbounds keyword is present, then the undefined behavior essentially occurs in the gep–the gep returns a poison value [1].

Put differently, load (gep (gep x, y), -y) is a valid program even if the inner gep goes out of bounds, whereas load (gep (gep inbounds x, y), -y) is not a valid program if the inner gep goes out of bounds, even though the outer gep would immediately pull it back in bounds.

inttoptr ((ptrtoint %p) + 42)

As mentioned by Nuno, the story of inttoptr (ptrtoint) is far more complicated, and really requires far more detail than is worth going into here. Essentially, the issue is that pointers track provenance via the ability to track data dependence chains (recursively looking at the operands of instructions), whereas optimizations intentionally do not preserve these dependence chains for integer values. Consequently, converting a pointer to an integer and back into a pointer broadens the scope of things the pointer may point to, so even if the numerical value of the pointer is the same, the pointer does not have the same provenance results.

[1] A poison value is not undefined behavior in and of itself, but it tends to propagate itself to a use that causes undefined behavior, so it’s useful to think of poison as a kind of “delayed” undefined behavior, even if that isn’t the most precise summation of its semantic behavior.

2 Likes

Hm, so what does the following mean then in this context?

Obviously, performing a load or a store requires an address of allocated and sufficiently aligned memory.

Are there any circumstances under which performing a load/store from non-inbounds GEP that became out of bounds legal in current IR? Judging by your response - there aren’t many, so probably this statement needs to be changed to something like “Obviously, performing a load or a store requires a sufficiently aligned address inside an actual underlying allocated object”.

What if there are no objects at the address yet and the following store is “creating” an object?

It looks like due to these semantic distinctions these two equivalent C functions compile to non-equivalent IR and are optimized differently: Compiler Explorer
If LLVM is unable to reason about inttoptrs due to the potential refinement issues (even though the source code doesn’t appear to have them, although I could be wrong), is there some forward way to make foo optimize the same as bar?
Probably teaching SCEV about IntToPtr might work, but that might be another can of worms.

What if there are no objects at the address yet and the following store is “creating” an object?

Stores don’t create objects; those are created by allocations (e.g., alloca, call to malloc, global variable).

It looks like due to these semantic distinctions these two equivalent C functions compile to non-equivalent IR and are optimized differently: Compiler Explorer

I haven’t read closely the current C pointer provenance papers to be certain, but those two C functions are not in fact equivalent. My recollection of the current leading PVNI proposal is that the second function is UB if the four writes straddle an object boundary, whereas the first function is still defined in that case (because provenance is not tracked through integers).

If LLVM is unable to reason about inttoptrs due to the potential refinement issues (even though the source code doesn’t appear to have them, although I could be wrong), is there some forward way to make foo optimize the same as bar?

Probably not. inttoptr may look easy to optimize, but it has very nonobvious pointer provenance ramifications. However, the other issue with pointer provenance is that proper specification for it is still an area of active academic research. There’s no proper specification of pointer provenance in the C specification (though it is still generally seen as implied), and the specific wording for pointer provenance currently in the LLVM IR LangRef is itself somewhat inconsistent with the current understanding. This makes being very specific about what is and isn’t legal around pointer provenance, and especially inttoptr, challenging, and without that specificity, we need to be extremely wary about optimizing inttoptr. The standing guideline around inttoptr is to try to avoid generating it wherever possible, so that the hard questions around semantics just don’t kick in in the first place.

Stores don’t create objects; those are created by allocations (e.g., alloca , call to malloc , global variable).

What object does malloc create? What is its type and size? Same for alloca.

haven’t read closely the current C pointer provenance papers to be certain, but those two C functions are not in fact equivalent. My recollection of the current leading PVNI proposal is that the second function is UB if the four writes straddle an object boundary, whereas the first function is still defined in that case (because provenance is not tracked through integers).

Assume that p = (long)malloc(N);, where N > 4; What boundary could be straddled here?

Probably not.

Being unable to turn store i8, (inttoptr x+i) into store <i8 x N> (inttoptr x) is unfortunate as GCC doesn’t seem to have an issue optimizing such code and it doesn’t appear to be that uncommon pattern.

malloc creates a new object with the requested size. Same as alloca.

We’ve seen some cases where gcc miscompiles code because it has no trouble optimizing int2ptr/ptr2int casts.

If you really want to know more, I think our paper gives an ok intro to memory models and the issues around pointer casts: https://web.ist.utl.pt/nuno.lopes/pubs/llvmmem-oopsla18.pdf

Alright, so it’s not in a higher level C++ sense. Object is just some “allocated slab of memory with a specific size”, correct?
So as long as we can guarantee that no stores/loads were done outside of original “memory slab” pointed to by inttoptr %p, or even if we are fine with UB if this happens (due to higher level semantics of the language the IR was generated from, although not sure how to express it in IR yet), this transformation is ok?

Yep. Plus a flag stating whether its alive or not, etc.

It’s not that simple, sorry. You have to go and read more about the subject if you want to understand it in full.

1 Like

Assume that p = (long)malloc(N);, where N > 4; What boundary could be straddled here?

If we knew the object that p pointed to came from such a malloc, then sure, we could conceivably optimize it away [1]. But in the program fragment given, the compiler doesn’t know that p has to come from such a malloc.

[1] I would still be very hesitant to sanction such an optimization until pointer provenance is pinned down better; there’s too many edge cases of pointers having the same integer value yet different provenances to feel comfortable optimizing it without a formal semantics. This is a space with a long history of miscompiles due to underspecified provenance concerns.

1 Like

Is there a way to express this notion in IR currently? Say, you want to generate IR for something similar to memset that takes intptr instead of regular pointer parameter but has the same “contract” that all accessed memory belongs to the same original allocation.

You mean the 2018 paper? Does it reflect the current state of things in LLVM (it’s been 5 years)? Haven’t finished it yet (especially the concrete operation semantics), but according to it does non-inbounds gep on wildcard-provenance ptr produce non-wildcard provenance? The examples look as if they only discuss inbounds gep, but the description (deferring UB to the use) reads as non-inbounds gep.

Maybe that’s why GCC has long-standing bugs like this or this where code using int2ptr casts gets compiled incorrectly.

If you are curious about the plethora of pitfalls around int-to-ptr casts, I will also shamelessly plug my blog post on the subject, and this follow-up.

I have another clarifying question. Is it correct to state that in the current model all objects/allocations are continuous? I.e. if LLVM sees load %p and load (gep %p, 42) it can (not sure if it currently does) assume that %p is dereferenceable(42), correct?

That’s my understanding of it, yes.