Pointers Are Complicated III, or: Pointer-integer casts exposed

Hi all,

I wrote a(nother) blog post on the semantics of pointer-integer and integer-pointer casts:
Pointers Are Complicated III, or: Pointer-integer casts exposed. I come mostly from a Rust and C perspective but this also interacts with LLVM IR semantics design, so I wonder what your thoughts on that are. :slight_smile:

Kind regards,
Ralf

5 Likes

The model that we use in CHERI C is very similar to the Rust Strict Provenance model. We, like Rust, have a single-provenance model, but I believe that it could be generalised to a multi-provenance model quite easily. The current state is:

  • Pointers are lowered to IR pointers.
  • [u]intptr_t is a typedef for a built-in type that is also lowered to IR pointers.
  • Pointer to / from [u]intptr_t are currently IR bitcasts but will become no-ops with the opaque pointer work.
  • Arithmetic on pointers / [u]intptr use a get-address intrinsic, then the op, then a set-address intrinsic. This propagates the provenance.
  • Pointer (including [u]intptr_t to integer casts use a get-address intrinsic.
  • Integer-to-pointer casts use a set-address intrinsic on a null pointer.

We may still have some inttoptr and ptrtoint uses, but weā€™d like to remove them entirely. I believe that this could be generalised to a multi-provenance semantics with two small tweaks:

First, allow set-address to take two pointers as arguments. a+b on [u]intptr_t then becomes:

%0 = @llvm.address.get(%a)
%1 = @llvm.address.get(%b)
%2 = add %0, %1
%3 = @llvm.address.set.multi(%a, %b, %2)

We are then explicitly tracking that %3 has the provenance of the union of %a and %b. This wouldnā€™t work for CHERI, but I believe it would for C/C++ on non-CHERI platforms.

Second, add a DataLayout flag for an address space that tells you if set-address on a null pointer gives wildcard or empty provenance. For CHERI, this would define empty provenance: void *a = (void*)0x12345678; on CHERI gives a pointer that cannot be dereferenced (enforced by the hardware) and so @llvm.address.set(null, 0x12345678) would give a pointer that the optimisers know cannot alias any valid pointer and where any attempt to load or store via it is UB. In contrast, on non-CHERI platforms, this would be assumed to have wildcard provenance: it may alias any other valid pointer.

It might be useful to make this a property of the cast, rather than a global property of the DataLayout. If the source language (e.g. Rust) can make strong guarantees that provenance-free casts from integer to pointer give a non-dereferencable pointer (i.e. the pointer type is used for storage, but the only valid thing to do with it is convert it back to an integer, which is useful for unions of pointer and integer types) then they may wish to opt in to the CHERI-like behaviour to expose more optimisation opportunities.

I enjoyed reading the post. I knew a bit about this already but your examples really helped make it clear.

(one typo, ā€œhold arbitrary data in a container of a given timeā€ should be ā€œtypeā€?)

I donā€™t know enough to comment on the choices proposed but I have been doing a lot of work in lldb to cope with pointers that are more than just an address. On AArch64 we have top byte tagging, memory tagging and pointer authentication which can use the ā€œnon-addressā€ bits.

Which from reading this could be considered part of the ā€œprovenanceā€ of such pointers. So very interested to see the to/from_raw_parts in rust. Where I assume the metadata could be tagging bits like those. So you could choose whether the result would be allowed to access the location or not.

Which certainly seems useful because itā€™s similar to what lldb wants to/is doing. Given a pointer, remove everything that isnā€™t the address and read that location. (of course we sidestep a lot of permissions issues being a debugger)

If x and y are both uintptr_t, then what does x+y return? Which sideā€™s provenance does it use?

oops thanks for pointing that out!

In our first version, it was consistently the left side. @arichardson improved that somewhat, so that itā€™s the left side if itā€™s ambiguous but if one side is promoted from an integer type then itā€™s the other side (so 1 + (intptr_t)somePtr will take the provenance from somePtr). In the ambiguous cases, the compiler will warn and say something like ā€˜provenance taken from the left, check that thatā€™s what you meantā€™. You can silence the warning by casting one side via an unambiguous integer type. CHERI C provides vaddr_t as a thing that is large enough to hold an address but does not carry provenance. On all current platforms this is the same as size_t, but itā€™s possible (in theory, at least) to have a system where capabilities have 64-bit bases but canā€™t have lengths >2^32, so size_t where size_t would be the same type as uint32_t, but vaddr_t would be uint64_t. Rust would require both to be U64 because Rust makes few fewer allowances for unconventional architectures than C. The flexibility in C has been very helpful for CHERI, and so I suspect that the Rust approach will turn out to be a terrible idea in 20-30 years time when someone comes up with the next big shift in CPU abstract machines.

NB: vaddr_t is deprecated and only exists in CheriBSD; we added ptraddr_t instead several years ago as the type to use instead. The released PDF of the programming guide is however outdated in that regard.

On the LLVM level that seems like a problem since it makes addition non-commutative (unless that warning can be promoted to UB).

I wonā€™t drive this off-topic by discussing design choices around integer types in Rust. :wink: But I think with Strict Provenance, Rust will have excellent support for CHERI targets, without needing a new integer type. (And the CHERI people that are active on the Rust side seem to agree.)

At the LLVM IR level, addition is not defined on pointer types (except via GEP, which is already non-commutative, it is pointer + set of integers). In a multi-provenance model, addition would be the sequence that I described above (get address on both operands, add, multi-provenance version of set-address).

I believe thatā€™s true for CHERI. CHERI was not considered when C99 picked its set of integer types, but having separate types for size_t, ptrdiff_t, intptr_t and so on has been valuable for us. By separating out the intent, C had a flexible set of primitives that allowed it to be adapted fairly easily to architectures that were quite different from any existing ones. Rust lacks that flexibility. CHERI is not the end of CPU architecture evolution. In the next 20 years or so, I expect that someone will come up with something else new that has performance or security advantages. I am much more confident that C will be able to adapt to take advantage of it, whatever it happens to be, than I am of Rust because C designed flexibility in, whereas Rust made strong assumptions about address spaces.

1 Like

An example of that would be things like 128-bit RISC-V as people push large shared address spaces further; one sketch for that reintroduces near and far pointers, and you could imagine having a 128-bit address space but still a 64-bit size_t (because why would you allocate more than 2^64 bytes for one objectā€¦). C can accommodate this but Rust cannot even with the strict provenance changes.

I agree it is a challenge. But this is not the place to discuss that ā€“ this is an LLVM forum, I thought, not a ā€œC vs Rustā€ forum. :wink: There are people that have thought about these issues far more than I did (which is basically not at all) and you can find them e.g. in the Rust internals forum or the Rust Zulip if you want to have a discussion about how Rust could be adjusted to architectures like that. :slight_smile:

For now, Iā€™d be more than happy if we can resolve the nasty questions of pointer-integer roundtrips on the platforms that already do exist. I have laid out my ideas for what Rust could do there, C is gravitating towards PNVI-ae-udi, but as far as I can tell the semantics of LLVM in this space remain an open question ā€“ which of course is a challenge for C/Rust frontends that want to use LLVM as a backend.

1 Like

That was an interesting read! Just a note that thereā€™s an updated version of the C Provenance paper; WG14 doesnā€™t have the nice infra that WG21 has (e.g. P-Papers where you can go to wg21.link/Pxxxx and get taken to the newest revision), so itā€™s worth to occasionally check the paper log for updates.

2 Likes

@h-vetinari thanks! I fixed the URL.

It is great to see progress is being made in this area. Especially those made by the Rust community is impressive to me.

FWIW, opt now has a --disable-i2p-p2i-opt flag that disables the int-ptr roundtrip cast.
It was added by a student during last GSoC.
Compiling LLVM with LLVM on x86-64 with the flag turned on resulted in very small number of assembly changes (< 10 asm files, as of last summer), IIRC.
For AArch64, the number was slightly bigger (a few hundred files). It was largely due to memcpy(dst, src, 8) ā†’ load/store i64. After the transformation, many ptr<->int casts are made by GVN, etc.

To avoid using ptr-int casts whose semantics are unclear ATM, we can utilize the llvm.ptrmask intrinsic function.
However, in the case of C/C++, it was unclear how the clang frontend could detect pointer-masking expressions and replace them with llvm.ptrmask. Hence, it is not being used AFAIR.
I wonder whether how the pointer masking expressions are represented in Rust? If there is a dedicated Rust API function that people are expected to use, lowering it to llvm.ptrmask intrinsic would be a good option.

1 Like

FWIW, opt now has a --disable-i2p-p2i-opt flag that disables the int-ptr roundtrip cast.

That sounds great. :slight_smile: Are there any plans for making it the default?

For what itā€™s worth, the biggest blocker for Rust that I am aware of is the fact that LLVM will replace one icmp-equal pointer by another, which is (in general) clearly unsound under any model I have seen anyone propose ā€“ those pointers might have different provenance (they might be ā€œbased onā€ different other pointers), so even if their address is the same, it matters which one is being used. If that LLVM issue were fixed, Rust could do the rest by compiling ptr2int and int2ptr casts to calls to some opaque FFI functions, which would force LLVM to treat them as an optimization barrier. That fixes all the remaining miscompilation examples I am aware of. Most Rust code can avoid these optimization barriers by using the Strict Provenance APIs instead.

That issue was retitled to say that the memory model ā€œneeds more rigorā€, but at this point it seems fairly clear that no matter what the rigorous model will look like, this particular optimization has no chance of being correct. So maybe it would be possible to remove it without waiting for all the answers to materialize?

They are expressed using addr and with_addr. Those APIs are more general and can still fully express the intent of ptrmask, namely not escaping the pointer and avoiding any ambiguity about which pointer is ā€˜based onā€™ which other pointer:

  • addr is like ptrtoint but cannot be cast back to a pointer, i.e., the compiler does not have to consider this operation as ā€˜escapingā€™ the pointer. (In PNVI-ae-udi terms, it just strips the provenance but does not expose the underlying storage instance.)
  • with_addr has the same signature as ptrmask, but the semantics are more like getelementptr ptr, addr - ptrtoint(ptr). IOW, the new pointer address is fully determined by the integer argument, but the new pointer is ā€œbased onā€ the ptr argument.

With these, %result_ptr = ptrmask(ptrty %ptr, intty %mask) can simply be implemented as follows (forgive my sloppy LLVM syntax):

%addr = llvm.addr(ptrty %ptr)
%masked_addr = and intty %addr %mask
%result_ptr = llvm.with_addr(ptrty %ptr, intty %masked_addr)

I didnā€™t know LLVM has ptrmask; maybe that means yā€™all would also be open to add addr and with_addr which are equally well-defined but more versatile? :smiley:

I believe the answer is yes, and we already have llvm::canReplacePointersIfEqual (https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/Analysis/Loads.h#L177, https://github.com/llvm/llvm-project/blob/main/llvm/lib/Analysis/Loads.cpp#L646). Maybe I can write a patch in a few days that updates EarlyCSE to use this function. For GVN or NewGVN, I am not familiar with the code unfortunately.

Not at the moment, sadly :confused:

Well, it made me think that we must simply use ā€˜gep p, (i-p)ā€™ and make the meaning of the gep clear.
Folding ā€˜gep p, (i-p)ā€™ into ā€˜iā€™ was removed in āš™ D98588 [InstCombine] Restrict a GEP transform to avoid changing provenance and āš™ D98611 [InstSimplify] Restrict a GEP transform to avoid provenance changes (Compiler Explorer).
What do you think about writing a LangRef patch that nails down that ā€˜gep p, (i-p)ā€™ is based on p only, justifying the patches?

Sadly the function still defaults to true even when it has no clue if both pointers have the same provenance. So even making GVN use that function wonā€™t fix the example in the bugreport.

The problem is that generating such a GEP leads to codegen regressions. (See some discussion here.) I was hoping with a dedicated intrinsic codegen could produce better results. :smiley:

Is that even up for discussion? The ā€œbased onā€ docs seem pretty clear here.
But yeah I#d be in favor of such a patch. (I wonā€™t have time to write it, though.)