Clarifiying the semantics of ptrtoint

During the recent discussion on clarifying non-integral pointer types, I noticed that ptrtoint is extremely underspecified. The current LangRef wording just says

The ‘ptrtoint ’ instruction converts value to integer type ty2 by interpreting the pointer value as an integer and either truncating or zero extending that value to the size of the integer type. If value is smaller than ty2 then a zero extension is done. If value is larger than ty2 then a truncation is done. If they are the same size, then nothing is done (no-op cast ) other than a type change.

This is probably fine for most architectures where a pointer representation is just the address (although it might be a bit more interesting if you are using the high bits for extra metadata). However, with non-integral pointers we really need to clarify what this conversion does.
Which of the following behaviours should ptrtoint use if we have something like a fat pointer struct of four i64s { base_addr; max_addr; offset; flags}:

  1. Return the bitwise representation of the fat pointer as if we had stored it to memory as a ptr addrspace(N) and loaded back as i256
  2. Return the underlying address of the pointer, so in this case base_addr+offset.
  3. Something else?

One problem with option 1 that I see when looking at the latest C standard draft

Two pointers compare equal if and only if both are null pointers, both are pointers to the same object (including a pointer to an object and a subobject at its beginning) or function, both are pointers to one past the last element of the same array object, or one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space

And the icmp eq ptr semantics since right now say:

If the operands are pointer typed, the pointer values are compared as if they were integers.

This suggests to me that icmp eq ptr %a, %b is equivalent to icmp eq i64 ptrtoint(%a), ptrtoint(%b) and if we want icmp to behave as defined by the C standard we must compare the underlying address here instead of the full bitwise comparison (which would always return false for fat pointers to adjacent objects). As far as I can tell this C requirement also does not hold for MTE tagged pointers where adjacent objects use different MTE tags and thus compare different despite having the same address.

My mental model of ptrtoint so far has been option 2, i.e. discard all metadata and return the underlying address. I will upload a PR to update the LangRef to clarify ptrtoint (and icmp eq/ne ptr) once we have consensus on the desired semantics.

The main challenge I see with using the underlying address is that we don’t currently have a datalayout property for the address range of pointers and would need to add this for fat pointers. In the CHERI world we have been able to get by with using the index width (as that always happens to match the address range), but I believe AMDGPU fat pointers have a 48-bit address range with a 32-bit index. The simplest solution to this problem would be adding another datalayout component after the index width (I originally proposed this in ⚙ D135158 [DataLayout] Introduce DataLayout::getPointerIntegralSize(AS) but that review stalled).

CC: @nikic @davidchisnall @krzysz00

1 Like

I am not sure that we want to include C semantics in LLVM IR. I would expect an icmp eq to respect the substitution principle. If %1 and %2 compare equal then an optimiser can replace all uses of %2 with %1 in every basic block dominated by the comparison. This proposal would need us to carefully audit optimisers to make sure that they all have an ‘except pointers’ special case.

It seems simpler to make the C front end emit a non-escaping ptrtoint on such comparisons.

This is not true today.
Currently, pointer comparison is defined as comparing addresses. But even if two pointers have the same addresses, they may have different provenance.

I’m not sure we need to offer such strong semantics (of comparing addresses), though.

So, if we want icmp eq ptr to mean address comparison - ignoring provenance, fat pointer stuff, etc, then I claim that the relation ptr addrspace(N) %a == %b => ptrtoint(%a) == ptrtoint(%b) needs to not be true for non-integral pointers, since my definition of ptrtoint is to include all the data needed to construct the pointer in an integer form you can hack on.

However, my definition of pointer comparison in LowerBufferFatPointers out in AMDGPU just compares the bits, not caring about what the underlying address is. I could reasonably change the implementation - I don’t think anyone’s relying on this behavior - but it would be a decision we’d need to think about.

Well, there’s the long-standing disagreement as to what CHERI C semantics should be. “Big CHERI” uses address comparison but used to do full capability comparison, whilst CHERIoT switched back to full capability comparison. Both have their benefits and drawbacks, and someone needs to do an empirical study to determine which is least disruptive to real-world C.

I don’t think you can sanely have ptrtoint include metadata bits if icmp ptr doesn’t take them into account.

Regardless of what semantics CHERI C uses at the C abstract machine level, we still need to clarify what icmp and ptrtoint mean at the LLVM IR level.

If I understand @nlopes correctly, right now LLVM optimizations assume that icmp eq ptr %a, %b ignores provenance into account, so %a cannot be subsitituted for %b. If we allow replacing such an icmp with a ptrtoint integer icmp, it implies that ptrtoint returns only the address and therefore option 2 is the current behaviour used by LLVM.

Somewhat related: A while ago I was considering add a icmp eq exact ptr %a, %b which would compare not just the address but also compiler-level provenance (for existing architectures) and HW provenance (for CHERI). If such an icmp returns true then you can assume the pointers are usable interchangeably. This would allow ISO C to continue using icmp eq to perform address comparisons, but allow CHERI compilers to opt into strict comparison behaviour using icmp exact.
Is this something that would be useful to solve the current ambiguity in the LangRef?

Not quite. We do not diverge from CHERI C/C++ semantics here because it would make porting code exciting. We provide a Capability class in C++ that implements spaceship operator comparisons. Ordered comparisons are defined to return a partial ordering, equality is exact equality.

I would like to make the change because, so far, it’s been one of the biggest sources of confusion for people writing code on the platform: if a == b but using a works and using b in the same place traps, people get very confused (and everyone encounters this very shortly after they start actually using CHERI features). People who use our wrappers don’t experience this confusion. But that’s not a large enough sample set to be sure.

I like this and using it instead of our current exact-equal intrinsic would probably enable more optimisations.

I think that the semantics of ptrtoint as far as upstream LLVM is concerned are fairly clear: It returns the integral representation of the whole pointer, not some part of it. The ptrtoint result type (without zero-extension or truncation) is the same size as the pointer type, not the size of the pointer index type or some other size. You’ll observe this most clearly if you let InstCombine canonicalize ptrtoints with different result type widths.

ptrtoint is essentially a bitcast that also exposes the pointer provenance – though the details of what those words mean are undecided.

Now, if we want to change this by introducing a new property for the address size / ptrtoint size of a pointer, a key question for me would be: If ptrtoint doesn’t return the full integral value anymore, how can you obtain it? Is this only going to be possible by transmuting it, in the sense of storing the pointer to memory and reading it back as an integer?

I kind of wonder whether it might make sense to separate ptrtoint into two instructions. One is the current ptrtoint that returns the full integral value and exposes provenance, and the other is a ptrtoaddr that returns only the address portion and does not expose provenance either. CHERI would then use ptrtoaddr rather than ptrtoint. ptrtoaddr is also what we’d want to use for pointer subtraction, SCEV, etc.

My original plan was to add a flag to ptrtoint that indicates that it doesn’t expose provenance, but making it an entirely separate instruction might make more sense, given that this substantially alters semantics.


As for icmp, there again I believe the current upstream semantics are a comparison of the full value, not just some subset like the pointer index. icmp eq on pointers is the same as icmp eq on the ptrtoints, with the exception that provenance is not exposed.

Not really, it just means that pointers have additional hidden state that is not covered by icmp. This hidden state still exists even if pointers have capabilities because of things like noalias. Replacing one pointer with another, even if they have the same integral value, just generally isn’t safe.

I don’t think it’s really possible to define an operation that “compares” provenance. The only thing you can do is say that the icmp returns poison if the provenance does not match.

2 Likes

To probe some edges of the proposal:

%x = ptrtoaddr ptr addrspace(N) %p to iA , where A is the “address size” of a pN, returns the byte address that %p points to as an iA.

For “normal” pointers, ptrtoaddr %p === ptrtoint %p.

For an “annotated” pointer (has a stable bit pattern but doesn’t cover a flat address space), this produces value, which may or may not be a subset of the bits of the pointer?

There’s no corresponding addrtoptr - that’s named inttoptr. inttoptr(ptrtoaddr(%p)) == %p for “normal” pointers, but isn’t, in general, possible for an annotated (or worse, one of those unstable GC) pointer.

For AMDGPU’s resource types:

  • ptr addrspace(8), which is the LLVM representation of a hardware V#, ptrtoint returns a 128-bit value, but ptrtoaddr returns a 48 bit value, the low 48 bits of the ptrtoint return
  • ptr addrspace(7) is not a real hardware object, but a convenience wrapper around ptr addrspace(8) that tracks a 32-bit offset so that you can, say, GEP it. Its ptrtoint returns an i160. Its %x = ptrtoaddr ptr addrspace(7) %p would, if I understand the suggestion correctly,be implemented as
ptr addrspace(8) %r || i32 %o = lower_to_parts(ptr addrspace(7) %p)
%r.addr = ptrtoaddr ptr addrspace(8) %r to i48
%o.zext = zext i32 %o to i48
%x = add nuw nsw i48 %r.addr, %o.zext

This is a decently useful operation, especially since a sign extension of that %x can get you back to a non-fat pointer.

(We also have, in the SPIR-V-to-LLVM code downstream, a addrspace(9) which carries two separate offset fields whose ptrtoaddr would be a bit complicated to write out and involve a lot of bit twiddling to get things like the swizzle parameters)


Now, as to icmp eq on these pointers, if I have two ptr addrspace(8)s, I … don’t have any concrete reason to expect any particular comparison behavior. However, given that these descriptors were (and are) historically stored as a <4 x i32>, I’d think the least surprising semantics for comparing them is bitwise equality. I’d never expect the comparison to return poison, even if it’s comparing two resources that have the same base address but different flags.

For address space 7, since that’s meant to be used more like a pointer, I’d accept a “comparing two pointers with different V# parts is poison” semantics, though that feels a bit weird - and I’d be tempted to implement this as bitwise equality anyway - in the terms above, two pointers with different “provenance” are definitionally not equal, even if they point at the same place. But that’s also an arbitrary choice to make, so, in general … poison semantics for those sorts of comparisons feel like they wouldn’t break anything.

(That being said, @jayfoad as a proxy for the rest of the SPIR-V/graphics/… people to see if there’re any nuances I’m unaware of from the compute perspective that would make these comparison wrong)

And I’m not sure I agree that it’s impossible to define provenance comparisons? At the end of the day, there exists some hardware state that’s attached to one of these annotated/provanced/fat/… pointers, which consists of some number of bits, and those bits can be compared for equality. I’m not sure that’d be useful, but I’m curious as to why it can’t be done.

This is not true for pointers and cannot be true (and LLVM’s GVN pass is unsound due to that). See https://www.ralfj.de/blog/2020/12/14/provenance.html for an example.

So indeed all optimizations need to be carefully audited for this. That ship has sailed long ago when alias analysis started making use of “derived from” information.

How would you compile this to assembly, given that compiler-level provenance is not preserved? I don’t think such an operation can exist. The best one can do is “for pointers with equal address but different provenance, either result may be returned non-deterministically”.

(IOW, I agree with what @nikic wrote above.)

On CHERI platforms, provenance is preserved, which is the point of this discussion. If two pointers compare bitwise equal (including tag bits), they either have the same provenance or have followed isomorphic provenance chains and can be substituted. There is an instruction that does this comparison.

There are other instructions that compare addresses, but if two pointer have the same address they may have different bounds or different permissions and one may not actually be a valid pointer at all and will trap if you use it. This means that address equality is insufficient for substitution.

No. On CHERI platforms, there exists a runtime capability on top of the always-present compiler/language-level notion of provenance, but the compiler-level provenance is still a thing independently of CHERI capabilities, and two pointers with the same CHERI capabilities can have different compiler-level provenance (and vice versa). @arichardson correctly distinguished the two in the post I was replying to (they call both of them “provenance” though which might be confusing), and my reply explicitly talks only about compiler-level provenance, not CHERI capabilities / “HW provenance”.

Ptrtoint must give you the machine representation of the pointer, in exactly the same bits that the corresponding target backend would use to spill it to the stack during register allocation (or otherwise record it in some non-register context). If it doesn’t do that today for some targets, I think that’s a problem.

Embedding the C++ pointer provenance model in IR so deeply that it infects icmp sounds very bad to me. It doesn’t correlate with the reality of code as expressed in hardware, only with the C++ abstract machine, which needs to be lowered as language agnostic IR with sane semantics as best we can manage.

To give another motivating example (somewhat hypothetical, but I suspect we can look at graphics buffers like this) consider an architecture where the main pointer is [segment ID]:[offset] (both of which are, to save my fingers, bytes). So a pointer could look like %p = 1:0x80 or %q = 2:0x00

Then

%pi = ptrtoint %p to i16 ; => 0x0180
%qi = ptrtoint %q to i16 ; => 0x0200
%c = icmp eq %p, %q ; => poison, since this comparison being true/false depends on the base address of segments 1 and 2

So I think the semantics where ptrtoint returns just the address part aren’t a good idea, and we’d want ptrtoaddr for that.

Proposed resolution to the question

That is, I think we’ve had a decent slate of opinions and I’d like to summarize so that we can get objections before someone weaves this into the LangRef.

Semantics of ptrtoint

ptrtoint ptr addrspace(A) %p to iN, where N is the width of the pointer, returns a N-bit integer containing all the bits needed to recreate %p if it were written to memory. In the case of annotated pointers (which have stable bit representations but may carry additional metadata) this conversion includes such metadata. If A is a non-integral (in the GC sense) address space, the result of this operation is unstable.

If N is not the width of the pointer type of address space A, as given in the data layout, the integer representation of %p is sign-extended or truncated as required.

For an integral pointer, ptrtoint returns the address to which %p points to, and additions to this value are equivalent to getelementptr (aside from potentially breaking alias and escoape analysis). That is

%q = getelementptr i8, ptr addrspace(A) %p, iL %o

and

%pi = ptrtoint ptr addrspace(A) %p to iL
%qi = add iL %pi, %o
%q = inttoptr iL %qi to %q

can be rewritten to each other freely.

For a GC-type pointer, the compiler must not introduce inttoptr/ptrtoint pairs, as those values may be unstable.

For an annotated pointer, the compiler must not use integer arithmetic to manipulate the values of such pointers (since the arithmetic may disturb the metadata). [but may assume that inttoptr(ptrtoint(%p)) == %p - do we need this? Can we impose this? That’s the one uncertain bit.]

Pointer equality

In the case of integral pointers, icmp eq ptr addrspace(A) %a, %b implies that their ptrtoint results are equal.

In the case of annotated pointers, if the two pointers have the same metadata part, then they may be compared for equality, which will test if they point to the same location.

If the metadata portion of the pointers differs, the result of comparisons is poison. Targets may, following the usual principle that poison values can be substituted for any legal value of their type, define stricter semantics for such comparisons, but target-independent passes may not rely on them.

To give an example, suppose I have (simplifying the notation and omitting flags) two pointers into buffers %x = 0x1000:0x08 and %y = 0x1008:0x0, icmp eq ptr %x, %y is poison.

Note that, because of the possibility of poison return values, the implication icmp eq ptr %a, %b <=> icmp eq (ptrtoint %a), (ptrtoint %b) is only true for integral pointers.

Note also that this definition allows, but does not require, icmp eq on annotated pointers to compare the pointed-to locations and ignore metadata.

ptrtoaddr

In order to facilitate inspection of the address pointed to by a pointer, we introduce ptrtoaddr ptr addrspace(A) to iW, where W is the “address width” of the pointer, a new property. ptrtoaddr has the same width-correction semantics as ptrtoint

For integral pointers, the address width is equal to the pointer width, and ptrtoaddr behaves identically to ptrtoint

For GC pointers, ptrtoaddr is unstable just like ptrtoint.

For annotated pointers, ptrtoaddr computes the address of the location the pointer points to. This address is not required to exist within the pointer’s bit representation and may need to be computed.

If icmp eq ptr %a, %b is true, than icmp eq iW (ptrtoaddr %a) (ptrtoaddr %b) is true, but the converse is not true. Two annotated pointers may point to the same location and be unequal as pointers.

There is no inverse addrtoptr operation, as implementing such an operation requires target-specific metadata to be available.

This sentence is problematic. It implies that integers are not just integers.
If that was the case, then all arithmetic operations would have to define what they do with the extra metadata. And that’s a can of worms.
Also, I’m allowed to write a pointer to the screen and ask the user to type it back. It’s just an integer, but can be cast back to %p.

Integers must be integers only and ptrtoint does more than just returning an integer.

I think I left out an important bit there (and have now edited it in): N is the width of the pointer.

That is, ptrtoint includes and flags, provenance data, etc etc. that would need to get written out to memory if you were strong the pointer somewhere. And that integer is just an integer … but it’s not the pointer address and shouldn’t be manipulated as if it were

|ptrtoint ptr addrspace(A) %p to iN| returns a |N|-bit integer
containing all the bits needed to recreate |%p| if it were written to
memory.

This is probably a quibble over wording as opposed to semantics (but
since it’s a major part of semantics here, it does need to be called out
early and frequently): as written, this text implies that integers are
going to get provenance, which is definitely not right.

(Also, I don’t like that this text is written as if N is any size here
and that the clarification for non-pointer-width N is in the second
paragraph. I’d prefer a sentence that starts a lot more along the lines
of “The conversion of a pointer to its native integer width…” or
something like that.)

In the case of annotated pointers (which have stable bit
representations but may carry additional metadata) this conversion
includes such metadata. If |A| is a non-integral (in the GC sense)
address space, the result of this operation is unstable.

This makes me think we need a dedicated section for describing the
semantics of pointer types, rather than sort of haphazardly stringing
this information in different sections.

For an integral pointer, |ptrtoint| returns the address to which |%p|
points to, and additions to this value are equivalent to
|getelementptr| (aside from potentially breaking alias and escoape
analysis). That is

|%q = getelementptr i8, ptr addrspace(A) %p, iL %o |

and

|%pi = ptrtoint ptr addrspace(A) %p to iL %qi = add iL %pi, %o %q =
inttoptr iL %qi to %q |

can be rewritten to each other freely.

The “rewritten to each other freely” part isn’t true, though, with our
pointer provenance model, I thought–it’s only a one-way rewrite.

In the case of integral pointers, |icmp eq ptr addrspace(A) %a, %b|
implies that their |ptrtoint| results are equal.

In the case of annotated pointers, if the two pointers have the same
metadata part, then they may be compared for equality, which will test
if they point to the same location.

This doesn’t cover what happens non-integral pointers.