[RFC] Introducing the opaque pointer type

For background on opaque pointer types, see 1 and many other patches/threads searchable with “opaque pointers”.

While there’s been lots of work around making opaque pointers work, we don’t actually have a type like that in LLVM yet. https://reviews.llvm.org/D101704 introduces the opaque pointer type within LLVM so we can start playing around with the opaque pointer type and see what goes wrong. Much of the patch above is based on TNorthover’s branch from a couple years ago 2.

The opaque pointer type is essentially just a PointerType with a null pointee type. Calling getElementType() on an opaque pointer asserts.

Since the bitcode representation for non-opaque pointers contains the pointee type, we need a new bitcode type code for opaque pointers, which only contains the address space.

For the textual IR representation, the current proposal is to represent an opaque pointer type with “ptr” with an optional “addrspace(N)”. This seems consistent with existing uses of “addrspace(N)” and “ptr” seems right.
There are a couple alternatives. TNorthover’s version uses “pN” where “N” is the address space, so most pointers would be “p0”, and a pointer in address space #5 would be “p5”. I initially attempted something like “ptr(N)”, but the spelling is slightly ambiguous with function types. We could also simply use a void pointer, which LLVM currently does not allow 3.
Feel free to bikeshed.

Hi Arthur,

For background on opaque pointer types, see [1] and many other patches/threads searchable with “opaque pointers”.

While there’s been lots of work around making opaque pointers work, we don’t actually have a type like that in LLVM yet. https://reviews.llvm.org/D101704 introduces the opaque pointer type within LLVM so we can start playing around with the opaque pointer type and see what goes wrong. Much of the patch above is based on TNorthover’s branch from a couple years ago [2].

The opaque pointer type is essentially just a PointerType with a null pointee type. Calling getElementType() on an opaque pointer asserts.

Since the bitcode representation for non-opaque pointers contains the pointee type, we need a new bitcode type code for opaque pointers, which only contains the address space.

For the textual IR representation, the current proposal is to represent an opaque pointer type with “ptr” with an optional “addrspace(N)”. This seems consistent with existing uses of “addrspace(N)” and “ptr” seems right.
There are a couple alternatives. TNorthover’s version uses “pN” where “N” is the address space, so most pointers would be “p0”, and a pointer in address space #5 would be “p5”. I initially attempted something like “ptr(N)”, but the spelling is slightly ambiguous with function types. We could also simply use a void pointer, which LLVM currently does not allow [3].

Thank you for doing this, and the approach seems largely good to me, except for one important point: We’ve been moving steadily towards making addrspace 0 be non-special for a long time now, so I strongly prefer a spelling that always has an address space. I don’t care too much about the exact spelling, pN and ptr(N) both seem fine to me assuming technical issues can be sorted out. pN has the benefit of already being used in codegen contexts, so count that as a mild preference for that spelling.

Cheers,
Nicolai

Hi Arthur,

For background on opaque pointer types, see [1] and many other patches/threads searchable with “opaque pointers”.

While there’s been lots of work around making opaque pointers work, we don’t actually have a type like that in LLVM yet. https://reviews.llvm.org/D101704 introduces the opaque pointer type within LLVM so we can start playing around with the opaque pointer type and see what goes wrong. Much of the patch above is based on TNorthover’s branch from a couple years ago [2].

The opaque pointer type is essentially just a PointerType with a null pointee type. Calling getElementType() on an opaque pointer asserts.

Since the bitcode representation for non-opaque pointers contains the pointee type, we need a new bitcode type code for opaque pointers, which only contains the address space.

For the textual IR representation, the current proposal is to represent an opaque pointer type with “ptr” with an optional “addrspace(N)”. This seems consistent with existing uses of “addrspace(N)” and “ptr” seems right.
There are a couple alternatives. TNorthover’s version uses “pN” where “N” is the address space, so most pointers would be “p0”, and a pointer in address space #5 would be “p5”. I initially attempted something like “ptr(N)”, but the spelling is slightly ambiguous with function types. We could also simply use a void pointer, which LLVM currently does not allow [3].

Thank you for doing this, and the approach seems largely good to me, except for one important point: We’ve been moving steadily towards making addrspace 0 be non-special for a long time now, so I strongly prefer a spelling that always has an address space. I don’t care too much about the exact spelling, pN and ptr(N) both seem fine to me assuming technical issues can be sorted out. pN has the benefit of already being used in codegen contexts, so count that as a mild preference for that spelling.

There are many other places in the textual IR where we use the “addrspace(N)” syntax – and AFAIK they all default to 0 right now. So my first inclination would be to agree with Arthur that it’s a shame to have this syntax diverge from that. But – do you have plans to change the behavior of those other contexts in the future?

Somebody pointed out to me that there’s very little actual documentation on opaque pointer types. I’ll try to write up some documentation so that the motivation and tradeoffs can be better discussed.

Thank you for doing this, and the approach seems largely good to me, except for one important point: We’ve been moving steadily towards making addrspace 0 be non-special for a long time now, so I strongly prefer a spelling that always has an address space. I don’t care too much about the exact spelling, pN and ptr(N) both seem fine to me assuming technical issues can be sorted out. pN has the benefit of already being used in codegen contexts, so count that as a mild preference for that spelling.

There are many other places in the textual IR where we use the “addrspace(N)” syntax – and AFAIK they all default to 0 right now. So my first inclination would be to agree with Arthur that it’s a shame to have this syntax diverge from that. But – do you have plans to change the behavior of those other contexts in the future?

+1 from somebody not super familiar with address spaces.

Hi Arthur,

    For background on opaque pointer types, see [1] and many other patches/threads searchable with "opaque pointers".

    While there's been lots of work around making opaque pointers work, we don't actually have a type like that in LLVM yet. ⚙ D101704 [IR] Introduce the opaque pointer type introduces the opaque pointer type within LLVM so we can start playing around with the opaque pointer type and see what goes wrong. Much of the patch above is based on TNorthover's branch from a couple years ago [2].

    The opaque pointer type is essentially just a PointerType with a null pointee type. Calling getElementType() on an opaque pointer asserts.

    Since the bitcode representation for non-opaque pointers contains the pointee type, we need a new bitcode type code for opaque pointers, which only contains the address space.

    For the textual IR representation, the current proposal is to represent an opaque pointer type with "ptr" with an optional "addrspace(N)". This seems consistent with existing uses of "addrspace(N)" and "ptr" seems right.
    There are a couple alternatives. TNorthover's version uses "pN" where "N" is the address space, so most pointers would be "p0", and a pointer in address space #5 would be "p5". I initially attempted something like "ptr(N)", but the spelling is slightly ambiguous with function types. We could also simply use a void pointer, which LLVM currently does not allow [3].

Thank you for doing this, and the approach seems largely good to me, except for one important point: We've been moving steadily towards making addrspace 0 be non-special for a long time now, so I *strongly* prefer a spelling that always has an address space. I don't care too much about the exact spelling, pN and ptr(N) both seem fine to me assuming technical issues can be sorted out. pN has the benefit of already being used in codegen contexts, so count that as a *mild* preference for that spelling.

I think requiring an address space would be too confusing for a majority of use
cases. Would it help if instead of defaulting to 0, the default address space
was target dependent?

- Tom

For CHERI targets, the default address space is ABI dependent: AS0 is a 64-bit integer that's relative to the default data capability, AS200 is a 128-bit capability (on 64-bit platforms). It can also differ between code, heap, and stack.

If this is purely a syntactic thing in the text serialisation, would it be possible to put something in the DataLayout that is ignored by everything except the pretty-printer / parser?

David

Could you give an example?

Also, perhaps we should separate the opaque pointer types transition from any changes to address spaces. Currently the proposal is basically unchanged from the current status quo in terms of pointer address spaces. We definitely should have a “default” pointer type in some shape or form which is represented by “ptr”, or else writing IR tests is too cumbersome. Currently that means AS0, but we can change that in the future if we want independently of opaque pointers.

+1 to this - pointers already carry their address space with explicit
syntax and I think it's OK to do that for this transition. Though I
wouldn't be opposed to a change in the future to roll it into the
pointer type name if that seems suitable.

- Dave

If there’s a larger effort to make address spaces then I’d be happy to change the representation since mass updating tests once is better than twice, but I’m worried that this may start becoming intertwined with more address space work, and the opaque pointers project has gone on long enough (like many other LLVM projects).

And of course, there’s always time before we do mass test updates to easily change the textual representation.

I agree. I think it would be a mistake to add an unnecessary difference vs. typed pointers along some other axis (address space, or otherwise). Opaque pointers have enough of their own challenges to solve.

I am very much beginner in opaque pointers but I am also minimalist too in a sense entities shouldnt be multiplied but rather divided where applicable.

Can someone point me to article(s) describing what problems opaque pointers solve that cant be solved with forward declaractions and typed pointers etc?

My first gutfeeling was when learning on idea of opaque pointers, theyre not much more than void* with all its issues from static analysis, compiler design, code readability, code quality, code security perspective. Can someone correct a newbie? Very open to change my mind.

-Pawel

wt., 11.05.2021, 02:35 użytkownik Duncan P. N. Exon Smith via llvm-dev <llvm-dev@lists.llvm.org> napisał:

I am very much beginner in opaque pointers but I am also minimalist too in a sense entities shouldnt be multiplied but rather divided where applicable.

Can someone point me to article(s) describing what problems opaque pointers solve that cant be solved with forward declaractions and typed pointers etc?

My first gutfeeling was when learning on idea of opaque pointers, theyre not much more than void*

Yep, that’s basically what they are. Though this is only relative to the IR design, not source language design.

with all its issues from static analysis, compiler design, code readability, code quality, code security perspective. Can someone correct a newbie? Very open to change my mind.

LLVM doesn’t provide any guarantees about pointer types (unlike, say, C++ that has type based aliasing guarantees about pointers - if you have an int* you know it can’t hold the same value as a float* in C++, but this property isn’t true in LLVM IR (this information can be carried separately in type based alias analysis metadata - but it’s not inherent in the LLVM IR of pointers themselves)) - so the type information provides limited value (somewhat useful for frontends generating IR to be able to have some intended type information carried around in the IR as it’s being constructed) and inhibits optimizations - converting between pointer types involves instructions (geps or bitcasts) - instructions that optimizations have to know to skip over/look through.

So instead, we’re moving to a model where pointers don’t have a type (since it’s not informative to optimizations anyway) - and operations carry type information (instead of “load from this int pointer” it’ll be “load an integer from this opaque pointer”).

If you look at the LLVM IR today, you’ll see these explicit types on operations (eg: the load instruction has an explicit type parameter to it, which currently looks redundant with the type of the pointer parameter that’s passed to the load instruction - but in the future that pointer parameter won’t carry any pointee type information and the load will rely entirely on the explicit type parameter it has).

  • Dave

Ok cool. If that makes llvm better cool with me. Just dont spread it to lang spec. One void* issue in complang spec is more than enough trouble from perspective of dude working on static analysis and other mentioned topics.

-Pawel

wt., 11.05.2021, 09:20 użytkownik David Blaikie <dblaikie@gmail.com> napisał:

     > I think requiring an address space would be too confusing for a
    majority
     > of use
     > cases. Would it help if instead of defaulting to 0, the default
    address
     > space
     > was target dependent?

    For CHERI targets, the default address space is ABI dependent: AS0 is a
    64-bit integer that's relative to the default data capability, AS200 is
    a 128-bit capability (on 64-bit platforms). It can also differ between
    code, heap, and stack.

    If this is purely a syntactic thing in the text serialisation, would it
    be possible to put something in the DataLayout that is ignored by
    everything except the pretty-printer / parser?

Could you give an example?

An example of what?

Also, perhaps we should separate the opaque pointer types transition from any changes to address spaces. Currently the proposal is basically unchanged from the current status quo in terms of pointer address spaces. We definitely should have a "default" pointer type in some shape or form which is represented by "ptr", or else writing IR tests is too cumbersome. Currently that means AS0, but we can change that in the future if we want independently of opaque pointers.

I agree that doing this incrementally is probably the right thing, but I disagree on the tests side. If we used a p{address space} notation then writing p0 is less to type than ptr, so writing tests that want AS0 is less effort and writing tests that want another address space is even less effort than writing `ptr addrspace(42)`.

David

There are a few problems with the current representation and they largely mirror the old problem with signed vs unsigned integers in the IR 15 years ago. In early versions of LLVM, integers were explicitly signed. This meant that the IR was cluttered with bitcasts from signed to unsigned integers, which slowed down analysis and didn't convey any useful semantics. Worse, there were a bunch of things conflated, for example does unsigned imply wrapping? Some time in the 2.x series (2.0? My memory is fuzzy here), LLVM moved to just i{size} types for integer and moved all of the semantics to the operations. It's now explicit whether an operation is signed or unsigned, whether overflow wraps or has undefined behaviour, and so on.

Pointers have a similar set of problems. Pointers carry a type, but that type doesn't actually carry any semantics. There are a lot of things that don't care about the type of the pointer, but they have no way of specifying this and generally use i8*. This means that the IR is full of bitcasts from {something}* to i8* and then back again.

This is particularly important for code that wants to use non-zero address spaces, because a lot of code does casts via i8* and forgets to change this to i8*-in-another-address-space.

The fact that a pointer is a pointer to some struct type currently doesn't imply anything about whether the pointed-to data and it's completely valid to bitcast a pointer to a random type and back again in an optimisation. The real type info (where applicable) is carried by TBAA metadata, dereferencability info by attributes, and so on.

TL;DR: The pointee type has no (or worse, misleading) semantics and forces a load of bitcasts. Opaque pointers remove this.

David

Ok. Cool. Im starting to understand now. ThankYou.

-Pawel

wt., 11.05.2021, 11:19 użytkownik David Chisnall via llvm-dev <llvm-dev@lists.llvm.org> napisał:

A quick doc on opaque pointers: https://reviews.llvm.org/D102292