[RFC] Introduce sentinel pointer value to `DataLayout`

The value of a null pointer is not always 0. For example, on AMDGPU, the null pointer in address spaces 3 and 5 is 0xffffffff. Currently, there is no target-independent way to get this information, making it difficult and error-prone to handle null pointers in target-agnostic code.

We do have ConstantPointerNull, but it might be a little confusing and misleading. It represents a pointer with an all-zero value rather than necessarily a real nullptr.

PR#131557 introduces the concept of a sentinel pointer value to DataLayout, representing the actual nullptr value for a given address space. The changes include:

  • A new interface function:

    APInt getSentinelPointerValue(unsigned AS)
    

    This function returns an APInt representing the sentinel pointer value for the given address space AS. An APInt is used instead of a literal integer to support cases where pointers are wider than 64 bits (e.g., AMDGPU’s address space 8).

  • An extension to the data layout string format:

    p[n]:<size>:<abi>[:<pref>[:<idx>[:<sentinel>]]]
    

    The new <sentinel> component specifies the sentinel value for the corresponding pointer. It currently supports two values:

    • 0 for an all-zero value
    • f for a full-bit set value

    These two values are the most common representations of nullptr. It is unlikely that any target would define nullptr as a random value.

A follow-up patch series will introduce an equivalent of ConstantPointerNull that represents the actual nullptr, built on top of the PR#131557.

1 Like

Could you please provide some more context on the ways in which this new “sentinel value” concept is going to be used? Maybe some examples of where/how you plan to use it?


It’s probably worth noting that, due to LLVM’s allocated object rules, only a zero-size object may be located at address -1, regardless of address space. So in a sense, it’s always a sentinel.

1 Like

The problem is we aren’t getting any null check optimizations on non-0 address spaces. The non-0 null pointer values are just a subset of the problem.

This impacts direct checks on pointers derived from alloca, globals, and arguments marked with nonnull. Comparison of object against addrspacecasted null pointer are not optimized out · Issue #58617 · llvm/llvm-project · GitHub is one example. We want to opt-in to normal null handling for most address spaces (except for 3 of them where we want the -1 value to behave identically and respect all the known-non-null annotations).

I think this part is straightforward and should be uncontroversial.

Are you saying that nonnull/!nonnull are going to change meaning to “not equal to the sentinel pointer”? Currently these mean that the integral value is not zero.

Generally, what I’m missing in this RFC is a specification of what the “sentinel pointer” actually implies in terms of semantics and how it interacts with existing IR constructs, or what new ones it introduces.

The interaction with nonnull is one of the things that should be clarified. Also the interaction with null_pointer_is_defined – should that become sentinel_pointer_is_defined? Also, how is this sentinel value going to be spelled in IR (i.e. what is the equivalent to ptr null in the default address space)?

Are there any other special semantics? For example, if I take a sentinel value in one address space and perform an address space cast to a different address space, will I also get back the sentinel pointer in that address space? So e.g. addrspacecast (ptr null to ptr addrspace(N)) would be guaranteed to result in ptr addrspace(N) sentinel (or whatever the spelling is)?

1 Like

FWIW, I work on an embedded compiler where null pointers are intentionally neither 0 nor all ones.

For my stuff, I’ll still have to use custom null pointer logic as my environment has some other complexities, such as only some bits contibuting to nullness depending on context, and global pointers that can’t be represented as LLVM Constants because of the bit shifts, masking, and XORs required for segmented pointers. For example I might have to treat FDFDxxxx and FFFF0000FxxxFxxx and Fxxxxxxx0000FFFF as null pointers all pointing to same address space.

1 Like

Thanks for the feedback. I need to think through the concern regarding the interaction with nonnull and null_pointer_is_defined.

Also, how is this sentinel value going to be spelled in IR (i.e., what is the equivalent to ptr null in the default address space)?

This will be proposed in a follow-up RFC. We don’t want to change existing IR semantics, so we plan to introduce a new representation:

ptr [addrspace(N)] nullptr

This would explicitly denote the sentinel pointer in address space N. Meanwhile, ptr addrspace(N) null will continue to represent an all-zero pointer value in address space N.

I understand that ptr nullptr might be confusing. Any suggestions for an alternative spelling for the new representation would be greatly appreciated.

So, e.g., addrspacecast (ptr null to ptr addrspace(N)) would be guaranteed to result in ptr addrspace(N) sentinel (or whatever the spelling is)?

Yes, that’s the plan. Currently, addrspacecast (ptr null to ptr addrspace(N)) cannot be folded, but with the introduction of sentinel pointers, it will be possible to fold it.

Thanks for the information. That’s good to know. Maybe we could extend this to support arbitrary value or at least to tell that we don’t have a regular sentinel value here so you can’t rely on getSentinelPointerValue.

Having ptr null be the same thing as ptr nullptr on some targets and different on others should be expected to produce a steady stream of inane bugs.

Yes, the naming leaves a lot to be desired. But in the common case we can just fold ptr nullptr to ptr null

I think this either needs to use a more distinct name (ptr sentinel instead of ptr nullptr), or we should actually change the semantics of ptr null to no longer refer to the “all zero pointer”. Having both ptr null and ptr nullptr is too confusing.

1 Like

I agree that having both ptr null and ptr nullptr is confusing. I’d prefer if we could have ptr null be the canonical nullptr representation for that address space and no longer assume that it is zero in every address space.
Since there are only a handful of address spaces that use a non-zero null pointer, would it be possible to handle this in autoupgrade and change existing ptr addrspace(n) null to ptr addrspace(n) zeroinitializer if the datalayout is being upgraded to have a sentinel pointer value?

Alternatively, we could deprecate ptr null and only use ptr nullptr/ptr zeroinitializer for canonical null/zero?

2 Likes

Regarding the interaction with nonnull, dereferenceable_or_null, and null_pointer_is_valid, it depends on how much we want to preserve the current semantics of null, which currently represents zero in LLVM.

I think @arichardson made a great point. Rather than introducing a new concept called “sentinel pointer”, using nullptr and deprecating null would make things clearer and less confusing.

Proposed Changes

  • A new IR representation ptr addrspace(n) nullptr will be introduced to represent the actual or canonical nullptr.
  • A new class ConstantNullPointer will be introduced.
  • ConstantPointerNull will likely be kept for now, based on feedback from previous discussions.
    • It will correspond to ptr addrspace(n) zeroinitializer.
    • All existing ptr addrspace(n) null will be auto-upgraded to ptr addrspace(n) zeroinitializer.
    • However, we will update its usage wherever the semantics allow.

Interaction with metadata

nonnull and dereferenceable_or_null will be replaced with nonnullptr and dereferenceable_or_nullptr, respectively.

My experience and knowledge in this area are fairly limited, but my understanding is that the key concern behind these attributes is probably not whether the pointer literally holds a zero value but whether it implies the actual nullptr. If nullptr in address space N is not zero, does it really matter whether a pointer is literally zero? We probably care more about whether it is actually a nullptr in that context.

Interaction with attributes

null_pointer_is_valid is a bit trickier. Based on my understanding, it only applies to address space 0 and is specifically used for the null address. I think we should keep the name but adjust its semantics to refer to nullptr instead of null.

I expect this change will not have any actual effect on existing code because:

  1. We always initialize the pointer specification for address space 0, and the default sentinel value is 0, unless it is override by data layout string.
  2. All existing upstream LLVM targets currently use 0 for nullptr.

Handling Legacy Attributes

If I remember correctly, we recently removed nocapture and replaced it with captures(...). How do we currently handle cases where we encounter the old nocapture attribute? @nikic

An “Easier” Alternative

The proposal above is to avoid confusion by replacing a “deceiving” terminology with clearer alternatives. However, as @nikic and @arichardson pointed out, we could also modify the semantics of existing terms instead of introducing new ones.

A more straightforward approach would be to redefine the meaning of null pointer across LLVM to represent the actual nullptr in its corresponding address space, while still keeping the null spelling.

We will still introduce the new nullptr representation in DataLayout, ensuring each address space has a well-defined nullptr valuek, but we modify the semantics of null to match the nullptr value defined for each address space.

This is an enhancement to the existing approach and doesn’t make things worse, even if we redefine null to mean nullptr. In most places, LLVM already avoids assuming null is always 0, though there are exceptions, such as this bug.

For handling ConstantPointerNull, we first replace all existing uses of ConstantPointerNull with Constant::getNull(PtrTy), and then put back ConstantPointerNull only in contexts where the pointer is not intended to represent a literal zero.

After making this change, we can safely assume that null represents nullptr in all contexts, and use it for futher development.

What do you think? @arsenm @arichardson @nikic

How does this interact with zeroinitializer? That’s generally used for default initialised globals, but C rules at least state that pointers are null-initialised, not zero-initialised. Those are just conveniently typically the same. Is there the expectation that zeroinitializer is really nullinitializer for pointers?

C rules at least state that pointers are null-initialised, not zero-initialised

There are a few different cases where null-initialization is not zero-initialization. Null pointers in exotic address spaces, as discussed in this RFC, but also C++ member pointers. clang already handles this, and does not generate zeroinitializer. (See CodeGenTypes::isZeroInitializable).

There is no confusion about zeroinitializer—it explicitly indicates that we want a zero value. With this RFC, the frontend needs to lower things correctly. When a nullptr is required, it should lower to ptr addrspace(n) null (or whatever representation we decide for the actual nullptr). When a zero-value pointer is needed, it should use ptr addrspace(n) zeroinitializer.

Hey folks, any new feedback about my updated proposal in here?

@arsenm @nikic @JonChesterfield @jrtc27 @efriedma-quic

My general inclination is to go with the “easier” variant, that is to change the semantics of existing constructs rather than renaming them. At least when looking at the final outcome, we’re not really gaining anything by doing renames like nonnull → nonnullptr, but introducing a good bit of churn. Especially as the new semantics are only relevant with a new (not yet used) data layout, code can be migrated incrementally.

ConstantPointerNull has ~260 mentions in llvm/, but from a quick survey only a small number of them will require adjustment to more strongly distinguish zero vs null.

What I am more concerned about in terms of migration is various other APIs like Constant::isNullValue() and Constant::getNullValue(). These should probably be split into Constant::isZeroValue() and Constant::isNullPtrValue(), as these are no longer the same.

However, a problem here is that the isZeroValue() query requires DataLayout to determine whether a null pointer has value zero or not, and getting DL into all the necessary places can be a pain. Though probably a lot of the uses are only actually working on integers.

It is upgraded during bitcode reading: llvm-project/llvm/lib/Bitcode/Reader/BitcodeReader.cpp at fd800487382e2ee3944493d58a961b6f48290243 · llvm/llvm-project · GitHub
As this is a common attribute, it is also upgraded during IR parsing (this is generally not required): llvm-project/llvm/lib/AsmParser/LLParser.cpp at fd800487382e2ee3944493d58a961b6f48290243 · llvm/llvm-project · GitHub


Probably worth noting that “allow specifying that zero is the null pointer in a non-default address space” and “allow specifying that there is a null pointer other than zero” do not necessarily have to be done at the same time. The former is a lot simpler than the latter.

1 Like

A quick update on my side. I’m still working on this. The current approach is to change the semantics of ptr null.

IR

  • ptr addrspace(N) null will represent the actual nullptr in address space N.
  • ptr addrspace(N) zeroinitializer will represent a zero-value pointer in address space N. Currently it looks like ptr addrspace(N) zeroinitializer will be updated to ptr addrspace(N) null so this needs to be changed.

API

  • Constant::getNullValue will return a null value. It is same as the current semantics except for the PointerType, which will return a real nullptr constant. That said, a real nullptr corresponds to a ConstantPointerNull, and a zero-value pointer corresponds to a ConstantExpr (effectively inttoptr i8 0 to ptr addrspace(N)).
  • Constant::getZeroValue will return a zero value constant. It is completely same as the current semantics.
  • Correspondingly, there will be both Constant::isNullValue and Constant::isZeroValue.

Currently, I’m still trying to figure out how to handle ConstantAggregateZero, since it explicitly says “zero” in its class name, so changing its semantics will definitely not work. Do we want to create a ConstantAggregateNull? If we do, then ConstantAggregateNull will be same as ConstantAggregateZero except for pointer types.

1 Like

These changes make sense to me, the only thing we might need to do is rename Constant::isNullValue to trigger compiler errors so that we don’t accidentally change the meaning of code that assumes that this function checks for zero-ness? For Constant::getNullValue() this seems less concerning, but probably worth deprecating it and adding something like a new Constant::getNullptr()

Isn’t ConstantAggregateZero used for zeroinitializer in LLVM IR? I would imagine this does not need to change.

If you want an array of “non-zero null pointers’“, I assume you’d just need expand it out? If this has significant compile-time impact, we could consider adding a new nullinitializer?