RFC: implicit null checks in llvm

Hi all,

I would like to propose a mechanism that would allow LLVM to fold null
pointer checks into "nearby" memory operations, subject to runtime
support. This is related to but not exactly the same as a proposal
floated by Peter Collingbourne earlier [1]. The obvious use cases are
managed languages like Java, C# and Go that require a null check on
pointers before they're used in certain ways (loads, stores, virtual
dispatches etc.). I'm sure there are other less obvious and more
interesting use cases.

I plan to break the design into two parts, roughly following the
statepoint philosophy:

# invokable @llvm.(load|store)_with_trap intrinsics

We introduce two new intrinsic families

  T @llvm.load_with_trap(T*) [modulo name mangling]
  void @llvm.store_with_trap(T, T*) [modulo name mangling]

They cannot be `call`ed, they can only be `invoke`d.

Semantically, they try to load from or store to the pointer passed to
them as normal load / store instructions do. @llvm.load_with_trap
returns the loaded value on the normal return path. If the load or
store traps then they dispatch to their unwind destination. The
landingpad for the unwind destination can only be a cleanup
landingpad, and the result of the landingpad instruction itself is
always undef. The personality function in the landingpad instruction
is ignored.

These intrinsics require support from the language runtime to work.
During code generation, the invokes are lowered into normal load or
store instructions, followed by a branch (explicit `jmp` or just
fall-through) to the normal destination. The PC for the unwind
destination is recorded in a side-table along with the PC for the load
or store. When a load or store traps or segfaults at runtime, the
runtime searches this table to see if the trap is from a PC recorded
in the side-table. If so, it the runtime jumps to the unwind
destination, otherwise it aborts.

Note that the signal handler / runtime do not themselves raise
exceptions at the ABI level (though they do so conceptually), but the
landing pad block can raise one if it wants to.

The table mapping load/store PCs to unwind PCs can be reported to the
language runtime via an __llvm_stackmaps like section. I am strongly
in favor of making this section as easy to parse as possible.

# optimization pass to create invokes to @llvm.(load|store)_with_trap

With the @llvm.(load|store)_with_trap intrinsics in place, we can
write an LLVM pass that folds null checks into nearby memory
operations on that same pointer. As an example, we can turn

  // r0 is a local register
  if (p != null) {
    r0 += 5;
    *(p + 16) = 42;
    ...
  } else {
    throw_NullPointerException();
    unreachable;
  }

into

  // r0 is a local register
  r0 += 5;
  invoke @llvm_store_with_trap(p + 16, 42) to label %ok, unwind label %not_ok

not_ok:
  %unused = landingpad .....
  throw_NullPointerException();
  unreachable;

ok:
  ...

A slight subtlety here is that the store to (p + 16) can trap (and we
can branch to not_ok) even if p is not null. However, in that case
the store would have happened in the original program anyway, and the
behavior of the original program is undefined.

A prerequisite for this optimization is that in the address space we
operate on, loads from and stores to pointers in some small
neighborhood starting from address `null` trap deterministically. The
above transform is correct only if we know that *(null + 16) will trap
synchronously and deterministically. Even on platforms where the 0th
page is always unmapped, we cannot fold null checks into memory
operations on an arbitrary offset away from the pointer to be null
checked (e.g. array accesses are not suitable for implicit null
checks).

Running this pass sufficiently late in the optimization pipeline will
allow for all the usual memory related optimization passes to work as
is -- they won't have to learn about the special semantics for the new
(load|store)_with_trap intrinsics to be effective.

This pass will have to be a profile-guided optimization pass for
fundamental reasons: implicit null checks are a pessimization even if
a small fraction of the implicit null checks fail. Typically language
runtimes that use page-fault based null checks recompile methods with
failed implicit null checks to use an explicit null check instead (e.g. [2]).

What do you think? Does this make sense?

[1]: https://groups.google.com/d/msg/llvm-dev/mMQzIt_8z1Y/cnE7WH1HNaoJ
[2]: jdk7u-hotspot/lcm.cpp at master · openjdk-mirror/jdk7u-hotspot · GitHub

-- Sanjoy

Hi all,

I would like to propose a mechanism that would allow LLVM to fold null
pointer checks into "nearby" memory operations, subject to runtime
support. This is related to but not exactly the same as a proposal
floated by Peter Collingbourne earlier [1]. The obvious use cases are
managed languages like Java, C# and Go that require a null check on
pointers before they're used in certain ways (loads, stores, virtual
dispatches etc.). I'm sure there are other less obvious and more
interesting use cases.

This feature will keep being requested. I agree LLVM should support it, and am happy to see it being done right.

I plan to break the design into two parts, roughly following the
statepoint philosophy:

# invokable @llvm.(load|store)_with_trap intrinsics

We introduce two new intrinsic families

T @llvm.load_with_trap(T*) [modulo name mangling]
void @llvm.store_with_trap(T, T*) [modulo name mangling]

They cannot be `call`ed, they can only be `invoke`d.

Semantically, they try to load from or store to the pointer passed to
them as normal load / store instructions do. @llvm.load_with_trap
returns the loaded value on the normal return path. If the load or
store traps then they dispatch to their unwind destination. The
landingpad for the unwind destination can only be a cleanup
landingpad, and the result of the landingpad instruction itself is
always undef. The personality function in the landingpad instruction
is ignored.

These intrinsics require support from the language runtime to work.
During code generation, the invokes are lowered into normal load or
store instructions, followed by a branch (explicit `jmp` or just
fall-through) to the normal destination. The PC for the unwind
destination is recorded in a side-table along with the PC for the load
or store. When a load or store traps or segfaults at runtime, the
runtime searches this table to see if the trap is from a PC recorded
in the side-table. If so, it the runtime jumps to the unwind
destination, otherwise it aborts.

The intrinsics need to be lowered to a pseudo instruction just like patchpoint (so that a stackmap can be emitted). In my mind the real issue here is how to teaching this pseudo instruction to emit the proper load/store for the target.

Note that the signal handler / runtime do not themselves raise
exceptions at the ABI level (though they do so conceptually), but the
landing pad block can raise one if it wants to.

The table mapping load/store PCs to unwind PCs can be reported to the
language runtime via an __llvm_stackmaps like section. I am strongly
in favor of making this section as easy to parse as possible.

Let’s just be clear that it is not recommended for the frontend to produce these intrinsics. They are a compiler backend convenience. (I don’t want InstCombine or any other standard pass to start trafficking in these.)

# optimization pass to create invokes to @llvm.(load|store)_with_trap

With the @llvm.(load|store)_with_trap intrinsics in place, we can
write an LLVM pass that folds null checks into nearby memory
operations on that same pointer. As an example, we can turn

// r0 is a local register
if (p != null) {
   r0 += 5;
   *(p + 16) = 42;
   ...
} else {
   throw_NullPointerException();
   unreachable;
}

into

// r0 is a local register
r0 += 5;
invoke @llvm_store_with_trap(p + 16, 42) to label %ok, unwind label %not_ok

not_ok:
%unused = landingpad .....
throw_NullPointerException();
unreachable;

ok:
...

A slight subtlety here is that the store to (p + 16) can trap (and we
can branch to not_ok) even if p is not null. However, in that case
the store would have happened in the original program anyway, and the
behavior of the original program is undefined.

A prerequisite for this optimization is that in the address space we
operate on, loads from and stores to pointers in some small
neighborhood starting from address `null` trap deterministically. The
above transform is correct only if we know that *(null + 16) will trap
synchronously and deterministically. Even on platforms where the 0th
page is always unmapped, we cannot fold null checks into memory
operations on an arbitrary offset away from the pointer to be null
checked (e.g. array accesses are not suitable for implicit null
checks).

This is a platform dependent intrinsic. There’s nothing wrong with a platform specific size for the unmapped page zero if we don’t already have one.

Running this pass sufficiently late in the optimization pipeline will
allow for all the usual memory related optimization passes to work as
is -- they won't have to learn about the special semantics for the new
(load|store)_with_trap intrinsics to be effective.

Good. This is a codegen feature. We can’t say that enough. If you really cared about the best codegen this would be done in machine IR after scheduling and target LoadStore opts.

This pass will have to be a profile-guided optimization pass for
fundamental reasons: implicit null checks are a pessimization even if
a small fraction of the implicit null checks fail. Typically language
runtimes that use page-fault based null checks recompile methods with
failed implicit null checks to use an explicit null check instead (e.g. [2]).

I don’t think making it profile-guided is important. Program behavior can change after compilation and you’re back to the same problem. I think recovering from repeated traps is important. That’s why you need to combine this feature with either code invalidation points or patching implemented via llvm.stackmap, patchpoint, (or statepoint) — they’re all the same thing.

What do you think? Does this make sense?

Well, you need the features that patchpoint gives you (stackmaps entries) and you’ll need to use patchpoints or stackmaps anyway for invalidation or patching. So why are you bothering with a totally new, independent intrinsic? Why not just extend the existing intrinsics. We could have a variant that

- emits a load instead of a call

- looks at the landing pad to generate a special stackmap entry in addition to the normal exception table (I don’t even see why you need this, except that the runtime doesn’t know how to parse an exception table.)

Andy

This feature will keep being requested. I agree LLVM should support it,
and am happy to see it being done right.

+1

> I plan to break the design into two parts, roughly following the
> statepoint philosophy:
>
> # invokable @llvm.(load|store)_with_trap intrinsics
>
> We introduce two new intrinsic families
>
> T @llvm.load_with_trap(T*) [modulo name mangling]
> void @llvm.store_with_trap(T, T*) [modulo name mangling]
>
> They cannot be `call`ed, they can only be `invoke`d.

Why not allow non-nounwind calls, in other words, an intrinsic call that
may throw?

In most languages with implicit null checks, there are far more functions
that do field accesses and method calls than there are functions that catch
exceptions. The common case is that the frame with the load will have
nothing to do other than propagate the exception to the parent frame, and
we should allow the runtime to handle that efficiently.

Essentially, in this model, the signal handler is responsible for
identifying the signal as a null pointer exception (i.e. SIGSEGVs on a
small pointer value with a PC in code known to use this EH personality) and
transitioning to the exception handling machinery in the language runtime.

Semantically, they try to load from or store to the pointer passed to
> them as normal load / store instructions do. @llvm.load_with_trap
> returns the loaded value on the normal return path. If the load or
> store traps then they dispatch to their unwind destination. The
> landingpad for the unwind destination can only be a cleanup
> landingpad, and the result of the landingpad instruction itself is
> always undef. The personality function in the landingpad instruction
> is ignored.

The landingpad personality normally controls what kind of EH tables are
emitted, so if you want something other than the __gxx_personality_v0 LSDA
table, you could invent your own personality and use that to control what
gets emitted. This might be useful for interoperating with existing
language runtimes.

These intrinsics require support from the language runtime to work.
> During code generation, the invokes are lowered into normal load or
> store instructions, followed by a branch (explicit `jmp` or just
> fall-through) to the normal destination. The PC for the unwind
> destination is recorded in a side-table along with the PC for the load
> or store. When a load or store traps or segfaults at runtime, the
> runtime searches this table to see if the trap is from a PC recorded
> in the side-table. If so, it the runtime jumps to the unwind
> destination, otherwise it aborts.

The intrinsics need to be lowered to a pseudo instruction just like
patchpoint (so that a stackmap can be emitted). In my mind the real issue
here is how to teaching this pseudo instruction to emit the proper
load/store for the target.

Does it really have to be a per-target pseudo? The way I see it, we can
handle this all in selection dag. All we need to do is emit the before
label, the load/store operation, and the end label, and establish control
dependence between them all to prevent folding. Does that seem reasonable,
or is this an overly simplistic perspective? :slight_smile:

Note that the signal handler / runtime do not themselves raise
> exceptions at the ABI level (though they do so conceptually), but the
> landing pad block can raise one if it wants to.
>
> The table mapping load/store PCs to unwind PCs can be reported to the
> language runtime via an __llvm_stackmaps like section. I am strongly
> in favor of making this section as easy to parse as possible.

Let’s just be clear that it is not recommended for the frontend to produce
these intrinsics. They are a compiler backend convenience. (I don’t want
InstCombine or any other standard pass to start trafficking in these.)

Would you be OK with simply documenting that these intrinsics are
optimization-hostile, in the same way that early safepoint insertion is?
There are some language constructs (__try / __except) that allow catching
memory faults like this. Such constructs are rare and don't really need to
be optimized. I just want to make sure that mid-level optimizations don't
actively break these.

> Running this pass sufficiently late in the optimization pipeline will
> allow for all the usual memory related optimization passes to work as
> is -- they won't have to learn about the special semantics for the new
> (load|store)_with_trap intrinsics to be effective.

Good. This is a codegen feature. We can’t say that enough. If you really
cared about the best codegen this would be done in machine IR after
scheduling and target LoadStore opts.

I agree, with the caveat above. Mid-level passes shouldn't actively break
these intrinsics.

The intrinsics need to be lowered to a pseudo instruction just like patchpoint (so that a stackmap can be emitted). In my mind the real issue here is how to teaching this pseudo instruction to emit the proper load/store for the target.

Do you a specific set of issues that you think could be problematic?

Let’s just be clear that it is not recommended for the frontend to produce these intrinsics. They are a compiler backend convenience. (I don’t want InstCombine or any other standard pass to start trafficking in these.)

Agreed.

This is a platform dependent intrinsic. There’s nothing wrong with a platform specific size for the unmapped page zero if we don’t already have one.

Agreed.

Good. This is a codegen feature. We can’t say that enough. If you really cared about the best codegen this would be done in machine IR after scheduling and target LoadStore opts.

I thought about doing this very late, but I figured that pattern
matching on control flow will be much easier while we are still in
LLVM IR.

I don’t think making it profile-guided is important. Program behavior can change after compilation and you’re back to the same problem. I think recovering from repeated traps is important. That’s why you need to combine this feature with either code invalidation points or patching implemented via llvm.stackmap, patchpoint, (or statepoint) — they’re all the same thing.

Without profiling information, how will the frontend communicate to
the backend that an explicit null check "never" fails and hence may be
replaced with an implicit null check (or that it has been observed to
fail, and should not be replaced with an implicit null check)? In the
scheme I'm thinking of, a null check is made implicit only if the (branch)
probability of it failing is approximately zero.

Well, you need the features that patchpoint gives you (stackmaps entries) and you’ll need to use patchpoints or stackmaps anyway for invalidation or patching. So why are you bothering with a totally new, independent intrinsic? Why not just extend the existing intrinsics. We could have a variant that

- emits a load instead of a call

- looks at the landing pad to generate a special stackmap entry in addition to the normal exception table (I don’t even see why you need this, except that the runtime doesn’t know how to parse an exception table.)

This sounds good -- reusing the patchpoint intrinsics will save me a
lot of work.

-- Sanjoy

Why not allow non-nounwind calls, in other words, an intrinsic call that may
throw?

In most languages with implicit null checks, there are far more functions
that do field accesses and method calls than there are functions that catch
exceptions. The common case is that the frame with the load will have
nothing to do other than propagate the exception to the parent frame, and we
should allow the runtime to handle that efficiently.

Essentially, in this model, the signal handler is responsible for
identifying the signal as a null pointer exception (i.e. SIGSEGVs on a small
pointer value with a PC in code known to use this EH personality) and
transitioning to the exception handling machinery in the language runtime.

I have no problems in semantically allowing unwinding calls to these
intrinsics that get your runtime to unwind the stack and propagate an
exception object. But this proposal is specifically to only introduce
LLVM side changes to generate the appropriate side-tables for implicit
exceptions. I think getting signal handlers and runtimes to throw
exceptions will be quite a bit of work beyond that.

One thing to note: it is usually impractical for managed languages to
use implicit null pointer exceptions unless they have some way to
"heal" the implicit null check sites into explicit null checks as they
fail. Unless you have a way to quickly converge your program to a
point where no implicit null checks actually fail at runtime, checking
for null pointers via virtual memory tricks is a pessimization.

This can be done via code patching / invalidation etc. as Andy
mentioned.

The landingpad personality normally controls what kind of EH tables are
emitted, so if you want something other than the __gxx_personality_v0 LSDA
table, you could invent your own personality and use that to control what
gets emitted. This might be useful for interoperating with existing language
runtimes.

That's a great idea. Do you think it is reasonable to standardize a
"__llvm_implicit_null_check" personality function that emits
information into the stackmaps sections instead?

Does it really have to be a per-target pseudo? The way I see it, we can
handle this all in selection dag. All we need to do is emit the before
label, the load/store operation, and the end label, and establish control
dependence between them all to prevent folding. Does that seem reasonable,
or is this an overly simplistic perspective? :slight_smile:

That was my impression also. :slight_smile:

Would you be OK with simply documenting that these intrinsics are
optimization-hostile, in the same way that early safepoint insertion is?
There are some language constructs (__try / __except) that allow catching
memory faults like this. Such constructs are rare and don't really need to
be optimized. I just want to make sure that mid-level optimizations don't
actively break these.

Personally, I'm okay with your proposal; but I will wait for Andy to
respond on this.

I agree, with the caveat above. Mid-level passes shouldn't actively break
these intrinsics.

Agreed. Mid-level passes should treat these as opaque calls.

-- Sanjoy

One thing to note: it is usually impractical for managed languages to
use implicit null pointer exceptions unless they have some way to
"heal" the implicit null check sites into explicit null checks as they
fail. Unless you have a way to quickly converge your program to a
point where no implicit null checks actually fail at runtime, checking
for null pointers via virtual memory tricks is a pessimization.

This can be done via code patching / invalidation etc. as Andy
mentioned.

^^ you can ignore these two paragraphs. They're not incorrect, but I
realized that they're not relevant to what you were suggesting.

The intrinsics need to be lowered to a pseudo instruction just like patchpoint (so that a stackmap can be emitted). In my mind the real issue here is how to teaching this pseudo instruction to emit the proper load/store for the target.

Do you a specific set of issues that you think could be problematic?

One problem is that we’re bypassing instruction selection. You’ll probably need to encode the load width as an instruction operand. You might only want to handle integer loads.

I don’t think making it profile-guided is important. Program behavior can change after compilation and you’re back to the same problem. I think recovering from repeated traps is important. That’s why you need to combine this feature with either code invalidation points or patching implemented via llvm.stackmap, patchpoint, (or statepoint) — they’re all the same thing.

Without profiling information, how will the frontend communicate to
the backend that an explicit null check "never" fails and hence may be
replaced with an implicit null check (or that it has been observed to
fail, and should not be replaced with an implicit null check)? In the
scheme I'm thinking of, a null check is made implicit only if the (branch)
probability of it failing is approximately zero.

It’s better to have PGO. I’m saying that it is neither necessary nor sufficient. You can speculate that check does not fail and recover from repeated failures by patching or recompiling. What you absolutely need is runtime profiling of failed checks. But my argument is irrelevant from LLVM side. I think the decision point is whether the pass should require BlockFrequency, and I agree that it should.

Well, you need the features that patchpoint gives you (stackmaps entries) and you’ll need to use patchpoints or stackmaps anyway for invalidation or patching. So why are you bothering with a totally new, independent intrinsic? Why not just extend the existing intrinsics. We could have a variant that

- emits a load instead of a call

- looks at the landing pad to generate a special stackmap entry in addition to the normal exception table (I don’t even see why you need this, except that the runtime doesn’t know how to parse an exception table.)

This sounds good -- reusing the patchpoint intrinsics will save me a
lot of work.

Yeah, the goal is to reuse the infrastructure. It could still be a separate intrinsic with shared implementation if that turns out to be easier. I’m unconvinced either way. An interesting question is how you will handle repeated traps: by invalidating the whole compiled method or just patching in an explicit branch. I suspect you’ll invalidate the compiled method, in which case a separate intrinsic is fine. To do the later you would obviously need all the features of a patchpoint.

Andy

The intrinsics need to be lowered to a pseudo instruction just like patchpoint (so that a stackmap can be emitted). In my mind the real issue here is how to teaching this pseudo instruction to emit the proper load/store for the target.

Does it really have to be a per-target pseudo? The way I see it, we can handle this all in selection dag. All we need to do is emit the before label, the load/store operation, and the end label, and establish control dependence between them all to prevent folding. Does that seem reasonable, or is this an overly simplistic perspective? :slight_smile:

It would be really nice to just emit a label next to an instruction and assume the stackmap entry will end up at the right address. I don’t think the backend can make that guarantee. Any Machine pass can insert instructions wherever it wants without looking for labels.

I think labels are only employed just before code emission. Is there any existing example of labels being used earlier (I think GCRootLowering is the earliest)?

Note that the signal handler / runtime do not themselves raise
exceptions at the ABI level (though they do so conceptually), but the
landing pad block can raise one if it wants to.

The table mapping load/store PCs to unwind PCs can be reported to the
language runtime via an __llvm_stackmaps like section. I am strongly
in favor of making this section as easy to parse as possible.

Let’s just be clear that it is not recommended for the frontend to produce these intrinsics. They are a compiler backend convenience. (I don’t want InstCombine or any other standard pass to start trafficking in these.)

Would you be OK with simply documenting that these intrinsics are optimization-hostile, in the same way that early safepoint insertion is? There are some language constructs (__try / __except) that allow catching memory faults like this. Such constructs are rare and don’t really need to be optimized. I just want to make sure that mid-level optimizations don’t actively break these.

Documentation is fine. There’s no reason to prevent a frontend from generating these. I just want the intention to be clear.

Andy

SelectionDAGBuilder::lowerInvokable() inserts EH_LABELs into the DAG for
normal invokes, and it more or less works out.

I don't know much about what transformations MI passes typically do, but my
guess is that you're concerned that some pass might re-schedule possibly
trapping operations inside the label range?

The scheduler itself doesn’t move anything around labels. But any pass can perform code motion or load/store optimization. Also, any pass can insert an instruction, like a copy, between the label and the load.

I’m not really sure how EH_LABEL ends up translating into exception tables, but my guess is that it’s encoding a range that may include any arbitrary instructions as long as the call is within the range. So as long as calls aren’t reordered with labels and appears to have side effects it would work.

So, you could add a different kind of stack map entry that encodes ranges instead of exact addresses, then survey all passes to ensure they don’t optimize loads across labels. I would have more confidence doing this with a pseudo instruction though.

Andy

I guess your concern is that you need an exact address to do patching.

If this feature is limited to simply allow handling exceptions raised from
loads and stores, then I'm not worried about copies, adds, or other code
being moved into the label range. We don't support catching traps from
those kinds of instructions. I'm only worried about memory accesses being
moved into the range. If we can address that, I think the EH_LABEL approach
works, and it generalizes to other trapping instructions like division or
FP exceptions.

I think this is riskier than EH_LABELing calls because targets do aggressive load/store optimization. It does sound plausible though. I imagine teaching those passes to treat labels equivalent to unknown stores.

We don’t need to support patching at the load. Patch points will be needed to “heal” bad implicit null checks, but that is probably better done by patching call sites into the optimized code. Eventually, someone may want to be able to patch their implicit null checks, and they’ll just need to use a patchpoint to do that instead.

Obviously, if this approach works, it’s better to define a separate llvm.load_with_trap intrinsic.

FWIW: Now that I think about it, the EH_LABEL approach makes sense and is more elegant (although it’s risky), but implementing the pseudo instruction would also be fairly straightforward. You would just mimic the patchpoint logic for lowering and MC streaming.

Andy

I don't think we can expose the memory operations directly from a
semantic, theoretical point of view. Whether practically we can do
this or not is a different question.

Does LLVM do optimizations like these at the machine instruction
level?

   if (condition)
     T = *X // normal load, condition guards against null

   EH_LABEL // clobbers all
   U = *X // implicit null check, branches out on fault
   EH_LABEL // clobbers all
   ...

=>

  since the second "load" from X always happens, X must be
  dereferenceable

   T = *X // miscompile here

   EH_LABEL // clobbers all
   U = *X // implicit null check, branches out on fault
   EH_LABEL // clobbers all
   ...

The fundamental problem, of course, is that we're hiding the real
control flow which is

if (!is_dereferenceable(X)) branch_out;
U = *X

We don’t need to support patching at the load. Patch points will be needed
to “heal” bad implicit null checks, but that is probably better done by
patching call sites into the optimized code. Eventually, someone may want to
be able to patch their implicit null checks, and they’ll just need to use a
patchpoint to do that instead.

Agreed.

-- Sanjoy

Hi,

I'd like to make sure this is headed in a direction that we'll be able to use/extend it for LLILC/CoreCLR (C#). A few questions:

1) We've got a runtime that will generate an ABI-level exception object in response to the machine trap. I'd like the compiler to be able to support targeting a runtime with that behavior (if nothing else, it saves us the code size of a call instruction per [group of] never-dynamically-failing null check[s]). So, given an explicit check that looks like this:

        %NullCheck = icmp eq %Ty* %2, null
        br i1 %NullCheck, label %ThrowNullRef, label %3

      ; <label>:3 ; preds = %1
        %4 = <index into %2 by constant offset smaller than guard page>
        %5 = load i32, i32* %4, align 8
        ... (code that uses %5)

      ThrowNullRef: ; preds = %1
        invoke void ThrowNullReferenceException() #2
                to label %6 unwind label %ExceptionDispatch

      ; <label>:6 ; preds = %ThrowNullRef
        unreachable

      ExceptionDispatch: ; preds = %ThrowNullRef
        %ExnData = landingpad { i8*, i32 } personality void ()* @ProcessCLRException
      ...

I understand the proposal here is to add a pass that changes this to:

        %5 = invoke @llvm.load_with_trap(...) %2, <index>
              to label %ok, unwind label %not_ok
      ok:
        ... (code that uses %5)

      not_ok:
        %unused = landingpad { i8*, i32 } personality void ()* @__llvm_implicit_null_check
        br %ThrowNullRef

      ThrowNullRef: ; preds = %1
        invoke void ThrowNullReferenceException() #2
                to label %6 unwind label %ExceptionDispatch

      ; <label>:6 ; preds = %ThrowNullRef
        unreachable

      ExceptionDispatch: ; preds = %ThrowNullRef
        %ExnData = landingpad { i8*, i32 } personality void ()* @ProcessCLRException
      ...

But what I eventually need is to also eliminate the call:

        %5 = invoke @llvm.load_with_trap(...) %2, <index>
              to label %ok, unwind label %ExceptionDispatch
      ok:
        ... (code that uses %5)

      ExceptionDispatch: ; preds = %ThrowNullRef
        %ExnData = landingpad { i8*, i32 } personality void ()* @ProcessCLRException
      ...

And I'm trying to understand how I might achieve that. Allowing a target configuration to direct the null check folding to also fold away the call seems most straightforward; is that the right idea? Or would it need to be something more like a separate pass that can be run for such targets immediately after this pass, which does the extra folding?

Related, regarding these restrictions:

      > The landingpad for the unwind destination can only be a cleanup landingpad, and the result of the landingpad instruction itself is always undef.

will they be enforced by the verifier? Could we perhaps make that conditional on the personality routine?

2) Is the signature

        T <at> llvm.load_with_trap(T*) [modulo name mangling]
        void <at> llvm.store_with_trap(T, T*) [modulo name mangling]

meant to imply that the pointer must be in address space zero, or that any address space is acceptable? In our case, since we're planning to follow the statepoint design and use address spaces to distinguish GC pointers, the pointers whose null checks we'll be wanting to fold will most often not be in address space zero.

3) I didn't see a follow-up to this point:

      > If you really cared about the best codegen this would be done in machine IR after scheduling and target LoadStore opts.

Can you elaborate on whether/why this is/isn't the plan? I'd hate for the use of implicit checks to come at the cost of worse load/store codegen, especially since we'll have null checks on most heap loads/stores.

Thanks
-Joseph

I'd like to make sure this is headed in a direction that we'll be able to use/extend it for LLILC/CoreCLR (C#).

Great!

And I'm trying to understand how I might achieve that. Allowing a target configuration to direct the null check folding to also fold away the call seems most straightforward; is that the right idea? Or would it need to be something more like a separate pass that can be run for such targets immediately after this pass, which does the extra folding?

I'd say it depends on the semantics of your runtime. If this is
strictly an optimization then I'd have a preference for the latter.
If this is a correctness issue then I think the former makes more
sense, since otherwise the GenerateImplicitNullChecks (or
whatever we call it) pass won't be meaning preserving.

Related, regarding these restrictions:

      > The landingpad for the unwind destination can only be a cleanup landingpad, and the result of the landingpad instruction itself is always undef.

will they be enforced by the verifier? Could we perhaps make that conditional on the personality routine?

Yes, we can make these dependent on the personality routine.

2) Is the signature

        T <at> llvm.load_with_trap(T*) [modulo name mangling]
        void <at> llvm.store_with_trap(T, T*) [modulo name mangling]

meant to imply that the pointer must be in address space zero, or that any address space is acceptable?

We use statepoints also, so non-zero address space pointers will have to work.

3) I didn't see a follow-up to this point:

      > If you really cared about the best codegen this would be done in machine IR after scheduling and target LoadStore opts.

Can you elaborate on whether/why this is/isn't the plan? I'd hate for the use of implicit checks to come at the cost of worse load/store codegen, especially since we'll have null checks on most heap loads/stores.

So there are two related issues here:

1. it may just be too difficult to do this correctly once we're out
    of LLVM IR. For instance, we may need to call on alias analysis:

      if (x != null) {
        x.f = 42;
        y.f = 100;
      }

    can be transformed, by LLVM's normal optimization pipeline, to

      if (x != null) {
        y.f = 100;
        x.f = 42;
      }

    if it can prove that x does not alias y.

    Now we cannot use the store to x.f to throw an NPE
    (NullPointerException) because then we'd have thrown an NPE after
    the store to y.f while the original program had the NPE throw
    before the store to y.f. To implicit-ify this null check we'd
    have to swap the order of the two stores and essentially reverse
    the above transform. To do this safely, we'd need alias analysis
    to prove that x does not alias y.

    Re-discovering if pointer X is pointer Y with an offset of 24
    bytes may also be difficult. This is important since in may cases
    the load doing the null check will be off a pointer derived from
    the pointer we want to check for nullness.

2. we may be able to get away with representing the trapping load in
    a way that looks like a normal load to the backend. I'm not sure
    if we can do this safely (the backend cannot assume that the load
    won't trap) but in principle this should allow most of the
    load/store optimizations to kick in.

-- Sanjoy

That’s a good description of the problem.

Lowering to real loads will *probably* just work because your are being saved by EH_LABEL instructions which are conservatively modeled as having unknown side effects. The feature that saves you will also defeat optimization of those loads. I don't see any advantage of this in terms of optimizing codegen. It is just a workaround to avoid defining pseudo instructions.

The optimal implementation would be to leave the explicit null check in place. Late in the pipeline, just before post-ra scheduling, a pass would combine and+cmp+br+load when it is profitable using target hooks like getLdStBaseRegImmOfsWidth(). Note that we still have alias information in the form of machine mem operands.

You could take a step in that direction without doing much backend work by lowering to pseudo-loads during ISEL instead of using EH_LABEL. Then the various load/store optimizations could be taught to explicitly optimize normal loads and stores over the pseudo loads but not among them.

Andy

I decided to go with Andy's suggestion of lowering explicit null
checks into implicit null checks late, after register allocation. The
tip of the change is at ⚙ D10201 [CodeGen] Add a pass to fold null checks into nearby memory operations..

-- Sanjoy