RFC: Add guard intrinsics to LLVM

This is a proposal to add guard intrinsics to LLVM.

Couple of glossary items: when I say "interpreter" I mean "the most
conservative tier in the compilation stack" which can be an actual
interpreter, a "splat compiler" or even a regular JIT that doesn't
make optimistic assumptions. By "bailing out to the interpreter" I
mean "side exit" as defined by Philip in
http://www.philipreames.com/Blog/2015/05/20/deoptimization-terminology/

# high level summary

Guard intrinsics are intended to represent "if (!cond) leave();" like
control flow in a structured manner. This kind of control flow is
abundant in IR coming from safe languages due to things like range
checks and null checks (and overflow checks, in some cases). From a
high level, there are two distinct motivations for introducing guard
intrinsics:

- To keep control flow as simple and "straight line"-like as is
   reasonable

- To allow certain kinds of "widening" transforms that cannot be
   soundly done in an explicit "check-and-branch" control flow
   representation

## straw man specification

Signature:

declare void @llvm.guard_on(i1 %predicate) ;; requires [ "deopt"(...) ]

Semantics:

`@llvm.guard_on` bails out to the interpreter (i.e. "deoptimizes" the
currently execution function) if `%predicate` is `false`, meaning that
after `@llvm.guard_on(i1 %t)` has executed `%t` can be assumed to be
true. In this way, it is close to `@llvm.assume` or an `assert`, with
one very important difference -- `@llvm.guard_on(i1 <false>)` is well
defined (and not UB). `@llvm.guard_on` on a false predicate bails to
the interpreter and that is always safe (but slow), and so
`@llvm.guard_on(i1 false)` is basically a `noreturn` call that
unconditionally transitions the current compilation to the
interpreter.

Bailing out to the interpreter involves re-creating the state of the
interpreter frames as-if the compilee had been executing in the
interpreter all along. This state is represented and maintained using
a `"deopt"` operand bundle attached to the call to `@llvm.guard_on`.
The verifier will reject calls to `@llvm.guard_on` without a `"deopt"`
operand bundle. `@llvm.guard_on` cannot be `invoke`ed (that would be
meaningless anyway, since the method it would've "thrown" into is
about to go away).

Example:

  ...
  %rangeCheck0 = icmp ult i32 %arg0, 10
  call void @llvm.guard_on(i1 %rangeCheck0) [ "deopt"(/* deopt state 0 */) ]
  %rangeCheck1 = icmp ult i32 %arg0, 12
  call void @llvm.guard_on(i1 %rangeCheck1) [ "deopt"(/* deopt state 1 */) ]
  ...

# details: motivation & alternatives

As specified in the high level summary, there are two key motivations.

The first, more obvious one is that we want the CFG to be less
complex, even if that involves "hiding" control flow in guard
intrinsics. I expect this to benefit both compile time and
within-a-block optimization.

The second, somewhat less obvious motivation to use guard intrinsics
instead of explicit control flow is to allow check widening.

## check widening

Consider:

  ...
  %rangeCheck0 = icmp ult i32 6, %len   ;; for a[6]
  call void @llvm.guard_on(i1 %rangeCheck0) [ "deopt"(/* deopt state 0 */) ]
  call void @printf("hello world")
  %rangeCheck1 = icmp ult i32 7, %len   ;; for a[7]
  call void @llvm.guard_on(i1 %rangeCheck1) [ "deopt"(/* deopt state 1 */) ]
  access a[6] and a[7]
  ...

we'd like to optimize it to

  ...
  %rangeCheckWide = icmp ult i32 7, %len
  call void @llvm.guard_on(i1 %rangeCheckWide) [ "deopt"(/* deopt state 0 */) ]
  call void @printf("hello world")
  ;; %rangeCheck1 = icmp ult i32 7, %len  ;; not needed anymore
  ;; call void @llvm.guard_on(i1 %rangeCheck1) [ "deopt"(/* deopt state 1 */) ]
  access a[6] and a[7]
  ...

This way we do a range check only on `7` -- if `7` is within bounds,
then we know `6` is too. This transform is sound only because we know
that the guard on `7 ult %len` will not simply throw an exception if
the said predicate fails, but will bail out to the interpreter with
the abstract state `/* deopt state 0 */`. In fact, if `%len` is `7`,
the pre-transform program is supposed to print `"hello world"` and
*then* throw an exception, and bailing out to the interpreter with `/*
deopt state 0 */` will do exactly that.

In other words, we're allowed to do speculative and aggressive
transforms that make a guard fail that wouldn't have in the initial
program. This is okay because a failed guard only bails to the
interpreter, and the interpreter always Does The Right Thing(TM). In
fact, it is correct (though unwise) to replace every guard's predicate
with `false`.

## the problem with check widening and explicit control flow

Widening is difficult to get right in an explicit "check-and-branch"
representation. For instance, the above example in a
"check-and-branch" representation would be (in pseudo C, and excluding
the printf):

  ...
  if (!(6 < %len)) { call @side_exit() [ "deopt"(P) ]; unreachable; }
  if (!(7 < %len)) { call @side_exit() [ "deopt"(Q) ]; unreachable; }
  ...

The following transform is invalid:

  ...
  if (!(7 < %len)) { call @side_exit() [ "deopt"(P) ]; unreachable; }
  if (!(true)) { call @side_exit() [ "deopt"(Q) ]; unreachable; }
  ...

since we do not know if the first side exit had been optimized under
the assumption `!(6 < %len)` (via JumpThreading etc.). E.g. the
"original" IR could have been

  ...
  if (!(6 < %len)) { call @side_exit() [ "deopt"(!(6 < %len)) ]; unreachable; }
  if (!(7 < %len)) { call @side_exit() [ "deopt"(Q) ]; unreachable; }
  ...

which got optimized to

  ...
  if (!(6 < %len)) { call @side_exit() [ "deopt"(true) ]; unreachable; }
  if (!(7 < %len)) { call @side_exit() [ "deopt"(Q) ]; unreachable; }
  ...

before the widening transform. The widening transform will now
effectively pass in an incorrect value for `!(6 < %len)`.

This isn't to say it is impossible to do check widening in a explicit
control flow representation, just that is more natural to do it with
guards.

# details: semantics

## as-if control flow

The observable behavior of `@llvm.guard_on` is specified as:

  void @llvm.guard_on(i1 %pred) {
  entry:
    %unknown_cond = < unknown source >
    %cond = and i1 %unknown_cond, %pred
    br i1 %cond, label %left, label %right

  left:
    call void @bail_to_interpreter() [ "deopt"() ] noreturn
    unreachable

  right:
    ret void
  }

So, precisely speaking, `@llvm.guard_on` is guaranteed to bail to the
interpreter if `%pred` is false, but it **may** bail to the
interpreter if `%pred` is true. It is this bit that lets us soundly
widen `%pred`, since all we're doing is "refining" `< unknown source >`.

`@bail_to_interpreter` does not return to the current compilation, but
it returns to the `"deopt"` continuation that is has been given (once
inlined, the empty "deopt"() continuation will be fixed up to have the right
continuation).

## applicable optimizations

Apart from widening, most of the optimizations we're interested in are
what's allowed by an equivalent `@llvm.assume`. Any conditional
branches dominated by a guard on the same condition can be folded,
multiple guards on the same condition can be CSE'ed, loop invariant
guards can be hoisted out of loops etc.

Ultimately, we'd like to recover the same quality of optimization as
we currently get from the "check-and-branch" representation. With the
"check-and-branch" representation, the optimizer is able to sink
stores and computation into the slow path. This is something it cannot
do in the guard_on representation, and we'll have to lower the
guard_on representation to the "check-and-branch" representation at a
suitable point in the pipeline to get this kind of optimization.

## lowering

At some point in the compilation pipeline, we will have to lower
`@llvm.guard_on` into explicit control flow, by "inlining" "an
implementation" of `@llvm.guard_on` (or by some other means). I don't
have a good sense on when in the pipeline this should be done -- the
answer will depend on what we find as we make LLVM more aggressive
around optimizing guards.

## environments without support for bailing to the interpreter

Our VM has deeply integrated support for deoptimizations, but not all
language runtimes do. If there is a need for non deoptimizing guards, we
can consider introducing a variant of `@llvm.guard_on`:

declare void @llvm.exception_on(i1 %predicate, i32 %exceptionKind)

with this one having the semantics that it always throws an exception
if `%predicate` fails. Only the non-widening optimizations for
`@llvm.guard_on` will apply to `@llvm.exception_on`.

## memory effects (unresolved)

[I haven't come up with a good model for the memory effects of
`@llvm.guard_on`, suggestions are very welcome.]

I'd really like to model `@llvm.guard_on` as a readonly function,
since it does not write to memory if it returns; and e.g. forwarding
loads across a call to `@llvm.guard_on` should be legal.

However, I'm in a quandary around representing the "may never return"
aspect of `@llvm.guard_on`: I have to make it illegal to, say, hoist a
load form `%ptr` across a guard on `%ptr != null`. There are couple
of ways I can think of dealing with this, none of them are both easy
and neat:

- State that since `@llvm.guard_on` could have had an infinite loop
   in it, it may never return. Unfortunately, the LLVM IR has some
   rough edges on readonly infinite loops (since C++ disallows that),
   so LLVM will require some preparatory work before we can do this
   soundly.

- State that `@llvm.guard_on` can unwind, and thus has non-local
   control flow. This can actually work (and is pretty close to
   the actual semantics), but is somewhat of hack since
   `@llvm.guard_on` doesn't _really_ throw an exception.

  - Special case `@llvm.guard_on` (yuck!).

What do you think?

-- Sanjoy

This is a proposal to add guard intrinsics to LLVM.

A couple of inline comments building on Sanjoy's points follow.

Couple of glossary items: when I say "interpreter" I mean "the most
conservative tier in the compilation stack" which can be an actual
interpreter, a "splat compiler" or even a regular JIT that doesn't
make optimistic assumptions. By "bailing out to the interpreter" I
mean "side exit" as defined by Philip in
http://www.philipreames.com/Blog/2015/05/20/deoptimization-terminology/

# high level summary

Guard intrinsics are intended to represent "if (!cond) leave();" like
control flow in a structured manner. This kind of control flow is
abundant in IR coming from safe languages due to things like range
checks and null checks (and overflow checks, in some cases). From a
high level, there are two distinct motivations for introducing guard
intrinsics:

  - To keep control flow as simple and "straight line"-like as is
    reasonable

  - To allow certain kinds of "widening" transforms that cannot be
    soundly done in an explicit "check-and-branch" control flow
    representation

## straw man specification

Signature:

declare void @llvm.guard_on(i1 %predicate) ;; requires [ "deopt"(...) ]

Semantics:

`@llvm.guard_on` bails out to the interpreter (i.e. "deoptimizes" the
currently execution function) if `%predicate` is `false`, meaning that
after `@llvm.guard_on(i1 %t)` has executed `%t` can be assumed to be
true. In this way, it is close to `@llvm.assume` or an `assert`, with
one very important difference -- `@llvm.guard_on(i1 <false>)` is well
defined (and not UB). `@llvm.guard_on` on a false predicate bails to
the interpreter and that is always safe (but slow), and so
`@llvm.guard_on(i1 false)` is basically a `noreturn` call that
unconditionally transitions the current compilation to the
interpreter.

It's also worth noting that @llvm.guard_on(i1 true) is useful and well defined as well. When lowering, such a guard simply disappears, but it can be useful to keep around in the IR during optimization. It gives a well defined point for widening transforms to apply with a well known deoptimization state already available.

I'd suggest a small change to Sanjoy's declaration. I think we should allow additional arguments to the guard, not just the condition. What exactly those arguments mean would be up to the runtime, but various runtimes might want to provide additional arguments to the OSR mechanism.

Bailing out to the interpreter involves re-creating the state of the
interpreter frames as-if the compilee had been executing in the
interpreter all along. This state is represented and maintained using
a `"deopt"` operand bundle attached to the call to `@llvm.guard_on`.
The verifier will reject calls to `@llvm.guard_on` without a `"deopt"`
operand bundle.

This introduces a very subtle point. The side exit only effects the *function* which contains the guard. A caller of that function in the same module may be returned to by either the function itself, or the interpreter after running the continuation implied by the guard. This introduces a complication for IPA/IPO; any guard (really, any side exit, of which guards are one form) has to be treated as a possible return point from the callee with an unknowable return value and memory state.

In our prototype implementation, we've implemented this by doing a linear scan of the function in each IPO pass and bailing if we see a side exit. One approach we could consider upstream is to introduce a function attribute which indicates "can exit via side exit or guard". Any function containing a guard would be required to have the attribute. InlineFunction could propagate it to the caller when inlining a callee with it. FunctionAttrs could remove it if no guards (or callees with guards) were found. This would make the implementation of the bail in our IPO passes much cheaper, but it wouldn't remove the need for it.

`@llvm.guard_on` cannot be `invoke`ed (that would be
meaningless anyway, since the method it would've "thrown" into is
about to go away).

I disagree with this bit. It needlessly complicates the inliner. Allowing an invoke of a guard which SimplifyCFG converts to calls just like it would a nothrow function seems much cleaner.

Example:

   ...
   %rangeCheck0 = icmp ult i32 %arg0, 10
   call void @llvm.guard_on(i1 %rangeCheck0) [ "deopt"(/* deopt state 0 */) ]
   %rangeCheck1 = icmp ult i32 %arg0, 12
   call void @llvm.guard_on(i1 %rangeCheck1) [ "deopt"(/* deopt state 1 */) ]
   ...

# details: motivation & alternatives

As specified in the high level summary, there are two key motivations.

The first, more obvious one is that we want the CFG to be less
complex, even if that involves "hiding" control flow in guard
intrinsics. I expect this to benefit both compile time and
within-a-block optimization.

I want to really emphasize this point. I expect that using guard intrinsics would be a major compile time win for higher level languages. (I can't say for sure, we currently use the explicit control model.)

Given many of our transforms are block-at-a-time, using guards might also lead to better optimization. I'm particularly interested in seeing how this influences instruction selection.

The second, somewhat less obvious motivation to use guard intrinsics
instead of explicit control flow is to allow check widening.

## check widening

Consider:

   ...
   %rangeCheck0 = icmp ult i32 6, %len   ;; for a[6]
   call void @llvm.guard_on(i1 %rangeCheck0) [ "deopt"(/* deopt state 0 */) ]
   call void @printf("hello world")
   %rangeCheck1 = icmp ult i32 7, %len   ;; for a[7]
   call void @llvm.guard_on(i1 %rangeCheck1) [ "deopt"(/* deopt state 1 */) ]
   access a[6] and a[7]
   ...

we'd like to optimize it to

   ...
   %rangeCheckWide = icmp ult i32 7, %len
   call void @llvm.guard_on(i1 %rangeCheckWide) [ "deopt"(/* deopt state 0 */) ]
   call void @printf("hello world")
   ;; %rangeCheck1 = icmp ult i32 7, %len  ;; not needed anymore
   ;; call void @llvm.guard_on(i1 %rangeCheck1) [ "deopt"(/* deopt state 1 */) ]
   access a[6] and a[7]
   ...

This way we do a range check only on `7` -- if `7` is within bounds,
then we know `6` is too. This transform is sound only because we know
that the guard on `7 ult %len` will not simply throw an exception if
the said predicate fails, but will bail out to the interpreter with
the abstract state `/* deopt state 0 */`. In fact, if `%len` is `7`,
the pre-transform program is supposed to print `"hello world"` and
*then* throw an exception, and bailing out to the interpreter with `/*
deopt state 0 */` will do exactly that.

In other words, we're allowed to do speculative and aggressive
transforms that make a guard fail that wouldn't have in the initial
program. This is okay because a failed guard only bails to the
interpreter, and the interpreter always Does The Right Thing(TM). In
fact, it is correct (though unwise) to replace every guard's predicate
with `false`.

Another optimization this allows is a more code size friendly version of IRCE. Today, we have to generate pre and post loops when removing bounds checks from an inner loop unless we can *prove* that no iterations would run in those loops. With a guard construct, we can find a guard before the loop and "widen" it to account for the pre and post iteration spaces. This transformation is sometimes known as "loop predication".

## the problem with check widening and explicit control flow

Widening is difficult to get right in an explicit "check-and-branch"
representation. For instance, the above example in a
"check-and-branch" representation would be (in pseudo C, and excluding
the printf):

   ...
   if (!(6 < %len)) { call @side_exit() [ "deopt"(P) ]; unreachable; }
   if (!(7 < %len)) { call @side_exit() [ "deopt"(Q) ]; unreachable; }
   ...

The following transform is invalid:

   ...
   if (!(7 < %len)) { call @side_exit() [ "deopt"(P) ]; unreachable; }
   if (!(true)) { call @side_exit() [ "deopt"(Q) ]; unreachable; }
   ...

since we do not know if the first side exit had been optimized under
the assumption `!(6 < %len)` (via JumpThreading etc.). E.g. the
"original" IR could have been

   ...
   if (!(6 < %len)) { call @side_exit() [ "deopt"(!(6 < %len)) ]; unreachable; }
   if (!(7 < %len)) { call @side_exit() [ "deopt"(Q) ]; unreachable; }
   ...

which got optimized to

   ...
   if (!(6 < %len)) { call @side_exit() [ "deopt"(true) ]; unreachable; }
   if (!(7 < %len)) { call @side_exit() [ "deopt"(Q) ]; unreachable; }
   ...

before the widening transform. The widening transform will now
effectively pass in an incorrect value for `!(6 < %len)`.

This isn't to say it is impossible to do check widening in a explicit
control flow representation, just that is more natural to do it with
guards.

# details: semantics

## as-if control flow

The observable behavior of `@llvm.guard_on` is specified as:

   void @llvm.guard_on(i1 %pred) {
   entry:
     %unknown_cond = < unknown source >
     %cond = and i1 %unknown_cond, %pred
     br i1 %cond, label %left, label %right

   left:
     call void @bail_to_interpreter() [ "deopt"() ] noreturn
     unreachable

   right:
     ret void
   }

So, precisely speaking, `@llvm.guard_on` is guaranteed to bail to the
interpreter if `%pred` is false, but it **may** bail to the
interpreter if `%pred` is true. It is this bit that lets us soundly
widen `%pred`, since all we're doing is "refining" `< unknown source >`.

Unless I'm misreading this, it looks like Sanjoy got the IR in the specification wrong. The intrinsic is specified to side exit if the condition is false (so that it's true in the caller after the guard), not the other way around. The text description appears correct.

`@bail_to_interpreter` does not return to the current compilation, but
it returns to the `"deopt"` continuation that is has been given (once
inlined, the empty "deopt"() continuation will be fixed up to have the right
continuation).

This "bail_to_interpreter" is a the more general form of side exit I mentioned above.

## applicable optimizations

Apart from widening, most of the optimizations we're interested in are
what's allowed by an equivalent `@llvm.assume`. Any conditional
branches dominated by a guard on the same condition can be folded,
multiple guards on the same condition can be CSE'ed, loop invariant
guards can be hoisted out of loops etc.

Ultimately, we'd like to recover the same quality of optimization as
we currently get from the "check-and-branch" representation. With the
"check-and-branch" representation, the optimizer is able to sink
stores and computation into the slow path. This is something it cannot
do in the guard_on representation, and we'll have to lower the
guard_on representation to the "check-and-branch" representation at a
suitable point in the pipeline to get this kind of optimization.

## lowering

At some point in the compilation pipeline, we will have to lower
`@llvm.guard_on` into explicit control flow, by "inlining" "an
implementation" of `@llvm.guard_on` (or by some other means). I don't
have a good sense on when in the pipeline this should be done -- the
answer will depend on what we find as we make LLVM more aggressive
around optimizing guards.

As a starting point, I'd likely do this just before code gen prep with some custom sinking logic to pull instructions only used on the failing path into the newly introduced block. Long term, we'd probably want to do the same thing over MI, but mucking with control flow at that layer is a bit more complicated.

Rather than framing this as inlining, I'd frame it as expansion to a well known body (pretty much the one Sanjoy gives above). The @bail_to_interpreter construct could be lowered directly to a function call to some well known symbol name (which JIT users can intercept and bind to whatever they want.) Something like __llvm_side_exit seems like a reasonable choice.

## environments without support for bailing to the interpreter

Our VM has deeply integrated support for deoptimizations, but not all
language runtimes do. If there is a need for non deoptimizing guards, we
can consider introducing a variant of `@llvm.guard_on`:

declare void @llvm.exception_on(i1 %predicate, i32 %exceptionKind)

with this one having the semantics that it always throws an exception
if `%predicate` fails. Only the non-widening optimizations for
`@llvm.guard_on` will apply to `@llvm.exception_on`.

Not sure we actually need this. A valid implementation of the side exit handler (which would do the OSR for us) is to call a runtime routine which generates and throws the exception. The only bit we might need is a distinction between widdenable and non-widenable guards.

## memory effects (unresolved)

[I haven't come up with a good model for the memory effects of
  `@llvm.guard_on`, suggestions are very welcome.]

I'd really like to model `@llvm.guard_on` as a readonly function,
since it does not write to memory if it returns; and e.g. forwarding
loads across a call to `@llvm.guard_on` should be legal.

However, I'm in a quandary around representing the "may never return"
aspect of `@llvm.guard_on`: I have to make it illegal to, say, hoist a
load form `%ptr` across a guard on `%ptr != null`.

Modeling this as memory dependence just seems wrong. We already have to model control dependence on functions which may throw. I don't think there's anything new here.

The only unusual bit is that we're going to want to teach AliasAnalysis that the guard does write to any memory location (to allow forwarding) while still preserving the control dependence.

There are couple
of ways I can think of dealing with this, none of them are both easy
and neat:

  - State that since `@llvm.guard_on` could have had an infinite loop
    in it, it may never return. Unfortunately, the LLVM IR has some
    rough edges on readonly infinite loops (since C++ disallows that),
    so LLVM will require some preparatory work before we can do this
    soundly.

  - State that `@llvm.guard_on` can unwind, and thus has non-local
    control flow. This can actually work (and is pretty close to
    the actual semantics), but is somewhat of hack since
    `@llvm.guard_on` doesn't _really_ throw an exception.

Er, careful. Semantically, the guard *might* throw an exception. It could be that's what the interpreter does when evaluating the continuation implied by the guard and any of our callers have to account for the fact the function which contains the guard might throw. The easiest way to ensure that is to model the guard call as possibly throwing.

Replies inline.

At a high level, it feels like we'll eventually need a new instruction
to represent the kind of control flow a guard entails (to be clear: we
should probably still start with an intrinsic) -- they are fairly
well-behaved, i.e. readonly, nounwind etc. as far as the immediate
"physical" caller is concerned, but not so as far as its callers's
callers are concerned.

one very important difference -- `@llvm.guard_on(i1 <false>)` is well
defined (and not UB). `@llvm.guard_on` on a false predicate bails to
the interpreter and that is always safe (but slow), and so
`@llvm.guard_on(i1 false)` is basically a `noreturn` call that
unconditionally transitions the current compilation to the
interpreter.

It's also worth noting that @llvm.guard_on(i1 true) is useful and well
defined as well. When lowering, such a guard simply disappears, but it can
be useful to keep around in the IR during optimization. It gives a well
defined point for widening transforms to apply with a well known
deoptimization state already available.

Yes! Actually, I had exactly this in an earlier version of this
writeup, which I removed to make the (already long) RFC shorter.

I'd suggest a small change to Sanjoy's declaration. I think we should allow
additional arguments to the guard, not just the condition. What exactly
those arguments mean would be up to the runtime, but various runtimes might
want to provide additional arguments to the OSR mechanism.

We'll still have to make a call on the signature of the intrinsic (or
are you suggesting a varargs intrinsic)?

I suppose we could also have a family of intrinsics, that take on
argument of variable type.

Bailing out to the interpreter involves re-creating the state of the
interpreter frames as-if the compilee had been executing in the
interpreter all along. This state is represented and maintained using
a `"deopt"` operand bundle attached to the call to `@llvm.guard_on`.
The verifier will reject calls to `@llvm.guard_on` without a `"deopt"`
operand bundle.

This introduces a very subtle point. The side exit only effects the
*function* which contains the guard. A caller of that function in the same
module may be returned to by either the function itself, or the interpreter
after running the continuation implied by the guard. This introduces a
complication for IPA/IPO; any guard (really, any side exit, of which guards
are one form) has to be treated as a possible return point from the callee
with an unknowable return value and memory state.

This is a really good point. This has strong implications for the
guards memory effects as well -- even though a guard can be seen as
readonly in its containing functions, things that call the containing
function has to see the guard as read-write. IOW @foo below is
read/write, even though v0 can be forwarded to v1:

int @foo(int cond) {
  int v0 = this->field;
  guard_on(cond);
  int v1 = this->field;
  return v0 + v1;
}

As you point out, we're also introducing a newish kind of control flow
here. It is not fundamentally new, since longjmp does something
similar (but not quite the same).

I hate to say this, but perhaps we're really looking at (eventually) a
new instruction here, and not just a new intrinsic.

`@llvm.guard_on` cannot be `invoke`ed (that would be
meaningless anyway, since the method it would've "thrown" into is
about to go away).

I disagree with this bit. It needlessly complicates the inliner. Allowing
an invoke of a guard which SimplifyCFG converts to calls just like it would
a nothrow function seems much cleaner.

SGTM.

The observable behavior of `@llvm.guard_on` is specified as:

   void @llvm.guard_on(i1 %pred) {
   entry:
     %unknown_cond = < unknown source >
     %cond = and i1 %unknown_cond, %pred
     br i1 %cond, label %left, label %right

   left:
     call void @bail_to_interpreter() [ "deopt"() ] noreturn
     unreachable

   right:
     ret void
   }

So, precisely speaking, `@llvm.guard_on` is guaranteed to bail to the
interpreter if `%pred` is false, but it **may** bail to the
interpreter if `%pred` is true. It is this bit that lets us soundly
widen `%pred`, since all we're doing is "refining" `< unknown source >`.

Unless I'm misreading this, it looks like Sanjoy got the IR in the
specification wrong. The intrinsic is specified to side exit if the
condition is false (so that it's true in the caller after the guard), not
the other way around. The text description appears correct.

Yes, and thanks for catching. The branch should have been "br i1
%cond, label %right, label %left".

`@bail_to_interpreter` does not return to the current compilation, but
it returns to the `"deopt"` continuation that is has been given (once
inlined, the empty "deopt"() continuation will be fixed up to have the
right
continuation).

This "bail_to_interpreter" is a the more general form of side exit I
mentioned above.

How is it more general?

As a starting point, I'd likely do this just before code gen prep with some
custom sinking logic to pull instructions only used on the failing path into
the newly introduced block. Long term, we'd probably want to do the same
thing over MI, but mucking with control flow at that layer is a bit more
complicated.

Rather than framing this as inlining, I'd frame it as expansion to a well
known body (pretty much the one Sanjoy gives above). The
@bail_to_interpreter construct could be lowered directly to a function call
to some well known symbol name (which JIT users can intercept and bind to
whatever they want.) Something like __llvm_side_exit seems like a
reasonable choice.

SGTM.

with this one having the semantics that it always throws an exception
if `%predicate` fails. Only the non-widening optimizations for
`@llvm.guard_on` will apply to `@llvm.exception_on`.

Not sure we actually need this. A valid implementation of the side exit
handler (which would do the OSR for us) is to call a runtime routine which
generates and throws the exception. The only bit we might need is a
distinction between widdenable and non-widenable guards.

Yes, and that's the only distinction I'm trying to make here.

## memory effects (unresolved)

[I haven't come up with a good model for the memory effects of
  `@llvm.guard_on`, suggestions are very welcome.]

I'd really like to model `@llvm.guard_on` as a readonly function,
since it does not write to memory if it returns; and e.g. forwarding
loads across a call to `@llvm.guard_on` should be legal.

However, I'm in a quandary around representing the "may never return"
aspect of `@llvm.guard_on`: I have to make it illegal to, say, hoist a
load form `%ptr` across a guard on `%ptr != null`.

Modeling this as memory dependence just seems wrong. We already have to
model control dependence on functions which may throw. I don't think
there's anything new here.

I am trying to model this as control dependence, but the difficult bit
is to do that while still maintaining that the call does not clobber
any memory. I'm worried that there may be reasons (practical or
theoretical) why we "readonly" functions always have to terminate and
be nothrow.

The only unusual bit is that we're going to want to teach AliasAnalysis that
the guard does write to any memory location (to allow forwarding) while
still preserving the control dependence.

So you're saying that we model the guard as otherwise read/write (thus
sidestepping the readonly non-returning quandary) but teach
AliasAnalysis that it doesn't clobber any memory? That would work.

We can also use the same tool to solve the "may return to its caller's
caller with arbitrary heap state" issue by teaching AA that a guard
does not alias with reads in its own (physical) function, but clobbers
the heap for other (physical) functions.

Notation: I'm differentiating between physical functions == functions
that create actual stack frames and inlined functions == logical Java
functions that don't create separate physical frames. Inlining IR
from one java level function into another usually creates a physical
function that contains more than one logical function.

There are couple
of ways I can think of dealing with this, none of them are both easy
and neat:

  - State that since `@llvm.guard_on` could have had an infinite loop
    in it, it may never return. Unfortunately, the LLVM IR has some
    rough edges on readonly infinite loops (since C++ disallows that),
    so LLVM will require some preparatory work before we can do this
    soundly.

  - State that `@llvm.guard_on` can unwind, and thus has non-local
    control flow. This can actually work (and is pretty close to
    the actual semantics), but is somewhat of hack since
    `@llvm.guard_on` doesn't _really_ throw an exception.

Er, careful. Semantically, the guard *might* throw an exception. It could
be that's what the interpreter does when evaluating the continuation implied
by the guard and any of our callers have to account for the fact the
function which contains the guard might throw. The easiest way to ensure
that is to model the guard call as possibly throwing.

Yes, it does not throw an exception into its own caller, but may throw
into its caller's caller.

-- Sanjoy

Replies inline.

At a high level, it feels like we'll eventually need a new instruction
to represent the kind of control flow a guard entails (to be clear: we
should probably still start with an intrinsic) -- they are fairly
well-behaved, i.e. readonly, nounwind etc. as far as the immediate
"physical" caller is concerned, but not so as far as its callers's
callers are concerned.

I think you're jumping ahead a bit here. I'm not sure the semantics are anywhere near as weird as you're framing them to be. :slight_smile:

I'd suggest a small change to Sanjoy's declaration. I think we should allow
additional arguments to the guard, not just the condition. What exactly
those arguments mean would be up to the runtime, but various runtimes might
want to provide additional arguments to the OSR mechanism.

We'll still have to make a call on the signature of the intrinsic (or
are you suggesting a varargs intrinsic)?

I suppose we could also have a family of intrinsics, that take on
argument of variable type.

I was proposing a varargs _intrinsic_. (Not varargs as in C/C++, but varargs as-in polymorphic over all static signatures.)

Bailing out to the interpreter involves re-creating the state of the
interpreter frames as-if the compilee had been executing in the
interpreter all along. This state is represented and maintained using
a `"deopt"` operand bundle attached to the call to `@llvm.guard_on`.
The verifier will reject calls to `@llvm.guard_on` without a `"deopt"`
operand bundle.

This introduces a very subtle point. The side exit only effects the
*function* which contains the guard. A caller of that function in the same
module may be returned to by either the function itself, or the interpreter
after running the continuation implied by the guard. This introduces a
complication for IPA/IPO; any guard (really, any side exit, of which guards
are one form) has to be treated as a possible return point from the callee
with an unknowable return value and memory state.

This is a really good point. This has strong implications for the
guards memory effects as well -- even though a guard can be seen as
readonly in its containing functions, things that call the containing
function has to see the guard as read-write. IOW @foo below is
read/write, even though v0 can be forwarded to v1:

int @foo(int cond) {
   int v0 = this->field;
   guard_on(cond);
   int v1 = this->field;
   return v0 + v1;
}

Essentially, we'd be introducing an aliasing rule along the following: "reads nothing on normal path, reads/writes world if guard is taken (in which case, does not return)." Yes, implementing that will be a bit complicated, but I don't see this as a fundamental issue.

As you point out, we're also introducing a newish kind of control flow
here. It is not fundamentally new, since longjmp does something
similar (but not quite the same).

I hate to say this, but perhaps we're really looking at (eventually) a
new instruction here, and not just a new intrinsic.

`@bail_to_interpreter` does not return to the current compilation, but
it returns to the `"deopt"` continuation that is has been given (once
inlined, the empty "deopt"() continuation will be fixed up to have the
right
continuation).

This "bail_to_interpreter" is a the more general form of side exit I
mentioned above.

How is it more general?

You can express a guard as a conditional branch to a @bail_to_interpreter construct. Without the @bail_to_interpreter (which is the thing which has those weird aliasing properties we're talking about), you're stuck.

(In our tree, we've implemented exactly the "bail_to_interpreter" construct under a different name, marked as readonly with an exception in FunctionAttrs to resolve the IPO problem.)

What the guard nodes add is a) conciseness, b) potentially better optimization, and c) the ability to widen.

Thinking about it further, I think we should probably have started with proposing the @bail_to_interpreter construct, gotten that working fully upstream, then done the guard, but oh well. We can do both in parallel since we have a working implementation downstream of @b_to_i.

## memory effects (unresolved)

[I haven't come up with a good model for the memory effects of
   `@llvm.guard_on`, suggestions are very welcome.]

I'd really like to model `@llvm.guard_on` as a readonly function,
since it does not write to memory if it returns; and e.g. forwarding
loads across a call to `@llvm.guard_on` should be legal.

However, I'm in a quandary around representing the "may never return"
aspect of `@llvm.guard_on`: I have to make it illegal to, say, hoist a
load form `%ptr` across a guard on `%ptr != null`.

Modeling this as memory dependence just seems wrong. We already have to
model control dependence on functions which may throw. I don't think
there's anything new here.

I am trying to model this as control dependence, but the difficult bit
is to do that while still maintaining that the call does not clobber
any memory. I'm worried that there may be reasons (practical or
theoretical) why we "readonly" functions always have to terminate and
be nothrow.

I think we should not bother modeling the call as readonly. As we've seen with @llvm.assume, we can get something which is readonly in practice without it being readonly per se. :slight_smile: And, as above, it's not clear that readonly is even a correct way to model a guard as all.

The only unusual bit is that we're going to want to teach AliasAnalysis that
the guard does write to any memory location (to allow forwarding) while
still preserving the control dependence.

So you're saying that we model the guard as otherwise read/write (thus
sidestepping the readonly non-returning quandary) but teach
AliasAnalysis that it doesn't clobber any memory? That would work.

Yes. As we've done for @llvm.assume and a number of other one off cases. Once we've implemented the exact semantics we want, we can decide if there's a property lurking which we can extract as a new attribute.

We can also use the same tool to solve the "may return to its caller's
caller with arbitrary heap state" issue by teaching AA that a guard
does not alias with reads in its own (physical) function, but clobbers
the heap for other (physical) functions.

I'd have to think about this a bit further before agreeing, but on the surface this seems reasonable.

I think you're jumping ahead a bit here. I'm not sure the semantics are
anywhere near as weird as you're framing them to be. :slight_smile:

I now think this weirdness actually does not have to do anything with
guard_on or bail_to_interpeter, but it has to do with deopt bundles
itself. Our notion of of "deopt bundles are readonly" is broken to
begin with, and that is what's manifesting as the complication we're
seeing here.

Consider something like

declare @foo() readonly
def @bar() { call @foo() [ "deopt"(XXX) ] }
def @baz() { call @bar() [ "deopt"(YYY) ] }

Right now according to the semantics of "deopt" operand bundles as in
the LangRef, every call site above is readonly. However, it is
possible for @baz() to write to memory if @bar is deoptimized at the
call site with the call to @foo.

You could say that it isn't legal to mark @foo as readonly, since the
action of deoptimizing one's caller is not a readonly operation. But
that doesn't work in cases like this:

global *ptr
declare @foo() readwrite
def @bar() { call @foo() [ "deopt"(XXX) ]; *ptr = 42 }
def @baz() { call @bar() [ "deopt"(YYY) ]; int v0 = *ptr }

Naively, it looks like an inter-proc CSE can forward 42 to v0, but
that's unsound, since @bar could get deoptimized at the call to
@foo(), and then who knows what'll get written to *ptr.

My interpretation here is that we're not modeling the deopt
continuations correctly. Above, the XXX continuation is a delimited
continuation that terminates at the boundary of @bar, and seen from
its caller, the memory effect (and any other effect) of @bar has to
take into account that the "remainder" of @bar() after @foo has
returned is either what it can see in the IR, or the XXX continuation
(which it //could// analyze in theory, but in practice is unlikely
to).

This is kind of a bummer since what I said above directly contradicts
the "As long as the behavior of an operand bundle is describable
within these restrictions, LLVM does not need to have special
knowledge of the operand bundle to not miscompile programs containing
it." bit in the LangRef. :frowning:

Essentially, we'd be introducing an aliasing rule along the following:
"reads nothing on normal path, reads/writes world if guard is taken (in
which case, does not return)." Yes, implementing that will be a bit
complicated, but I don't see this as a fundamental issue.

Yup, and this is a property of deopt operand bundles, not just guards.

How is it more general?

You can express a guard as a conditional branch to a @bail_to_interpreter
construct. Without the @bail_to_interpreter (which is the thing which has
those weird aliasing properties we're talking about), you're stuck.

I thought earlier you were suggesting bail_to_interpreter is more
general than side_exit (when I thought they were one and the same
thing), not that bail_to_interpreter is more general than guard.

Aside: theoretically, if you have @guard() as a primitive then
bail_to_interpreter is just @guard(false).

-- Sanjoy

Some minor additions to what I said earlier:

My interpretation here is that we're not modeling the deopt
continuations correctly. Above, the XXX continuation is a delimited
continuation that terminates at the boundary of @bar, and seen from
its caller, the memory effect (and any other effect) of @bar has to
take into account that the "remainder" of @bar() after @foo has
returned is either what it can see in the IR, or the XXX continuation
(which it //could// analyze in theory, but in practice is unlikely
to).

A related question is: down below, we're clearly allowed to forward 42
into v0 after inlining through the call to @bar (while before inlining
we weren't, as shown earlier) -- what changed?

global *ptr
declare @foo() readwrite
def @bar() { call @foo() [ "deopt"(XXX) ]; *ptr = 42 }
def @baz() { call @bar() [ "deopt"(YYY) ]; int v0 = *ptr }

inlining ==>

global *ptr
declare @foo() readwrite
def @bar() { call @foo() [ "deopt"(XXX) ]; *ptr = 42 }
def @baz() { call @foo() [ "deopt"(YYY XXX) ]; *ptr = 42; int v0 = *ptr }

What changed is that inlining composed the normal (non-deopt)
continuation in @baz() with the non-deopt continuation in @bar (and
the deopt continuation with the deopt continuation). Thus the
non-deopt continuation in @baz no longer sees a merge of the final
states of the deopt and non-deopt continuations in @bar and can thus
be less pessimistic.

This is kind of a bummer since what I said above directly contradicts
the "As long as the behavior of an operand bundle is describable
within these restrictions, LLVM does not need to have special
knowledge of the operand bundle to not miscompile programs containing
it." bit in the LangRef. :frowning:

Now that I think about it, this isn't too bad. This means "deopt"
operand bundles will need some special handling in IPO passes, but
they get that anyway in the inliner.

-- Sanjoy

So, to summarize, the action item here is (let me know if you
disagree): I need to go and fix the
semantics of deopt operand bundles around IPO, and once that is done,
the weirdness around guards being readonly only in their immediate
callers will no longer be an issue.

-- Sanjoy

Hi Sanjoy,

A quick question here. With the bailing to the interpreter support that you’re envisioning (“deopt operand bundle”), it appears to overlap quite a bit with the existing stack maps. What’s the story/interaction/etc here? I agree that a simpler control flow is great when bailing to the interpreter - doing it with phi nodes is a recipe for pain and long compile times.

Thanks!

-eric

Hi Eric,

(That's a very valid question, thanks for asking!)

The key difference between stackmaps and operand bundles it is more
natural to teach the optimizer to manipulate deopt operand bundles, as
opposed to values live in to a stackmap. This becomes a requirement[0]
around inlining, since (say) we have:

void foo() {
  if (ptr == null) bail_to_interpreter(); [ "deopt"(i32 101) ]
}

void bar() {
  foo() [ "deopt"(i32 237) ]
}

and we wish to inline foo into bar. Then the inlined body has to look like

void bar() {
  if (ptr == null)
    bail_to_interpreter() [ "deopt"(i32 237, i32 101) ]
}

// definition of foo() is elided

This is because if ptr is null, and we do have to bail to the
interpeter, we cannot just create one frame, but have to create *two*
interpreter frames, the younger of which is in state "i32 101"
(obviously a real deopt state will be more complex) and the older of
which is in state "i32 237". Once these frames have been created,
execution will continue in the interpreter.

(Just to be very explicit about it -- the deopt state "i32 237", when
resumed to, will re-start interptered execution in the method @bar()
"as if" foo() had just returned.)

We want the inliner to do this kind of adjustment for us [1], and
doing this over an intrinsic is awkward. For instance, an
intrinsified version of the above will look like:

void foo() {
  if (ptr == null) call @call_with_deopt_state(bail_to_interpreter, i32 101)
}

void bar() {
  call @call_with_deopt_state(foo, i32 237)
}

and the inliner and other IPO passes will have to have special
knowledge of the @call_with_deopt_state intrinsic. Given how
different the semantics of @call_with_deopt_state are from a normal
call, it seemed more natural to make deoptimization state a first
class citizen of LLVM IR.

Secondly, the issues with deopt bundles and IPO this discussion
uncovered are also applicable to stackmaps; and since deopt bundles
are a distinct IR construct, phrasing the IPO / IPA restrictions will
be easier.

Does this answer your question?

[0]: This isn't a problem clients who don't do much optimization
within LLVM, but it is for us.

[1]: https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/Utils/InlineFunction.cpp#L1486

-- Sanjoy

Sanjoy gave the long answer, let me give the short one. :slight_smile:

“deopt” argument bundles are used in the middle end, they are lowered into a statepoint, and generate the existing stackmap format. i.e. one builds on the other.

Heh. I like his more, but I see where you’re coming from as well - the easy way (for me at least) to look at the guard is as an implicit branch to a side exit that restores (or communicates) state back to the interpreter. The lowering will likely need to surface some amount of that control flow. Though I’m not quite sure how you want the intrinsic to perform in the face of optimization. What’s legal to move across etc. There were some comments earlier in the thread, but getting it explicitly laid out in the next writeup would be good.

More thoughts:

It might make sense for the stackmap intrinsic to go away after this - I don’t think it’s clear though. I’m open to arguments either way.

That said, I didn’t see a representation for the deopt bundles? I’m assuming they’re normal operands to the IR instruction yes?

-eric

Heh. I like his more, but I see where you're coming from as well - the easy
way (for me at least) to look at the guard is as an implicit branch to a
side exit that restores (or communicates) state back to the interpreter. The

By "this" do you mean "explicit conditional branch to
bail_to_interpreter"? That mostly works, except that it does not
allow "widening" type optimizations, as discussed in the very first
email.

lowering will likely need to surface some amount of that control flow.
Though I'm not quite sure how you want the intrinsic to perform in the face
of optimization. What's legal to move across etc. There were some comments
earlier in the thread, but getting it explicitly laid out in the next
writeup would be good.

Yes. The current status here is that there are some (not specific to
guard intrinsics) issues with the deopt bundles representation that
need to be fixed before we can proceed with implementing guard
intrinsics.

More thoughts:

It might make sense for the stackmap intrinsic to go away after this - I
don't think it's clear though. I'm open to arguments either way.

We can consider dropping the IR representation of stackmaps, but
that's a separate discussion. Especially given that they're
intrinsics, I don't think we're paying a high cost in keeping them.

That said, I didn't see a representation for the deopt bundles? I'm assuming
they're normal operands to the IR instruction yes?

Deopt bundles are normal operands to the IR instruction, and they've
been in tree for some time now:
http://llvm.org/docs/LangRef.html#deoptimization-operand-bundles

-- Sanjoy

Heh. I like his more, but I see where you’re coming from as well - the easy
way (for me at least) to look at the guard is as an implicit branch to a
side exit that restores (or communicates) state back to the interpreter. The

By “this” do you mean “explicit conditional branch to
bail_to_interpreter”? That mostly works, except that it does not
allow “widening” type optimizations, as discussed in the very first
email.

I don’t use “this” in my above comment so I’m not sure what you’re asking?

lowering will likely need to surface some amount of that control flow.
Though I’m not quite sure how you want the intrinsic to perform in the face
of optimization. What’s legal to move across etc. There were some comments
earlier in the thread, but getting it explicitly laid out in the next
writeup would be good.

Yes. The current status here is that there are some (not specific to
guard intrinsics) issues with the deopt bundles representation that
need to be fixed before we can proceed with implementing guard
intrinsics.

OK.

More thoughts:

It might make sense for the stackmap intrinsic to go away after this - I
don’t think it’s clear though. I’m open to arguments either way.

We can consider dropping the IR representation of stackmaps, but
that’s a separate discussion. Especially given that they’re
intrinsics, I don’t think we’re paying a high cost in keeping them.

That’s fair. I don’t have a strong preference anyhow.

That said, I didn’t see a representation for the deopt bundles? I’m assuming
they’re normal operands to the IR instruction yes?

Deopt bundles are normal operands to the IR instruction, and they’ve
been in tree for some time now:
http://llvm.org/docs/LangRef.html#deoptimization-operand-bundles

Ha. That’s what I get for popping up so late in your work :slight_smile:

-eric

I misread “his” as “this”, sorry.

– Sanjoy

Heh. I like his more, but I see where you’re coming from as well - the easy
way (for me at least) to look at the guard is as an implicit branch to a
side exit that restores (or communicates) state back to the interpreter. The

By “this” do you mean “explicit conditional branch to
bail_to_interpreter”? That mostly works, except that it does not
allow “widening” type optimizations, as discussed in the very first
email.

I don’t use “this” in my above comment so I’m not sure what you’re asking?

I misread “his” as “this”, sorry.

No worries, I think we’re on the same page, I just wanted to make sure :slight_smile:

-eric

This is a proposal to add guard intrinsics to LLVM.

Couple of glossary items: when I say "interpreter" I mean "the most
conservative tier in the compilation stack" which can be an actual
interpreter, a "splat compiler" or even a regular JIT that doesn't
make optimistic assumptions. By "bailing out to the interpreter" I
mean "side exit" as defined by Philip in
http://www.philipreames.com/Blog/2015/05/20/deoptimization-terminology/

# high level summary

Guard intrinsics are intended to represent "if (!cond) leave();" like
control flow in a structured manner. This kind of control flow is
abundant in IR coming from safe languages due to things like range
checks and null checks (and overflow checks, in some cases). From a
high level, there are two distinct motivations for introducing guard
intrinsics:

- To keep control flow as simple and "straight line"-like as is
  reasonable

- To allow certain kinds of "widening" transforms that cannot be
  soundly done in an explicit "check-and-branch" control flow
  representation

## straw man specification

Signature:

declare void @llvm.guard_on(i1 %predicate) ;; requires [ "deopt"(...) ]

Thanks Sanjoy. Sorry it took a couple days for me to read this.

I like this approach a lot. I can envision multiple intrinsics leveraging the basic framework of operand bundles, but each with different semantics and lowering. I'm no longer considering extending the patchpoint design to cover these cases. Your semantics for @llvm.guard_on are perfect for managed runtimes.

Just to give you the idea of a similar intrinsic we may want to introduce... I would also like to add a @llvm.trap_on(i1 %pred) intrinsic, with side effects and lowering as such:

  br i1 %cond, label %succeed, label %fail

fail:
  call void @llvm.trap()
  unreachable

This clearly doesn't need operand bundles, but using an intrinsic would permit special code motion semantics. It could be hoisted and merged with other traps, but the condition could never be widened beyond the union of the original conditions. Unlike deoptimizing guards, it would need to be sequenced with memory barriers, but could otherwise be hoisted as readnone. The semantics of "non deoptimizing guards" are different enough that they should be a different intrinsic. @trap_on would essentially be halfway between @guard_on and @exception_on.

## memory effects (unresolved)

[I haven't come up with a good model for the memory effects of
`@llvm.guard_on`, suggestions are very welcome.]

I'd really like to model `@llvm.guard_on` as a readonly function,
since it does not write to memory if it returns; and e.g. forwarding
loads across a call to `@llvm.guard_on` should be legal.

Of course it would be nice to consider these intrinsics readonly. (One thing we never fixed with patchpoints is the memory effects and that was probably hurting performance.) But has this bug been fixed yet: http://llvm.org/pr18912 Optimizer should consider "maythrow" calls (those without "nounwind) as having side effects? I vaguely remember some work being done to improve the situation, but last I checked LICM was still violating it, and who knows what else?

BTW, if you do want readonly semantics, why would you want readonly to be implicit instead of explicit?

However, I'm in a quandary around representing the "may never return"
aspect of `@llvm.guard_on`: I have to make it illegal to, say, hoist a
load form `%ptr` across a guard on `%ptr != null`. There are couple
of ways I can think of dealing with this, none of them are both easy
and neat:

- State that since `@llvm.guard_on` could have had an infinite loop
  in it, it may never return. Unfortunately, the LLVM IR has some
  rough edges on readonly infinite loops (since C++ disallows that),
  so LLVM will require some preparatory work before we can do this
  soundly.

- State that `@llvm.guard_on` can unwind, and thus has non-local
  control flow. This can actually work (and is pretty close to
  the actual semantics), but is somewhat of hack since
  `@llvm.guard_on` doesn't _really_ throw an exception.

- Special case `@llvm.guard_on` (yuck!).

What do you think?

I think you would need to mark @guard_on as may-unwind AND fix any lingering assumptions in LLVM that readonly calls are nounwind. (Loads can be CSE's across unwind calls but can't be hoisted across them).

An alternative is to *not* mark them as readonly but rely on alias analysis. I think doing that would properly model the "unwind" path for the purpose of IPA/Inlining. Introducing TBAA on calls should be easy and sufficient to unblock the memory optimization you need.

Of course, IPA still needs to know about side exits, which it really should with any may-unwind call. Other than that, I don't see the problem.

Another good point mentioned later in the thread is that a readonly callee should *not* imply a readonly @guard_on or other "deopt" call site. The simple solution for this is to disallow "deopt" of readonly calls.

-Andy

Hi Andy,

Thanks for replying, responses inline below:

This clearly doesn't need operand bundles, but using an intrinsic
would permit special code motion semantics. It could be hoisted and
merged with other traps, but the condition could never be widened
beyond the union of the original conditions. Unlike deoptimizing
guards, it would need to be sequenced with memory barriers, but could

By memory barrier, do you mean things like fences?

otherwise be hoisted as readnone.

Can you clarify this a little bit? Are you talking about things like:

  *ptr = 42
  @trap_if(%foo)
=>
  @trap_if(%foo)
  *ptr = 42

? I.e. if a failing guard postdominates a point "P" in the program,
then the guard can be failed early at P.

(We'd want a stronger condition than postdomination, since we won't
want to hoist

    while (!*cond) { }
    @trap_if(*cond)
  =>
    @trap_if(*cond)
    while (!*cond) { }
)

[snip]

Of course it would be nice to consider these intrinsics
readonly. (One thing we never fixed with patchpoints is the memory
effects and that was probably hurting performance.) But has this bug
been fixed yet: http://llvm.org/pr18912 Optimizer should consider
"maythrow" calls (those without "nounwind) as having side effects? I
vaguely remember some work being done to improve the situation, but
last I checked LICM was still violating it, and who knows what else?

The specific case in the bug was fixed by David in
http://reviews.llvm.org/rL256728. But I agree with your concern, that
this notion of "readnone calls are always okay to remove" may have
leaked into other parts of LLVM.

BTW, if you do want readonly semantics, why would you want readonly
to be implicit instead of explicit?

I don't understand the question. :slight_smile: What's explicit vs. implicit here?

I think you would need to mark @guard_on as may-unwind AND fix any
lingering assumptions in LLVM that readonly calls are nounwind. (Loads
can be CSE's across unwind calls but can't be hoisted across them).

Yes, but a failed guard doesn't exactly "unwind". I.e. if A invokes B
and B fails a guard, then the guard may not always unwind to the A->B
callsite's landingpad, but can also return to its normal no-exception
successor; unless you inline B into A, in which case "B"s guard (now
physically present in the A+B function) will return from A if it
fails. In other words, we'll have to re-purpose the term "unwind" to
mean either "throwing an exception" or "failed a guard". I'm fine
with this, but it is a choice we need to make explicitly.

Another good point mentioned later in the thread is that a readonly
callee should *not* imply a readonly @guard_on or other "deopt" call
site. The simple solution for this is to disallow "deopt" of readonly
calls.

Assuming we're talking about the same thing, the concern was more
along the lines of: you cannot do IPA on a callsite that calls
something that contains a potential deoptimization point, be it either
for a @guard_on call, or a "normal" deoptimization safepoint. This is
because if you have:

void @something()

void callee() {
  call @something() [ "deopt"(state0) ]
  *ptr = 100;
}

void caller() {
  call @callee()  // whether this is a "deopt" call site or a
                  // a normal call site doesn't matter
  int k = *ptr;
}

you cannot, say, fold `k` to `100` in @caller() because the deopt
state, state0, if resumed to may actually write something else to
*ptr.

Making the call to @something() read/write/may unwind does not solve
the problem -- even if the call to @something() wrote to *ptr (or
threw an exception), we could still fold k to 100. The novel control
flow here is that `@caller`, if deoptimized with `state0`, will
execute some arbitrary code, and **return** to @caller at the callsite
to @callee.

(Btw: things are a little different inside our JVM because of the way
we register dependencies, but I'm trying to make a LLVM-centric
argument here.)

At this point, I think it is most straightforward to do a slight
variant of what Philip suggested to me offline: if a function can
deoptimize, it needs to be marked as "mayBeOverridden", preferably by
introducing a new function attribute or by setting the linkage type to
one of those checked by GlobalValue::mayBeOverridden. This is the
responsibility of the frontend, and resuming to the "deopt"
continuation in a physical frame where the enclosing function was not
marked mayBeOverridden is UB.

-- Sanjoy

Hi Andy,

Thanks for replying, responses inline below:

This clearly doesn't need operand bundles, but using an intrinsic
would permit special code motion semantics. It could be hoisted and
merged with other traps, but the condition could never be widened
beyond the union of the original conditions. Unlike deoptimizing
guards, it would need to be sequenced with memory barriers, but could

By memory barrier, do you mean things like fences?

Yes, the ‘fence’ instruction (not to be confused with GC barriers)

otherwise be hoisted as readnone.

Can you clarify this a little bit? Are you talking about things like:

*ptr = 42
@trap_if(%foo)
=>
@trap_if(%foo)
*ptr = 42

? I.e. if a failing guard postdominates a point "P" in the program,
then the guard can be failed early at P.

That’s right. Conservatively, I would not hoist at the LLVM level past
opaque non-readonly calls or fences.

(We'd want a stronger condition than postdomination, since we won't
want to hoist

   while (!*cond) { }
   @trap_if(*cond)
=>
   @trap_if(*cond)
   while (!*cond) { }
)

Sorry, if you need an infinite loop you need to hide it in an opaque call or perform a
volatile read.

Of course it would be nice to consider these intrinsics
readonly. (One thing we never fixed with patchpoints is the memory
effects and that was probably hurting performance.) But has this bug
been fixed yet: http://llvm.org/pr18912 Optimizer should consider
"maythrow" calls (those without "nounwind) as having side effects? I
vaguely remember some work being done to improve the situation, but
last I checked LICM was still violating it, and who knows what else?

The specific case in the bug was fixed by David in
http://reviews.llvm.org/rL256728. But I agree with your concern, that
this notion of "readnone calls are always okay to remove" may have
leaked into other parts of LLVM.

I've begun to think that may-unwind (or may fail guard) and readonly
should be mutually exclusive. readonly refers to all system memory,
not just LLVM-visible memory. To achieve the effect of
"aliases-with-nothing-in-llvm" it's cleaner to use alias analysis.

BTW, if you do want readonly semantics, why would you want readonly
to be implicit instead of explicit?

I don't understand the question. :slight_smile: What's explicit vs. implicit here?

I meant readonly can always be a property of the call site as opposed an intrinsic
property for more frontend control

I think you would need to mark @guard_on as may-unwind AND fix any
lingering assumptions in LLVM that readonly calls are nounwind. (Loads
can be CSE's across unwind calls but can't be hoisted across them).

Yes, but a failed guard doesn't exactly "unwind". I.e. if A invokes B
and B fails a guard, then the guard may not always unwind to the A->B
callsite's landingpad, but can also return to its normal no-exception
successor; unless you inline B into A, in which case "B"s guard (now
physically present in the A+B function) will return from A if it
fails. In other words, we'll have to re-purpose the term "unwind" to
mean either "throwing an exception" or "failed a guard". I'm fine
with this, but it is a choice we need to make explicitly.

I was thinking that that "unwind" would be repurposed as you say. But
after responding to your points below, I think that could be
misleading. It's probably cleaner to adhere to the rule that unwinding
can only resume in a landing pad.

Another good point mentioned later in the thread is that a readonly
callee should *not* imply a readonly @guard_on or other "deopt" call
site. The simple solution for this is to disallow "deopt" of readonly
calls.

Assuming we're talking about the same thing, the concern was more
along the lines of: you cannot do IPA on a callsite that calls
something that contains a potential deoptimization point, be it either
for a @guard_on call, or a "normal" deoptimization safepoint. This is
because if you have:

void @something()

void callee() {
 call @something() [ "deopt"(state0) ]
 *ptr = 100;
}

void caller() {
 call @callee()  // whether this is a "deopt" call site or a
                 // a normal call site doesn't matter
 int k = *ptr;
}

you cannot, say, fold `k` to `100` in @caller() because the deopt
state, state0, if resumed to may actually write something else to
*ptr.

Making the call to @something() read/write/may unwind does not solve
the problem -- even if the call to @something() wrote to *ptr (or
threw an exception), we could still fold k to 100. The novel control
flow here is that `@caller`, if deoptimized with `state0`, will
execute some arbitrary code, and **return** to @caller at the callsite
to @callee.

(Btw: things are a little different inside our JVM because of the way
we register dependencies, but I'm trying to make a LLVM-centric
argument here.)

At this point, I think it is most straightforward to do a slight
variant of what Philip suggested to me offline: if a function can
deoptimize, it needs to be marked as "mayBeOverridden", preferably by
introducing a new function attribute or by setting the linkage type to
one of those checked by GlobalValue::mayBeOverridden. This is the
responsibility of the frontend, and resuming to the "deopt"
continuation in a physical frame where the enclosing function was not
marked mayBeOverridden is UB.

I understand your problem. I was ignoring an aspect of analyzing the
unwind path and the fact the LLVM could make assumptions about where
control will resume.

mayBeOverriden makes sense. I would think that the symbol at the deopt
call site needs to be marked mayBeOverriden.

-Andy

Hi Andy,

Responses inline:

By memory barrier, do you mean things like fences?

That’s right. Conservatively, I would not hoist at the LLVM level past
opaque non-readonly calls or fences.

Just for curiosity: why do you care about memory fences specifically?

I've begun to think that may-unwind (or may fail guard) and readonly
should be mutually exclusive. readonly refers to all system memory,
not just LLVM-visible memory. To achieve the effect of
"aliases-with-nothing-in-llvm" it's cleaner to use alias analysis.

(I've put this on the bug as well:)

https://llvm.org/bugs/show_bug.cgi?id=18912#c3

(In reply to comment #2)
> The test case in this bug report is fixed by
> http://reviews.llvm.org/rL256728.
>
> I'm closing this because I don't have a test case and I no longer think it
> makes much sense to mark may-unwind calls "readonly". An unwind path always
> touches some memory. That fact that it doesn't alias with LLVM-visible

I'm not sure about this -- why does an unwind path *have* to write
some memory? Can't you implement unwind as "read the appropriate
exception handler RPC from a table, and return to that RPC"?

memory access can be handled by alias analysis.

Btw, I think in this interpretation it is incorrect for -functionattrs
to infer readnone for @foo in (which it does today):

define void @foo({ i8*, i32 } %val) personality i8 8 {
  resume { i8*, i32 } %val
}

I meant readonly can always be a property of the call site as opposed an intrinsic
property for more frontend control

Yes, I agree that if we decide that readonly semantics are fine, we
can always explicitly mark these as readonly, and not have to worry
about specifying the memory effects as part of the intrinsics
definition. Part of the discussion here has devolved to is "readonly"
is correct.

I was thinking that that "unwind" would be repurposed as you say. But
after responding to your points below, I think that could be
misleading. It's probably cleaner to adhere to the rule that unwinding
can only resume in a landing pad.

Ok.

I understand your problem. I was ignoring an aspect of analyzing the
unwind path and the fact the LLVM could make assumptions about where
control will resume.

mayBeOverriden makes sense. I would think that the symbol at the deopt
call site needs to be marked mayBeOverriden.

Ok. I'll probably split off an "introduce a `interposable` function
attribute" into an independent review (I think re-using a linkage type
for this will be somewhat of a hack).

Thanks!
-- Sanjoy

I think you're jumping ahead a bit here. I'm not sure the semantics are
anywhere near as weird as you're framing them to be. :slight_smile:

I now think this weirdness actually does not have to do anything with
guard_on or bail_to_interpeter, but it has to do with deopt bundles
itself. Our notion of of "deopt bundles are readonly" is broken to
begin with, and that is what's manifesting as the complication we're
seeing here.

Consider something like

declare @foo() readonly
def @bar() { call @foo() [ "deopt"(XXX) ] }
def @baz() { call @bar() [ "deopt"(YYY) ] }

Right now according to the semantics of "deopt" operand bundles as in
the LangRef, every call site above is readonly. However, it is
possible for @baz() to write to memory if @bar is deoptimized at the
call site with the call to @foo.

You could say that it isn't legal to mark @foo as readonly, since the
action of deoptimizing one's caller is not a readonly operation. But
that doesn't work in cases like this:

global *ptr
declare @foo() readwrite
def @bar() { call @foo() [ "deopt"(XXX) ]; *ptr = 42 }
def @baz() { call @bar() [ "deopt"(YYY) ]; int v0 = *ptr }

Naively, it looks like an inter-proc CSE can forward 42 to v0, but
that's unsound, since @bar could get deoptimized at the call to
@foo(), and then who knows what'll get written to *ptr.

Ok, I think this example does a good job of getting at the root issue. You claim this is not legal, I claim it is. :slight_smile: Specifically, because the use of the inferred information will never be executed in baz. (see below)

Specifically, I think the problem here is that we're mixing a couple of notions. First, we've got the state required for the deoptimization to occur (i.e. deopt information). Second, we've got the actual deoptimization mechanism. Third, we've got the *policy* under which deoptimization occurs.

The distinction between the later two is subtle and important. The *mechanism* of exiting the callee and replacing it with an arbitrary alternate implementation could absolutely break the deopt semantics as you've pointed out. The policy we actually use does not. Specifically, we've got the following restrictions:
1) We only replace callees with more general versions of themselves. Given we might be invalidating a speculative assumption, this could be a *much* more general version which includes actions and control flow invalidate any attribute inference done over the callee.
2) We invalidate all callers of @foo which could have observed the incorrect inference. (This is required to preserve correctness.)

I think we probably need to separate out something to represent the interposition/replacement semantics implied by invalidation deoptimization. In it's most generic form, this would model the full generality of the mechanism and thus prevent nearly all inference. We could then clearly express our *policy* as a restriction over that full generality.

Another interesting case to consider:

global *ptr
declare @foo() readwrite
def @bar() { call @foo() [ "deopt"(XXX) ]; *ptr = 42 }
def @baz() {
   v0 = 42;
   while (C) {
     call @bar() [ "deopt"(v0) ];
     int v0 = *ptr
   }
}

Could we end up deoptimization with an incorrect deopt value for v0 based on circular logic? We can infer that v0 is always 42 in this example. I claim that's legal precisely up to the point at which we deoptimize @bar and @baz together. If we deoptimized @bar, let @baz run another loop iteration, then invalidated @baz, that would be incorrect.

My interpretation here is that we're not modeling the deopt
continuations correctly. Above, the XXX continuation is a delimited
continuation that terminates at the boundary of @bar, and seen from
its caller, the memory effect (and any other effect) of @bar has to
take into account that the "remainder" of @bar() after @foo has
returned is either what it can see in the IR, or the XXX continuation
(which it //could// analyze in theory, but in practice is unlikely
to).

This is kind of a bummer since what I said above directly contradicts
the "As long as the behavior of an operand bundle is describable
within these restrictions, LLVM does not need to have special
knowledge of the operand bundle to not miscompile programs containing
it." bit in the LangRef. :frowning:

Per above, I think we're fine for invalidation deoptimization.

For side exits, the runtime function called can never be marked readonly (or just about any other restricted semantics) precisely because it can execute an arbitrary continuation. In principle, we could do bytecode inference to establish restricted semantics per call site.

Philip