Managed Languages BOF @ Dev Meeting

Sanjoy, Joseph, and I will be hosting a BoF on using LLVM to build compilers for managed languages. We’re curious to hear who’s planning on attending and what topics you’d like to hear about. Depending on the expected audience, we’re happy to do anything between a rough “what to expect getting started” to a down in the weeds working session on relevant optimization or ABI related topics. If you’re planning on attending, please let us know. If you have particular topics you’d like to hear about or discuss, please reply to this thread.

The major areas we’ve thought of include:

  • High level architectures for successful projects: debuggable architectures, working with the community, etc…
  • Dealing w/ABI requirements like calling conventions, exceptions, patchable calls, Inline caches, frame inspection, and deoptimization
  • Lowering/compilation strategies for high level languages
  • Differences in canonical IR from traditional C code and resulting optimization challenges

Philip

I'm planning on attending. You've mentioned some of these, but
specific topics of interest for me include:
- Dealing with the explosion of basic blocks that come up with
languages where almost every function call, implicit and explicit, can
raise exceptions.
- Proper ABI implementation without reimplementing Clang's
TargetInfo.cpp or embedding Clang.
- The issue of knowing what the canonical IR is to begin with and then
how to deal with the differences.
- Preserving correct semantics of language specific constructs without
hindering optimizations.

-- Joe Ranieri

I’m planning to attend. I’m generally interested in hearing about use cases for LLVM’s JIT infrastructure, and looking for opportunities to improve our support for them.

Cheers,
Lang.

I won’t be able to attend, but I’d be interested in hearing if any conclusions are reached on several of these topics:

- Dealing with the explosion of basic blocks that come up with
languages where almost every function call, implicit and explicit, can
raise exceptions.

This also relates to non-call exceptions. There was a proposal a few years ago to adopt a model a bit closer to WHRIL’s exception regions, where the unwind destination would become a property of the basic block, not the call site (this also maps fairly closely to the underlying implementation, which deals with PC-ranges). You would need to split blocks in places to ensure that every used value in the unwind target is live on entry to a block that can unwind to it, but that should still reduce the basic block explosion.

- Proper ABI implementation without reimplementing Clang's
TargetInfo.cpp or embedding Clang.

Although the implementation is a bit hairy, GCC’s JIT interfaces are a lot cleaner in this respect, as they deal with C types.

I’m not opposed to having to embed a chunk of clang (clang itself isn’t that huge, and in a shared-library build the size is easily amortised), but it would be very nice to have some of this exposed via stable interfaces. I believe that Apple folks have done some work in this direction for Swift - perhaps they could be persuaded (either as part of, or separately from, the open sourcing of Swift) to put some work into a stable library interface for external consumers?

- The issue of knowing what the canonical IR is to begin with and then
how to deal with the differences.

This is largely a documentation issue. Various passes all have different notions of canonical forms for various constructs in their input and output and these are all undocumented.

David

Hi,

I’m planning to attend. I’m interested in hearing about efforts to add/enhance middle-end and back-end optimizations to LLVM. For example, will be there something like Polly but more integrated to LLVM and not only limited to SCoPs?

Best,
Dounia

I saw this come up in a recent discussion about LLILC. One simple way to do this is to introduce a new intrinsic, like “llvm.nullcheck(x)” which symbolically represents the null check and bail out in one instruction - instead of the fully elaborated basic block representation (which slows everything down by increasing the height of the CFG). These null checks can be optimized away through standard techniques, and any remaining ones expanded out to LLVM IR before codegen, or as part of lowering to MachineInstrs.

-Chris

FYI, there are details about this work at the last devmtg:
http://llvm.org/devmtg/2014-10/#talk18

-Chris

This also relates to non-call exceptions. There was a proposal a

> few years ago to adopt a model a bit closer to WHRIL’s exception
> regions, where the unwind destination would become a property of the
> basic block, not the call site (this also maps fairly closely to the
> underlying implementation, which deals with PC-ranges). You would
> need to split blocks in places to ensure that every used value in the
> unwind target is live on entry to a block that can unwind to it, but
> that should still reduce the basic block explosion.

Supporting only basic block level granularity for "try ranges" may not
be sufficient for Java -- if a basic block has more than one null check
in it then throwing the NullPtrException for the first null check (if
it fails) is semantically different from throwing the NullPtrException
for the second null check (the constructed exceptions will have
different stack traces attached to them, for instance [1]). So we'd
have to do repeat the null checks in the unwind block, like

   superblock: # unwinds to unwind_block
     null_check(ptr_a)
     do_something
     null_check(ptr_b)
     do_something_again

   unwind_block:
     ;; either ptr_a is null or ptr_b is null
     if (ptr_a == null)
       throw_nullptrexception(bci = 42)
     else ;; if (ptr_b == null)
       throw_nullptrexception(bci = 43)

So the code explosion problem still exists (in unwind_block), and
we've duplicated a bunch of code.

Having said that, I'd love to take a look at the proposal that was
made -- will it be easy for you to dig up a link?

-- Sanjoy

[1]: there is a "hot exception" optimization that some JVMs can do
   that may allow us to get around this
   (http://jawspeak.com/2010/05/26/hotspot-caused-exceptions-to-lose-their-stack-traces-in-production-and-the-fix/),
   but that's an _optimization_ that Java programmers should be able to
   turn off.

>
>> - Proper ABI implementation without reimplementing Clang's
>> TargetInfo.cpp or embedding Clang.
>
> Although the implementation is a bit hairy, GCC’s JIT interfaces are a lot cleaner in this respect, as they deal with C types.
>
> I’m not opposed to having to embed a chunk of clang (clang itself isn’t that huge, and in a shared-library build the size is easily amortised), but it would be very nice to have some of this exposed via stable interfaces. I believe that Apple folks have done some work in this direction for Swift - perhaps they could be persuaded (either as part of, or separately from, the open sourcing of Swift) to put some work into a stable library interface for external consumers?
>
>> - The issue of knowing what the canonical IR is to begin with and then
>> how to deal with the differences.
>
> This is largely a documentation issue. Various passes all have different notions of canonical forms for various constructs in their input and output and these are all undocumented.

I saw this come up in a recent discussion about LLILC. One simple

> way to do this is to introduce a new intrinsic, like
> “llvm.nullcheck(x)” which symbolically represents the null check and
> bail out in one instruction - instead of the fully elaborated basic
> block representation (which slows everything down by increasing the
> height of the CFG).

Is this compile time slowdown just a generic effect of the fact that a
lot of things in LLVM have O(f(# BBs)) complexity? Or are there
specific things (like domtree construction) that suffer due to lots of
basic blocks?

> These null checks can be optimized away through
> standard techniques, and any remaining ones expanded out to LLVM IR
> before codegen, or as part of lowering to MachineInstrs.

That won't be free, since keeping the control flow explicit does give
us better optimization in cases where we can sink work into the cold
throwing path, like these:

  *ptr = 42
  if (a == null) { throw(); unreachable; }
  *ptr = 43

==>

  if (a == null) { *ptr = 42; throw(); unreachable; }
  *ptr = 43

And

  val = *ptr;
  if (a == null) { throw(some_arg = val); unreachable; }

==>

  if (a == null) { val = *ptr; throw(some_arg = val); unreachable; }

> I saw this come up in a recent discussion about LLILC. One simple
> way to do this is to introduce a new intrinsic, like
> “llvm.nullcheck(x)” which symbolically represents the null check and
> bail out in one instruction - instead of the fully elaborated basic
> block representation (which slows everything down by increasing the
> height of the CFG).

Is this compile time slowdown just a generic effect of the fact that a
lot of things in LLVM have O(f(# BBs)) complexity?

Yes.

Or are there
specific things (like domtree construction) that suffer due to lots of
basic blocks?

Yes. :slight_smile:

Some things work in time proportional to the depth of the CFG, and each null check adds a level.

> These null checks can be optimized away through
> standard techniques, and any remaining ones expanded out to LLVM IR
> before codegen, or as part of lowering to MachineInstrs.

That won't be free, since keeping the control flow explicit does give
us better optimization in cases where we can sink work into the cold
throwing path, like these:

I don’t claim it to be optimal. That said, we do sinking at the machine level as well, so we will catch at least some of these cases.

-Chris

Supporting only basic block level granularity for "try ranges" may not
be sufficient for Java -- if a basic block has more than one null check
in it then throwing the NullPtrException for the first null check (if
it fails) is semantically different from throwing the NullPtrException
for the second null check (the constructed exceptions will have
different stack traces attached to them, for instance [1]).

[ Footnote inlined ]

[1]: there is a "hot exception" optimization that some JVMs can do
that may allow us to get around this
(http://jawspeak.com/2010/05/26/hotspot-caused-exceptions-to-lose-their-stack-traces-in-production-and-the-fix/),
but that's an _optimization_ that Java programmers should be able to
turn off.

There are some quite exciting security vulnerabilities associated with this optimisation (it turns out that running a few instructions after a security-critical check has failed is not always a good thing to do) so I agree that it’s a good idea to avoid depending on it.

So we'd
have to do repeat the null checks in the unwind block, like

superblock: # unwinds to unwind_block
   null_check(ptr_a)
   do_something
   null_check(ptr_b)
   do_something_again

unwind_block:
   ;; either ptr_a is null or ptr_b is null
   if (ptr_a == null)
     throw_nullptrexception(bci = 42)
   else ;; if (ptr_b == null)
     throw_nullptrexception(bci = 43)

So the code explosion problem still exists (in unwind_block), and
we've duplicated a bunch of code.

Does it? I guess it depends on how you are implementing the exceptions. I was expecting that the non-call exceptions would be implemented on top of signals (or something SEH-like on Windows), so the trap handler would have some generic code to identify the faulting instruction, map this back to something useful, and then throw the exception. You wouldn’t be throwing the null pointer exception from the unwind target, that would be generated in the signal handler and the unwind target would only have to do the normal unwinding.

In a JIT environment, the unwind targets could probably also be lazily emitted, so in your initial IR you’d just have a single patchpoint for the unwind target. My knowledge of common Java idioms is slightly rusty, but my impression was that, while lots of Java code abuses exceptions for non-exceptional behaviour, this is quite rare for null-pointer exceptions. Is there much (any?) real-world code where the handling of null pointer exceptions is performance critical?

Having said that, I'd love to take a look at the proposal that was
made -- will it be easy for you to dig up a link?

It was this mailing list, but your ability to search it is probably as good as mine. My brain, unfortunately, does not contain a well indexed cross-reference of the archive. I believe that the general consensus was that something like this would be good to have, but not sufficiently important to anyone with time to spend actively hacking on stuff to do. Given that people are now seriously using LLVM for languages that are not so C-like, this might be different...

David

David Chisnall wrote:
>> So we'd
>> have to do repeat the null checks in the unwind block, like
>>
>> superblock: # unwinds to unwind_block
>> null_check(ptr_a)
>> do_something
>> null_check(ptr_b)
>> do_something_again
>>
>> unwind_block:
>> ;; either ptr_a is null or ptr_b is null
>> if (ptr_a == null)
>> throw_nullptrexception(bci = 42)
>> else ;; if (ptr_b == null)
>> throw_nullptrexception(bci = 43)
>>
>> So the code explosion problem still exists (in unwind_block), and
>> we've duplicated a bunch of code.
>

> Does it? I guess it depends on how you are implementing the
> exceptions. I was expecting that the non-call exceptions would be
> implemented on top of signals (or something SEH-like on Windows), so
> the trap handler would have some generic code to identify the faulting
> instruction, map this back to something useful, and then throw the
> exception. You wouldn’t be throwing the null pointer exception from
> the unwind target, that would be generated in the signal handler and
> the unwind target would only have to do the normal unwinding.

I see what you mean. You'd keep some bookkeeping information on each
"faulting_load" instruction that specifies how the exception being
thrown has to be constructed. So my example would look something like
this:

    superblock: # unwinds to unwind_block
      faulting_load(ptr_a) # exception_construction_args = (42)
      do_something
      faulting_load(ptr_b) # exception_construction_args = (43)
      do_something_again

Did I understand the scheme correctly?

The interesting bit (I'm sure you've thought of this, I'm still trying
to verify that I've understood the scheme correctly) here is that the
faulting_load instruction will not have the same reordering semantics
as a normal load. You cannot reorder two `faulting_load` instructions
as that would change which NPE you get. E.g. if you were supposed to
get an NPE on bci 42, when loading ptr_a, in the above example, the
optimizer cannot change that to getting an NPE on bci 43, when loading
ptr_b. It therefore cannot re-order faulting_load(ptr_a) and
faulting_load(ptr_b) even though they're control equivalent, and even if
aliasing permits.

> In a JIT environment, the unwind targets could probably also be
> lazily emitted, so in your initial IR you’d just have a single
> patchpoint for the unwind target. My knowledge of common Java idioms
> is slightly rusty, but my impression was that, while lots of Java code
> abuses exceptions for non-exceptional behaviour, this is quite rare
> for null-pointer exceptions.

Yes, that is accurate. In fact, LLVM has an implementation of
page-fault based null check elimination [1] that we've been using for
a while. However, this is not done by a new "faulting_load"
instruction; but is done by late matching "test Reg, Reg; jz throw"
and trying to "fold" the null check into a "nearby memory operation".

There are two reasons why we chose this late matching approach:

  1. Keeping the control flow explicit lets most of the optimizer elide
     redundant null checks (without us teaching it about another new
     thing).

  2. It helps making the optimization profile guided and optional -- if
     you don't do anything then you still have an explicit null check
     (the "conservative" case, in this scenario) to fall back on, and
     not an implicit null check.

     Since taking a page fault is so much more expensive than a fully
     predicted explicit null check branch, getting this wrong even in a
     handful of cases can cause a net regression (because an occasional
     "catch NullPointerException" can heavily skew the scales). So we
     have to rely on recompilation to avoid having a failing implicit
     null check in an application's steady state. As soon as we detect
     that an implicit null check has failed, we kick off a recompile of
     the same method with information that tells LLVM that that
     specific null check must not be made implicit this time.

The problem with this scheme is that it does not address the basic
block explosion issue that started this discussion.

> Is there much (any?) real-world code
> where the handling of null pointer exceptions is performance critical?

I'd expect "evolution" to have taken care of most such code. :slight_smile:

> Given that people are now seriously using LLVM for
> languages that are not so C-like, this might be different...

See [1]. :slight_smile:

-- Sanjoy

[1]: http://llvm.org/docs/FaultMaps.html

David Chisnall wrote:
>> So we'd
>> have to do repeat the null checks in the unwind block, like
>>
>> superblock: # unwinds to unwind_block
>> null_check(ptr_a)
>> do_something
>> null_check(ptr_b)
>> do_something_again
>>
>> unwind_block:
>> ;; either ptr_a is null or ptr_b is null
>> if (ptr_a == null)
>> throw_nullptrexception(bci = 42)
>> else ;; if (ptr_b == null)
>> throw_nullptrexception(bci = 43)
>>
>> So the code explosion problem still exists (in unwind_block), and
>> we've duplicated a bunch of code.
>

> Does it? I guess it depends on how you are implementing the
> exceptions. I was expecting that the non-call exceptions would be
> implemented on top of signals (or something SEH-like on Windows), so
> the trap handler would have some generic code to identify the faulting
> instruction, map this back to something useful, and then throw the
> exception. You wouldn’t be throwing the null pointer exception from
> the unwind target, that would be generated in the signal handler and
> the unwind target would only have to do the normal unwinding.

I see what you mean. You'd keep some bookkeeping information on each
"faulting_load" instruction that specifies how the exception being
thrown has to be constructed. So my example would look something like
this:

  superblock: # unwinds to unwind_block
    faulting_load(ptr_a) # exception_construction_args = (42)
    do_something
    faulting_load(ptr_b) # exception_construction_args = (43)
    do_something_again

Did I understand the scheme correctly?

Yes, in much the same way that we already emit data into the LSDA. Whatever catches the trap (signal handler, SEH handler) would receive the PC of the faulting load, look up the relevant information, and then prepare the exception object.

The interesting bit (I'm sure you've thought of this, I'm still trying
to verify that I've understood the scheme correctly) here is that the
faulting_load instruction will not have the same reordering semantics
as a normal load. You cannot reorder two `faulting_load` instructions
as that would change which NPE you get. E.g. if you were supposed to
get an NPE on bci 42, when loading ptr_a, in the above example, the
optimizer cannot change that to getting an NPE on bci 43, when loading
ptr_b. It therefore cannot re-order faulting_load(ptr_a) and
faulting_load(ptr_b) even though they're control equivalent, and even if
aliasing permits.

Yes, that would probably be the trickiest part of this. Note that this is also somewhat language dependent. If the source language does not define ordering of memory operations, then reordering them would be fine. I’m definitely not sufficiently familiar with the Java spec to say exactly when this would or wouldn’t be permitted - I think Java is a lot stricter than C/C++ with respect to evaluation order. We’d probably need some metadata on each basic block, along with the unwind target, to specify what the requirements are. Alternatively, we could use the atomic load and store instructions with a new ordering that would guarantee ordering within a single thread but not actually guarantee atomicity.

> In a JIT environment, the unwind targets could probably also be
> lazily emitted, so in your initial IR you’d just have a single
> patchpoint for the unwind target. My knowledge of common Java idioms
> is slightly rusty, but my impression was that, while lots of Java code
> abuses exceptions for non-exceptional behaviour, this is quite rare
> for null-pointer exceptions.

Yes, that is accurate. In fact, LLVM has an implementation of
page-fault based null check elimination [1] that we've been using for
a while. However, this is not done by a new "faulting_load"
instruction; but is done by late matching "test Reg, Reg; jz throw"
and trying to "fold" the null check into a "nearby memory operation".

There are two reasons why we chose this late matching approach:

1. Keeping the control flow explicit lets most of the optimizer elide
   redundant null checks (without us teaching it about another new
   thing).

2. It helps making the optimization profile guided and optional -- if
   you don't do anything then you still have an explicit null check
   (the "conservative" case, in this scenario) to fall back on, and
   not an implicit null check.

   Since taking a page fault is so much more expensive than a fully
   predicted explicit null check branch, getting this wrong even in a
   handful of cases can cause a net regression (because an occasional
   "catch NullPointerException" can heavily skew the scales). So we
   have to rely on recompilation to avoid having a failing implicit
   null check in an application's steady state. As soon as we detect
   that an implicit null check has failed, we kick off a recompile of
   the same method with information that tells LLVM that that
   specific null check must not be made implicit this time.

The problem with this scheme is that it does not address the basic
block explosion issue that started this discussion.

I can definitely see the attraction of keeping flow control explicit. Note that moving the unwind target to the basic block also simplifies flow control for explicit (call) exceptions though - it’s not limited to non-call exceptions. If you have a language like Java with large try blocks, containing a number of calls, all with the same jump target, then the IR will be a lot simpler for cases where the catch block doesn’t touch any variables modified in the try block (you’d still need to split the try block into basic blocks where it does).

The recompilation case would be similar in both cases, only this time you’d be replacing an implicit null check with an explicit one, rather than removing some metadata.

David

My knowledge of common Java idioms
is slightly rusty, but my impression was that, while lots of Java
code abuses exceptions for non-exceptional behaviour, this is quite
rare for null-pointer exceptions.

Actually such abuse is heavily frowned upon. They use it only where unavoidable - i.e. when calling into the JDK's implementations of file I/O and JDBC.
They certainly don't expect the exceptional path to be fast; I have yet to see code where the hot path goes through a catch clause.

> Is there much (any?) real-world

code where the handling of null pointer exceptions is performance
critical?

I doubt that NPEs are really used for that purpose. The catch block would never be sure that the NPE doesn't originate from a true null pointer dereference somewhere in the code.
If I must imagine a Java programmer who uses exceptions for control flow, they'd likely use some exception that they think cannot happen. If they are mildly compentent, they'll define their own exception class just to be on the safe side.

Of cousre, there are shoddy programmers who do that kind of thing, and worse. I don't think LLVM needs to worry about the speed of that kind of code though :slight_smile:

Actually Java is relatively relaxed in general, but some language constructs impose strict guarantees.

See https://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html for the programmer perspective, and The JSR-133 Cookbook for the implementer perspective.

Whatever catches the trap (signal handler, SEH handler) would receive the PC of the faulting load, look up the relevant information, and then prepare the exception object.

This is exactly how the CLR's trap handler works (and likewise for SEH).

<<<
  I see what you mean. You'd keep some bookkeeping information on each
  "faulting_load" instruction that specifies how the exception being
  thrown has to be constructed. So my example would look something like
  this:
  
    superblock: # unwinds to unwind_block
      faulting_load(ptr_a) # exception_construction_args = (42)
      do_something
      faulting_load(ptr_b) # exception_construction_args = (43)
      do_something_again

In fact, we'll need it to look something like this to be usable in LLILC, and I intend to extend null-check-folding to support generating something like that.

You cannot reorder two `faulting_load` instructions as that would change which NPE you get

MSIL specifically allows this reordering.

I can definitely see the attraction of keeping flow control explicit

As can I :). The two approaches I have some familiarity with from other codebases are:
1) Have something like faulting_load through all IR, split try blocks at each exception but don't generate throw blocks (or explicit null compares)
2) Don't break basic blocks at exceptions, annotate handlers on blocks, but take special care when reasoning about liveness/dependence across potentially-faulting instructions (e.g. using a special annotation for stores that are live-out along "early" exception edges but not necessarily out of their block's normal exit).

With #1, despite being concerned about it, we never did identify basic block count as a scaling issue, aside from generally needing to code flow walks with iteration over an explicit worklist instead of recursion. Most of our throughput-intensive analyses/optimizations were more sensitive to the SSA graph than the CFG (though of course the same may not be true for LLVM?). The very large PHIs at very large joins for very large try regions' handlers were a continual issue, and we generally had to avoid doing any super-linear PHI processing.

With #2, the special semantics for live-out stores were a nuisance, but I believe this has been the more common approach for MS compilers.

It's not clear to me how much IR is too much IR / how many blocks are too many blocks, what the right tradeoff is here. I like the approach of getting things working with the model that is most elegant for analyses/optimizations and introduces the fewest special IR constructs, and then measuring to get a concrete handle on what it costs. I expect we'll be doing exactly that as part of the LLILC effort.

cases where the catch block doesn’t touch any variables modified in the try block

That's a cool idea for trying to get most of the best of both worlds. I'd be interested to know how often try blocks are mergeable by this reasoning in practice.

Thanks
-Joseph

David Chisnall wrote:
> Yes, that would probably be the trickiest part of this. Note that
> this is also somewhat language dependent. If the source language does
> not define ordering of memory operations, then reordering them would
> be fine. I’m definitely not sufficiently familiar with the Java spec
> to say exactly when this would or wouldn’t be permitted - I think Java
> is a lot stricter than C/C++ with respect to evaluation order. We’d
> probably need some metadata on each basic block, along with the unwind
> target, to specify what the requirements are. Alternatively, we could
> use the atomic load and store instructions with a new ordering that
> would guarantee ordering within a single thread but not actually
> guarantee atomicity.

This is what the spec says: "The Java programming language guarantees
that the operands of operators appear to be evaluated in a specific
evaluation order, namely, from left to right."

Apart from re-ordering, we need to be careful about removing
faulting_load instructions too -- removing an unused load instruction
does not mean we can remove the implied null check.

Joachim Durchholz wrote:
> Actually Java is relatively relaxed in general, but some language
> constructs impose strict guarantees.

> See https://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html
> for the programmer perspective, and http://gee.cs.oswego.edu/dl/jmm
> /cookbook.html for the implementer perspective.

The multi-threaded memory model does allow "re-ordering", but
semantically it is phrased in terms of which read sees which write,
not re-ordering memory operations.

-- Sanjoy