Question about implementing exceptions, especially to the VMKit team

Hi all, I’m working on implementing exceptions in Pyston, and I was hoping to get some guidance from the list. I think I’ve learned enough about C+±style DWARF exceptions to convince myself I have a workable approach, but then I found this VMKit paper (2010) which says

The exception manager: To manage exceptions, J3 reserves a
word for the pending exception in the local storage of each thread.
After each method invocation, the word is tested. If an exception
has been raised and the function is able to trap the exception, a
branch to the exception handler is executed. Otherwise, the function
returns to the caller. This implementation is not optimal because it
requires a test and branch after each method invocation. A more
efficient implementation would use exception tables. However, at
time of writing the paper, the exception tables generated by LLVM
rely on the GCC runtime library [19] which is not optimized for
dynamically generated tables, at least on Linux.

So now I’m reexamining whether C+±style exceptions are a good choice for Python, which I would guess throws more exceptions than Java.

Does anyone have advice on the subject? I’m wary of the costs of having to branch after every function call, but maybe that ends up not being too much of a performance drain if the CPU can predict them correctly? But then again, it looks like VMKit has moved to a setjmp-longjmp approach, which makes me think that there must have been some overhead that they wanted to avoid.

I’m also curious about why libgcc is slow on dynamically generated tables, and if that’s still true with MCJIT (I assume the VMKit change was made while on the old JIT).

thanks,
kmod

Hi Kevin,

I have beginning a new implementation of exception in VMKit with the
dwarf exception tables. Unfortunately, I have not finished yet, so I
can not answer :slight_smile: I'm pretty sure that, today, dwarf is a better
solution than our test after each method invocation. But I don't have
any evidence for that...

Anyway, using the dwarf exception tables with VMKit is not so
complicated, so you can also try. If you need help, just tell me.
Basically, I have not yet exhaustively tested them (at least, it
works), mainly because I'm re-writting all VMKit to use MCJIT!

Gaël

I have some data points on this.

  • The style of exception handling employed by VMKit has in the past been used as an optimization. For example, OpenVM switched to this after using C++ exceptions for along time because it was a net performance win. This is because Java-like languages tend to do more exception throwing than the C++ runtime is tuned for and the cost of unwinding in C++ is too great: so paying one branch after every call ends up being cheaper, for many benchmarks, than using C++ exceptions because all benchmarks have some throwing (usually at least internally in the standard library). Such high exception volume may be a Javaism.

  • WebKit uses LLVM and does exceptions. The implementation is incomplete (I.e. We won’t use the LLVM JIT for some exception code paths) but the basic idea is there: 1) throwing an exception through a function that doesn’t catch just means you need to just use unwind meta-data. That’s easy with compact_unwind. 2) catching can be done by using either deoptimization (deoptimize the entire catching function) or by ensuring that state live in a catch is spilled. It’s best to do both: use deoptimization if profiling tells you that the catch doesn’t execute and switch to spilling if it does. The benefit is that the deopt approach is very fast if zero throwing happens and the spilling approach will let you implement very efficient unwinding for when it happens with high volume.

I’m not sure this can completely answer your question. Exception implementation is a tricky subject with many strategies and it’s an area where you could easily come up with fresh ideas. :slight_smile:

-Filip

We have a local implementation which supports both check-after-call and unwind tables. I don’t have good performance numbers across a wide suite of benchmarks (yet), but our early numbers showed using C+±style exceptions to be a significant win on the normal path. We haven’t measured the exception path yet. An interesting point is that (at least in theory), you don’t need to pick one strategy. You could use c++ style exceptions for “cold throws”, and check-after-call for “warm throws”. This is quite a bit of infrastructure to build mind you. :slight_smile: We’ve also considered the idea of speculating that functions don’t throw and de-optimizing if it turns out they do. We haven’t implemented this yet, but on the surface, it sounds like a pretty good idea. Note that you’d need a recompilation mechanism to support the best use of this. I don’t know the answer to this. I’m not familiar with the mechanism the old JIT provided in this area. Philip

Hi Kevin,

To elaborate on Philip's point, depending on the state Pyston's
runtime already is in, you may have the choice of using a hybrid of a
"pending exception" word in your runtime thread structure, and an
implicit alternate ("exceptional") return address for calls into
functions that may throw. This lets you elide the check on the
pending exception word after calls by turning them into invokes that
unwind into a landingpad containing a generic exception handler. This
generic exception handler then checks the type of the pending
exception word and handles the exception (which may involve rethrowing
to the caller if the current frame doesn't have catch handler).

Instead of relying on libgcc to unwind when you throw you can then
parse the [call PC, generic exception handling PC] pairs from the
.eh_frame section, and when throwing to your caller, look up the
generic exception handling PC (using the call PC pushed on the stack)
and "return" to that instead. Rethrow is similar.

This scheme has the disadvantage of "returning" through every active
frame on an exception throw, even if a particular frame never had an
exception handler and could've been skipped safely. However, this
scheme allows you to easily switch to one of two other implementations
based on profiling data on a per-callsite basis:

1. high exception volume -- if an invoke has seen too many exception
    throws, recompile by replacing the invoke with a call followed by
    a test of "pending exception" and branch. The logic to generate
    the branch target should largely be the same as logic to generate
    the landing pad block.

2. low exception volume -- keep the invoke, but put a deoptimization
    trap in the landing pad block.

We did some rough benchmarking, and using such implicit exceptions
(i.e. not explicitly checking the pending exception word) reduces
non-throwing call overhead by 20-25%. I don't have any numbers on how
it affects the performance of exceptional control flow though.

-- Sanjoy

That’s definitely good confirmation to hear that the test+branch for every call does in fact add noticeable overhead – thanks for the datapoints.

What I’m taking away from this is that even within the space of “unwind-based exception handling using DWARF CFI side-tables”, there is a fair amount of room for different approaches with different tradeoffs, and also potentially room for a custom-tailored unwinder to beat libgcc. That’s definitely good to know, and you guys have encouraged me to peel back the magic one more layer and try to implement my own unwinder :slight_smile:

As for switching between unwind-based exceptions and checked-status-code exceptions, I’m not quite sure I buy that that can completely be done by the catching function, since the throwing function also needs to use the matching mechanism. I think if you truly want to do this, you need to compile separate variants of whatever functions you might call (including whatever functions they might call), one for each exception mechanism you want to use. I’m thinking about doing this, but only for certain built-in functions that are expected to throw a lot. Another option I’m thinking of is to inline those particular functions and then create an optimization pass that will know that py_throw always throws, and stitch up the CFG appropriately. Anyway, lots to chew on, thanks everyone for the responses!

Aside about Python exceptions: Python has interesting for loops, which are always for-each loops and implement the termination condition using exceptions:

PyObject iterator; // what we’re iterating over
while (true) {
PyObject
i;
try {
i = iterator.next();
} except (StopIteration) {
break;
}
// do stuff
}

Percentage-wise, throwing the StopIteration might be rare, but I would wager that most loops get terminated this way (as opposed to a “break” statement) so it’s certainly not never; I think this means the exception gets thrown enough that it’s better to handle the exception in-line rather than do a deopt-on-throw. Microbenchmarks suggest that for-loop overhead is important enough that it’s further worth trying to avoid any exception-related unwinding entirely, but I’m not sure how true that is for larger programs (probably somewhat true).

Fair warning, I have absolutely no idea if our current implementation is actually a good idea or not. We need to get back to that and actually benchmark the various options. :slight_smile: We’ve been experimenting wildly, but without much rigour. We’ve been mainly focused on identifying the possible options within LLVM. I think what we do at the moment is always set the ‘pending exception’ flag, even if we’re going to use the unwind table based dispatching. As a result, any frame can decide to use either mechanism. I’ll point out though that this is purely an accident of implementation. We didn’t purposely design it this way. :slight_smile: I’ll just mention that you really really want to translate throw/catch pairs in the same function into a direct jump where possible. :slight_smile: In fact, LLVM should be doing this for you during inlining if you structure your IR properly. Are you not seeing this in practice? For this case in particular, you probably want to avoid throwing exceptions at all. If you inline the next() function to expose the throw, you should be able to convert the “throw; catch;” into a branch to the exit block. This will really really help your performance as compared to just about any other option. Philip

Just a quick follow-up: I ended up using libgcc and C++ exception handling for the initial implementation. I managed to do dynamic eh_frame registration with libunwind, and got to the point that I was starting to implement a custom exception manager. Not insurmountable, but then I discovered that LLVM already has libgcc dynamic registration support (via __register_frame), so a number of problems were sidestepped by just directly using libgcc and C++ exception handling functions (ex __cxa_throw). I’m sure it’s not the most performant implementation – in order to implement Python exception semantics on top of the C++ exception handler, every Python-level exception handler is implemented as a “catch all” at the C+±level, which then executes the Python matching semantics, and potentially allocates and throws a new C+±level exception to continue unwinding. The longer term plan is to implement a custom exception manager on libunwind, but for now the current libgcc setup seems good enough. There’s still the goal to be able to catch Python-thrown exceptions in C++, which is easy when Python throws C++ exceptions, but it might require some trickery to avoid using existing C++ exception-handling runtime functions (ex __cxa_begin_catch).

Also, I think I figured out why dynamically-registered frames are slower to search than static ones, at least in libunwind – the static information ends up getting put into a binary search table in .eh_frame_hdr, but dynamically registered frames go into a linked list of 1-element binary search trees. (I haven’t read through libgcc since the plan is to move away from it eventually, but I assume it does something similar.) I think this is solvable in libunwind by periodically compacting the linked list and combining the search trees, which looks like it can be done completely through the libunwind API.

kmod