One way to support unwind on x86

Hi,

I want to support the unwind instruction on x86. Specifically I want to:

* Provide an efficient runtime implementation that does not
    depend on reading the DWARF EH information.
* It should be self hosted, meaning the runtime is static
    linked in. I want to use it kernel mode.
* Unwinding should be a read-only operation regarding the
    stack, so I can create a stack dump in the landing pad.
* Provide a pass that raises C++ exception handling to just
    unwind instructions and thread-local data.

My idea on handling the DWARF EH actions is to compile it to machine
instructions. Fx. given an Instruction Pointer, unwinding a call frame
might be described as:

    add esp, 12 ; Remove the current call frame.
    jmp unwind ; Continue the unwinding process.

Other call frames might be more complex to handle. It depends on the
moves needed to restore the registers of the previous call frame (the
caller) and to remove the current frame.

I want to tag all calls and invokes in a manner that can be easily
recognized by a runtime. I can tolerate a small overhead on calls. The
idea is to do something like this:

      call some_function
      jmp continue
      jmp case_of_unwinding
    continue:

The case-of-unwinding will either unwind the current call frame or
jump to the landing pad (if it was an invoke instruction). I think the
overhead will in fact be quite small because of branch prediction on
modern processors. The size overhead by tagging all calls, compared to
only tagging unwind areas and actions in DWARF, does not matter much
to me.

On x86 a call stores the next Instruction Pointer address on the
stack. The runtime would be something like this (assuming the first
jmp instruction after the call is always short):

    unwind:
      ; We are on a call frame boundary. Assuming we can
      ; always destroy the content of EAX (true for C calling
      ; convention), we get the next Instruction Pointer (the
      ; instruction after the call):
      pop eax
      ; We advance our virtual Instruction Pointer to after
      ; the first jmp. Here we assume it is a two-byte
      ; encoded short jmp, but in practice we can support
      ; both short and long jmps:
      inc eax
      inc eax
      ; Then we invoke the local unwind handling. It will
      ; in turn either unwind another call frame or jump to
      ; the landing pad:
      jmp eax

The pass to raise C++ exceptions to use unwind is another discussion.
The most valuable thing for me is to support unwind.

What do you think about the idea?
Can it be achieved or did I overlook something important?

I guess it should be implemented as a few passes. I think the one that
inserts code after each call machine instruction is the most difficult
to implement. Is it even possible to insert jmp instructions without
them being optimized away? I also need to access information about the
call frame.

My last question: Is it possible to implement a flag for the back-end
that selects which kind of exception handling to use? Ie. my idea
versus DWARF + libgcc.

Cheers,
Bjarke Walling

Hello, Bjarke

* Provide a pass that raises C++ exception handling to just
   unwind instructions and thread-local data.

Are you familiar with C++ EH? How would you handle catches? Cleanups?

Other call frames might be more complex to handle. It depends on the
moves needed to restore the registers of the previous call frame (the
caller) and to remove the current frame.

The devil is in details. How would you restore callee-saved registers?
In general, you need to know what they are in the unwind point and
their exact position on stack.

What do you think about the idea?

The idea looks sane. But only for 'basic' unwind. Otherwise you'll end
with rewriting libgcc unwind runtime.
Also, how

My last question: Is it possible to implement a flag for the back-end
that selects which kind of exception handling to use? Ie. my idea
versus DWARF + libgcc.

EH is complex. For C++ EH you will need to have a frontend support.
Actually, calls to libgcc unwinding runtime is inserted by frontend.
Backend is only responsible for emission of codegen-time information.
So, if you're planning to hack on C++ EH somehow, you will need to
hack on frontend as well.

Have you looked into libunwind? Maybe it will be easy to hook it into
llvm somehow?

Hi Bjarke,

* Provide an efficient runtime implementation that does not
    depend on reading the DWARF EH information.

why? The DWARF EH info encodes two things: (1) how to restore
registers; and (2) matching rules for exception objects, and
what to do with them. You will need something along the lines
of (1) if you unwind out of the middle of functions. As for (2),
if you don't do any matching of exceptions against types, this is
an extremely minimal amount of info. In any case, it is entirely
up to the personality function what happens here - you can always
write your own. Check out the C personality function (yes, C not
C++!) in gcc/unwind-c.c to get an idea of what a small personality
function looks like.

* It should be self hosted, meaning the runtime is static
    linked in. I want to use it kernel mode.

You can link statically with libgcc.

* Unwinding should be a read-only operation regarding the
    stack, so I can create a stack dump in the landing pad.

You can get stack dumps with gcc dwarf eh. The Ada front-end
does this for example - very convenient.

* Provide a pass that raises C++ exception handling to just
    unwind instructions and thread-local data.

Take a look at libunwind (http://www.hpl.hp.com/research/linux/libunwind/).
Another possibility, very close you yours and currently used by the vmkit
project, is to modify all functions so they return two values, the usual
return value and an additional boolean value indicating whether an exception
was thrown during the call or not. Callers then branch to an appropriate
place based on this value. Thus there is no special stack unwinding, it
is just functions returning. This adds some distributed overhead, but
unwinding is fast. You can always return something more complicated than
a boolean of course.

My idea on handling the DWARF EH actions is to compile it to machine
instructions. Fx. given an Instruction Pointer, unwinding a call frame
might be described as:

    add esp, 12 ; Remove the current call frame.
    jmp unwind ; Continue the unwinding process.

This is a lot like what you get using the "function returns an extra
boolean" method.

Other call frames might be more complex to handle. It depends on the
moves needed to restore the registers of the previous call frame (the
caller) and to remove the current frame.

If you really plan to unwind out of the middle of functions you will
have to do magic to restore registers. Do you plan to use the dwarf
frame moves info for this? If you use the "functions return an
extra boolean" method then you don't have to do anything, since
functions return and have registers restored in the usual way.

I want to tag all calls and invokes in a manner that can be easily
recognized by a runtime. I can tolerate a small overhead on calls. The
idea is to do something like this:

      call some_function
      jmp continue
      jmp case_of_unwinding
    continue:

This again looks a lot like the "have functions return an extra boolean"
method.

The case-of-unwinding will either unwind the current call frame or
jump to the landing pad (if it was an invoke instruction). I think the
overhead will in fact be quite small because of branch prediction on
modern processors. The size overhead by tagging all calls, compared to
only tagging unwind areas and actions in DWARF, does not matter much
to me.

The vmkit experience is that unwinding using their method is a lot faster
than using the dwarf unwinder. I don't know if the distributed overhead
is noticeable or not.

Ciao,

Duncan.

Hi Duncan, Hi Bjarke,

Duncan Sands wrote:

Take a look at libunwind (http://www.hpl.hp.com/research/linux/libunwind/).
Another possibility, very close you yours and currently used by the vmkit
project, is to modify all functions so they return two values, the usual
return value and an additional boolean value indicating whether an exception
was thrown during the call or not. Callers then branch to an appropriate
place based on this value. Thus there is no special stack unwinding, it
is just functions returning. This adds some distributed overhead, but
unwinding is fast. You can always return something more complicated than
a boolean of course.
  
That's correct. Although, to make things clear, I'm not using the multiple return value features of LLVM. The signature of functions only return one value. If an exception happens, the boolean value is stored in a thread local specific variable. But conceptually, this pretty much is the same than returning two values.

The vmkit experience is that unwinding using their method is a lot faster
than using the dwarf unwinder. I don't know if the distributed overhead
is noticeable or not.
  
It's *hardly* noticeable. If your language/runtime throws a lot of exceptions, that's the way to go currently (there are other sophisticated techniques, but much more complex). Dwarf is great if your language/runtime basically never throws exceptions. I can't give you an accurate measure of the real cost of Dwarf on Java applications because, unfortunately, libgcc did not optimize its dwarf info lookup on dynamically registered frames. It's doing a linear search of the info, and that takes *a lot* of time.

Nicolas

Nicolas Geoffray wrote:

It's *hardly* noticeable. If your language/runtime throws a lot of
exceptions, that's the way to go currently (there are other
sophisticated techniques, but much more complex). Dwarf is great if your
language/runtime basically never throws exceptions. I can't give you an
accurate measure of the real cost of Dwarf on Java applications because,
unfortunately, libgcc did not optimize its dwarf info lookup on
dynamically registered frames. It's doing a linear search of the info,
and that takes *a lot* of time.

gcj uses DWARF unwinder data, and the overhead is quite significant;
The unwinder code is fairly high in the list when profiling, mostly
because of the very unfortunate way that ClassLoader.loadClass throws
exceptions when a class could not be found.

It should be easy enough to fix libgcc to do something more efficient
with dynamically registered frames.

Andrew.

Hello Anton,

Are you familiar with C++ EH? How would you handle catches? Cleanups?

I guess I'm not familiar with C++ EH in details. Maybe I'm begin
naive, but I think exceptions doesn't have to be that complex.

When compiling C++ the front-end inserts a llvm.eh.selector intrinsic
in the landing pad. It references the types you want to catch and a
personality routine. If you use this intrinsic the type information is
added to the DWARF .eh_frame section. The unwind runtime knows how to
handle the actions described in this section. It is sort of a virtual
machine, if I understand it correctly. The instructions and their
semantics are described in the DWARF specification. The unwind runtime
execute these instructions, unwinding frames and call the personality
routine when it reaches a catch block to ask whether the catch block
can handle this exception.

I read the LLVM documentation on exceptions and regarding C++ cleanups
it sounds a bit more complicated. I would have to find out how exactly
libgcc, the DWARF actions and the personality routine plays together.

I know DWARF was made to provide a platform-independant way to add
debugging information. That is why they made a virtual machine for
stack unwinding. If you have a compiled C++ program you have DWARF
actions and a personality routine that is not going to change. If you
don't care about being ABI compliant, then you could compile this
virtual machine and all its actions to native machine code.

The idea looks sane. But only for 'basic' unwind. Otherwise you'll end
with rewriting libgcc unwind runtime.
Also, how

Maybe so. What is 'basic' unwind? Maybe I shouldn't focus on C++ for
now, but just about supporting invoke/unwind. If I don't use the
llvm.eh.exception and llvm.eh.selector intrinsics, then the landing
pad of an invoke becomes a catch-all. The landing pad code can check
whether it accepts the exception stored as a pointer in a thread-local
store. If it doesn't support the exception it unwind's again. I still
need to handle stack unwinding, but the information is already there.
It is written in .eh_frame and .gcc_except_table. What I want to do is
translating this information, the actions, to native machine code for
great efficiency.

EH is complex. For C++ EH you will need to have a frontend support.
Actually, calls to libgcc unwinding runtime is inserted by frontend.
Backend is only responsible for emission of codegen-time information.
So, if you're planning to hack on C++ EH somehow, you will need to
hack on frontend as well.

I think you're right I shouldn't focus on C++. It has quite
complicated and featureful exception handling. I guess the real pain
is multiple inheritance.

Have you looked into libunwind? Maybe it will be easy to hook it into
llvm somehow?

I will look at that, thank you.

I haven't timed the unwinding proces yet, but I want to do: How many
clock cycles does it take to handle various types of try/catch +
throw. I don't think libgcc is very efficient, but I can't be sure
before I measure it. Also adding 50KB to your executable is a lot in
my opinion. It is about 13000 machine instructions (I measured that).

Bjarke Walling

Hi Duncan,

why? The DWARF EH info encodes two things: (1) how to restore
registers; and (2) matching rules for exception objects, and
what to do with them. You will need something along the lines
of (1) if you unwind out of the middle of functions. As for (2),
if you don't do any matching of exceptions against types, this is
an extremely minimal amount of info. In any case, it is entirely
up to the personality function what happens here - you can always
write your own. Check out the C personality function (yes, C not
C++!) in gcc/unwind-c.c to get an idea of what a small personality
function looks like.

Yes, I need (1) to restore registers. I don't see why the type
checking can't be done in the landing pad. Yes, it is an overhead, but
not more than interpreting DWARF gives me.

I will look at the C personality function, thank you for that.

You can link statically with libgcc.

Yes, I know, but I think 50KB is a lot. It's not a microkernel I'm
writing anymore.

Also I don't get the benefits of invoke/unwind. LLVM handles function
inlining with invoke/unwind quite nicely. I'm not sure it can do that
to the same degree with calls to libgcc?

* Unwinding should be a read-only operation regarding the
    stack, so I can create a stack dump in the landing pad.

You can get stack dumps with gcc dwarf eh. The Ada front-end
does this for example - very convenient.

Very convenient. Does libgcc provide that too? I like the features of
DWARF, just not the time and space overhead.

Take a look at libunwind (http://www.hpl.hp.com/research/linux/libunwind/).

I will, thank you.

Another possibility, very close you yours and currently used by the vmkit
project, is to modify all functions so they return two values, the usual
return value and an additional boolean value indicating whether an exception
was thrown during the call or not. Callers then branch to an appropriate
place based on this value. Thus there is no special stack unwinding, it
is just functions returning. This adds some distributed overhead, but
unwinding is fast. You can always return something more complicated than
a boolean of course.

Maybe that's an option. I think I know how to do that already by
writing a LLVM pass. I guess it would be translated to something like
this in machine code:

    call some_function
    test ebx, ebx ; Check second return value
    jz handle_unwind ; If nessescary handle unwinding

That is a fairly small overhead.

My idea on handling the DWARF EH actions is to compile it to machine
instructions. Fx. given an Instruction Pointer, unwinding a call frame
might be described as [...]

This is a lot like what you get using the "function returns an extra
boolean" method.

Yes, you're right and it's conceptually easier to handle.

Other call frames might be more complex to handle. It depends on the
moves needed to restore the registers of the previous call frame (the
caller) and to remove the current frame.

If you really plan to unwind out of the middle of functions you will
have to do magic to restore registers. Do you plan to use the dwarf
frame moves info for this?

Yes. The information is available. It is stored as a DWARF virtual
machine (CFA instructions). I "just" need to translate this to machine
instructions. I thought that it would be somewhat easy to hook into
LLVM to get the DWARF instructions it would write and then emit
machine instructions instead.

If you use the "functions return an
extra boolean" method then you don't have to do anything, since
functions return and have registers restored in the usual way.

Yes, I see.

I want to tag all calls and invokes in a manner that can be easily
recognized by a runtime. I can tolerate a small overhead on calls. The
idea is to do something like this [...]

This again looks a lot like the "have functions return an extra boolean"
method.

Except there's a less overhead on a jmp instruction than test+jz
instructions. Maybe it's worth it. It is definitely easier to
implement than my first idea.

The vmkit experience is that unwinding using their method is a lot faster
than using the dwarf unwinder. I don't know if the distributed overhead
is noticeable or not.

It can be measured. Maybe I should look at the vmkit code as well.

Bjarke Walling

Hi Nicolas,

Duncan Sands wrote:

Another possibility, very close you yours and currently used by the vmkit
project, is to modify all functions so they return two values, the usual
return value and an additional boolean value indicating whether an
exception
was thrown during the call or not. Callers then branch to an appropriate
place based on this value. Thus there is no special stack unwinding, it
is just functions returning. This adds some distributed overhead, but
unwinding is fast. You can always return something more complicated than
a boolean of course.

That's correct. Although, to make things clear, I'm not using the multiple
return value features of LLVM. The signature of functions only return one
value. If an exception happens, the boolean value is stored in a thread
local specific variable. But conceptually, this pretty much is the same than
returning two values.

I see. So you check this value stored in a thread-local variable after
each call? And you lower invoke to a call and branch with regard to
this value?

It's *hardly* noticeable. If your language/runtime throws a lot of
exceptions, that's the way to go currently (there are other sophisticated
techniques, but much more complex). Dwarf is great if your language/runtime
basically never throws exceptions. I can't give you an accurate measure of
the real cost of Dwarf on Java applications because, unfortunately, libgcc
did not optimize its dwarf info lookup on dynamically registered frames.
It's doing a linear search of the info, and that takes *a lot* of time.

In my view I would use exceptions a lot more if they were very
efficient. Exceptions are nice, because they break the current code
flow and send a signal/message in the dynamic call stack.

What are these sophisticated techniques you are talking about? My time
frame for implementing this is, not unlimited, but fairly long. Less
than a year (of spare time) would be ok. I still need to feel
progress.

Another option I'm thinking about is creating a runtime that, when
initialized, compiles the DWARF information to native code. It could
create an Instruction Pointer lookup hash table associated with unwind
actions. Maybe libgcc/libunwind does this already? Starting your
executable would have a small overhead, but unwinding would be
efficient. I didn't start implementing this, mainly because I'm not
sure it's already implemented and also because I think exceptions,
like many other programming concepts, should be placed as a
responsibility of the compiler, not the executable. If I could place
as much overhead for the unwinding proces as possible in the compiler
instead of the runtime it would be great. I thought that instead of
the runtime created lookup table you could encode it as jmp
instructions after each call. It is like the optimisation of
malloc/free where you write a size value just before the allocated
memory block. The memory is in itself a hash table if the hashes are
memory locations.

Bjarke Walling

Bjarke Walling wrote:

Another option I'm thinking about is creating a runtime that, when
initialized, compiles the DWARF information to native code. It could
create an Instruction Pointer lookup hash table associated with unwind
actions.

JIT-compiling the unwinder data, yes. Given that the unwinder data
is, basically, the source for a specialized bytecode interpreter I
can't see any reason this wouldn't work.

Maybe libgcc/libunwind does this already?

It doesn't.

Andrew.

Hi Bjarke,

Bjarke Walling wrote:

I see. So you check this value stored in a thread-local variable after
each call? And you lower invoke to a call and branch with regard to
this value?
  
Yes, that's correct.

What are these sophisticated techniques you are talking about? My time
frame for implementing this is, not unlimited, but fairly long. Less
than a year (of spare time) would be ok. I still need to feel
progress.

I'm talking about this:

I don't know how difficult it would be to implement it in LLVM's codegen.

Nicolas

Nicolas Geoffray wrote:

It's *hardly* noticeable. If your language/runtime throws a lot of exceptions, that's the way to go currently (there are other sophisticated techniques, but much more complex). Dwarf is great if your language/runtime basically never throws exceptions. I can't give you an accurate measure of the real cost of Dwarf on Java applications because, unfortunately, libgcc did not optimize its dwarf info lookup on dynamically registered frames. It's doing a linear search of the info, and that takes *a lot* of time.

Plus, for those of us who care, the linear search is protected by a single mutex because there appears to be the potential that dynamically registered frames can be removed from the list (in addition to being inserted) and a mutex is the easiest way to protect against inconsistency.

Unfortunately, this means that threads that throw a lot will end up serializing here, plus there's a decent amount of cache-line pinging for read-only use of the list.

There are other ways to deal with the potential concurrent removal from the list that force the remover to do much more overhead (rcu, node pools rather than deallocation, etc), but I've never gotten around to actually doing it.

Luke

Hi Andrew,

JIT-compiling the unwinder data, yes. Given that the unwinder data
is, basically, the source for a specialized bytecode interpreter I
can't see any reason this wouldn't work.

I might look into that. It will be a good challenge to understand the
EH data in detail and program my own stack unwinding routine. If it
works out I can think of how to do most of the work compile-time. I
still don't see the complexities in stack unwinding: Restore caller
registers, remove the frame, lookup next frame, etc. The complexity
might be in the details of course. If I start implementing my own
unwinding routine I can prove myself wrong in how easy I think it is.

It would be nice with a suite of test cases that use invoke/unwind in
various situations (complex stack frames, simple stack frames, unwind
as the first thing/last thing/middle of function, unwind few/many
frames, etc.) I can run this suite to see if and where my runtime
fails. I'm not sure such a thing exists? Even with the test suite I
would have to run it using different compiler settings and
optimisations.

Bjarke Walling

Hi Nicolas,

Bjarke Walling wrote:

What are these sophisticated techniques you are talking about? My time
frame for implementing this is, not unlimited, but fairly long. Less
than a year (of spare time) would be ok. I still need to feel
progress.

I'm talking about this:

CiteSeerX — A Study of Exception Handling and Its Dynamic Optimization in Java

I don't know how difficult it would be to implement it in LLVM's codegen.

That definitely looks sophisticated. Maybe I read it wrong, but it
looks to me that LLVM is capable of doing some of it already:
Profiling (has to be setup to profile the exception path), aggressive
function inlining (on a specific code path?), turn unwind into a
branch to the landing pad. I don't know to which degree it will work
though. The method sounds exciting, but it seems to be made for
JIT-interpreting, not static compilation.

It cites some papers that actually talks about what I want to do: To
optimize the exception code path using stack unwinding while having a
(near) zero-cost overhead on the normal path. One of the papers talk
about a branch table, close to my original idea. Stack cutting is not
interesting for me, because it makes the normal code path slow. I
think it's close to what the enable-correct-eh-support flag does.

Bjarke Walling

Hello,

still don't see the complexities in stack unwinding: Restore caller
registers, remove the frame, lookup next frame, etc.

This is not enough for C++, for example. You would need to run destructors, be able to propagate exception object upper, do a type-based catch, etc.
You might want to read http://www.codesourcery.com/public/cxx-abi/abi-eh.html
Also, studying libgcc unwinding runtime and different personality functions is a good start.

It would be nice with a suite of test cases that use invoke/unwind in
various situations (complex stack frames, simple stack frames, unwind
as the first thing/last thing/middle of function, unwind few/many
frames, etc.) I can run this suite to see if and where my runtime
fails. I'm not sure such a thing exists?

gcc testsuite has some amount of such tests.

I read that link. Yes, I know there's more to handle in C++. I think
I've become aware that it's not just making some type-checking and
branching in the landing pad. The "just" is actually complex to do and
involves the actions taken by the personality routine. However I will
start by trying to support the unwind instruction and see where it
leads me.

Thank you all for your constructive criticism and the discussion. It
has definitely raised some thoughts.

Bjarke Walling