x86 unwind support

1. Is there already a push underway to get it in?

2. If not, how's this change sound:

ECX is not a callee-saved register, so callers assume it gets nuked
anyway. So for LLVM functions, ECX gets a flag indicating whether
unwinding is taking place. At each callsite for "call", check ECX and
bail out if the unwind flag is set. At the callsite for "invoke",
check ECX and jump to the unwind label if ECX is set; otherwise, jump
to the regular return label.

It doesn't add to register pressure since ECX gets clobbered by
function calls anyway. It doesn't access memory for LLVM-to-LLVM
calls. The only overhead to callsites is a conditional branch on a
register value.

Now when calling external functions, this obviously won't work.
Perhaps a thread-local global that gets checked only on returns from
external functions. Or perhaps unwinds coming from external functions
just doesn't get supported for now.

3. Perhaps a pass that lowers unwinds to an EH intrinsic? Would that
map well without adding more overhead than the current setjmp/longjmp
lowering pass?

Hello, Kenneth

ECX is not a callee-saved register, so callers assume it gets nuked
anyway.

2 problems here at least:
1. ECX is used as parameter passing register in some calling conventions
2. ECX is used as chain holding register for nested functions

PS: What's about x86-64?

1. Which ones? I know that Windows uses it for the "this" pointer.

Anyway, unless the callee is required to preserve it in a given
calling convention, that doesn't preclude us using it for a *return*
value. It would be checked after calls return, and wouldn't affect
the use of the register for passing values in before the call is made.
The callee would set it right before return.

2. Does LLVM support nested functions? I must have missed that.

Anyway, I haven't looked too deeply into X86-64, but I was thinking
that a similar scheme with one of its non-callee-saved registers would
work there.

1. Which ones? I know that Windows uses it for the "this" pointer.

The internal fastcc convention and the Windows fastcall convention off
the top of my head.

Anyway, unless the callee is required to preserve it in a given
calling convention, that doesn't preclude us using it for a *return*
value. It would be checked after calls return, and wouldn't affect
the use of the register for passing values in before the call is made.
The callee would set it right before return.

Right, so that sounds okay.

2. Does LLVM support nested functions? I must have missed that.

To the extent required to implement the gcc nested functions
extension, yes. The specific relevant behavior here is that if a
parameter is marked with the nest attribute, it gets passed in ECX.

-Eli

Kenneth Uildriks wrote:

3. Perhaps a pass that lowers unwinds to an EH intrinsic? Would that
map well without adding more overhead than the current setjmp/longjmp
lowering pass?

In the past there have been suggestions that a good approach would be to target the libunwind (The libunwind project) library interface in a lowering pass. This could provide both low "availability" overhead and low "use" overhead.

If libunwind had setjmp/longjmp implementations for your platform (I think they're currently only available in IA64), then it would be trivial to use a setjmp/longjmp lowering pass and get what you want.

I keep wanting to do this but it always seems to get bumped off of my critical path.

Luke

Hello, Kenneth

1. Which ones? I know that Windows uses it for the "this" pointer.

Many :slight_smile:

1. windows fastcall
2. LLVM's own fastcc
3. arguments marked inreg (consider e.g. gcc's attribute inreg(3), etc).

2. Does LLVM support nested functions? I must have missed that.

nested functions in gcc's sense. They are funky lowered, etc.

Just pulled down libunwind 0.99. README says I'm out of luck on x86
as far as longjmp goes.

According to this page:

data coming from L1 is only about three times as expensive as data
coming from a register. So putting a register check after *every*
call is probably not going to be profitable, compared to a
thread-local global variable check after every invoke... if they
happen often on a thread, that variable will probably be in cache, and
if they don't happen often, the performance impact will be minimal.

Of course if most methods have variables with destructors, I'll end up
with a check of some kind after almost every (non-nounwind) call
anyway, so a register check would be better. On the other hand,
implementing the register check would seem to require native codegen
changes at callsites as opposed to an IR-modifying pass with a
possible new intrinsic or two.

Anyway, here's my new plan:

1. A thread local global variable, type i8*, initialized to zero.
2. At invoke callsites, right before the invoke call a native method
(mysetjmp) that:

a. Saves ESI, EDI, EBX, EBP, ESP to a buffer alloca'd within the
method containing the invokesite..
b. Sets EAX to 0
c. Returns.

3. The return value of that native method (EAX) is checked, and if
nonzero, branch to unwind label. Otherwise, save the value of the
thread-local-global into the buffer, write the address of that
alloca'd buffer into the thread-local global and make the call.

4. After the call returns, copy the old thread-local-global value out
of the alloca'd buffer back to the thread-local-global.

The unwind instruction will then:

1. Load the thread-local-global value. If it's zero, there's nowhere
to unwind to, so abort.
2. Restore ESI, EDI, EBX, EBP, ESP, and the thread-local-global value
from the buffer.
3. Set EAX to 1.
4. Jump to 2c. (the return instruction for the native method mysetjmp).

The native method will return with all callee-saved registers restored
and a return value in EAX of 1, which will cause the following check
to branch to the unwind label.

Invoke sites only write five callee-saved registers to the stack, and
read/write one pointer to a single thread-local global variable, and
make one direct call. Unwind sites make one direct call, read five
callee-saved registers from the stack (some distance up, so those
memory values might not be warm) and read/write one pointer to a
single thread-local global variable.

The next step would be to replace the mysetjmp call with a new
intrinsic, and then I'd have to save EIP and do an indirect jump to it
at the unwind site instead of jumping to a constant offset within the
native mysetjmp. Making mylongjmp call a new intrinsic will
necessitate no other modifications.

Hi Kenneth, this way of implementing unwind won't interact properly with
dwarf exception handling. That's rather bad.

Ciao,

Duncan.

Arrgh. Exception handling uses invoke and doesn't use unwind! Or did
I just miss it?

Since unwind doesn't take any operands, is there *any* possible
implementation of unwind that fits with exception handling/invoke?

Implemented as an optional pass, my scheme can go unused when you're
using the g++ front-end or something else that uses __cxa_throw.

OK, I've read through http://www.llvm.org/docs/ExceptionHandling.html
several times now.

Let's see if I understand this...

1. Everywhere inside a "try" block, the C++ front-end emits "invoke"
instructions instead of "call" instructions. Without any
transformations, this "invoke" instruction compiles down to assembly
code that doesn't seem to do anything different from a "call"
instruction. Also, "unwind" compiles down to nothing. However, every
function gets some DWARF info compiled into it by LLVM, and part of it
is information about the invoke site.

2. To throw an exception, call __cxa_allocate_exception to allocate an
exception object, and __cxa_throw to throw it.

3. Every function gets some DWARF info complied into it by LLVM. The
__cxa_throw function uses it to find the function that issued the
"invoke" and find the "landing pads" and jump to the right landing pad
based on the exception type.

4. The landing pad uses exception-handling intrinsics to match the
exception type and to get the exception object.

The lowerinvoke pass adds SJLJ-based unwinding, which is a separate
mechanism based on GCC sjlj exception handling.

My proposed pass adds a lighter-weight setjmp/longjmp-style unwinding.

How do either of these prevent DWARF exception handling from working?
Would a landing pad expecting to get an exception object from the
exception intrinsics fail to get one in the case of an unwind and
crash?

Did I misunderstand anything I outlined above?

Is the exception-throwing function call expected to become an
intrinsic or an instruction in the future? Will it replace unwind?

(Perhaps I should put all this aside and just have my compiler handle
my invoke/unwind logic instead of trying to use invoke/unwind
instructions.)

Hi,

How do either of these prevent DWARF exception handling from working?

if you throw an exception using your proposed unwind implementation,
then it wouldn't be caught by dwarf catch/cleanup regions (eg: invoke).

Would a landing pad expecting to get an exception object from the
exception intrinsics fail to get one in the case of an unwind and
crash?

The landing pad would never be executed in the first place. This
is rather bad, for example cleanups won't be run.

(Perhaps I should put all this aside and just have my compiler handle
my invoke/unwind logic instead of trying to use invoke/unwind
instructions.)

For the moment that is the best solution I think.

Ciao,

Duncan.

Hi,

Can I interject something at this point.

Can I suggest that invoke/unwind be renamed DWARF_invoke/DWARF_unwind to warn the unwary that if they want lightweight exception handling in their Python/ML/whatever implementation they should use some other method.

PS.
Kenneth, why don't you just use setjmp/longjmp directly.
Or, if you want, I can email you my lightweight versions if you want,

Mark.

Duncan Sands wrote:

Hi Mark,

Can I suggest that invoke/unwind be renamed DWARF_invoke/DWARF_unwind to warn the unwary that if they want lightweight exception handling in their Python/ML/whatever implementation they should use some other method.

probably there should be a switch to choose whether codegen should turn
unwind/invoke into dwarf or setjmp/longjmp style code.

Ciao,

Duncan.

probably there should be a switch to choose whether codegen should turn
unwind/invoke into dwarf or setjmp/longjmp style code.

There is, but it happens before codegen and is slow.
-enable-correct-eh-support will translate invoke/unwind into
setjmp/longjmp pairs for the correct behavior. See:
http://llvm.org/docs/Passes.html#lowerinvoke

Nick

Nick Johnson wrote:

probably there should be a switch to choose whether codegen should turn
unwind/invoke into dwarf or setjmp/longjmp style code.

It seems to me that there is an implicit, and undocumented, assumption that unwinding needs to handle stack-allocated objects.

In languages without stack-allocated objects (ie. most languages that support exceptions) there is no need to unwind frame-by-frame, the unwind simply needs to make a single jump to the invoke instruction and restore the context (which in x86 is just 6 registers).

There is, but it happens before codegen and is slow.
-enable-correct-eh-support will translate invoke/unwind into
setjmp/longjmp pairs for the correct behavior. See:
LLVM’s Analysis and Transform Passes — LLVM 16.0.0git documentation

*Begin rant*

It is possible to implement invoke/unwind in such a way that both invoke *and* unwind are fast, when unwind just unwinds and doesn't perform any magic behind-the-scenes operations.

After all, isn't it the job of the front-end to insert all the clean-up code for stack-allocated objects?
Java, C#, Python, and Ruby have destructors(finalizers), but they are managed by the garbage collector.
C++ is the odd one out, so why do the semantics of an llvm instruction depend on C++ semantics?

*End rant* :wink:

Mark.

Mark Shannon wrote:

Nick Johnson wrote:

probably there should be a switch to choose whether codegen should turn
unwind/invoke into dwarf or setjmp/longjmp style code.

It seems to me that there is an implicit, and undocumented, assumption
that unwinding needs to handle stack-allocated objects.

In languages without stack-allocated objects (ie. most languages that
support exceptions) there is no need to unwind frame-by-frame, the
unwind simply needs to make a single jump to the invoke instruction and
restore the context (which in x86 is just 6 registers).

Not quite. It's also necessary to execute all the pending POSIX
pthread_cleanup_pop() actions.

It is possible to implement invoke/unwind in such a way that both invoke
  *and* unwind are fast, when unwind just unwinds and doesn't perform
any magic behind-the-scenes operations.

I don't think so. Not for pthreads, anyway.

Andrew.

Andrew Haley wrote:

Mark Shannon wrote:

Nick Johnson wrote:

probably there should be a switch to choose whether codegen should turn
unwind/invoke into dwarf or setjmp/longjmp style code.

It seems to me that there is an implicit, and undocumented, assumption that unwinding needs to handle stack-allocated objects.

In languages without stack-allocated objects (ie. most languages that support exceptions) there is no need to unwind frame-by-frame, the unwind simply needs to make a single jump to the invoke instruction and restore the context (which in x86 is just 6 registers).

Not quite. It's also necessary to execute all the pending POSIX
pthread_cleanup_pop() actions.

POSIX pthread_cleanup_pop() can only be called directly from C++/C.
C doesn't haven't exceptions.
So yet again, this is a C++ issue.

It is possible to implement invoke/unwind in such a way that both invoke
  *and* unwind are fast, when unwind just unwinds and doesn't perform
any magic behind-the-scenes operations.

I don't think so. Not for pthreads, anyway.

This is C++ specific.

In languages without stack-allocated objects (ie. most languages that
support exceptions) there is no need to unwind frame-by-frame, the
unwind simply needs to make a single jump to the invoke instruction and
restore the context (which in x86 is just 6 registers).

No. Java, C#, Ruby and Python all support the finally/ensure block;
C# supports the using( IDisposable x =...) {} construct. Both
constructs require support for a frame-by-frame unwind; as these
construct can be nested, a single throw may visit many landing pads
(which may come from different compilation units).

It doesn't have anything to do with stack-allocated vs heap-allocated,
but rather with the language's guarantees about exceptions.

It is possible to implement invoke/unwind in such a way that both invoke
*and* unwind are fast, when unwind just unwinds and doesn't perform
any magic behind-the-scenes operations.

Why? Exceptions are supposed to occur in exceptional situations. In
general, one should try to optimize for the common case, which does
not include invoke/unwind.

One should certainly not slow down a function call which never throws
just because other functions may throw. Paraphrasing Bjarne
Stroustrup, "If you don't use it, you shouldn't pay for it."

Nick

Mark Shannon wrote:

Andrew Haley wrote:

Mark Shannon wrote:

Nick Johnson wrote:

probably there should be a switch to choose whether codegen should turn
unwind/invoke into dwarf or setjmp/longjmp style code.

It seems to me that there is an implicit, and undocumented, assumption
that unwinding needs to handle stack-allocated objects.

In languages without stack-allocated objects (ie. most languages that
support exceptions) there is no need to unwind frame-by-frame, the
unwind simply needs to make a single jump to the invoke instruction and
restore the context (which in x86 is just 6 registers).

Not quite. It's also necessary to execute all the pending POSIX
pthread_cleanup_pop() actions.

POSIX pthread_cleanup_pop() can only be called directly from C++/C.
C doesn't haven't exceptions.

But it does have pthread_exit().

So yet again, this is a C++ issue.

No, it isn't:

       The effect of calling longjmp() or siglongjmp() is undefined if there
       have been any calls to pthread_cleanup_push() or pthread_cleanup_pop()
       made without the matching call since the jump buffer was filled.

Andrew.