Proposal: stack/context switching within a thread

Right now the functionality is available, sometimes, from the C
standard library. But embedded environments (often running a limited
standard library) and server environments would benefit heavily from a
standard way to specify context switches within a single thread in the
style of makecontext/swapcontext/setcontext, and built-in support for
these operations would also open the way for optimizers to begin
handling these execution paths.

The use cases for these operations, and things like coroutines built
on top of them, will only increase in the future as developers look
for ways to get more concurrency while limiting the number of
high-overhead and difficult to manage native threads, locks, and
mutexes.

context.txt (8.57 KB)

I took the liberty of forwarding this to the Stackless Python list,
since they switch stacks, and I got a response at
http://thread.gmane.org/gmane.comp.python.stackless/4464/focus=4467.
The upshot is that they really need the ability to allocate only a
tiny amount of space for each thread and grow that as the thread
actually uses more stack. The way they accomplish that now is by
copying the entire stack to the heap on a context switch, and having
all threads share the main C stack. This isn't quite as bad as it
sounds because it only happens to threads that call into C extension
modules. Pure Python threads operate entirely within heap Python
frames. Still, it would be nice to support this use case.

Kenneth, I don't want to insist that the first version of this be both
a floor wax _and_ a dessert topping, but is there a natural extension
to supporting what Stackless needs that we could add later? It looks
like swapcontext() simply repoints the stack pointer and restores the
registers, while Stackless wants it to be able to allocate memory and
copy the stack. Maybe that implies a "mode" argument?

Alternately, Stackless could probably work with a segmented stack
mechanism like Ian Taylor implemented in gcc for Go. Do you see
anything that would prevent us from layering segmented stacks on top
of this context switching mechanism later?

Thanks,
Jeffrey

As I see it, the context switching mechanism itself needs to know
where to point the stack register when switching. The C routines take
an initial stack pointer when creating the context, and keep track of
it from there. If we don't actually need to interoperate with
contexts created from the C routines, we have a lot more freedom.

Anyway, one approach would be to expose intrinsics to interrogate an
inactive context, to get its initial stack pointer (the one it was
created with) and its current stack pointer, and also to modify both
before making the context active again.

I don't see any reason why this scheme wouldn't also be compatible
with segmented stacks. In fact, one could segment an inactive context
stack at any time rather than copying the whole thing, as long as you
can assume that there aren't any pointers into the context's stack
living outside the context.

In fact, one could segment an inactive context
stack at any time rather than copying the whole thing, as long as you
can assume that there aren't any pointers into the context's stack
living outside the context.

Never mind that... stacks by default don't have back pointers to
previous stack frames. We'd still need runtime support apart from the
context switching bit, and without such support we'd have to copy the
whole stack.

On the other hand, stack manipulation really ought to be handled by
the target, since only the target knows the details of how the stack
is laid out to begin with. Also, if we have stack manipulation calls
in the IR, optimization quickly becomes very difficult. Unless we
just allow optimizers to ignore the stack manipulations and assume
they're doing the "right" thing.

On the gripping hand, we don't want the target emitting memory
allocation calls in order to grow the stack (unless a function pointer
to malloc or its equivalent is passed in from the IR).

The way they accomplish that now is by
copying the entire stack to the heap on a context switch, and having
all threads share the main C stack. This isn't quite as bad as it
sounds because it only happens to threads that call into C extension
modules. Pure Python threads operate entirely within heap Python
frames. Still, it would be nice to support this use case.

This wouldn't hold in IR, since virtual registers regularly get
spilled to the stack.. every context, regardless of the language,
would have to have its stack saved. Also, this method would mean that
a context cannot be used in any native thread other than the one that
created it, right?

Having read through Stackless Python's web pages a bit:

1. They're doing pretty much what I'd like to do, except that I don't
want to be tied to a particular language and I'd like to be able to
use the stack. (Also, stack use is inescapable with LLVM, as far as I
can tell).

2. We should be able to support "hard switching" in Stackless Python
by adding a llvm.getcontextstacktop intrinsic. If, as in Kristján's
example, llvm.getcontext is used to create context A, and then
execution continues until context B is created with
llvm.swapcontext(B, A), the region of memory between
llvm.getcontextstacktop(A) and llvm.getcontextstacktop(B) can be saved
and later restored when B is resumed. Of course that usage would
throw a monkey wrench into a segmented stack scheme... it assumes that
context stack areas actually behave like contiguous stacks. Not only
that, it assumes that no pointers to a context's stack exist outside
of the context... when the context is inactive, a pointer into a
context's stack won't be valid!

But in the case of Stackless Python, these caveats can be addressed
with a simple "Don't do that!", since it's all tied into the language.

3. I would need to run some benchmarks, but in some cases it might be
better to use mmap to swap stacks between contexts... that way nothing
would need to be copied.

4. I'm hoping that LLVM ends up growing optimization passes that
minimize the actual physical use of contexts in many use cases. Also,
we might be able to guarantee small stack usage with a pass that
forces recursive calls to spawn a new context and turns large alloca's
into malloc's, making it safer to have a bunch of little stacks
without any needed juggling.

As I see it, the context switching mechanism itself needs to know
where to point the stack register when switching. The C routines take
an initial stack pointer when creating the context, and keep track of
it from there. If we don't actually need to interoperate with
contexts created from the C routines, we have a lot more freedom.

I guess the reason to interoperate with contexts from the C routines
would be to support ucontext_t's passed into signal handlers? But then
the LLVM intrinsics need to specify that their context's layout is the
same as ucontext_t's, on platforms where ucontext_t exists.

Anyway, one approach would be to expose intrinsics to interrogate an
inactive context, to get its initial stack pointer (the one it was
created with) and its current stack pointer, and also to modify both
before making the context active again.

I don't see any reason why this scheme wouldn't also be compatible
with segmented stacks.
...
On the other hand, stack manipulation really ought to be handled by
the target, since only the target knows the details of how the stack
is laid out to begin with. Also, if we have stack manipulation calls
in the IR, optimization quickly becomes very difficult. Unless we
just allow optimizers to ignore the stack manipulations and assume
they're doing the "right" thing.

On the gripping hand, we don't want the target emitting memory
allocation calls in order to grow the stack (unless a function pointer
to malloc or its equivalent is passed in from the IR).

In gcc's split-stacks
(http://gcc.gnu.org/ml/gcc/2009-02/msg00429.html; I got the name wrong
earlier), Ian planned to call a known global name to allocate memory
(http://gcc.gnu.org/ml/gcc/2009-02/msg00479.html). I'm not sure what
he actually wound up doing on the gccgo branch. LLVM could also put
the allocation/deallocation functions into the context, although it'd
probably be better to just follow gcc.

The way they accomplish that now is by
copying the entire stack to the heap on a context switch, and having
all threads share the main C stack. This isn't quite as bad as it
sounds because it only happens to threads that call into C extension
modules. Pure Python threads operate entirely within heap Python
frames. Still, it would be nice to support this use case.

This wouldn't hold in IR, since virtual registers regularly get
spilled to the stack.. every context, regardless of the language,
would have to have its stack saved. Also, this method would mean that
a context cannot be used in any native thread other than the one that
created it, right?

Well, a frontend can generate code in continuation-passing style or do
all of its user-level "stack" frame manipulation on the heap. Then it
only uses a constant amount of C-stack space, which might not be part
of the context that needs to be switched. Only foreign calls
necessarily use a chunk of C stack. Stackless's approach does seem to
prevent one coroutine's foreign code from using pointers into another
coroutine's stack, and maybe they could/should create a new context
each time they need to enter a foreign frame instead of trying to copy
the stack...

2. We should be able to support "hard switching" in Stackless Python
by adding a llvm.getcontextstacktop intrinsic. If, as in Kristján's
example, llvm.getcontext is used to create context A, and then
execution continues until context B is created with
llvm.swapcontext(B, A), the region of memory between
llvm.getcontextstacktop(A) and llvm.getcontextstacktop(B) can be saved
and later restored when B is resumed.

Wait, what stack top does swapcontext get? I'd thought that A's and
B's stack top would be the same since they're executing on the same
stack.

Of course that usage would
throw a monkey wrench into a segmented stack scheme... it assumes that
context stack areas actually behave like contiguous stacks. Not only
that, it assumes that no pointers to a context's stack exist outside
of the context... when the context is inactive, a pointer into a
context's stack won't be valid!

But in the case of Stackless Python, these caveats can be addressed
with a simple "Don't do that!", since it's all tied into the language.

And users shouldn't need both stack copying and split stacks. Just one
should suffice.

3. I would need to run some benchmarks, but in some cases it might be
better to use mmap to swap stacks between contexts... that way nothing
would need to be copied.

Presumably the user would deal with that in allocating their stacks
and switching contexts, using the intrinsics LLVM provides? I don't
see a reason yet for LLVM to get into the mmap business.

4. I'm hoping that LLVM ends up growing optimization passes that
minimize the actual physical use of contexts in many use cases.

That sounds very tricky...

Also,
we might be able to guarantee small stack usage with a pass that
forces recursive calls to spawn a new context and turns large alloca's
into malloc's, making it safer to have a bunch of little stacks
without any needed juggling.

This sounds like a stopgap until real split stacks can be implemented.
http://gcc.gnu.org/wiki/SplitStacks#Backward_compatibility describes
some of the other difficulties in getting even this much to work.
(foreign calls, and function pointers, at least)

As I see it, the context switching mechanism itself needs to know
where to point the stack register when switching. The C routines take
an initial stack pointer when creating the context, and keep track of
it from there. If we don't actually need to interoperate with
contexts created from the C routines, we have a lot more freedom.

I guess the reason to interoperate with contexts from the C routines
would be to support ucontext_t's passed into signal handlers? But then
the LLVM intrinsics need to specify that their context's layout is the
same as ucontext_t's, on platforms where ucontext_t exists.

Or perhaps it can be an argument to the target code generator, if
there's any need to switch "compatibility mode" off and on. All that
the intrinsics require is that context creators be given
a memory area of at least size llvm.context.size() to write contexts
into, and that nothing besides the intrinsics mess with the context
structure.

Anyway, one approach would be to expose intrinsics to interrogate an
inactive context, to get its initial stack pointer (the one it was
created with) and its current stack pointer, and also to modify both
before making the context active again.

I don't see any reason why this scheme wouldn't also be compatible
with segmented stacks.
...
On the other hand, stack manipulation really ought to be handled by
the target, since only the target knows the details of how the stack
is laid out to begin with. Also, if we have stack manipulation calls
in the IR, optimization quickly becomes very difficult. Unless we
just allow optimizers to ignore the stack manipulations and assume
they're doing the "right" thing.

On the gripping hand, we don't want the target emitting memory
allocation calls in order to grow the stack (unless a function pointer
to malloc or its equivalent is passed in from the IR).

In gcc's split-stacks
(http://gcc.gnu.org/ml/gcc/2009-02/msg00429.html; I got the name wrong
earlier), Ian planned to call a known global name to allocate memory
(http://gcc.gnu.org/ml/gcc/2009-02/msg00479.html). I'm not sure what
he actually wound up doing on the gccgo branch. LLVM could also put
the allocation/deallocation functions into the context, although it'd
probably be better to just follow gcc.

The way they accomplish that now is by
copying the entire stack to the heap on a context switch, and having
all threads share the main C stack. This isn't quite as bad as it
sounds because it only happens to threads that call into C extension
modules. Pure Python threads operate entirely within heap Python
frames. Still, it would be nice to support this use case.

This wouldn't hold in IR, since virtual registers regularly get
spilled to the stack.. every context, regardless of the language,
would have to have its stack saved. Also, this method would mean that
a context cannot be used in any native thread other than the one that
created it, right?

Well, a frontend can generate code in continuation-passing style or do
all of its user-level "stack" frame manipulation on the heap. Then it
only uses a constant amount of C-stack space, which might not be part
of the context that needs to be switched. Only foreign calls
necessarily use a chunk of C stack. Stackless's approach does seem to
prevent one coroutine's foreign code from using pointers into another
coroutine's stack, and maybe they could/should create a new context
each time they need to enter a foreign frame instead of trying to copy
the stack...

I see what you mean. I'll have to look up the conditions under which
an llvm call instruction can avoid creating a new physical stack
frame, but you've convinced me that it could be made to work better
than I thought when I wrote that.

2. We should be able to support "hard switching" in Stackless Python
by adding a llvm.getcontextstacktop intrinsic. If, as in Kristján's
example, llvm.getcontext is used to create context A, and then
execution continues until context B is created with
llvm.swapcontext(B, A), the region of memory between
llvm.getcontextstacktop(A) and llvm.getcontextstacktop(B) can be saved
and later restored when B is resumed.

Wait, what stack top does swapcontext get? I'd thought that A's and
B's stack top would be the same since they're executing on the same
stack.

No, A's stack top would be whatever the stack pointer was when
llvm.getcontext was called to
create it. B's stack top would be whatever the stack pointer was when
llvm.swapcontext was
called to create it... it would be further from the common base than
A's stack top. The region
between them is what needs to be restored before B can become active
again, assuming that
A's stack space remained valid.

Of course that usage would
throw a monkey wrench into a segmented stack scheme... it assumes that
context stack areas actually behave like contiguous stacks. Not only
that, it assumes that no pointers to a context's stack exist outside
of the context... when the context is inactive, a pointer into a
context's stack won't be valid!

But in the case of Stackless Python, these caveats can be addressed
with a simple "Don't do that!", since it's all tied into the language.

And users shouldn't need both stack copying and split stacks. Just one
should suffice.

Exactly.

3. I would need to run some benchmarks, but in some cases it might be
better to use mmap to swap stacks between contexts... that way nothing
would need to be copied.

Presumably the user would deal with that in allocating their stacks
and switching contexts, using the intrinsics LLVM provides? I don't
see a reason yet for LLVM to get into the mmap business.

Me either. Stack copying, mmap'ing, slicing, or whatever should be
done by the scheduler. LLVM
would not include any part of a scheduler, just intrinsics to allow a
scheduler to create and switch contexts.

4. I'm hoping that LLVM ends up growing optimization passes that
minimize the actual physical use of contexts in many use cases.

That sounds very tricky...

One common case would be for a coroutine to be inlined with the help
of indirect branches.

Also,
we might be able to guarantee small stack usage with a pass that
forces recursive calls to spawn a new context and turns large alloca's
into malloc's, making it safer to have a bunch of little stacks
without any needed juggling.

This sounds like a stopgap until real split stacks can be implemented.
http://gcc.gnu.org/wiki/SplitStacks#Backward_compatibility describes
some of the other difficulties in getting even this much to work.
(foreign calls, and function pointers, at least)

True. Some front-ends have more control over this than others,
though. And users of a C-like
front end would find this pass helpful even though they're ultimately
responsible for only calling "safe" functions within a
small-stack-space context.

Anyway, I updated the document to take a lot of this discussion into
account. I hope that the assumptions I made actually are universally
applicable in (non-split-stack-enabled) LLVM.

context.txt (14.2 KB)

2. We should be able to support "hard switching" in Stackless Python
by adding a llvm.getcontextstacktop intrinsic. If, as in Kristján's
example, llvm.getcontext is used to create context A, and then
execution continues until context B is created with
llvm.swapcontext(B, A), the region of memory between
llvm.getcontextstacktop(A) and llvm.getcontextstacktop(B) can be saved
and later restored when B is resumed.

Wait, what stack top does swapcontext get? I'd thought that A's and
B's stack top would be the same since they're executing on the same
stack.

No, A's stack top would be whatever the stack pointer was when
llvm.getcontext was called to
create it. B's stack top would be whatever the stack pointer was when
llvm.swapcontext was
called to create it... it would be further from the common base than
A's stack top. The region
between them is what needs to be restored before B can become active
again, assuming that
A's stack space remained valid.

I forgot to mention that this depends on the assumption that the
function that created context A did not return to its caller before
the llvm.swapcontext that created context B was executed. And while
I'm at it, what it the function that created context A made a tail
call in the meantime?

As I see it, the context switching mechanism itself needs to know
where to point the stack register when switching. The C routines take
an initial stack pointer when creating the context, and keep track of
it from there. If we don't actually need to interoperate with
contexts created from the C routines, we have a lot more freedom.

I guess the reason to interoperate with contexts from the C routines
would be to support ucontext_t's passed into signal handlers? But then
the LLVM intrinsics need to specify that their context's layout is the
same as ucontext_t's, on platforms where ucontext_t exists.

Or perhaps it can be an argument to the target code generator, if
there's any need to switch "compatibility mode" off and on. All that
the intrinsics require is that context creators be given
a memory area of at least size llvm.context.size() to write contexts
into, and that nothing besides the intrinsics mess with the context
structure.

Yeah, that sounds right.

2. We should be able to support "hard switching" in Stackless Python
by adding a llvm.getcontextstacktop intrinsic. If, as in Kristján's
example, llvm.getcontext is used to create context A, and then
execution continues until context B is created with
llvm.swapcontext(B, A), the region of memory between
llvm.getcontextstacktop(A) and llvm.getcontextstacktop(B) can be saved
and later restored when B is resumed.

Wait, what stack top does swapcontext get? I'd thought that A's and
B's stack top would be the same since they're executing on the same
stack.

No, A's stack top would be whatever the stack pointer was when
llvm.getcontext was called to
create it. B's stack top would be whatever the stack pointer was when
llvm.swapcontext was
called to create it... it would be further from the common base than
A's stack top. The region
between them is what needs to be restored before B can become active
again, assuming that
A's stack space remained valid.

Oops. Either I can't read, or I was confused by the fact that x86
stacks grow down. Probably the reading.

I forgot to mention that this depends on the assumption that the
function that created context A did not return to its caller before
the llvm.swapcontext that created context B was executed. And while
I'm at it, what it the function that created context A made a tail
call in the meantime?

Yep. The opengroup manpages are pretty bad about describing the limits
of setcontext():
http://www.opengroup.org/onlinepubs/007908775/xsh/getcontext.html.
Could you sketch out the restrictions in your document? They may be
identical to setjmp/longjmp, which opengroup does document:
http://www.opengroup.org/onlinepubs/007908775/xsh/longjmp.html.

Me either. Stack copying, mmap'ing, slicing, or whatever should be
done by the scheduler. LLVM
would not include any part of a scheduler, just intrinsics to allow a
scheduler to create and switch contexts.

+1

Anyway, I updated the document to take a lot of this discussion into
account. I hope that the assumptions I made actually are universally
applicable in (non-split-stack-enabled) LLVM.

Thanks! Here are some thoughts on your additions. (just the "Stack
management" section, right?)

A working document like this may work better on a wiki, in
http://codereview.appspot.com, or in a public repository rather than
as a series of email attachments. :slight_smile:

"The context will not carry any information about the maximum stack
space available to it" <- and this is the only line that would need to
change to add split stacks, I think.

"1. A function call will ..." -> "1. A non-tail function call will ..." ?

Item 3 starts referring to "active" and "inactive" contexts, but I
don't think you've introduced the terms.

The opengroup manpages of swapcontext() and friends don't mention
whether it's possible to use them to move a context from one thread to
another. *sigh*. I suspect it works fine, but there's a chance it does
the wrong thing to thread-local variables. Could you add that as an
open question? It'd be nice for LLVM to allow it.

Point 4 is a bit confusing. Normally, it's fine for a thread to share
some of its stack space with another thread, but your wording seems to
prohibit that.

I'll forward your next draft back to the stackless folks, unless you
want to pick up the thread with them.

Thanks,
Jeffrey

I created a wiki at http://code.google.com/p/llvm-stack-switch/

Right now I just copied and formatted the document as-is... I'll go
back over it with your comments in mind soon. One more question,
which you can answer here or there:

Point 4 is a bit confusing. Normally, it's fine for a thread to share
some of its stack space with another thread, but your wording seems to
prohibit that.

Really? How does that work?

I'll forward your next draft back to the stackless folks, unless you
want to pick up the thread with them.

If you're willing to be the go-between, I really appreciate it.. I
don't think I have the time to really get involved with Stackless
Python, especially as I would have to learn regular Python first.

I created a wiki at http://code.google.com/p/llvm-stack-switch/

Right now I just copied and formatted the document as-is... I'll go
back over it with your comments in mind soon. One more question,
which you can answer here or there:

Point 4 is a bit confusing. Normally, it's fine for a thread to share
some of its stack space with another thread, but your wording seems to
prohibit that.

Really? How does that work?

void thread1() {
  Foo shared_var;
  queue.send(&shared_var);
  int result = otherqueue.recv();
  return;
}

void thread2() {
  Foo* shared_var = queue.recv();
  otherqueue.send(work_on(shared_var));
}

is legal with posix threads. It's just illegal to return out of a
function while its stack space is used by another thread. I've seen
this used inside a condition variable implementation, among other
places.

I'll forward your next draft back to the stackless folks, unless you
want to pick up the thread with them.

If you're willing to be the go-between, I really appreciate it.. I
don't think I have the time to really get involved with Stackless
Python, especially as I would have to learn regular Python first.

Sure.

I'm very interested in seeing support for stack/context switching in LLVM, if only for prototyping language ideas. I'm particularly interested in mechanisms that would make it possible to implement full asymmetric coroutines as described in "Revisiting Coroutines" (Moura & Ierusalimschy, Feb 2009 TOPLAS). From skimming the thread and looking at the llvm-stack-switch wiki, it looks like you're headed more in the direction of symmetric coroutines.

I've read that there is a Lua JIT based on LLVM, but haven't looked into the details of how coroutines are implemented there.

In skimming through this thread I see some apparent requirements that I would hope could be avoided - e.g. the existence of mmap, any memory allocation going on "under the covers", or a requirement that a front-end do CPS conversion - it looks like later email has made this same point, so perhaps this is not being considered any longer.

One thing I don't think I've seen mentioned so far is the interplay between swapcontext() and register allocation - I would hope a high performance implementation would exist that would only result in registers that are currently live being saved/restored at these points, not just a general save/restore of register state.

I'm very interested in seeing support for stack/context switching in LLVM, if only for prototyping language ideas. I'm particularly interested in mechanisms that would make it possible to implement full asymmetric coroutines as described in "Revisiting Coroutines" (Moura & Ierusalimschy, Feb 2009 TOPLAS). From skimming the thread and looking at the llvm-stack-switch wiki, it looks like you're headed more in the direction of symmetric coroutines.

According to the paper you linked, asymmetric coroutines "provide
two control-transfer operations: one for invoking a coroutine and one for
suspending it, the latter returning control to the coroutine invoker. While
symmetric coroutines operate at the same hierarchical level, an asymmetric
coroutine can be regarded as subordinate to its caller, the relationship between
them being somewhat similar to that between a called and a calling
routine." The constructs proposed for LLVM are intended to support
both symmetric and asymmetric coroutines (along with fibers and other
things) - each context carries a "linked" context that represents the
invoker of the given one, and control can be transferred back to it.
The front-end can support a "coreturn" statement that does this
automatically.

I've read that there is a Lua JIT based on LLVM, but haven't looked into the details of how coroutines are implemented there.

In skimming through this thread I see some apparent requirements that I would hope could be avoided - e.g. the existence of mmap, any memory allocation going on "under the covers", or a requirement that a front-end do CPS conversion - it looks like later email has made this same point, so perhaps this is not being considered any longer.

Right now, I'm envisioning these more as possible strategies to allow
huge numbers of context stacks to exist in a limited address space,
not a requirement imposed by or on LLVM to do the context switching.
As far as I can tell, any conceivable strategy (other than segmented
stacks) really belongs outside of LLVM, and should be handled by a
scheduler, runtime library, or front-end.

One thing I don't think I've seen mentioned so far is the interplay between swapcontext() and register allocation - I would hope a high performance implementation would exist that would only result in registers that are currently live being saved/restored at these points, not just a general save/restore of register state.

I'd like to see that too, and it's one of several things that's
convincing me that simply lowering to the C routines, even in
environments where that would work, is not really what we want to end
up doing unless we need compatibility for some reason.

One simple strategy would be to declare all registers dead at a
swapcontext. Then *only* live registers get spilled to the context
stack before and restored afterward.

Their reply: http://thread.gmane.org/gmane.comp.python.stackless/4464/focus=4475

Instead of @llvm.getcontextstacktop(%context), they want
@llvm.getcurrentstacktop() so that they can copy the stack out before
switching, and copy it back in after switching. I'm not sure this
exactly works either, since before "copying the stack back", any
allocas are pointing at old-frame data and so are invalid. They'd need
to guarantee that any relevant data is in machine registers rather
than the stack during the switch, but LLVM doesn't provide a way to do
that. The cleaner way to do this is to switch to an intermediate stack
scheduler context with its own stack, have it replace the stacks, and
then switch to the target context back on the original stack. But that
requires two swapcontext() calls, which seems less than idea
performance-wise.

Since the backend _can_ ensure that the relevant data is in machine
registers, do you think it makes sense to provide a swapcontext() that
also moves the stack? Or is there another way to do this that I'm
missing?

Jeffrey

I'll forward your next draft back to the stackless folks, unless you
want to pick up the thread with them.

Their reply: http://thread.gmane.org/gmane.comp.python.stackless/4464/focus=4475

(From the reply)

But any function call we perform after the
swapcontext() may trample the unsaved stack, if the source and destination stack positions
overlap.

I'm having trouble visualizing that situation. Is there a "main"
context that will handle all this saving and restoring of stacks? If
so, does it actually share stacks with other contexts in the way
they're expected to share it with each other?

Instead of @llvm.getcontextstacktop(%context), they want
@llvm.getcurrentstacktop() so that they can copy the stack out before
switching, and copy it back in after switching. I'm not sure this
exactly works either, since before "copying the stack back", any
allocas are pointing at old-frame data and so are invalid. They'd need
to guarantee that any relevant data is in machine registers rather
than the stack during the switch, but LLVM doesn't provide a way to do
that. The cleaner way to do this is to switch to an intermediate stack
scheduler context with its own stack, have it replace the stacks, and
then switch to the target context back on the original stack. But that
requires two swapcontext() calls, which seems less than idea
performance-wise.

OK, I see you don't want to have a "main" context that does this and
dispatches other contexts if it can be avoided. Compared to copying
stacks, though, I'm not sure that two swapcontexts will be that bad,
especially if we don't implement them as posix library function calls.

Since the backend _can_ ensure that the relevant data is in machine
registers, do you think it makes sense to provide a swapcontext() that
also moves the stack? Or is there another way to do this that I'm
missing?

Can it really ensure that on all targets? Some of them don't exactly
have an abundance of registers.

I'll forward your next draft back to the stackless folks, unless you
want to pick up the thread with them.

Their reply: http://thread.gmane.org/gmane.comp.python.stackless/4464/focus=4475

(From the reply)

But any function call we perform after the
swapcontext() may trample the unsaved stack, if the source and destination stack positions
overlap.

I'm having trouble visualizing that situation. Is there a "main"
context that will handle all this saving and restoring of stacks? If
so, does it actually share stacks with other contexts in the way
they're expected to share it with each other?

Instead of @llvm.getcontextstacktop(%context), they want
@llvm.getcurrentstacktop() so that they can copy the stack out before
switching, and copy it back in after switching. I'm not sure this
exactly works either, since before "copying the stack back", any
allocas are pointing at old-frame data and so are invalid. They'd need
to guarantee that any relevant data is in machine registers rather
than the stack during the switch, but LLVM doesn't provide a way to do
that. The cleaner way to do this is to switch to an intermediate stack
scheduler context with its own stack, have it replace the stacks, and
then switch to the target context back on the original stack. But that
requires two swapcontext() calls, which seems less than idea
performance-wise.

OK, I see you don't want to have a "main" context that does this and
dispatches other contexts if it can be avoided. Compared to copying
stacks, though, I'm not sure that two swapcontexts will be that bad,
especially if we don't implement them as posix library function calls.

Good point about the relative cost of copying the stack. I'll suggest
the "main" context to them.

Since the backend _can_ ensure that the relevant data is in machine
registers, do you think it makes sense to provide a swapcontext() that
also moves the stack? Or is there another way to do this that I'm
missing?

Can it really ensure that on all targets? Some of them don't exactly
have an abundance of registers.

I guess I mean that the relevant data has to be inside the context
data. Registers would accomplish that, but they're not necessary.

Jeffrey