[RFC] Coroutines

Hi llvmdev!

I've been working on adding coroutines to LLVM. Mentioned below is the
implementation plan I'm following, for suggestions, flames and other
input. Using segmented stacks is a prerequisite.

The idea is to associate every coroutine with a coroutine descriptor. A
coroutine descriptor consists of four words: w0, w1, w2 and w3.

w0 always contains the _launcher_, and invoking a coroutine always
involves setting a predefined register (%eax or %rax on x86) to point to
the descriptor and calling into w0. The contents of w1, w2 and w3 are
decided by the state the coroutine is in. As of now, there are two
possibilities:

A. The coroutine has just been created:

w0 is set to __launch_coroutine, which starts the coroutine by mallocing
some initial stack space, calling into
__generic_morestack_set_initial_sp (in libgcc) and then finally calling
into the "main" function for that coroutine. Once the "main" function
returns, a cleanup routine frees the stack.

Before calling the "main" function, it also sets w0 to __continue_coroutine.

In this state:

w1 contains a pointer to the "main" function
w2 contains a pointer that is passed as an argument to the "main" function
w3 contains the initial stack size.

B. The coroutine has already been invoked once:

w0 is set to __continue_coroutine. __continue_coroutine sets the stack
pointer correctly, and jumps to where the last coroutine switch left
off. In this state:

w1 contains the IP to jump to.
w2 contains the stack pointer.
w3 is undefined.

Directly based on previous discussions on the list, I think it makes
sense to add the following intrinsics to LLVM:

// Returns the size of a coroutine descriptor.
i32 llvm.coroutine_descriptor_size()

// Creates a new coroutine, which starts off with the function
// `function' being called with the argument `baton'. For this to
// work correctly, function must be of the type `void () (i8 *)'.
// `stack_size' will be the initial stack size. `descriptor' is a
// block of memory with at least as many bytes as returned by
// llvm.coroutine_descriptor_size().

void llvm.coroutine_new(i8 *descriptor, i8 *function, i8 *baton, i32
stack_size)

// Store the state of the current coroutine in `this' and switch to
// the one described by `new'.
void llvm.coroutine_switch(i8 *this, i8 *new)

It should be possible to have the LLVM register allocator correctly
handle persisting the registers by replacing llvm.coroutine_switch with
a pseudo instruction which has every register in its Def list. This will
also have the advantage of not saving unneeded registers.

One issue is deciding a home for __launch_coroutine and
__continue_coroutine. Currently I'm coding based on these two being in a
separate library of their own; but they can be merged into compiler-rt
if required.

Thanks!

Hi llvmdev!

I've been working on adding coroutines to LLVM. Mentioned below is the
implementation plan I'm following, for suggestions, flames and other
input. Using segmented stacks is a prerequisite.

I think my only comment is that, while this would probably work, implementing it in C with a bit of assembly for saving/restoring registers also does.

What are the benefits of doing it in llvm itself?

The idea is to associate every coroutine with a coroutine descriptor. A
coroutine descriptor consists of four words: w0, w1, w2 and w3.

w0 always contains the _launcher_, and invoking a coroutine always
involves setting a predefined register (%eax or %rax on x86) to point to
the descriptor and calling into w0. The contents of w1, w2 and w3 are
decided by the state the coroutine is in. As of now, there are two
possibilities:

A. The coroutine has just been created:

w0 is set to __launch_coroutine, which starts the coroutine by mallocing
some initial stack space, calling into
__generic_morestack_set_initial_sp (in libgcc) and then finally calling
into the "main" function for that coroutine. Once the "main" function
returns, a cleanup routine frees the stack.

Before calling the "main" function, it also sets w0 to __continue_coroutine.

In this state:

w1 contains a pointer to the "main" function
w2 contains a pointer that is passed as an argument to the "main" function
w3 contains the initial stack size.

B. The coroutine has already been invoked once:

w0 is set to __continue_coroutine. __continue_coroutine sets the stack
pointer correctly, and jumps to where the last coroutine switch left
off. In this state:

w1 contains the IP to jump to.
w2 contains the stack pointer.
w3 is undefined.

Directly based on previous discussions on the list, I think it makes
sense to add the following intrinsics to LLVM:

// Returns the size of a coroutine descriptor.
i32 llvm.coroutine_descriptor_size()

// Creates a new coroutine, which starts off with the function
// `function' being called with the argument `baton'. For this to
// work correctly, function must be of the type `void () (i8 *)'.
// `stack_size' will be the initial stack size. `descriptor' is a
// block of memory with at least as many bytes as returned by
// llvm.coroutine_descriptor_size().

void llvm.coroutine_new(i8 *descriptor, i8 *function, i8 *baton, i32
stack_size)

// Store the state of the current coroutine in `this' and switch to
// the one described by `new'.
void llvm.coroutine_switch(i8 *this, i8 *new)

It should be possible to have the LLVM register allocator correctly
handle persisting the registers by replacing llvm.coroutine_switch with
a pseudo instruction which has every register in its Def list. This will
also have the advantage of not saving unneeded registers.

One issue is deciding a home for __launch_coroutine and
__continue_coroutine. Currently I'm coding based on these two being in a
separate library of their own; but they can be merged into compiler-rt
if required.

Thanks!

Cheers,
Rafael

Please look at architectures with register window like SPARC. They are
the hard case for efficient co-routines.

Joerg

Hi,

The benefits are quite simple.

When you do it with setjmp/longjmp oder context (which is deprecated in OS X Lion) your program uses lots of memory because each coroutine need its own stack (I believe for context its 32k per thread).
As ucontext is a kernel call, this uses a lot of time as well. The setjmp/longjmp implementations usually invalidate pointers to objects on other coroutines calling stacks.
Also, the debugging with gdb isn't really nice.
Implementing coroutines with C macros is not really an option either.

Doing everything with llvm would allow the compiler to do everything without stack switching. You could turn each coroutine into a state machine. If a context change occurs, llvm would return from one coroutine to the "llvm scheduler" without unwinding the state machines calling stack (this should be possible because of the usage of segmented stacks). Then it would execute the other coroutine from its current state.
This can also be imagined as splitting each coroutines into small tasks ( the program between two context switches would become one task). And then the "llvm scheduler" would execute a full task and after it is done it executes the next task (the one that the coroutine wanted to switch to).

Also, there are lots of optimization opportunities if llvm knows what a coroutine is. For example, quite often it is possible that you don't need to switch to the other coroutine but can just inline some simple code.

Toralf Niebuhr

Hi,

The benefits are quite simple.

When you do it with setjmp/longjmp or context (which is deprecated in OS X Lion) your program uses lots of memory because each coroutine need its own stack (I believe for context its 32k per thread).
As ucontext is a kernel call, this uses a lot of time as well. The setjmp/longjmp implementations usually invalidate pointers to objects on other coroutines calling stacks.
Also, the debugging with gdb isn't really nice.
Implementing coroutines with C macros is not really an option either.

Doing everything with llvm would allow the compiler to do everything without stack switching. You could turn each coroutine into a state machine. If a context change occurs, llvm would return from one coroutine to the "llvm scheduler" without unwinding the state machines calling stack (this should be possible because of the usage of segmented stacks). Then it would execute the other coroutine from its current state.
This can also be imagined as splitting each coroutines into small tasks ( the program between two context switches would become one task). And then the "llvm scheduler" would execute a full task and after it is done it executes the next task (the one that the coroutine wanted to switch to).

Also, there are lots of optimization opportunities if llvm knows what a coroutine is. For example, quite often it is possible that you don't need to switch to the other coroutine but can just inline some simple code.

Toralf Niebuhr

(sorry for double post….I used the wrong mail address)