LLVM and coroutines/microthreads

I saw this was mentioned briefly last year, but there seemed to be
some confusion as to what coroutines entailed and the thread died out.
This technique has an unfortunate number of names, but it does get a
lot of use, including popular languages like Ruby.

I'm currently working on a programming language called Minnow
(http://www.minnow-lang.org). It's an actor-based language, where
each actor has its own microthread. In the current implementation I'm
manually saving and restoring the stack and live variables by hand to
get this effect. After spending months wrestling with these manual
continuations, I can safely say they are a total pain in the butt.
The story doesn't improve much with 3rd party coroutine libraries,
which have their own limitations (for instance, can you resume a
microthread on a different OS thread than it was created?).

The million dollar question is: can LLVM capture the live variables
and stack into a buffer? Or, even just save off register variables to
the stack, and give me a way to capture and restore that stack (and
pc)?

Thanks,

Jonathan

I saw this was mentioned briefly last year, but there seemed to be
some confusion as to what coroutines entailed and the thread died out.
This technique has an unfortunate number of names, but it does get a
lot of use, including popular languages like Ruby.

I'm currently working on a programming language called Minnow
(http://www.minnow-lang.org). It's an actor-based language, where
each actor has its own microthread. In the current implementation I'm
manually saving and restoring the stack and live variables by hand to
get this effect. After spending months wrestling with these manual
continuations, I can safely say they are a total pain in the butt.
The story doesn't improve much with 3rd party coroutine libraries,
which have their own limitations (for instance, can you resume a
microthread on a different OS thread than it was created?).

First, I will assume that you have read
http://www.nondot.org/sabre/LLVMNotes/ExplicitlyManagedStackFrames.txt
and if you have not, do so.

I am also working on an Actor based language. I initially started
with Erlang (too difficult to bind into C++ where I have some required
libraries that I need). I then went to Stackless Python (almost
stupid-easy to bind with, but the switching was too slow anytime there
was a native frame in an Actors callstack (as it did what you are
doing, copying the stack around, without a native frame on the
callstack switching was pretty fast though, and since Python did not
operate over multiple threads simultaneously I was forced to create
multiple processes and link them as a mesh, ala Erlang, worked well,
but slower then heck due to all of the serialization and such). I
then created two implementations in C++, one was messy stack copying
as you are currently doing, but it had a nice interface, and the
second type was ugly, required a lot of scaffolding, but held all
values that needed to be moved around on the heap instead of the
stack, it was much faster, and although I would have no qualms with
all the scaffolding, others did not even understand what was going on.

Honestly, I would use Erlang, but I have a few restrictions. First, I
need to bind with a rather large C++ only library, and binding C++ and
Erlang (by creating an Erlang driver, or by creating an Erlang
compatible C++ node) was a new level of horror (although I do foresee
that there could be a really good C++ binding library made, with heavy
use of templates to get rid of so much of the ugliness). Second, some
of the things I need to use it in are larger apps that I have no
control over, so I need to be able to embed it into an app (which
Stackless Python was able to do beautifully, made a nice bit of
reusable scaffolding code for that). If not for these two
restrictions Erlang would be pretty nice, and it is still quite fast
(although I would certainly not balk at something closer to C speed
for linear code).

And yes, I have to agree, ever C/C++ coroutine library sucked beyond
all heck, either stack fiddling, or they use crap like Microsoft's
micro-thread implementation when on Windows (whoever chose to use that
is a moron, did you know that if you tell it that you only need, say
100k of stack space for a micro-thread, it still allocates and gives
it memory equal to a system thread, meaning about 4 megs by default,
so yea, that does not scale well at all unless you set a compile
option to vastly reduce your system thread size as well, which cannot
always be done...).

Thus enters llvm. As C++ is generally called C with classes (although
it is so much more), I started creating a language that could be
described as C with Actors, first thing, remove every concept of a
global state, so no global variables, etc, I want this to be a 'safe'
language, as if someone really needs power they can write something in
C/C++ to link it in. Over a few months I actually started replacing a
bit of C syntax, so it does not really look purely like C anymore, but
it is easier to write in my opinion.

Basically I have two types of 'functions' in my language, there are
normal C style functions, they use the stack, they are fast, that can
be well optimized, etc, however they can only call other direct
functions and they cannot suspend execution (also note, this language
is cooperatively microthreaded, not pre-emptive, so it requires a bit
more work on the programmers part, but I intend to minimize that with
new language primitives, but this way it is a lot faster and has a
great deal more control over its own execution). The other type is
much more free, it can call others can that also suspend, as well as
normal functions. When a normal function is called then it is called
like any normal C function, uses stack, etc... The other function
type (an actor function) when compiled becomes lots of little
functions, they are broken up at every call to another actor function
(and I intend to add a few more places to break it up, such as analyze
loops and if they are heavy then break it up in the loop iterations
too). The above linked document describes how I break it up well,
with a few slight changes. An actor function also uses the stack,
however if any variable has any uses past another actor function call
then it is stuffed into a struct and passed to the actor function call
as the first parameter. This made coding my front-end notable more
difficult since I have to put it in SSA form directly, since the
alloca to SSA pass does not work if you 'alloca' from some passed in
memory (I really wish that could be fixed, I am thinking of copying
that pass to a new name and altering it, would save me so much work).
The Actor functions first (hidden) parameter is a pointer to some
memory, a stack. When an actor function calls another actor function
(meaning that this is the end of the function and the function will be
continued at one of the split points), it 'pushes' an SSA variable
(that has a use past this point) onto the passed in 'stack', and
increments the pointer the size of the memory that the SSA variable
took up, like a real stack and so forth. At the resume point it
'pops' all the memory out of the passed in stack back into SSA
variables, which llvm can put in registers or put on the real stack.

As stated, an Actor function can only be called by other Actor
function, but they just make up a whole single callstack on the Actor.
However, a normal function (as well as Actor functions) can 'new' a
new Actor stack. Basically the Actor that will start the new
microthread is passed to a scheduler as well as a constant integer
that represents the maximum 'stack' space that the passed in Actor
stack will require (reference:
http://www.nondot.org/sabre/LLVMNotes/SizeOf-OffsetOf-VariableSizedStructs.txt),
the scheduler allocates the necessary memory and calls the Actor with
the newly created memory passed in (and some scaffolding, the
beginning of the 'stack' also contains a pointer to exception handling
structures and so forth, I handle exceptions very different in this
language, as well as destructors, they are appended on to a list of
function calls that will clean up memory, handle exceptions, etc... as
necessary, since they all have to be contained in this allocated
memory range instead of on the real stack, so my exception handling
has some overhead, which is just a memory write of a size of
2*sizeof(void*) anytime a new destructor or exception handler needs to
be registers, throwing is faster since that stack is not unwound or
read, it just immediately jumps through the handlers, running
destructors, until it gets something that wants to stop and handle
it).

As you can see, an Actor callstack will stay inside this actor until
an explicit switch is asked for (and I may add some implicit ones in
tight loops and such, still debating on how much I want hidden to the
programmer, if I do I will add a keyword to be able to deactivate it
for a given loop regardless). A switch can be a basic switch, or a
message pass, both are handled differently then in Erlang I would
imagine. A basic switch (no message pass) is simple, it simply asks
the scheduler for one thing, a pointer to the top of the memory stack
of the next microthread that needs to run, it then pops off a function
pointer (an actor function that is split up, all the splits have the
same signature, just taking a pointer to some untyped memory and
another pointer to struct of a certain format, used for message
passing described later, but in a switch case it is null), then
serializes up the SSA values it needs to keeps itself for its
continuation (a switch is also a split point as should be obvious) and
gives that to the scheduler to schedule as it sees fit, then executes
the popped function pointer passing the new top of its 'stack' to it
(then returning so it is always a tail-call, this is why I asked on
the list a while back whether indirect function calls are tail-call
optimized, this design requires them to be so it is fast). A normal
switch, if possible, should be minimized, only occurring on actor
function call boundaries, if the programmer wishes to pause then they
should handle a message while they are at it.

To pass a message is similar to a switch, but it asks a different
scheduler for a specific microthread that is actively waiting for a
message, if it is not actively waiting (the receiving microthread is
on a normal split that does not handle messages for example, or is
running on another OS thread) then it gets rather heavy as it then
queries the main world handler for if the microthread even exists (it
could have returned for example), if it does not exist then the
microthread message is deallocated (put back into reusable queue, I
may add garbage collection to this part, works fine as is though, just
have another actor query it once in a while to see what has not been
used in a while and free that memory if so). If it does exist, but is
just not waiting for a message then right now the message is given to
a freshly created Actor (these ones are designed to be short lived,
using almost no stack space, no exception handling and so forth, and
there memory is always kept around so they can be reused as necessary)
that just tries to keep sending the message every once in a while
(these make up a 'fake' queue/mailbox, remember, unbounded
nondetermanism the Actor model is, and not everything will need
queues, so even queues are modeled as Actors, I do intend to overhaul
this design though, thinking of creating another Actor per Actor that
is created as necessary that will hold a queue internally to do
dispatching, replacing the main actors link to itself so it then
handles all messaging). Just like in Erlang it will be good to get
all messages (regardless of matching) just to clear things out,
although it would be quite possible for a queue Actor to send an
exception onto the stack of its main Actor if there are too many
messages in queue. Hopefully though there should be few messages
queued as most messages sent will be instantly received without any
queuing, I need to go over the design though to make sure it will
scale well, and make changes as necessary.

To get a message the Actor will serialize up its SSA variables and
pass its stack to another schedule (that does not actually do any
scheduling, it just handles message passing), then ask the main
scheduler for the next thing to run and run it as in a normal switch,
thus it is a normal switch, just that it will not continue until a
message is passed instead of just after some arbitrary time.

As stated earlier, when an Actor is created I pre-get all the stack
that the function will require, which means all necessary exception
handler space, destructor pointer space, and SSA values that need to
be saved space, not only for the Actor function itself, but also all
other Actor functions it will directly call as well. I also intend to
add the ability to call Actor function pointers, and that will only
take up sizeof(void*) on the stack since the called actor function
will need to create its own extra space on top of that (when the
address of an Actor function is taken I intend to wrap it in another
function that will create the stack as necessary so I do not have to
create two versions of that, possibly large, actor function).

When you are waiting for a message in Erlang, you can also specify a
timeout, I have not implemented that yet either, but I intend to so
when a timeout is specified then another message match is implicitly
added that will match a timeout message, and send off a message to a
timer scheduler that will send out a message with a timeout of a
certain ID that it will send back after a time-delay. If a message is
received before the timeout then a cancel message will be sent out
(although the timeout message could have been sent first, but not
received yet, which is why the ID has to match when it receives it,
else the Actor will discard the timeout message, since it was probably
old). Do note, in the compiled code much of this handling is just
function calls so there is no real overhead.

I use no locking primitives for the things that are multi-threaded
(such as the timer) with liberal use of instructions such as atomic
compare-and-swap (my favorite, well, most used anyway, I also wrote
that assembly directly for those in the C++ actor libraries I created
long ago). I want every thread to be as busy as possible. Every
actor may end up getting resumed on a different thread every time if
it had such timing (luck of the draw).

I also intend to implement mesh functionality, like in Erlang, and I
intend to set that up with more lightweight actors that just act as
proxies, so when a pid (in Erlang terms) of an Actor on a remote
system is received then a local proxy Actor is setup that will handle
the network passing from the local to the remote system between those
Actors as the remote proxy actor converts the network messages back
into normal messages and passes it to the real Actor. I intend for
these proxy actors to be situation dependent, so if you have two
programs on the same machine then they will use shared memory to pass
between them, perhaps TCP/IP if someone wants a license-free
networkable way, but my main network method I intend to create a
RakNet binding since it is a great deal faster and easier to handle
then TCP/IP with still full reliability, although I intend to make it
quite simple to make bindings so a binding to a serial port would be
easy to make, or whatever else anyone may need. If I feel up for it I
may make a UDP binding with reliability built on since it should give
faster speed then TCP/IP with reliability setup in such a way that it
matches the system.

I do realize that the stuffing of the live SSA values into my passed
in stack is not perfectly efficient, but since LLVM supplies no
ability to specify a different stack pointer to use for alloca instead
of the system stack, I do not have much choice.

So far the bit I have created works well enough, just have had no time
to work on it the past 4 months due to final year in school + work
taking up ever bit of free time I have (and much of my sleep time, I
only average 3 hours a night, ugh...). I intend to continue here in
about one month since I will have vastly more free time. Hopefully
this design I am using you can take and use as you wish as well.
Right now I do the function splitting in my own tree, but I have been
throwing around the idea to make a module pass to do it at the LLVM
level with some annotation information tossed in, would simplify my
code quite a bit since LLVM holds user information and so forth. If I
do end up making such a pass then it may become useful for your
project as well if you choose to use a similar design as I. Although
if you manage to create one before I get my free time back in a month,
I would not complain. :slight_smile:

The million dollar question is: can LLVM capture the live variables
and stack into a buffer? Or, even just save off register variables to
the stack, and give me a way to capture and restore that stack (and
pc)?

So yes, LLVM can capture the live variable information and stack into
a buffer, but it currently has no functionality to do so, thus you
have to do it yourself as I do. Although the above way is vastly
faster then memcpy'ing entire stacks around (as we do in C++ itself to
simulate full coroutines).

Also, I apologize for the length, once I start explaining things I
find it hard to stop, should have seen how I was going on about how
sexpressions are better then xml yesterday at work (and no, we use
neither, it was a philosophical argument)...

- OvermindDL1

First, I will assume that you have read
http://www.nondot.org/sabre/LLVMNotes/ExplicitlyManagedStackFrames.txt
and if you have not, do so.

I hadn't. That's very similar to what I had tried early on, but found
it was actually slower than managing my own stacks lazily (on
continuation, the functions check a state variable for whether or not
to freeze live variables. The state var is kept pretty warm). I
haven't gone nearly as deeply as you have, though.

It looks like the article is suggesting that you're keep a pointer and
do a bunch of derefs on it in all cases. It just sounds inefficient,
unless I'm missing something.

And yes, I have to agree, ever C/C++ coroutine library sucked beyond
all heck, either stack fiddling, or they use crap like Microsoft's

Looks like pretty much all the options are going to require
bookkeeping, hackery, or both. Even my stack copying suggestion has
restrictions on stack size.

timeout, I have not implemented that yet either, but I intend to so
when a timeout is specified then another message match is implicitly
added that will match a timeout message, and send off a message to a
timer scheduler that will send out a message with a timeout of a
certain ID that it will send back after a time-delay. If a message is

That's a handy idea. This technique reminds me of Orc, if you haven't
seen it you should check it out: http://orc.csres.utexas.edu/

I use no locking primitives for the things that are multi-threaded
(such as the timer) with liberal use of instructions such as atomic
compare-and-swap (my favorite, well, most used anyway, I also wrote

Funny you should mention this. I've spent the better part of the last
few weeks trying to get a CAS solution to be rock-solid. Then, it
turns out I could make a locky solution that performed similarly to
the CAS solution by using tiny (2-3 instructions) locked areas. If
you look through Art of Multiprocessor Programming, you'll see that a
lot of the solutions use multiple CAS's per action. As best I can
tell, if you CAS more than twice per task, you're starting to get into
the area of tiny locks performance-wise (at least on Intel platforms).

system is received then a local proxy Actor is setup that will handle
the network passing from the local to the remote system between those
Actors as the remote proxy actor converts the network messages back

Sounds good. I've thought about something like this, but shelved the
idea for the first release.

So yes, LLVM can capture the live variable information and stack into
a buffer, but it currently has no functionality to do so, thus you
have to do it yourself as I do. Although the above way is vastly
faster then memcpy'ing entire stacks around (as we do in C++ itself to
simulate full coroutines).

From what you're telling me, as far as LLVM is concerned there are

ways to do it manually, you just have to pick one that suits the
project. There aren't any automatic solutions, nor hooks to attach
your own. I admit, I assumed that might be the case since there isn't
any "standard" coroutine style.

Jonathan

It looks like the article is suggesting that you're keep a pointer and
do a bunch of derefs on it in all cases. It just sounds inefficient,
unless I'm missing something.

It is just treating it identically to how the real system stack is
treated. As stated, most of the code will still keep it in registers
and so forth since the front-end outputs the SSA form directly.

Looks like pretty much all the options are going to require
bookkeeping, hackery, or both. Even my stack copying suggestion has
restrictions on stack size.

My method has no restriction, and I do have a decorator keyword for
the actor function that always causes a new stack to be created at
that point as well (good for if a function may use a lot of stack
space, but be called rarely, like an alloca'd known-size array). The
only real limit is system memory.

That's a handy idea. This technique reminds me of Orc, if you haven't
seen it you should check it out: http://orc.csres.utexas.edu/

I had not, thanks for the link, will look it over shortly.

Funny you should mention this. I've spent the better part of the last
few weeks trying to get a CAS solution to be rock-solid. Then, it
turns out I could make a locky solution that performed similarly to
the CAS solution by using tiny (2-3 instructions) locked areas. If
you look through Art of Multiprocessor Programming, you'll see that a
lot of the solutions use multiple CAS's per action. As best I can
tell, if you CAS more than twice per task, you're starting to get into
the area of tiny locks performance-wise (at least on Intel platforms).

The only places that such things are ever needed is when something is
pushed onto a queue. In this pattern no memory is nor can be aliased
by any other Actor, so things stay very self-contained, so the only
synchronization necessary is when pushing something onto a list, which
should require only one atomic CAS. I have not yet ran into anything
that needed more (although it is possible, the language is no where
near finished yet, need some free time again for that)

On Thu, Apr 16, 2009 at 2:44 PM, Jonathan D. Turner
<jonathan.d.turner@gmail.com> wrote:> >From what you're telling me, as
far as LLVM is concerned there are

ways to do it manually, you just have to pick one that suits the
project. There aren't any automatic solutions, nor hooks to attach
your own. I admit, I assumed that might be the case since there isn't
any "standard" coroutine style.

The one thing I do wish so much is if you could specify your own
pre-allocated memory for the alloca instructions, then the passes that
raise alloca's to SSA and so forth would just work fine.