Efficient Green Thread Context-Switching

Hi LLVM devs,

I’d like to describe my problem, and then propose new features of LLVM which would solve it efficiently.

I’m building a new statically-compiled programming language, and I plan on using LLVM as the backend. The language will have a runtime with cooperatively managed green threads (user-space “mini-threads”, each with their own dynamically allocated stack). A single OS thread will be responsible for managing many (perhaps millions) of green threads.

Essentially, the program flow is like this:

  1. Program is executing Green Thread 1 (GT1)
  2. Program reaches a “yield point” which does three things:
    a. Saves all active registers to the stack.
    b. Calls a runtime function runtimeYield().
    c. Pops the saved registers from the stack, restoring the original CPU state (we’ll get to this later).
  3. The runtimeYield function is called with a calling convention that stores the return address on top of the stack.
  4. Inside runtimeYield, we record the stack pointer (%rsp) within a queue maintained by the scheduler (used to keep track of all existing green threads).
  5. Now we need to resume some other green thread (GT2). First we set the stack pointer (%rsp) to reference GT2’s stack. Then, we pop GT2’s return address from the top of its stack, and “jmp” to it, which should jump us to where GT2 previously yielded.
  6. As GT2 resumes execution, it’s on the second half of a “yield point”, so it restores its original CPU state by popping saved registers from the stack.

This model has the potential to be extremely efficient (perhaps more efficient than any other existing green thread implementation). But it can only achieve that efficiency through compiler support. This model is borrowed from a research paper (found here: https://dl.acm.org/doi/10.1145/2400682.2400695) published in 2013. The paper describes how, by designing the context switch as a compiler intrinsic instead of a runtime library function, many existing compiler optimizations can be automatically applied to user-space context switches.

The paper shows the results of many benchmarks, all of which show this model heavily outperforming existing models of context switching, in both speed and memory usage. For example, it mentions that the POSIX functions makecontext() and swapcontext() save a structure that’s 936 bytes in size (768 on my machine), but their model generates context switches that save only 8-80 bytes of CPU state, depending on which existing optimizations are able to apply to the specific context switch. In some cases, the entire context switch is reduced to 1 or 2 instructions.

Conveniently, the authors of that research paper actually implemented their model in a fork of LLVM! (found here: https://github.com/stedolan/llvm). The model is essentially implemented with just two additions: a new calling convention called “swapstack”, and a new intrinsic function called “newstack”. They demonstrate that these two primitives can be used not just for user-level context-switching, but also for implementing efficient coroutines and generators, all while taking advantage of existing LLVM optimizations. I encourage you to read the entire research paper, as it explains the implementation of “swapstack” and “newstack” in great detail.

I also found a conversation in the mailing list from 2010, which shows someone else interested in the same functionality (found here: http://lists.llvm.org/pipermail/llvm-dev/2010-April/030914.html).

I propose that the LLVM devs consider adding “swapstack” and “newstack” into the LLVM project officially.

Thank you for taking the time to read this proposal, and thank you for all your hard work on LLVM.


  • Joshua Wise

Take a look at the the setjmp/longjmp intrinsics. IMO it is dishonest to
compare with makecontext/swapcontext, which generally have to save the
full FPU state and that's where the majority of the data is.


Sorry, I certainly didn't mean to be dishonest. I was just repeating one of the comparisons given by the research paper. Regardless, even setjmp() uses a structure of 148 bytes in size (on my machine). The main point is that with compiler support, many context switches can be easily reduced to just a few instructions and only 8 byte of memory, and even in the worst case they will outperform setjmp() by a factor of 2. That’s very significant for modern servers which could expect millions of contexts to exist simultaneously. It would also really help small IOT processors to multitask, which is something they’re normally bad at. I think the use-cases and efficiency achieved is compelling (especially as computing becomes more and more parallel/concurrent).

Let me repeat. Please take a look at the setjmp/longjmp intrinsics. On
support architectures they boil down to at most 5 pointers.


The first time you said “setjmp/longjmp intrinsics” I thought it was a typo, and that you were talking about the C standard library functions. If you’re actually talking about LLVM intrinsics, are they documented? I don’t see any intrinsics with those names in the language reference (https://llvm.org/docs/LangRef.html).

https://llvm.org/docs/ExceptionHandling.html .


Ahhh I see, I wasn’t looking in the right place.

I’ve spent the last few hours experimenting with the setjmp/longjmp intrinsics, and I don’t think they can achieve the desired result, for two reasons:

  1. While switching stacks, it’s necessary to pass the original stack pointer (or setjmp pointer) to the longjmp destination, hence why the research paper suggests using a new calling convention. It doesn’t look like the longjmp intrinsic allows for passing parameters.
  2. Using setjmp/longjmp requires an extra branch (depending on whether setjmp returned zero or non-zero), which should not be necessary.