Hi LLVM devs,
I’d like to describe my problem, and then propose new features of LLVM which would solve it efficiently.
I’m building a new statically-compiled programming language, and I plan on using LLVM as the backend. The language will have a runtime with cooperatively managed green threads (user-space “mini-threads”, each with their own dynamically allocated stack). A single OS thread will be responsible for managing many (perhaps millions) of green threads.
Essentially, the program flow is like this:
- Program is executing Green Thread 1 (GT1)
- Program reaches a “yield point” which does three things:
a. Saves all active registers to the stack.
b. Calls a runtime function runtimeYield().
c. Pops the saved registers from the stack, restoring the original CPU state (we’ll get to this later).
- The runtimeYield function is called with a calling convention that stores the return address on top of the stack.
- Inside runtimeYield, we record the stack pointer (%rsp) within a queue maintained by the scheduler (used to keep track of all existing green threads).
- Now we need to resume some other green thread (GT2). First we set the stack pointer (%rsp) to reference GT2’s stack. Then, we pop GT2’s return address from the top of its stack, and “jmp” to it, which should jump us to where GT2 previously yielded.
- As GT2 resumes execution, it’s on the second half of a “yield point”, so it restores its original CPU state by popping saved registers from the stack.
This model has the potential to be extremely efficient (perhaps more efficient than any other existing green thread implementation). But it can only achieve that efficiency through compiler support. This model is borrowed from a research paper (found here: https://dl.acm.org/doi/10.1145/2400682.2400695) published in 2013. The paper describes how, by designing the context switch as a compiler intrinsic instead of a runtime library function, many existing compiler optimizations can be automatically applied to user-space context switches.
The paper shows the results of many benchmarks, all of which show this model heavily outperforming existing models of context switching, in both speed and memory usage. For example, it mentions that the POSIX functions makecontext() and swapcontext() save a structure that’s 936 bytes in size (768 on my machine), but their model generates context switches that save only 8-80 bytes of CPU state, depending on which existing optimizations are able to apply to the specific context switch. In some cases, the entire context switch is reduced to 1 or 2 instructions.
Conveniently, the authors of that research paper actually implemented their model in a fork of LLVM! (found here: https://github.com/stedolan/llvm). The model is essentially implemented with just two additions: a new calling convention called “swapstack”, and a new intrinsic function called “newstack”. They demonstrate that these two primitives can be used not just for user-level context-switching, but also for implementing efficient coroutines and generators, all while taking advantage of existing LLVM optimizations. I encourage you to read the entire research paper, as it explains the implementation of “swapstack” and “newstack” in great detail.
I also found a conversation in the mailing list from 2010, which shows someone else interested in the same functionality (found here: http://lists.llvm.org/pipermail/llvm-dev/2010-April/030914.html).
I propose that the LLVM devs consider adding “swapstack” and “newstack” into the LLVM project officially.
Thank you for taking the time to read this proposal, and thank you for all your hard work on LLVM.
- Joshua Wise