New ideas about how to improve garbage collection support


I currently write a LLVM backend for RPython, a programming language and
toolchain to write dynamic language interpreters. For example, this
toolchain powers PyPy, a fast Python interpreter. In the moment the
implementation uses a shadowstack to manage garbage collection stack
roots. To improve performance, I experimented with LLVM's garbage
collection support.

It works by marking a stack slot as containing a garbage collection
root. This enables easy code generation for frontends which use the
mem2reg pass to construct the SSA form. Such frontends just need to emit
a @llvm.gcroot() intrinsic referring to the AllocaInst. RPython uses a
SSA representation of variables internally. To use LLVM's GC support, it
has to reassign SSA values back to stack slots, which is hard. Another
issue (not only for RPython) is that using the @llvm.gcroot() intrinsic
prevents mem2reg from promoting these stack slots to registers,
disabling optimizations. Ending the live time of a stack root is only
possible by storing null to the stack slot.

There were some discussions on this list about how to to mark SSA values
that are stack roots. None one them resulted in implementation of a
better GC root support. This was partly because most ideas assumed that
a stack root must be a pointer. With the current scheme, a stack root
doesn't need to be a pointer. The stack slot referred by a
@llvm.gcroot() intrinsic can contain anything. Not only pointers, but
also tagged unions and compressed pointers. Other ideas involved changes
to the type system.

To support GC roots of any type without changing the type system, a new
instruction needs to be introduced. It could be used like this:

     %gcvar = call %gctype* @allocate_type()
     mark_gcroot %gctype* %gcvar

For a proof-of-concept, which only supports pointers as GC roots, an
intrinsic is enough:

     %gcvar = call %gctype* @allocate_type()
     %tmp = bitcast %gctype* %gcvar to i8*
     call void @gcroot.mark(i8* %tmp)

What part of the garbage collector needs support from the code
generator? When a collection cycle is triggered, the garbage collector
needs to know what's contained in variables pointing to garbage
collected memory. A moving garbage collector might need to change these
variables to point to the new memory location. The code generator emits
tables that describe the layout of the stack. Using these tables and
target-specific code, the garbage collection runtime is able to "walk
the stack", reading and possibly writing variables pointing to garbage
collected memory.

What do these tables contain? To simplify things, I assume a simple,
non-concurrent garbage collector. For each call site that could trigger
a garbage collection cycle, the tables contain the size of the frame in
which the call happens and the relative locations of the GC roots.

Where are the stack roots stored? The naive approach is to push all gc
roots on the stack before a call and pop them after the call. A more
clever approach is to integrate with the register allocator. Registers
which are spilled anyway doesn't need to be pushed on the stack by the
garbage collection support code. Also, the values don't need to be
reloaded from stack until needed to reduce register pressure. I'd call
this "forced spilling".

Although my C++ coding skills are quite limited, I'd like to try
implementing this. My open questions are:

* How to pass "this is a GC root" through SelectionDAG to the post-ISel
* How to force spilling of a register that lives across a call?
* How to encode in IR which calls can trigger a garbage collection