I am an active user of LLVM's GC infrastructure. Here are some notes on
how I currently use gcroot():
Wow, thank you for sharing! Your usage is far beyond anything else I've
heard of. Can I ask which project/language this is? Or is that
Just a personal project: https://code.google.com/p/foster/source/browse/
The interesting bits are in Haskell, so you probably won't be able to reuse
much code apart from, maybe, the custom LLVM passes.
1) I use several IRs. A high-level (CPS/SSA hybrid) IR is the target of
inlining and other such high-level optimization; most allocation is
implicit in the high level IR. Closure conversion/lambda-lifting produces a
variant of that IR, extended with explicit allocation constructs. A
separate pass introduces the allocations themselves.
2) Next, a dataflow-based root insertion pass inserts the minimal(ish) set
of roots and root reloads to preserve the invariant that every live GCable
pointer is in a root at each potential GC point. The root insertion pass
also computes liveness and uses it to minimize the set of roots via slot
coloring. The result of root insertion is (very close to) LLVM IR.
I would be really interested in hearing more about your implementation for
this part. We need to solve a similar problem in the statepoint lowering
code. What we have to date is a simply greedy scheme that works reasonable
well in practice, but leaves lots of room for improvement.
Ack, I was tired when I wrote that -- ignore the "via slot coloring". I'm
also doing greedy reuse of dead slots with no backtracking or anything. The
implementation is rather hairy, in part because the underlying problem goes
against the grain of the dataflow library I'm using (Hoopl).
3) Potential GC points are determined inter-procedurally, and call sites
which are statically known to (transitively) not GC do not break GCable
pointer live ranges.
I was planning something like this for LLVM at some point. It's
relatively low on my priority list since IPO tends to be of minimal
interest when JITing which is my primary use case.
Ah, that makes sense. I think it's a much, much bigger deal for a static
4) I use a lightly-modified variant of the OCaml plugin's stackmap format.
5) I don't currently use load or store barriers, but do eventually plan to.
6) I do support GCing through global variables.
7) I wrote a custom LLVM pass to verify that values loaded from GC roots
aren't used across GC points.
Cool. We have a similar one for statepoints. (If you can share, getting
both into the common tree would be nice.)
My code's at
-- I'm not sure how useful it would be in the common tree, though, because
it relies on the frontend adding metadata reflecting interprocedural may-GC
When I first started using gcroot(), I used the naïve approach of
reloading roots after every call, and encountered significant overhead. The
optimizations sketched above significantly reduced that overhead.
Unfortunately, I don't know of any meaningful language-independent
benchmark suites for measuring that sort of overhead, and of course the
overhead on any given program will strongly depend on details of how that
program is written...
Also, FWIW when I first started, I got up and running with the shadow
stack, just as Gordon described, before transitioning to the "real"
infrastructure. As long as it doesn't carry a significant burden on the
implementation side, I think it's worth having, because it significantly
improves the learning curve for new users.
I'm planning on retaining the shadow stack mechanism.
OK, direct answers to some of your questions:
* I think custom stack map formats have small but non-zero value. There
have been a few papers (most in the early 90's, I think) which showed that
stack maps can make up a non-trivial fraction of a binary's size, and thus
are increasingly desirable to optimize as programs grow larger. My verdict:
if support for custom formats are ever actively impeding forward progress,
toss 'em; otherwise, there should (eventually) be a more detailed look at
the costs and benefits.
My preferred usage model would be: LLVM generates standard format, runtime
parses and saves in custom binary format.
Having said that, retaining the capability doesn't seem to involve too
much complexity. I see no reason to kill it since multiple folks seem to
Ah, yeah -- again, this is more important for a static compiler, since the
stackmaps are embedded into the compiled binary.
* As above, I think the primary benefit of one-stackmap-per-safepoint is
saving space at the cost of time (and root traffic). AFAIK the primary
virtue of the one-stack-slot-per-gcroot() implementation is implementation
simplicity, nothing more.
* The ultimate strength & weakness of gcroot() is that basically
everything is left to the frontend. There's very little unavoidable
overhead imposed on a frontend that is willing to go the distance to
generate non-naïve code -- any knowledge that the frontend has can be used
to generate better code. Unfortunately, this also places a rather heavy
burden on the frontend to generate valid & efficient code.
Since I haven't had the chance to look beyond mailing list & blog posts
on statepoints, I can't comment much on their tradeoffs vs gcroots. The
high-level impression I get is that statepoints will allow a
less-sophisticated frontend to get better results than naïve usage of
gcroot(). It also looks like statepoints will do a better job of steering
frontends away from the pitfalls of using GC-invalidated pointer values. I
have no idea whether there will be any codegen benefit for more
sophisticated frontends, but I'll be happy to port my implementation to
statepoints once they've gotten a chance to settle down somewhat, and
provide more detailed feedback to help determine what to eventually do with
gcroot(). I can currently only do ad-hoc benchmarking, but hopefully by the
time gcroot() is on the chopping block, I'll have a more extensive
automated performance regression test suite
I'll be very interested in your results. I'm quite sure you'll find
stability and performance bugs. The existing code 'works' but has really
only been hammered in one particular use case (source language, virtual
machine). Every new consumer will help to make the code more robust. Let
me know when you're ready to start prototyping this. I'm happy to answer
questions and help guide you around any bugs you might find.
I suspect from what you've said above that we might want to extend the
mechanism slightly to allow you to retain your preassigned stack slots.
You may be getting better spilling code than the current implementation
would give you by default. This seems like it might be a generally useful
mechanism. We could either use a call attribute on the gc.relocate, a bit
of metadata, or possible an optional third argument that specifies an
'abstract slot'. It would depend on your initial results and how much
we've managed to improve the lowering code by then.
Will have to measure to find out No rush, though, I probably won't be
able to get around to it for several months.
One quick question: I think I understand why address spaces are needed to
distinguish GCable pointers for late safepoint placement, but are they also
needed if the frontend is inserting all safepoints itself?
Nope. They might allow some additional sanity checks, but nothing that is
currently checked in relies on address spaces in any way. My plan is to
make the pointer distinction mechanism (addrspace, gcroot, vs custom) a
property of the GCStrategy with easily control extension points.
Sounds like a solid plan!
Finally, thank you for taking on the burden of improving LLVM's GC