Jon Harrop wrote:
A while back there was a discussion thread about whether an accurate,
concurrent garbage collector could be "generic" in the sense of being
able to support multiple different languages efficiently. After having
done some work on this, I now believe that this is the case - using C++
policy-based design principles, you can create a set of modules that
represent different aspects of collector behavior (such as
mark-and-sweep, stop-the-world, and so on) along with different aspects
of the runtime environment (object tracing strategies, heap structures,
threading primitives, atomics), and encode these various behaviors as
template classes which can be bound together to create an efficient
This is great news! I have some questions...
And thank you for your interest!
I had great success using some seemingly-unusual techniques in my experiments.
Firstly, rather than using a single 1 word pointer to represent a reference I chose to use 3 words including a pointer to the type and a pointer to the value (as well as metadata). This allows typed nulls and that addresses an important deficiency found in most other VMs including the CLR. Is Scarcity able to handle such references or does its implementation of stack frames require references to be a single word?
As you no doubt know, there are a couple of different approaches to obtaining the type information for an object reference. These can be divided into static typing, where the compiler knows in advance the type of the reference, and object tagging, in which the type information is stored with the object being referred to (usually in the object itself, although your case its with the reference.)
Scarcity attempts to unify these two models by declaring that object tagging is a special case of static typing. For every object reference the compiler is required to supply a tracing strategy for that reference, in the form of a function that knows how to trace objects of that static type. That tracing strategy may know the entirety of the object's structure at compile time, in which case object tags are not needed; Or if the reference is polymorphic, then the tracing strategy function can inspect the object's tag in order to look up the metadata that describes where the references within the object are located. This is up to the language implementer to decide. In fact, you could conceivably have more than one object hierarchy each with it's own tag format, and so long as the compiler can determine statically which tag format is being used for a particular reference, it should all work.
The strategy function enumerates all of the references in the object and passes them to the trace visitor, which is a functor object supplied by the collection algorithm. The strategy function passes into the visitor a pointer to each reference, rather than the reference itself - the reason for this is to support copying collectors which can modify the reference to point to the new location.
This means that for your purposes, Scarcity does not dictate the format of object metadata nor the way objects are traced. However, some of the collector algorithms will assume that the pointer-to-reference is pointing to an object pointer. So you wouldn't be able to use those collector algorithms.
In other words, you probably won't be able to use the Scarcity collection algorithms for what you want, but what you can do is build a collector within the Scarcity framework that does what you want. Whether the various modules within Scarcity are valuable enough to be used in this way is up to you.
Secondly, I used LLVM to JIT compile per-type code for garbage collection in order to traverse data structures as efficiently as possible. Moreover, I chose to write the entire GC in an unsafe subset of the language that I was implementing so that I could rely upon tail calls and so forth. Does Scarcity require the GC code to be written entirely in C?
I imagine that most languages will want to write as much as possible in the hosted language rather than in C. The question is where to draw the dividing line. There are several issues which must be considered:
1) Small snippets of code such as write barriers should be inlined wherever possible. This pretty much requires that they be written in the same language as the code that they will be embedded within.
2) Many high-level languages really aren't very good at dealing with raw memory, because it requires you to introduce pointer arithmetic and other "unsafe" operations as you point out. For some languages, adding the necessary unsafe operations can distort and corrupt the language design to the point where it feels like you are writing a second compiler. If such is the case, then it makes sense to offload some of these operations to a lower-level language such as C.
3) In a hybrid approach where you are doing high-level stuff in your language, and then grunging around with pointers in C, there's some efficiency loss due to the fact that the compiler for each language cannot inline or optimize code written in the other language. (Although if you compile the C part with clang and use link-time optimization, you can mitigate that somewhat.) One has to think very carefully how to structure the API to avoid frequent jumping back and forth across the language barrier.
Because Scarcity is designed as a framework of modules, you can decide where to draw that line by deciding which modules that you are going to use. For example, you might choose not to use any of Scarcity's collection algorithms at all, but you still might use it's stop-the-world thread manager, or its memory heap.
Finally, do you have any references describing the techniques you have used to implement a concurrent GC? I have been reading up on the subject for several years now, with a view to implementing my own concurrent GC.
There's a wiki page on the scarcity site that lists references to a whole bunch of papers on garbage collection techniques.