Garbage Collection Roots

Hi all,

I've been looking through the documentation (http://llvm.org/docs/GarbageCollection.html) on how to implement a garbage collector for LLVM and there's a couple of things that I don't quite understand. Specifically, it says that when a stack variable goes out of scope, you're supposed to assign a null value to it to indicate that the value is no longer live.

What I don't quite understand is how the GC can efficiently use this information.

The GC will generally have some data structure that represents a list of all current roots. Calling llvm.gcroot adds a pointer to the list. When the variable goes out of scope, you'll want to remove the pointer from the list. But merely assigning null to the value doesn't cause this to happen, it only prevents the object previously being pointed to from being traced. If the root pointer persists in the table long enough, that same stack address may get re-used for a different variable, which will be non-null, and cause bad things to happen.

It might be clearer if I explained what I want to do. The way that I manager root pointers is as follows: For each application thread, there's a pair of arrays stored in thread local data. One array is for adding roots, the other is for deleting roots. In other words, each time a root gets added, its address gets appended to the 'add' array; Each time a root is deleted, its pointer gets appended to the 'delete' array. Since both arrays are thread-local, there's no locking or synchronization required. If either list gets full, a collection cycle is triggered.

Since the roots aren't actually used *except* during a collection cycle, we don't both compacting the lists until the start of the next collection. At that time, the 'add' list is compacted by removing both duplicate entries and any entries found in the 'delete' list. The 'add' list is then used as the root set for the cycle.

In order for this to work, however, I definitely need a way to know when a reference has gone out of scope so that it can be added to the 'deleted' list.

-- Talin

Hi Talin,

I've been looking through the documentation (http://llvm.org/docs/GarbageCollection.html) on how to implement a garbage collector for LLVM and there's a couple of things that I don't quite understand. Specifically, it says that when a stack variable goes out of scope, you're supposed to assign a null value to it to indicate that the value is no longer live.

What I don't quite understand is how the GC can efficiently use this information.

The GC intrinsics exist for the type of collector which depends on metadata compiled alongside the function. At runtime, such a collector walks the dynamic call stack and merges that with the static metadata to identify dynamically live roots. Like "zero cost exception handling" or debug info, collecting this kind of metadata requires backend code generator support, which again is the reason these intrinsics exist.

The GC will generally have some data structure that represents a list of all current roots. Calling llvm.gcroot adds a pointer to the list. When the variable goes out of scope, you'll want to remove the pointer from the list. But merely assigning null to the value doesn't cause this to happen, it only prevents the object previously being pointed to from being traced. If the root pointer persists in the table long enough, that same stack address may get re-used for a different variable, which will be non-null, and cause bad things to happen.

It might be clearer if I explained what I want to do. The way that I manager root pointers is as follows: For each application thread, there's a pair of arrays stored in thread local data. One array is for adding roots, the other is for deleting roots. In other words, each time a root gets added, its address gets appended to the 'add' array; Each time a root is deleted, its pointer gets appended to the 'delete' array. Since both arrays are thread-local, there's no locking or synchronization required. If either list gets full, a collection cycle is triggered.

Since the roots aren't actually used *except* during a collection cycle, we don't both compacting the lists until the start of the next collection. At that time, the 'add' list is compacted by removing both duplicate entries and any entries found in the 'delete' list. The 'add' list is then used as the root set for the cycle.

Okay.

In order for this to work, however, I definitely need a way to know when a reference has gone out of scope so that it can be added to the 'deleted' list.

From what you've described, your collector does not need any code generator support. You could easily ignore the llvm.gc* intrinsics. Instead, your front-end could simply insert 'add' and 'remove' fragments as source variables enter and exit scope.

That said, it is possible to easily adapt LLVM's model to your runtime if you prefer to use the intrinsics. First, observe that the stack roots in a stack frame are valid until the function returns. Given that, you can adjust the 'add' and 'remove' arrays in a batch at the start and end of the function. To avoid keeping objects alive unnecessarily long, null the root as the variable goes out of scope, just as the LLVM manual suggests.

The existing LowerGC pass also finds (and removes) gcroot intrinsics and transforms them into LLVM code at function entry and return. You could easily adapt that pass to cooperate with your runtime.

Hope that helps.

— Gordon

So, if I understand what you are saying correctly, the "llvm.gcroot"
intrinsic isn't actually an "instruction" in the sense I normally
think of as an action to be taken; It's more of a marker which is used
by the code generator to create a data structure that describes the
stack layout. Is that correct?

If that's the case, then my next question is: How do the LLVM garbage
collection intrinsics work in a multi-threaded environment? If there's
a data structure that describes the layout of the stack, and you have
multiple stacks, what happens?

BTW, I'm also noticing that the current SemiSpace example code seems
quite incomplete.

I’ve been looking through the documentation (http://llvm.org/docs/GarbageCollection.html) on how to implement a garbage collector for LLVM and there’s a couple of things that I don’t quite understand. Specifically, it says that when a stack variable goes out of scope, you’re supposed to assign a null value to it to indicate that the value is no longer live.

What I don’t quite understand is how the GC can efficiently use this information.

The GC intrinsics exist for the type of collector which depends on metadata compiled alongside the function. At runtime, such a collector walks the dynamic call stack and merges that with the static metadata to identify dynamically live roots. Like “zero cost exception handling” or debug info, collecting this kind of metadata requires backend code generator support, which again is the reason these intrinsics exist.

So, if I understand what you are saying correctly, the “llvm.gcroot” intrinsic isn’t actually an “instruction” in the sense I normally think of as an action to be taken; It’s more of a marker which is used by the code generator to create a data structure that describes the stack layout. Is that correct?

That’s correct; the gcroot intrinsic is an annotation of an alloca. This is not without precedent. The llvm.noalias intrinsic is an annotation of an SSA pointer value.

If that’s the case, then my next question is: How do the LLVM garbage collection intrinsics work in a multi-threaded environment? If there’s a data structure that describes the layout of the stack, and you have multiple stacks, what happens?

The intrinsics are entirely neutral to collector implementation, and thus to threading. They could easily be used to implement reference counting, for instance, which may or may not be implemented in a threadsafe manner. However, as with your algorithm, reference counting does not require code generator support, and so would not justify the introduction of the intrinsics.

But to elaborate upon the described class of collectors: The emitted GC tables do not describe the dynamic call stack; rather, they are static data compiled into the executable. Only when cross-referenced with the dynamic call stack do they identify the roots. The garbage collectors for .NET and Java work in this manner.

Here’s an example of the sort of data these tables might contain. (I generated this using the -print-gc feature which I have added to llc locally; the llc in Subversion does cannot print this data because it does not collect it.) This test case has a simple function with 2 roots and 1 call.

GC roots for fun:
0 8[sp]
1 4[sp]
GC safe points for fun:
label 1: post-call, live = { 0, 1 }

Provided this information, the collector can easily identify the live roots within fun’s stack frame at runtime—so long as fun is paused at the safe point, which it must be if it is not the topmost stack frame.

Walking the stack; identifying the current safe point and stack pointer; stopping concurrent threads at safe points; and actually tracing the heap—all left as exercises to the interested reader. :slight_smile:

Here’s a resource you might find interesting for further reading:

http://www.cs.kent.ac.uk/people/staff/rej/gcbib/gcbib.html

BTW, I’m also noticing that the current SemiSpace example code seems quite incomplete.

I observed that myself.

— Gordon

Gordon Henriksen wrote:

The intrinsics are entirely neutral to collector implementation, and thus to threading. They could easily be used to implement reference counting, for instance, which may or may not be implemented in a threadsafe manner. However, as with your algorithm, reference counting does not require code generator support, and so would not justify the introduction of the intrinsics.

Well, I might rethink my algorithm based on what I learn here. :slight_smile:

But to elaborate upon the described class of collectors: The emitted GC tables do not describe the dynamic call stack; rather, they are static data compiled into the executable. Only when cross-referenced with the dynamic call stack do they identify the roots. The garbage collectors for .NET and Java work in this manner.

Here's an example of the sort of data these tables might contain. (I generated this using the -print-gc feature which I have added to llc locally; the llc in Subversion does cannot print this data because it does not collect it.) This test case has a simple function with 2 roots and 1 call.

GC roots for fun:
        0 8[sp]
        1 4[sp]
GC safe points for fun:
        label 1: post-call, live = { 0, 1 }

Provided this information, the collector can easily identify the live roots within fun's stack frame at runtime—so long as fun is paused at the safe point, which it must be if it is not the topmost stack frame.

How do you get the info for the non-topmost stack frames in this case? Am I going to need to implement my own version of llvm_cg_walk_gcroots() ?

My current mad scheme (which I'm not sure will work) for safe points in a multi-threaded environment uses a global POSIX reader/writer lock. Each running thread keeps a read lock on the global lock when it's running. The old-generation collector attempts to acquire a write-lock when it needs to do a collection. At some semi-regular interval, the application threads will need to momentarily release the lock, telling the global collector its OK to run. (This is only for the old generation - young generation collections are handled per-thread, hopefully. Although cross-thread references make that complex...) The advantage of this scheme is that it doesn't require preemptive thread freezing, icky pagefault-based write barriers or other platform-specific hacks, the disadvantage is that all threads have to wait if some thread doesn't want to release the read lock in a timely manner.

The hard part will be analyzing loops in the generated IR to decide whether a call to pulse the lock needs to be inserted in the loop body or not. Obviously, dropping and reacquiring a lock is expensive, so if the loop is provably short, I don't want to do that.

So it sounds like I should be passing the pointer to the stack root list at the same time I release the lock? I supposed I could store it in the TLD which I guess wouldn't be too expensive.

Walking the stack; identifying the current safe point and stack pointer; stopping concurrent threads at safe points; and actually tracing the heap—all left as exercises to the interested reader. :slight_smile:

Here's a resource you might find interesting for further reading:

    Richard Jones' Garbage Collection Bibliography

Cool. I have tons of papers on garbage collection already, though :slight_smile: But I will add it to the list.

Gordon Henriksen wrote:

But to elaborate upon the described class of collectors: The emitted GC tables do not describe the dynamic call stack; rather, they are static data compiled into the executable. Only when cross-referenced with the dynamic call stack do they identify the roots. The garbage collectors for .NET and Java work in this manner.

Here’s an example of the sort of data these tables might contain. (I generated this using the -print-gc feature which I have added to llc locally; the llc in Subversion cannot print this data because it does not collect it.) This test case has a simple function with 2 roots and 1 call.

GC roots for fun:
0 8[sp]
1 4[sp]
GC safe points for fun:
label 1: post-call, live = { 0, 1 }

Provided this information, the collector can easily identify the live roots within fun’s stack frame at runtime—so long as fun is paused at the safe point, which it must be if it is not the topmost stack frame.

How do you get the info for the non-topmost stack frames in this case?

The runtime needs to walk the machine stack. The return address of the inner frame will have the address of label 1 while fun is paused in the call that precedes it. An actual codegen example of the same module is here:

http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20070903/053242.html

Llabel1 is the safe point.

Am I going to need to implement my own version of llvm_cg_walk_gcroots() ?

If you choose not use the LowerGC transformation and the attendant shadow stack.

My current mad scheme (which I’m not sure will work) for safe points in a multi-threaded environment uses a global POSIX reader/writer lock. Each running thread keeps a read lock on the global lock when it’s running. The old-generation collector attempts to acquire a write-lock when it needs to do a collection. At some semi-regular interval, the application threads will need to momentarily release the lock, telling the global collector its OK to run. (This is only for the old generation - young generation collections are handled per-thread, hopefully. Although cross-thread references make that complex…) The advantage of this scheme is that it doesn’t require preemptive thread freezing, icky pagefault-based write barriers or other platform-specific hacks, the disadvantage is that all threads have to wait if some thread doesn’t want to release the read lock in a timely manner.

Could work.

The hard part will be analyzing loops in the generated IR to decide whether a call to pulse the lock needs to be inserted in the loop body or not. Obviously, dropping and reacquiring a lock is expensive, so if the loop is provably short, I don’t want to do that.

Sure. And if there already exists an unconditional safe point within the loop body, then you also do not need to insert one in the loop. This is a very common need, and is one of the target-independent features I want to provide in the collector infrastructure. However, although I have made provisions for it, I have no immediate need for it. Patches are welcome.

So it sounds like I should be passing the pointer to the stack root list at the same time I release the lock? I supposed I could store it in the TLD which I guess wouldn’t be too expensive.

— Gordon