Garbage collection

One of the more interesting subjects of conversation at the 2008 developer day was related to garbage collection. With the increasing number of LLVM-based VMs and other projects, I suspect that the desire for more comprehensive garbage collection support in LLVM is only going to increase. (I am now involved in two different open-source projects both of which will eventually have a strong requirement for a high-performance LLVM-compatible collector.)

The last time I looked, the LLVM support for GC is fairly austere, consisting of (a) the bare minimum set of low-level intrinsics needed to integrate a collector with a runtime environment, and (b) a couple of example of collectors, neither of which AFAIK take full advantage of the low-level intrinsics to get optimal performance.

The IR-level intrinsics themselves don't much help you *write* a GC, so much as to integrate one with LLVM. What is provided is essentially a mechanism for walking the stack, and a means to insert read/write barriers into the generated code, which together form a tiny fraction of what it would take to design a working collector.

Part of the reason why there isn't more direct support for GC is the theory that there is no such thing as a one-size-fits-all collector. The argument goes that a really efficient collector design requires detailed knowledge of the object model of the language being compiled. Some languages use explicit type information embedded in the object, which has a specific offset and a specific format that is likely to be different for different languages. Other languages have the benefit of compile-time knowledge of the type of an entire object graph, and thus do not require the overhead of a per-object type field.

On the other hand, it is possible to make a counter-argument to this theory that goes like this: The Java VM has been used to implement a large number of front-end languages efficiently, without requiring a special garbage collector for each language. Furthermore, I've observed that the Java VM, more than any other runtime environment, tends to be used as a platform for academic research into GC algorithms, without requiring changes to the set of languages which are hosted by the VM. In other words, both the language and the GC algorithm can be varied independently, from which we can conclude that perhaps the choice of language and GC aren't as tightly coupled as we might think.

It might also be argued that many of the languages that people are wanting to build with LLVM have Java-like object semantics anyway, at least at the level of GC. (Although finalization is an issue, since this is an area in which language semantics vary widely.)

It also seems to me that even radically different collector designs could utilize some common building blocks for heap management, work queuing, and so on. It might make sense to create a library of such components as an auxilliary project to LLVM. For example, while there are a large number of collector algorithms, there are a fairly small number of designs for write barriers - so the library could provide several implementations of such. Similarly, there are only so many ways to "stop the world" (for concurrent collectors that require this), and this could be provided as well.

Of course, there is always a danger when creating libraries of the "ivory tower syndrome", putting a lot of effort into components that don't actually get used. This is why it would be even better to create a standard, high performance collector for LLVM that actually uses these methods.

As I see it, such a collector would include two tracing strategies, one based on per-object type information, and one based on static type knowledge. Having both types of tracing in a single collector would not, I think, cause a serious performance degradation, especially as object models such as Java require both types anyway.

Specific object model details such as the offset of the type info within an object can be dealt with by implementing the tracing functions in LLVM IR, so that all of the various inefficiencies caused by parameterizing the object layout get optimized away. It should be relatively easy for a frontend to generate the IR for a tracing function for each concrete class, while the collector library supplies the code that is called by the tracer for each pointer location in memory. Tracing functions aren't that complicated anyway, it's the interaction between the tracer and the mutator is where things get complex, and that is largely independent of things like object memory layouts.

In other words, I think that it should be possible to create a fairly comprehensive and efficient collector implementation in which 80% of the implementation lives in a library, and the language front end generates the remaining 20%. Moreover, it should be possible to design this library in a way that makes it possible for people to replace significant parts of the collector, so that it can be used as a basis for developing a diverse selection of collection strategies. And because all of the parts that are specific to the object model are generated by the front-end, the collector should work for a large number of languages with different object semantics.

The hardest part will be issues involving concurrency and threading. It has been my observation that support for concurrency is the one thing that you can't build up incrementally - that is, with most aspects of a GC you can start with a simple implementation and gradually add improvements (multiple generations and so on), but once you add concurrency you pretty much have to start over. Unfortunately, I am not a huge concurrent programming whiz, which is part of the reason why I haven't simply gone and implemented the collector myself. However, this does suggest to me that if we want to support concurrency (which I believe we will) then it has to be supported from the very beginning.

As I might have mentioned before on this list, I have written a few modest components that I plan to use as the basis for a collector which I plan to name 'Scarcity'. One of these is a heap manager which replaces malloc/free and which has fairly efficient functions for things like "free all blocks for which this callback function returns true" which can be used to implement mark and sweep.

-- Talin

Hello,

2009/2/26 Talin <viridia@gmail.com>

The IR-level intrinsics themselves don’t much help you write a GC, so
much as to integrate one with LLVM. What is provided is essentially a
mechanism for walking the stack, and a means to insert read/write
barriers into the generated code, which together form a tiny fraction of
what it would take to design a working collector.

<…>

In other words, I think that it should be possible to create a fairly
comprehensive and efficient collector implementation in which 80% of the
implementation lives in a library, and the language front end generates
the remaining 20%. Moreover, it should be possible to design this
library in a way that makes it possible for people to replace
significant parts of the collector, so that it can be used as a basis
for developing a diverse selection of collection strategies. And because
all of the parts that are specific to the object model are generated by
the front-end, the collector should work for a large number of languages
with different object semantics.

IMO LLVM should try to keep its ow evel design. Garbage collection is already very high level.
A library which supports building garbage collectors on the other hand would be very helpful of course.
Basically such a library boils down to an operatic system abstraction library. Functions like “stop the world” are completely managed by the OS and not the CPU. I’m not sure if such functionality is part of LLVMs goals.

Having that said; functions like :

  • Enumerate threads

  • Pause/Resume thread

  • Get root pointers for a thread (including registers)

  • Get a list of modified memory-pages (GetWriteWatch in Windows - used in the .net GC)

for different platforms - would definitely help building a GC. Just look at the source code of the Boehm GC: It’s a completely unmaintainable mess of #ifdefs

A little bit off topic: Has anybody tried building a concurrent GC - running in a different process, instead of a thread?
The idea: To perform a collection you do a fork(). The child process collects all unreferenced memory regions and reports them back to the parent process. The parrent process waits for the result (in a sperate thread) and if it gets the result it frees the memory regions.
This way you do not have to worry about barriers and all the nasty stuff. The maximum pause time is close to zero.
Of course this is only useful with a working “copy on write” implementation of fork().

  • Ralf

Ralf Schneider wrote:

A little bit off topic: Has anybody tried building a concurrent GC - running in a different _process_, instead of a thread?
The idea: To perform a collection you do a fork(). The child process collects all unreferenced memory regions and reports them back to the parent process.

I remember reading a paper in ACM Sigplan Notices (I think)
many years back, describing exactly such a system.

Not sure if it's the one you're referring to, but looking for the idea I found this:

http://www.cs.purdue.edu/homes/grr/snapshot-gc.ps

Jonathan

Hello,

> The IR-level intrinsics themselves don't much help you *write* a GC, so
> much as to integrate one with LLVM. What is provided is essentially a
> mechanism for walking the stack, and a means to insert read/write
> barriers into the generated code, which together form a tiny fraction of
> what it would take to design a working collector.

<...>

> In other words, I think that it should be possible to create a fairly
> comprehensive and efficient collector implementation in which 80% of the
> implementation lives in a library, and the language front end generates
> the remaining 20%. Moreover, it should be possible to design this
> library in a way that makes it possible for people to replace
> significant parts of the collector, so that it can be used as a basis
> for developing a diverse selection of collection strategies. And because
> all of the parts that are specific to the object model are generated by
> the front-end, the collector should work for a large number of languages
> with different object semantics.

IMO LLVM should try to keep its <L>ow <L>evel design. Garbage collection is
already very high level.
A library which supports building garbage collectors on the other hand would
be very helpful of course.
Basically such a library boils down to an operatic system abstraction
library. Functions like "stop the world" are completely managed by the OS
and not the CPU. I'm not sure if such functionality is part of LLVMs goals.

Having that said; functions like :

- Enumerate threads
- Pause/Resume thread
- Get root pointers for a thread (including registers)
- Get a list of modified memory-pages (GetWriteWatch in Windows - used in
the .net GC)
- ...

for different platforms - would definitely help building a GC. Just look at
the source code of the Boehm GC: It's a completely unmaintainable mess of
#ifdefs

A little bit off topic: Has anybody tried building a concurrent GC - running
in a different _process_, instead of a thread?

Yes, I had a proof of concept implementation of a GC with
- shared memory as the GC arena,
- (C++) throw-catch-based marking
- simple lookup rules for (in-arena) associated
  instance metadata.

I never had the need to finish the implementation, but
the fork approach worked reasonably well, and the mark
and sweep parts ran in the forked process, with the
shared memory communicating back the collection results.

The most amusing thing was to see how the stack unwinding
in the forked process did the marking and the original process
was not harmed.

I hope to extend the concept to multiple threads by (m)protecting
increasing parts of the arena and hoping that all threads get
caught with time. Finally the last running threads must be
stopped and forced to fork. This last question and how to recover
the threads from the protection violations in the original process
are the big open questions to be solved.

Cheers,

    Gabor

What do you do if the program is multi-threaded? fork()-ing a
multithreaded program can lead to problems ...
The functions you can call after forking is very limited according to
POSIX: "If a multi-threaded process calls /fork/(), the new process
shall contain a replica of the calling thread and its entire address
space, possibly including the states of mutexes and other resources.
Consequently, to avoid errors, the child process may only execute
async-signal-safe operations until such time as one of the /exec
<http://www.opengroup.org/onlinepubs/9699919799/functions/exec.html>/
functions is called."

Best regards,
--Edwin

With the increasing
number of LLVM-based VMs and other projects, I suspect that the desire
for more comprehensive garbage collection support in LLVM is only going
to increase.

Absolutely!

Part of the reason why there isn't more direct support for GC is the
theory that there is no such thing as a one-size-fits-all collector. The
argument goes that a really efficient collector design requires detailed
knowledge of the object model of the language being compiled.

Yes, you do need to have some knowledge about the object model. However, it would be perfectly reasonable for LLVM to support and include multiple different collectors for different classes of language.

On the other hand, it is possible to make a counter-argument to this
theory that goes like this: The Java VM has been used to implement a
large number of front-end languages efficiently, without requiring a
special garbage collector for each language.

Most importantly to me, the takeaway from Java is that just having something that works "well enough" is really important and helps bootstrap a lot of projects, which can then take the "last 10% of performance" as an optimization opportunity, instead of being blocked from even starting with LLVM.

I'd claim that JavaVM really isn't a good way to implement a lisp vm or something like that. However, the perf delta induced by the Java VM may just *not matter* in the big picture. At least with LLVM, a Lisp implementation could be brought up on an "OOP GC" and switched to something more custom as the project develops.

It also seems to me that even radically different collector designs
could utilize some common building blocks for heap management, work
queuing, and so on.

Yes.

Of course, there is always a danger when creating libraries of the
"ivory tower syndrome", putting a lot of effort into components that
don't actually get used. This is why it would be even better to create a
standard, high performance collector for LLVM that actually uses these
methods. <many good thoughts trimmed>

What you see in LLVM right now is really only the second step of the planned GC evolution. The first step was very minimal, but useful for bridging to other existing collectors. The second step was Gordon's (significant!) extensions to the system which allowed him to tie in the Ocaml collector and bring some more sanity to codegen.

While people object to adding high level features to LLVM, high level and language-specific features are *great* in llvm as long as they are cleanly separable. I would *love* to see a composable collection of different GC subroutines with clean interfaces built on LLVM "assembly language" GC stuff.

In my ideal world, this would be:

1. Subsystems [with clean interfaces] for thread management, finalization, object model interactions, etc.
2. Within different high-level designs (e.g. copying, mark/sweep, etc) there can be replaceable policy components etc.
3. A couple of actual GC implementations built on top of #1/2. Ideally there would only be a couple of high-level collectors that can be parameterized by replacing subsystems and policies.
4. A very simple language implementation that uses the facilities, on the order of complexity as the kaleidoscope tutorial.

As far as I know, there is nothing that prevents this from happening today, we just need leadership in the area to drive it. To avoid the "ivory tower" problem, I'd strongly recommend starting with a simple GC and language and get the whole thing working top to bottom. From there, the various pieces can be generalized out etc. This ensures that there is always a *problem being solved* and something that works and is testable.

One of the annoying reasons that the GC stuff is only halfway fleshed out is that I was working on an out of tree project (which of course got forgotten about when I left) when developing the GC intrinsics, so there is no fully working example in public.

-Chris

ps. Code generation for the GC intrinsics can be improved significantly. We can add new intrinsics that don't pin things to the stack, update optimizations, and do many other things if people started using the GC stuff seriously.

2009/2/26 Török Edwin <edwintorok@gmail.com>

What do you do if the program is multi-threaded? fork()-ing a
multithreaded program can lead to problems …
The functions you can call after forking is very limited according to
POSIX: “If a multi-threaded process calls /fork/(), the new process
shall contain a replica of the calling thread and its entire address
space, possibly including the states of mutexes and other resources.
Consequently, to avoid errors, the child process may only execute
async-signal-safe operations until such time as one of the /exec
<http://www.opengroup.org/onlinepubs/9699919799/functions/exec.html>/
functions is called.”

Well, I have alread thought a little bit about this problem.
If I remember right only the thread calling fork will remain active in the child process. Thats exactly what we want.
In pseudocode it could look like this.

Garbage collection thread:
shared_mem = create_shared_memory()
loop:
wait_for_garbage_collection_signal() # wait until a garbage collection is requeted
pause_all_other_threads() # stop the world
root_pointers = get_root_pointers_from_all_threads()
resume_all_other_threads()
child_pid = fork()
if(!child_pid):

child process - all threads except of the current one has been stopped by fork()

collect_garbage(root_pointers, shared_memory) # collect unreferenced memory blocks, put the result in shared_memory
exit()
else:
wait_for_process_exit(child_pid)
free_memory_blocks(shared_mem) # free all memory blocks the GC has found

=> There is probably no problem with mutexes and other synchronization primitieves, because we simply do not care in the garbage collection thread

But may be I’m missing something.

Regards,
Ralf

[...]

A little bit off topic: Has anybody tried building a concurrent GC - running
in a different _process_, instead of a thread?

Yes, I had a proof of concept implementation of a GC with
- shared memory as the GC arena,
- (C++) throw-catch-based marking
- simple lookup rules for (in-arena) associated
instance metadata.

I never had the need to finish the implementation, but
the fork approach worked reasonably well, and the mark
and sweep parts ran in the forked process, with the
shared memory communicating back the collection results.

The most amusing thing was to see how the stack unwinding
in the forked process did the marking and the original process
was not harmed.

I hope to extend the concept to multiple threads by (m)protecting
increasing parts of the arena and hoping that all threads get
caught with time. Finally the last running threads must be
stopped and forced to fork. This last question and how to recover
the threads from the protection violations in the original process
are the big open questions to be solved.

What do you do if the program is multi-threaded? fork()-ing a
multithreaded program can lead to problems ...
The functions you can call after forking is very limited according to
POSIX: "If a multi-threaded process calls /fork/(), the new process
shall contain a replica of the calling thread and its entire address
space, possibly including the states of mutexes and other resources.
Consequently, to avoid errors, the child process may only execute
async-signal-safe operations until such time as one of the /exec
<http://www.opengroup.org/onlinepubs/9699919799/functions/exec.html>/
functions is called."

My understanding is that the child process would basically only take care of the "mark" phase (and return results via shared memory for a "lazy sweep" by the allcotor). Do you anticipate async-signal-unsafe operations would be required for that?

  Daveed

I fear that the IR generator and GC are too tightly coupled.

For example, the IR I am generating shares pointers read from the heap even
across function calls. That is built on the assumption that the pointers are
immutable and, therefore, that the GC is non-moving. The generated code is
extremely efficient even though I have not even enabled LLVM's optimizations
yet precisely because of all this shared immutable data.

If you wanted to add a copying GC to my VM you would probably replace every
lookup of the IR register with a lookup of the code to reload it, generating
a lot of redundant loads that would greatly degrade performance so you would
rely upon LLVM's optimization passes to clean it up again. However, I bet
they do not have enough information to recover all of the lost performance.
So there is a fundamental conflict here where a simple GC design decision has
a drastic effect on the IR generator.

Although it is theoretically possible to parameterize the IR generator
sufficiently to account for all possible combinations of GC designs I suspect
the result would be a mess. Consequently, perhaps it would be better to
consider IR generation and the GC as a single entity and, instead, factor
them both out using a common high-level representation not dissimilar to JVM
or CLR bytecode in terms of functionality but much more closely related to
LLVM's IR?

This falls into the category of "future work to improve codegen". Tagging pointers as "magic for GC" is very trivial now, just mark one LLVM "address space" as being a GC heap. Apply magic to that address space, problem solved.

Lets get the basics working well first though :slight_smile:

-Chris

Jon Harrop wrote:

In my ideal world, this would be:

1. Subsystems [with clean interfaces] for thread management,
finalization, object model interactions, etc.
2. Within different high-level designs (e.g. copying, mark/sweep, etc)
there can be replaceable policy components etc.
3. A couple of actual GC implementations built on top of #1/2.
Ideally there would only be a couple of high-level collectors that can
be parameterized by replacing subsystems and policies.
4. A very simple language implementation that uses the facilities, on
the order of complexity as the kaleidoscope tutorial.

As far as I know, there is nothing that prevents this from happening
today, we just need leadership in the area to drive it. To avoid the
"ivory tower" problem, I'd strongly recommend starting with a simple
GC and language and get the whole thing working top to bottom. From there, the various pieces can be generalized out etc. This ensures
that there is always a *problem being solved* and something that works
and is testable.

I fear that the IR generator and GC are too tightly coupled.

For example, the IR I am generating shares pointers read from the heap even across function calls. That is built on the assumption that the pointers are immutable and, therefore, that the GC is non-moving. The generated code is extremely efficient even though I have not even enabled LLVM's optimizations yet precisely because of all this shared immutable data.

If you wanted to add a copying GC to my VM you would probably replace every lookup of the IR register with a lookup of the code to reload it, generating a lot of redundant loads that would greatly degrade performance so you would rely upon LLVM's optimization passes to clean it up again. However, I bet they do not have enough information to recover all of the lost performance. So there is a fundamental conflict here where a simple GC design decision has a drastic effect on the IR generator.

Although it is theoretically possible to parameterize the IR generator sufficiently to account for all possible combinations of GC designs I suspect the result would be a mess. Consequently, perhaps it would be better to consider IR generation and the GC as a single entity and, instead, factor them both out using a common high-level representation not dissimilar to JVM or CLR bytecode in terms of functionality but much more closely related to LLVM's IR?

IMHO, it would be better if support for GC was dropped from llvm altogether. I say this having written a copying GC for my VM toolkit, which also uses llvm to do its JIT compilation. And it works just fine!

I have simply avoided the intrinsics.

The problem with the llvm is that to write a GC using the llvm intrinsics, you have to mess around with the code-gen part of llvm.

When I want to add a generational collector to my toolkit in the future, it is easy to specify write-barriers in the IR. Modifying code-gen to handle the intrinsics is a task I would rather avoid.

Mark.

Mark Shannon wrote:

Jon Harrop wrote:

In my ideal world, this would be:

1. Subsystems [with clean interfaces] for thread management,
finalization, object model interactions, etc.
2. Within different high-level designs (e.g. copying, mark/sweep, etc)
there can be replaceable policy components etc.
3. A couple of actual GC implementations built on top of #1/2.
Ideally there would only be a couple of high-level collectors that can
be parameterized by replacing subsystems and policies.
4. A very simple language implementation that uses the facilities, on
the order of complexity as the kaleidoscope tutorial.

As far as I know, there is nothing that prevents this from happening
today, we just need leadership in the area to drive it. To avoid the
"ivory tower" problem, I'd strongly recommend starting with a simple
GC and language and get the whole thing working top to bottom. From
there, the various pieces can be generalized out etc. This ensures
that there is always a *problem being solved* and something that works
and is testable.

I fear that the IR generator and GC are too tightly coupled.

For example, the IR I am generating shares pointers read from the heap even
across function calls. That is built on the assumption that the pointers are
immutable and, therefore, that the GC is non-moving. The generated code is
extremely efficient even though I have not even enabled LLVM's optimizations
yet precisely because of all this shared immutable data.

If you wanted to add a copying GC to my VM you would probably replace every
lookup of the IR register with a lookup of the code to reload it, generating
a lot of redundant loads that would greatly degrade performance so you would
rely upon LLVM's optimization passes to clean it up again. However, I bet
they do not have enough information to recover all of the lost performance.
So there is a fundamental conflict here where a simple GC design decision has
a drastic effect on the IR generator.

Although it is theoretically possible to parameterize the IR generator
sufficiently to account for all possible combinations of GC designs I suspect
the result would be a mess. Consequently, perhaps it would be better to
consider IR generation and the GC as a single entity and, instead, factor
them both out using a common high-level representation not dissimilar to JVM
or CLR bytecode in terms of functionality but much more closely related to
LLVM's IR?

IMHO, it would be better if support for GC was dropped from llvm
altogether. I say this having written a copying GC for my VM toolkit,
which also uses llvm to do its JIT compilation. And it works just fine!

I have simply avoided the intrinsics.

The problem with the llvm is that to write a GC using the llvm
intrinsics, you have to mess around with the code-gen part of llvm.

When I want to add a generational collector to my toolkit in the future,
it is easy to specify write-barriers in the IR. Modifying code-gen to
handle the intrinsics is a task I would rather avoid.

People need very different things for GC. All we need for Java is the
ability to dump all live object pointers into the stack, generate a bitmap
that describes which words on the stack are object pointers. Also, the optimizer
has to be taught that while objects might move during a collection, this will
never cause a valid object pointer to become NULL nor will it change the
contents of any non-reference fields. I don't think that this is an
enormous task.

Andrew.

Hi,

I realise this might be a bit controversial :wink:

Suppose I am writing a VM (such as VMKit), or a VM toolkit, and I want to add a generational GC.

If I want to use the llvm.gcwrite intrinsic for my write barrier then
I need to write a GC and then implement for each and *every* backend the gcwrite intrinsic for my write barrier.

Now, if I don't use the intrinsic, I need to write my write barrier *once* in llvm IR. All I need is a nop intrinsic and ensure that all objects collectable by the GC are reachable from some global variable.
This ensures that the optimisation phases know that they cannot rely on memory objects not moving at GC safe points.

I have a *copying* collector that works with llvm JITted code, so I know that this works :slight_smile:

In fact, this leads to a more general point:
ANY intrinsic that is not guaranteed to be implemented by ALL backends is useless, since a front-end that uses llvm to target multiple architectures MUST avoid them.

Mark.

Hi Mark,

I don't think anyone will dispute that it's easier to hack up a shadow stack (or plug into a conservative collector) to get up and running with GC. That is absolutely the route to go if portability trumps performance.

If you review the mailing list history, I think you'll also find that developers who do care about performance have been disappointed with the impact of using a shadow stack, either managed with LLVM intrinsics or by hand. Even the current state of LLVM GC (static stack maps) is a significant performance improvement—but it absolutely does require support from the code generator. Return addresses must be mapped to stack maps, and only the code generator knows where return addresses lie and how the stack frame is laid out.

The ultimate endgoal is to support schemes with still-lower execution overhead. The next step for LLVM GC would be elimination of the reload penalty for using GC intrinsics with a copying collector. This, again, requires that the code generator perform bookkeeping for GC pointers.

Jon Harrop wrote:

  

In my ideal world, this would be:

1. Subsystems [with clean interfaces] for thread management,
finalization, object model interactions, etc.
2. Within different high-level designs (e.g. copying, mark/sweep, etc)
there can be replaceable policy components etc.
3. A couple of actual GC implementations built on top of #1/2.
Ideally there would only be a couple of high-level collectors that can
be parameterized by replacing subsystems and policies.
4. A very simple language implementation that uses the facilities, on
the order of complexity as the kaleidoscope tutorial.

As far as I know, there is nothing that prevents this from happening
today, we just need leadership in the area to drive it. To avoid the
"ivory tower" problem, I'd strongly recommend starting with a simple
GC and language and get the whole thing working top to bottom. From there, the various pieces can be generalized out etc. This ensures
that there is always a *problem being solved* and something that works
and is testable.
    
I fear that the IR generator and GC are too tightly coupled.

For example, the IR I am generating shares pointers read from the heap even across function calls. That is built on the assumption that the pointers are immutable and, therefore, that the GC is non-moving. The generated code is extremely efficient even though I have not even enabled LLVM's optimizations yet precisely because of all this shared immutable data.

If you wanted to add a copying GC to my VM you would probably replace every lookup of the IR register with a lookup of the code to reload it, generating a lot of redundant loads that would greatly degrade performance so you would rely upon LLVM's optimization passes to clean it up again. However, I bet they do not have enough information to recover all of the lost performance. So there is a fundamental conflict here where a simple GC design decision has a drastic effect on the IR generator.

Although it is theoretically possible to parameterize the IR generator sufficiently to account for all possible combinations of GC designs I suspect the result would be a mess. Consequently, perhaps it would be better to consider IR generation and the GC as a single entity and, instead, factor them both out using a common high-level representation not dissimilar to JVM or CLR bytecode in terms of functionality but much more closely related to LLVM's IR?

Most copying collector designs that I have seen rely on explicit coordination from the mutator, which means that object addresses won't change at any arbitrary time, but only during a "sync point". It's up to the mutator to call "sync" at fairly regular intervals (although calls to allocate memory for an object are implicitly sync points as well.) During that call, the heap may get re-arranged, but between sync points pointer values are stable. So you can still share pointer values most of the time.

-- Talin

Gordon Henriksen wrote:

Hi Mark,

I don't think anyone will dispute that it's easier to hack up a shadow stack (or plug into a conservative collector) to get up and running with GC. That is absolutely the route to go if portability trumps performance.

Why? LLVM is all about portability AND performance.

If you review the mailing list history, I think you'll also find that developers who do care about performance have been disappointed with the impact of using a shadow stack, either managed with LLVM intrinsics or by hand. Even the current state of LLVM GC (static stack maps) is a significant performance improvement—but it absolutely does require support from the code generator. Return addresses must be mapped to stack maps, and only the code generator knows where return addresses lie and how the stack frame is laid out.

I agree that the code-generator should provide information about stack-layout, and it must be possible to inform the optimisation passes that certain memory locations may be moved.

But information about stack layout is useful for things other than GC and would be useful for interactive debugging as well.

Intrinsics should be named for their function, not for their presumed usage.

The ultimate endgoal is to support schemes with still-lower execution overhead. The next step for LLVM GC would be elimination of the reload penalty for using GC intrinsics with a copying collector. This, again, requires that the code generator perform bookkeeping for GC pointers.

Elimination of the reload penalty is impossible, unless the GC can be informed about traceable objects in registers.

I'm not sure where such vociferous concern on this subject arises. All the extant collector plugins I'm aware of operate in conjunction with the target-independent framework and require exactly zero code within each target backend.

No collector plugins actually use gcread/gcwrite, since there are no generational collectors for llvm (as yet).

According to the documentation
http://llvm.org/docs/GarbageCollection.html#runtime
The GC interface is "a work in progress"

The semantics of llvm.gcroot are vague:
"At compile-time, the code generator generates information to allow the runtime to find the pointer at GC safe points."

Vague, ill-specified interfaces are worse than none.

Fundamentally, implementers of new back-ends shouldn't have to worry about GC, and implementers of GC algorithms should not have to delve into the internals of the back-end.

Mark.

Gordon Henriksen wrote:

The ultimate endgoal is to support schemes with still-lower execution overhead. The next step for LLVM GC would be elimination of the reload penalty for using GC intrinsics with a copying collector. This, again, requires that the code generator perform bookkeeping for GC pointers.

Elimination of the reload penalty is impossible, unless the GC can be informed about traceable objects in registers.

Exactly.

I'm not sure where such vociferous concern on this subject arises. All the extant collector plugins I'm aware of operate in conjunction with the target-independent framework and require exactly zero code within each target backend.

No collector plugins actually use gcread/gcwrite, since there are no generational collectors for llvm (as yet).

According to the documentation
http://llvm.org/docs/GarbageCollection.html#runtime
The GC interface is "a work in progress"

The "runtime interface" is a historical artifact. LLVM does not impose a runtime library on its users. I wouldn't have a problem deleting all mention of it, since LLVM does not impose a contract on the runtime.

The semantics of llvm.gcroot are vague:
"At compile-time, the code generator generates information to allow the
runtime to find the pointer at GC safe points."

Vague, ill-specified interfaces are worse than none.

There's nothing ill-defined about the semantics of gcroot except insofar as GC code generation is pluggable.

Fundamentally, implementers of new back-ends shouldn't have to worry about GC, and implementers of GC algorithms should not have to delve into the internals of the back-end.

This is precisely the current state of affairs.

— Gordon

I agree this could be better. I think it would be prudent of you, being aware of this problem, to structure your compiler so as to limit the number of pieces of code which would be effected when you switch to a copying collector. I think such structure in the back-end for your compiler addresses the same needs as would be by your "slightly less low-level virtual machine."

I order to address it thoroughly, the LLVM GC infrastructure needs to track register roots so that it doesn't need to conservatively reload from the stack so frequently. This would likely entail a change to the IR (either a 'GC' address space as Chris suggests, a new intrinsic to 'tag' a value as a GC pointer, or even a change to the Type hierarchy)--another reason to isolate GC-handling code in your compiler.

— Gordon

With the increasing number of LLVM-based VMs and other projects, I suspect that the desire for more comprehensive garbage collection support in LLVM is only going to increase.

What you see in LLVM right now is really only the second step of the planned GC evolution. The first step was very minimal, but useful for bridging to other existing collectors. The second step was Gordon's (significant!) extensions to the system which allowed him to tie in the Ocaml collector and bring some more sanity to codegen.

I agree; this would be a great contribution, making LLVM much more accessible to the development of novel and existing languages.

While people object to adding high level features to LLVM, high level and language-specific features are *great* in llvm as long as they are cleanly separable. I would *love* to see a composable collection of different GC subroutines with clean interfaces built on LLVM "assembly language" GC stuff.

Absolutely.

It is definitely valuable that the existing infrastructure doesn't bolt LLVM to a particular runtime. With only a few days of work, PyPy was able to try out the LLVM GC intrinsics and static stack maps and saw a big performance boost from it on their LLVM back-end. (Their GCC backend still outperformed LLVM, but by a much smaller margin.) But this in no way prevents providing GC building blocks for projects that are not working with existing runtimes and GCs.

As far as I know, there is nothing that prevents this from happening today, we just need leadership in the area to drive it. To avoid the "ivory tower" problem, I'd strongly recommend starting with a simple GC and language and get the whole thing working top to bottom. From there, the various pieces can be generalized out etc. This ensures that there is always a *problem being solved* and something that works and is testable.

I strongly agree with this as well.

ps. Code generation for the GC intrinsics can be improved significantly. We can add new intrinsics that don't pin things to the stack, update optimizations, and do many other things if people started using the GC stuff seriously.

I've already commented on this elsewhere in the thread. Promoting GC roots into SSA variables from stack slots would allow much more freedom for the middle- and back-end optimizations, and I think is clearly the next logical step.

— Gordon