Improving Garbage Collection

Motivation & Abstract

It has been observed by a number of LLVM users that the current garbage collection framework requires frontends to generate complex and inefficient code. This is primarily due to the fact that only allocas can be stack roots - it is not possible to trace SSA values directly. Because of this limitation, a frontend wanting to support garbage collection has to manually copy any SSA values containing potential roots to memory before each call instruction (or other ‘safe point’), and reload them back into SSA values after the call instruction.

Background: How the inability to trace SSA values degrades efficiency.

Although it would be ideal if there was a way for registers to be traced directly, in practice this is very difficult, for a number of reasons - for one thing, it requires the collector to have knowledge of the register file of the current processor. It is much simpler if all garbage collection roots can be identified by an address in memory. Doing this requires that before each safe point, all register values must be spilled to memory.

Although this may seem inefficient, in practice it need not be: since most processors have a relatively small register file, chances are that only a few local variables will be live in registers at any given point. Regardless of how deep the stack is (that is, how many call frames there are and how many local variables there are in each frame), the most you should ever need to copy to memory is the entire register file - less if some of the registers contain non-root values such as integers, or dead values.

An additional optimization can be obtained if we realize that in most collectors, the current call frame can be treated as a special case. A collection cycle only happens inside one of the collector’s runtime utility functions, and those functions can be written in such a way as to not require tracing. So it is only the parent of the current call frame, and its ancestors, that need to be traced.

Unfortunately, reaching this theoretical point of maximum efficiency requires a lot of knowledge about which local variables are in registers at any given moment, and which values are live. LLVM-based frontends do not, as rule, have this knowledge. This requires the frontend to be extremely conservative and pessimistic - it has to treat every SSA value as if it were in a register, and copy it to a stack location whether it really needs to or not. Further, because copying collectors can modify object addresses, the temporary copy on the stack has to be moved back into an SSA value afterwards.

The end result is that if you look at a disassembled code for a typical function, the generated code is a mess - it will have dozens of additional allocas at the beginning of the function, each with their own call to llvm.gcroot(), and each initialized to a “safe” value. And each call to a subroutine is surrounded by a bunch of extra stores and loads. There’s so much extra clutter that the code is hard to read.

Background: Explicit Lifetimes for Roots

Another problem with the current LLVM approach for handling stack roots has to do with the way that you signal to LLVM that a given stack root is going out of scope - which is by assigning NULL to the pointer. Theoretically, this extra store shouldn’t be necessary, as long as your tracer can handle the case where different safe points within the same function can have different stack maps (which is fairly easy to do in the current framework). That is, if there was a way to communicate to LLVM the region for which a given stack root was in scope, then the GC strategy could simply remove that variable from the stack map for all safe points which are outside that region. No store would be needed.

Background: What is a root?

Many language runtimes make the assumption that all roots are simple pointers. For example, in Java, there are really only two kinds of data: primitive numerical types (floats and ints), and objects. Furthermore, all objects are type-tagged, so tracing is very simple: If you see a pointer, and it’s non-null, trace it.

However, some languages have a richer set of internal data types, which can impact garbage collection. Three fairly common examples come to mind:

  • LISP - a pointer field can either hold an actual pointer or a small integer. The least significant bit of the pointer is used to determine which type it is.
  • Tagged unions - many languages support “tagged unions” or “disjoint types”. These generally consist of a small struct containing a bitfield and a “payload” area, where the payload area is as large as the largest type in the union. The bitfield is used to determine what type is currently stored in the payload - this affects garbage collection because the collector needs to examine the bitfield in order to determine if the payload contains pointers or not.
  • Variable-length arrays. These generally contain a ‘size’ field that determines how many elements are in the array. If the array contains traceable objects, the collector needs to examine the ‘size’ field in order to determine how many values to trace. (While it is possible that you could get around this by initializing unused array elements to null, in practice this would slow down many array operations.)

Currently LLVM puts no restriction on the type of data that can be a root - in other words, a root is “whatever the frontend says is a root”. Furthermore, the llvm.gcroot() intrinsic allows arbitrary metadata to be associated with each root, making it relatively easy to handle non-tagged data types - the frontend can use this metadata argument to pass a table of descriptors which tells the collector how to trace that specific value.

Take the case of tagged unions as an example. We can declare the entire tagged union struct to be a “root”, meaning that what gets passed to the collector is not the address of the payload area, but the address of the whole union struct. The collector knows how to examine the bitfield and trace the payload area because of the metadata that is associated with that root - it won’t assume that the root address is the address of a pointer.

Now, in practice, the only LLVM data types that will ever be roots are pointers and structs. In the latter case, this will most often be structs containing pointers, but this won’t always be the case (Imagine a tagged union containing both pointer and non-pointer members, but whose largest member contains no pointers.)

In general, it would be best if LLVM treated roots opaquely - that is, it should not attempt to interpret the contents of the root, but should instead just pass it through to the collector.

Overall Proposal: Support marking SSA values as roots (an evolutionary approach)

My proposal consists of three rather significant changes to LLVM:

  • Allow frontends to mark SSA values - or even portions of SSA values - as stack roots.
  • For alloca roots, add a way to communicate to LLVM when a root goes out of scope.
  • Transfer the responsibility for copying stack roots to memory, from the frontend to the LLVM code generators.

Let me take the last point first.

Proposal Pt 3: Copying Stack Roots to Memory

The LLVM code generators and analysis passes have a much more thorough knowledge of SSA value lifetimes than frontends do, and therefore could avoid spilling and reloading of values when it wasn’t needed. So the LLVM libraries should take over the job of creating local allocas for holding SSA values during a safe point, as well as the job of copying those SSA values to those allocas.

Of course, generating “optimal” code (as outlined in the previous section) would require a lot of work to the backends. But we don’t need to do that right away. The first goal should be a “conservative, pessimistic” implementation that’s no better or worse than what we have today (other than being far easier to use.) It might be possible to do this as some sort of stand-alone transformation/lowering pass.

This is what I mean by an evolutionary approach - come up with the right interface between the frontends and LLVM that enables us to gradually move towards that eventual goal.

Proposal Pt 1: Marking SSA roots

In order for the backend to be able to properly handle SSA roots, we need some way to communicate to LLVM which values are roots. Currently the way that this is done is via the llvm.gcroot() intrinsic. The intrinsic itself actually does nothing - rather, the presence of the intrinsic instruction is used as a “marker” that is recognized by certain LLVM analysis passes.

One approach would be to extend the current scheme to work with SSA values, possibly using a new intrinsic. This is somewhat problematic because you can’t really have a ‘do-nothing’ function on SSA values - the result of the function is always going to be a copy of the value, not the original value that was passed in.

A different approach would be to some how annotate an LLVM type with information about it’s ‘root-ness’. In once sense, this is logical, because it mirrors what the frontend does - uses the type to determine how the garbage collector should treat an object.

However, annotating types has some major problems:

  • LLVM normally folds structurally identical types together into a single type. However, you might have two types which are structurally identical when translated into IR, but which are supposed to be treated differently from a GC standpoint. The only way to avoid this to make the annotations part of the type’s signature - to treat types having different annotations as unique.
  • For garbage collection we need at least 1 bit and 1 pointer per type annotated. (Actually, it’s technically possible that we could do without the pointer, which is used to store the metadata, but it would make a frontend developer’s life considerably harder.) Unfortunately, increasing the size of every type by that much would be a huge cost.
  • Pointers in LLVM have an “address space” attribute, from which we can steal a few bits. Unfortunately, this has two problems: Not all roots are pointers, and a few bits aren’t really enough.
    Given that, there are a few different options I can think of - none of which I completely like, but they are better than anything I’ve come up with so far.

Option 1: Make type annotations first-class types in LLVM

The idea here is that you would have a new kind of type expression in LLVM, which “wraps” an existing type and adds an annotation to it. I’m going to randomly pick a character that isn’t used yet in the LLVM ASCII serialization format (#) and craft an example of what this might look like:

%mytype = type #( { %mytype*, i32 }, i8* %annotation )

So basically this is saying that ‘mytype’ is an annotated struct - you can use it just like you would use a struct, but it’s a distinct type that will not be folded with other structs, unless they have an identical annotation.

Multiple annotations can be handled either by nested annotations (one annotation wrapping another), or by allowing the annotation construct to take multiple arguments.

You can have structs that are roots embedded inside non-root structs (this applies to options 2a and 2b as well):

%mytype = type { int, #( { %mytype*, i32 }, i8* %annotation ) }

The way to interpret this is as follows: If you have an SSA value whose type is “mytype” then only the second member needs to be copied to memory during a safe point, and the address passed to the garbage collector is the address of that member only.

The first advantage of this approach is that it’s “pay for what you use” - only people who use annotations have to pay the memory cost for them.
A second advantage of this approach is that you could use it for other things besides garbage collection - const, volatile, thread-local, and any other value attributes you can think of.

The big downside here is that you would have to go and update the LLVM code base to “unwrap” type expressions - that is, any function that takes an LLVM type as an argument now has to potentially strip off an arbitrary number of annotations before it can work with the type. This is a vast amount of work.

Option 2a: Allow annotations for Structs only

Chris Lattner’s proposal for named struct types has a fortunate side effect, which is that we can now control whether types are folded or not. Unfortunately, this only applies to structs, and not pointers, which is the other type we are interested in. However, we can convert any arbitrary LLVM into a struct by simply wrapping it in a struct.

So let’s say that in addition to allowing structs to be named, we also allow arbitrary annotations to be added. (In fact, the struct name could be implemented as one type of annotation, so we’re really only adding one extra field.) We make sure that any struct that is a root has a unique name, and add the appropriate ‘root’ annotation to it.

This means that any garbage-collectable pointer type such X* now has to be transformed into { X* }. This shouldn’t affect anything except that now you have to add an extra 0 to your GEP expressions. Ah well, such is life.

Option 2b: Annotations on structs via an auxiliary data structure.

Instead of adding the ability to annotate struct types, we build a parallel data structure which maps the name of the struct to its garbage collection metadata. The upside of this is that it involves the least number of changes to LLVM, but it has a couple of downsides:

  • You still have to unwrap pointers before using them (via an extra 0 to GEP.)
  • This data structure needs to be serialized as part of the module.
  • The information about the struct is no longer all in one place, the various backend passes that know about stack roots now have to correlate information from two different sources.

Proposal Pt 2: Marking SSA roots

The final part of the proposal is to allow the frontend to tell LLVM when an alloca variable has gone out of scope. This part is easy - just add another intrinsic, similar to llvm.gcroot, which is used as a marker.

That doesn't seem like a huge problem. In the common case, the
"marked" SSA value and the "unmarked" original wouldn't be live
simultaneously. Uses of the unmarked version reachable from the mark
intrinsic can either be treated as a separate value or just left
undefined.

It makes sense to also add an "unmark" intrinsic to define when the
SSA stack root goes out of scope

That’s an interesting suggestion.

I realized, however, that there’s another advantage to using type-based marking rather than an intrinsic - it allows you to declare function parameters as stack roots. Depending on the calling convention, some of the parameters to a function will be located on the stack while others will be passed in registers. The ones already in memory ought to be able to be traced right where they are, but unfortunately there’s no way to mark them as roots with the intrinsic method - which means that the frontend has to copy the parameters into an alloca and then mark that as a root.

Although this would indeed be nice, it is not done by similar platforms in practice. I have investigated [very] briefly into whether the CLR or JVM implement garbage collection in their IR, and it does not seem that they do (meaning, the CLR/JVM implementation itself is responsible for garbage collection, not the code generated by the CLR/Java language compilers).

However, I have left remaining the only strong point in taking an IR approach to GC that you pointed out. I have to say that it makes a convincing argument, since the generators and analysis passes would know even more about the lifetime of variables in code than any plug-in GC library would. I’d be curious as to what benefits could be gained from this approach over the current standard way of doing it; but not terribly hopeful.

I’m not sure this is a valid comparison. CLR and JVM IRs are typed, and by nature of those VMs, all pointers on the stack and in registers are automatically GC roots. Unlike LLVM, the JVM (and I think the CLR) doesn’t even have a concept of a non-GC pointer.

Sebastian

For the past few years, my group in Intel Labs has been working on a project similar to LLVM and C–, and perhaps our experience in handling roots and stack walking could be useful in deciding how LLVM should evolve in the GC area. Our project is called Pillar (you can see our paper “Pillar: A Parallel Implementation Language” in Languages and Compilers for Parallel Computing 2008 for a general overview). Pillar is a generic language and runtime designed to support managed languages and parallel languages although most of our work so far has been on the managed side. We’ve implemented a JVM and a functional language on top of Pillar so it has evidenced some generality.

In the Pillar language, we have a generalized ref (similar to llvm.gcroot) data type that encompases different kinds of pointers and types into the managed heap. This could be regular pointers, a couple varieties of interior pointers, or user-defined pointer types. The underlying compiler then generates per-callsite stack maps. Those stack maps can indicate when a live ref (local or parameter) is at a known offset from the start of the frame and can also indicate when a live ref is present in a register.

If we want to enumerate the refs of a call stack, we call a runtime function, prtYoungestActivation, and that gives us back a stack iterator for the sequence of activations. The iterator contains the value for esp and pointers to the memory locations where ebp, eip, edi, esi, and ebx were last saved to the stack. (For architectures other than 32-bit x86, the register-specific contents of the activation would of course be defined differently.) In the eip case, this is saved for every activation as part of the call sequence and so provides you with the ability to determine which call site you are at. The other registers may have been saved by the current frame or a subsequent frame. The stack iterator also contains a virtual frame number that basically allows an unwinder to see logical activations instead of physical activations that may combine multiple logical activations via inlining.

You can then call another runtime function, prtEnumerateRootsOfActivation, to ask for the roots associated with the given stack iterator’s current activation to be enumerated. To this function, you pass a PrtRseInfo struct that contains a callback that gets invoked once per live ref found and a opaque environment variable passed (unexamined) onto the callback function. To go to the next activation on the call stack, you then call the runtime’s prtNextActivation. There are routines to determine when you’ve got to the bottom of the call stack as well. In addition, there are routines for performing other (non-GC) operations the stack iterator’s activation, such as looking up data associated with the SPAN construct (taken from C–) or exception handlers.

We have a couple of associated constructs in this regard. The first is the concept of code regions. We can define regions of code and define different owners for each region. A typical configuration is for most code to be in a region associated with our compiler, another region for code implemented in the Pillar runtime, and another region for code implemented in our GC. You first register a code info manager with the Pillar runtime and then associate regions of code with that manager. When you register, you provide a set of callback functions to enumerate roots, unwind to the next activation, etc. So, the Pillar compiler’s code info manager uses a callback that consults the previously generated stack maps in order to enumerate roots. The GC could optimize its barriers and provide its own code info manager to handle enumerating roots in those barriers.

Thoughts?

Todd

Below are the relevant portions from our runtime’s include files.

/* This is essentially a pointer-sized integer. */

typedef char *PrtRegister;

/* These are used to document function parameters to indicate whether they are IN, OUT, or INOUT. */

#define PRT_IN

#define PRT_OUT

#define PRT_INOUT

/*

Perhaps you are right to point that out. I can’t claim to be an expert in compiler/language/GC design here.

My thoughts are many, and inline below:

For the past few years, my group in Intel Labs has been working on a project similar to LLVM and C–, and perhaps our experience in handling roots and stack walking could be useful in deciding how LLVM should evolve in the GC area. Our project is called Pillar (you can see our paper “Pillar: A Parallel Implementation Language” in Languages and Compilers for Parallel Computing 2008 for a general overview). Pillar is a generic language and runtime designed to support managed languages and parallel languages although most of our work so far has been on the managed side. We’ve implemented a JVM and a functional language on top of Pillar so it has evidenced some generality.

In the Pillar language, we have a generalized ref (similar to llvm.gcroot) data type that encompases different kinds of pointers and types into the managed heap. This could be regular pointers, a couple varieties of interior pointers, or user-defined pointer types. The underlying compiler then generates per-callsite stack maps. Those stack maps can indicate when a live ref (local or parameter) is at a known offset from the start of the frame and can also indicate when a live ref is present in a register.

I assume that by interior pointer you mean a pointer to the inside of another object?

If we want to enumerate the refs of a call stack, we call a runtime function, prtYoungestActivation, and that gives us back a stack iterator for the sequence of activations. The iterator contains the value for esp and pointers to the memory locations where ebp, eip, edi, esi, and ebx were last saved to the stack. (For architectures other than 32-bit x86, the register-specific contents of the activation would of course be defined differently.) In the eip case, this is saved for every activation as part of the call sequence and so provides you with the ability to determine which call site you are at. The other registers may have been saved by the current frame or a subsequent frame. The stack iterator also contains a virtual frame number that basically allows an unwinder to see logical activations instead of physical activations that may combine multiple logical activations via inlining.

That all makes sense.

I would say that your framework is at a somewhat higher level than I would expect LLVM to support - by that, I mean that ideally it should be possible to use the primitives that LLVM supplies to build exactly what you have described. My own use case has slightly different requirements, and as a result I could not adopt your framework as it is - but I see no reason why I should not be able to use the same LLVM primitives to build what I need.

The main difference has to do with the handling of disjoint types in my language. For example, if I declare a variable using the following syntax:

var a:String or float or int;

What the compiler generates is a data structure that looks like this (as it would appear in C):

struct {
int disc:2; // 0 = String, 1 = int, 2 = float
union {
String * s;
int i;
float f;
};
};

The garbage collector needs to examine the ‘disc’ field in order to know whether to trace ‘s’ or not. The way I handle this is by treating the entire structure as a root, rather than just the pointer field, so that what the garbage collector sees is a pointer to the beginning of the struct.

I suspect that register maps wouldn’t be able to handle this case very well. SInce the data structure can’t fit into a single register, it would have to be ‘sliced’ - half in one register and half in another - and the register map would need some way to communicate to the collector which half of the struct is in which register. (Or worse, what if only one half was in a register and the other half on the stack?) I’m guessing that the right answer is “don’t do that” - meaning, don’t allow this data structure to be sliced in that way, at least not during a safe point.

You can then call another runtime function, prtEnumerateRootsOfActivation, to ask for the roots associated with the given stack iterator’s current activation to be enumerated. To this function, you pass a PrtRseInfo struct that contains a callback that gets invoked once per live ref found and a opaque environment variable passed (unexamined) onto the callback function. To go to the next activation on the call stack, you then call the runtime’s prtNextActivation. There are routines to determine when you’ve got to the bottom of the call stack as well. In addition, there are routines for performing other (non-GC) operations the stack iterator’s activation, such as looking up data associated with the SPAN construct (taken from C–) or exception handlers.

In my own case, I don’t have an iterator for roots within a single activation, rather I have a static data structure that describes how to find all of the roots within the call frame. One of my design goals is to minimize the number of function calls required to trace a pointer, so the different parts of my code accept whole stack maps as arguments instead of single pointers.

However, the LLVM approach to generating stack maps is flexible enough to handle either case - your GCStrategy plugin gets a list of all stack roots for each safe point, and you can transform that into whatever data structures you need.

We have a couple of associated constructs in this regard. The first is the concept of code regions. We can define regions of code and define different owners for each region. A typical configuration is for most code to be in a region associated with our compiler, another region for code implemented in the Pillar runtime, and another region for code implemented in our GC. You first register a code info manager with the Pillar runtime and then associate regions of code with that manager. When you register, you provide a set of callback functions to enumerate roots, unwind to the next activation, etc. So, the Pillar compiler’s code info manager uses a callback that consults the previously generated stack maps in order to enumerate roots. The GC could optimize its barriers and provide its own code info manager to handle enumerating roots in those barriers.

Thoughts?

Todd

Below are the relevant portions from our runtime’s include files.

/* This is essentially a pointer-sized integer. */

typedef char *PrtRegister;

/* These are used to document function parameters to indicate whether they are IN, OUT, or INOUT. */

#define PRT_IN

#define PRT_OUT

#define PRT_INOUT

/*

For the past few years, my group in Intel Labs has been working on a project similar to LLVM and C–, and perhaps our experience in handling roots and stack walking could be useful in deciding how LLVM should evolve in the GC area. Our project is called Pillar (you can see our paper “Pillar: A Parallel Implementation Language” in Languages and Compilers for Parallel Computing 2008 for a general overview). Pillar is a generic language and runtime designed to support managed languages and parallel languages although most of our work so far has been on the managed side. We’ve implemented a JVM and a functional language on top of Pillar so it has evidenced some generality.

In the Pillar language, we have a generalized ref (similar to llvm.gcroot) data type that encompases different kinds of pointers and types into the managed heap. This could be regular pointers, a couple varieties of interior pointers, or user-defined pointer types. The underlying compiler then generates per-callsite stack maps. Those stack maps can indicate when a live ref (local or parameter) is at a known offset from the start of the frame and can also indicate when a live ref is present in a register.

I assume that by interior pointer you mean a pointer to the inside of another object?

[TAA]

Yes.

If we want to enumerate the refs of a call stack, we call a runtime function, prtYoungestActivation, and that gives us back a stack iterator for the sequence of activations. The iterator contains the value for esp and pointers to the memory locations where ebp, eip, edi, esi, and ebx were last saved to the stack. (For architectures other than 32-bit x86, the register-specific contents of the activation would of course be defined differently.) In the eip case, this is saved for every activation as part of the call sequence and so provides you with the ability to determine which call site you are at. The other registers may have been saved by the current frame or a subsequent frame. The stack iterator also contains a virtual frame number that basically allows an unwinder to see logical activations instead of physical activations that may combine multiple logical activations via inlining.

That all makes sense.

I would say that your framework is at a somewhat higher level than I would expect LLVM to support - by that, I mean that ideally it should be possible to use the primitives that LLVM supplies to build exactly what you have described. My own use case has slightly different requirements, and as a result I could not adopt your framework as it is - but I see no reason why I should not be able to use the same LLVM primitives to build what I need.

The main difference has to do with the handling of disjoint types in my language. For example, if I declare a variable using the following syntax:

[TAA]

I’m not so sure we couldn’t support your case. If a disjoint type might contain a root then you could define the possible pointer in the struct as a generalized ref having a user-defined type and extraneous data. That user-defined generalized ref type for you would indicate a disjoint union and the extraneous data could indicate the address of the discriminator in the union. Whether the data was enregistered or wouldn’t matter in that case. The GC root callback is passed the user-defined type and extraneous data you declare the field with.

I agree that much of this is at a higher level that LLVM should not have to worry about but having such a generalized ref type in the core seems important so you can handle ref params and refs in registers. You can define it as data structures in the core and implement a library on top.

You can then call another runtime function, prtEnumerateRootsOfActivation, to ask for the roots associated with the given stack iterator’s current activation to be enumerated. To this function, you pass a PrtRseInfo struct that contains a callback that gets invoked once per live ref found and a opaque environment variable passed (unexamined) onto the callback function. To go to the next activation on the call stack, you then call the runtime’s prtNextActivation. There are routines to determine when you’ve got to the bottom of the call stack as well. In addition, there are routines for performing other (non-GC) operations the stack iterator’s activation, such as looking up data associated with the SPAN construct (taken from C–) or exception handlers.

In my own case, I don’t have an iterator for roots within a single activation, rather I have a static data structure that describes how to find all of the roots within the call frame. One of my design goals is to minimize the number of function calls required to trace a pointer, so the different parts of my code accept whole stack maps as arguments instead of single pointers.

However, the LLVM approach to generating stack maps is flexible enough to handle either case - your GCStrategy plugin gets a list of all stack roots for each safe point, and you can transform that into whatever data structures you need.

[TAA]

They seem interchangeable…if you have a data structure you can just as easily write an iterator. I’m curious if you see this as performance critical. In my experience, GC times tend to be completely dominated by walking the live objects and not root set enumeration.

For the past few years, my group in Intel Labs has been working on a project similar to LLVM and C–, and perhaps our experience in handling roots and stack walking could be useful in deciding how LLVM should evolve in the GC area. Our project is called Pillar (you can see our paper “Pillar: A Parallel Implementation Language” in Languages and Compilers for Parallel Computing 2008 for a general overview). Pillar is a generic language and runtime designed to support managed languages and parallel languages although most of our work so far has been on the managed side. We’ve implemented a JVM and a functional language on top of Pillar so it has evidenced some generality.

In the Pillar language, we have a generalized ref (similar to llvm.gcroot) data type that encompases different kinds of pointers and types into the managed heap. This could be regular pointers, a couple varieties of interior pointers, or user-defined pointer types. The underlying compiler then generates per-callsite stack maps. Those stack maps can indicate when a live ref (local or parameter) is at a known offset from the start of the frame and can also indicate when a live ref is present in a register.

I assume that by interior pointer you mean a pointer to the inside of another object?

[TAA]

Yes.

If we want to enumerate the refs of a call stack, we call a runtime function, prtYoungestActivation, and that gives us back a stack iterator for the sequence of activations. The iterator contains the value for esp and pointers to the memory locations where ebp, eip, edi, esi, and ebx were last saved to the stack. (For architectures other than 32-bit x86, the register-specific contents of the activation would of course be defined differently.) In the eip case, this is saved for every activation as part of the call sequence and so provides you with the ability to determine which call site you are at. The other registers may have been saved by the current frame or a subsequent frame. The stack iterator also contains a virtual frame number that basically allows an unwinder to see logical activations instead of physical activations that may combine multiple logical activations via inlining.

That all makes sense.

I would say that your framework is at a somewhat higher level than I would expect LLVM to support - by that, I mean that ideally it should be possible to use the primitives that LLVM supplies to build exactly what you have described. My own use case has slightly different requirements, and as a result I could not adopt your framework as it is - but I see no reason why I should not be able to use the same LLVM primitives to build what I need.

The main difference has to do with the handling of disjoint types in my language. For example, if I declare a variable using the following syntax:

[TAA]

I’m not so sure we couldn’t support your case. If a disjoint type might contain a root then you could define the possible pointer in the struct as a generalized ref having a user-defined type and extraneous data. That user-defined generalized ref type for you would indicate a disjoint union and the extraneous data could indicate the address of the discriminator in the union. Whether the data was enregistered or wouldn’t matter in that case. The GC root callback is passed the user-defined type and extraneous data you declare the field with.

Interesting. Let me think about that. It would indeed be nice if that worked. How would you handle a case where the union member was (String *)[3] instead of just String *?:

I agree that much of this is at a higher level that LLVM should not have to worry about but having such a generalized ref type in the core seems important so you can handle ref params and refs in registers. You can define it as data structures in the core and implement a library on top.

I’m in agreement here.

You can then call another runtime function, prtEnumerateRootsOfActivation, to ask for the roots associated with the given stack iterator’s current activation to be enumerated. To this function, you pass a PrtRseInfo struct that contains a callback that gets invoked once per live ref found and a opaque environment variable passed (unexamined) onto the callback function. To go to the next activation on the call stack, you then call the runtime’s prtNextActivation. There are routines to determine when you’ve got to the bottom of the call stack as well. In addition, there are routines for performing other (non-GC) operations the stack iterator’s activation, such as looking up data associated with the SPAN construct (taken from C–) or exception handlers.

In my own case, I don’t have an iterator for roots within a single activation, rather I have a static data structure that describes how to find all of the roots within the call frame. One of my design goals is to minimize the number of function calls required to trace a pointer, so the different parts of my code accept whole stack maps as arguments instead of single pointers.

However, the LLVM approach to generating stack maps is flexible enough to handle either case - your GCStrategy plugin gets a list of all stack roots for each safe point, and you can transform that into whatever data structures you need.

[TAA]

They seem interchangeable…if you have a data structure you can just as easily write an iterator. I’m curious if you see this as performance critical. In my experience, GC times tend to be completely dominated by walking the live objects and not root set enumeration.

Well, I’ve always been a premature optimizer :slight_smile: However, I use the same data structure type to describe both stack roots and heap objects - essentially I treat the stack frame as though it were another object (except that the offsets are usually negative).

Oh, one other minor misunderstanding I wanted to clear up: the “llvm.gcroot” intrinsic, as it exists today, is used to mark values, not types, as being roots. One of the things I am proposing is to make it so that the marking of roots is type-based rather than value-based. This would, among other benefits, allow function parameters to be roots.

If we want to enumerate the refs of a call stack, we call a runtime function, prtYoungestActivation, and that gives us back a stack iterator for the sequence of activations. The iterator contains the value for esp and pointers to the memory locations where ebp, eip, edi, esi, and ebx were last saved to the stack. (For architectures other than 32-bit x86, the register-specific contents of the activation would of course be defined differently.) In the eip case, this is saved for every activation as part of the call sequence and so provides you with the ability to determine which call site you are at. The other registers may have been saved by the current frame or a subsequent frame. The stack iterator also contains a virtual frame number that basically allows an unwinder to see logical activations instead of physical activations that may combine multiple logical activations via inlining.

That all makes sense.

I would say that your framework is at a somewhat higher level than I would expect LLVM to support - by that, I mean that ideally it should be possible to use the primitives that LLVM supplies to build exactly what you have described. My own use case has slightly different requirements, and as a result I could not adopt your framework as it is - but I see no reason why I should not be able to use the same LLVM primitives to build what I need.

The main difference has to do with the handling of disjoint types in my language. For example, if I declare a variable using the following syntax:

[TAA]

I’m not so sure we couldn’t support your case. If a disjoint type might contain a root then you could define the possible pointer in the struct as a generalized ref having a user-defined type and extraneous data. That user-defined generalized ref type for you would indicate a disjoint union and the extraneous data could indicate the address of the discriminator in the union. Whether the data was enregistered or not wouldn’t matter in that case. The GC root callback is passed the user-defined type and extraneous data you declare the field with.

Interesting. Let me think about that. It would indeed be nice if that worked. How would you handle a case where the union member was (String *)[3] instead of just String *?:

[TAA] So it is either an array of ints or an array of strings? A generalized ref is just another type so you can declare of arrays of them. Your discriminator could even be a bitmask that indicates which members of the array were ints and which were strings assuming you wanted to mix-and-match rather than all ints or all strings. Either way it works.

You can then call another runtime function, prtEnumerateRootsOfActivation, to ask for the roots associated with the given stack iterator’s current activation to be enumerated. To this function, you pass a PrtRseInfo struct that contains a callback that gets invoked once per live ref found and a opaque environment variable passed (unexamined) onto the callback function. To go to the next activation on the call stack, you then call the runtime’s prtNextActivation. There are routines to determine when you’ve got to the bottom of the call stack as well. In addition, there are routines for performing other (non-GC) operations the stack iterator’s activation, such as looking up data associated with the SPAN construct (taken from C–) or exception handlers.

In my own case, I don’t have an iterator for roots within a single activation, rather I have a static data structure that describes how to find all of the roots within the call frame. One of my design goals is to minimize the number of function calls required to trace a pointer, so the different parts of my code accept whole stack maps as arguments instead of single pointers.

However, the LLVM approach to generating stack maps is flexible enough to handle either case - your GCStrategy plugin gets a list of all stack roots for each safe point, and you can transform that into whatever data structures you need.

[TAA]

They seem interchangeable…if you have a data structure you can just as easily write an iterator. I’m curious if you see this as performance critical. In my experience, GC times tend to be completely dominated by walking the live objects and not root set enumeration.

Well, I’ve always been a premature optimizer :slight_smile: However, I use the same data structure type to describe both stack roots and heap objects - essentially I treat the stack frame as though it were another object (except that the offsets are usually negative).

Oh, one other minor misunderstanding I wanted to clear up: the “llvm.gcroot” intrinsic, as it exists today, is used to mark values, not types, as being roots. One of the things I am proposing is to make it so that the marking of roots is type-based rather than value-based. This would, among other benefits, allow function parameters to be roots.

[TAA] I would be in complete agreement here.

The initial step of your evolutionary approach sounds basically equivalent to what is possible today. When you try to go beyond that you will run into a lot of issues, since the backend is untyped and doesn’t even distinguish between pointers and integers. The closest thing we have to tracking liveness of SSA values throughout the backend is debug info, which is frequently incorrect (although usually not due to the code generator itself).

Who is going to implement support for all of this in the code generator passes and maintain it? Will GC compatibility be a requirement for any further improvements in code generation?

Cameron

The initial step of your evolutionary approach sounds basically equivalent to what is possible today. When you try to go beyond that you will run into a lot of issues, since the backend is untyped and doesn’t even distinguish between pointers and integers. The closest thing we have to tracking liveness of SSA values throughout the backend is debug info, which is frequently incorrect (although usually not due to the code generator itself).

Who is going to implement support for all of this in the code generator passes and maintain it? Will GC compatibility be a requirement for any further improvements in code generation?

Cameron

[TAA] I don’t know who would do it but perhaps it isn’t as daunting as one might expect. We implemented most of this in the Intel compiler and ended up changing about 1-2% of the lines in the code base. The trend seems to be for most new languages to be garbage collected. Yes, you can do garbage collection (even accurate garbage collection) totally at the user-level but in our experience you may be leaving 10% or more performance on the table by not having a tighter integration with the underlying compiler.

You don't need function parameters to be stack roots. You already
have a stack root for those values in the calling function, right?

Anyway, declaring the function parameters as stack roots means that
you *must* pass pointers into the GC heap to those functions, whether
or not it makes any difference to the function itself. That means,
among other things, that optimization passes for turning GC
allocations into pool allocations or local allocations become more
complicated.

You don't need function parameters to be stack roots. You already
have a stack root for those values in the calling function, right?

Allowing parameters themselves to be stack roots allows an object that
the calling function itself no longer needs to be collected once the
called function (and everyone else, obviously) is done with it.

It would also make tail calls easier to support: if parameters can't
be stack roots, the called function would have to copy each pointer
parameter to a stack root before doing anything else, or at least make
sure the GC doesn't kick in before it does so. And it would have to do
so whether or not it's ever actually tail called, because it (usually)
can't know who calls it and how.

Anyway, declaring the function parameters as stack roots means that
you *must* pass pointers into the GC heap to those functions, whether
or not it makes any difference to the function itself. That means,
among other things, that optimization passes for turning GC
allocations into pool allocations or local allocations become more
complicated.

Not if the GC is smart enough to ignore pointers that don't point into
its heap. This shouldn't be that much extra work to achieve because it
needs to figure out what object a pointer points to anyway (at least
if interior pointers are allowed). If a pointer doesn't point to an
object it knows about, it can simply be ignored.
This does depend on the structure of the GC heap though: if the GC
knows beforehand which pointers point into its own heap the "figure
out what object it points to" algorithm may be able to use some tricks
that aren't safe otherwise.

Y'all are right... I was in the car and realized I needed to account
for all of that and didn't get a chance to issue a retraction.

Overall Proposal: Support marking SSA values as roots (an evolutionary approach)

My proposal consists of three rather significant changes to LLVM:

  • Allow frontends to mark SSA values - or even portions of SSA values - as stack roots.
  • For alloca roots, add a way to communicate to LLVM when a root goes out of scope.
  • Transfer the responsibility for copying stack roots to memory, from the frontend to the LLVM code generators.

Let me take the last point first.

Proposal Pt 3: Copying Stack Roots to Memory

The LLVM code generators and analysis passes have a much more thorough knowledge of SSA value lifetimes than frontends do, and therefore could avoid spilling and reloading of values when it wasn’t needed. So the LLVM libraries should take over the job of creating local allocas for holding SSA values during a safe point, as well as the job of copying those SSA values to those allocas.

Of course, generating “optimal” code (as outlined in the previous section) would require a lot of work to the backends. But we don’t need to do that right away. The first goal should be a “conservative, pessimistic” implementation that’s no better or worse than what we have today (other than being far easier to use.) It might be possible to do this as some sort of stand-alone transformation/lowering pass.

This is what I mean by an evolutionary approach - come up with the right interface between the frontends and LLVM that enables us to gradually move towards that eventual goal.

The initial step of your evolutionary approach sounds basically equivalent to what is possible today. When you try to go beyond that you will run into a lot of issues, since the backend is untyped and doesn’t even distinguish between pointers and integers. The closest thing we have to tracking liveness of SSA values throughout the backend is debug info, which is frequently incorrect (although usually not due to the code generator itself).

Who is going to implement support for all of this in the code generator passes and maintain it? Will GC compatibility be a requirement for any further improvements in code generation?

That’s really the key question. Most of the developers who are paid to work on LLVM are doing so in support of Clang, which doesn’t need this feature. And language experimenters generally don’t have time to work on their own language and at the same time become familiar enough with the LLVM code generator passes to do useful work.

It’s kind of a chicken and egg problem: Managed language creators aren’t attracted to LLVM because it lacks certain features - they are far more likely to choose the JVM as a development platform (as seen in the case of Scala, Groovy, and many others). Conversely, the lack of customers for those features in LLVM makes it difficult to justify putting significant effort into building and maintaining them. This is what I meant when I said that the LLVM community lacks momentum in certain areas.

However, what I’m attempting to do is at least get some sort of consensus on what we’d like to have happen, even if we don’t yet have someone available to do the work. If some GSoC student showed up tomorrow and volunteered to do all this, I don’t think we’d be in agreement as to what we’d want them to do.

I would say that first and foremost, let’s come up with some scheme that allows types, rather than values, to be marked as roots, and which doesn’t impose a huge cost on users that don’t need GC. Pretty much everything else is contingent on that. I suggested several approaches, but I’m sure that there are people on this list who have much better ideas than mine.

So - here’s a really modest proposal:

Based on Renato’s suggestion, there is a way to associate arbitrary annotation data with any LLVM type.

So what I would propose as a first step is to simply replace (or perhaps supplement) the current llvm.gcroot() feature with a type annotation. That is, LLVM would only support roots that are allocas, just it does today. The backend code would not change at all, except that in addition to detecting the presence of the llvm.gcroot marker, it would also detect the presence of a type annotation. That means that every alloca of a given type would be considered a root - there would be no need to insert the llvm.gcroot intrinsic into each function.

Later we could think about supporting SSA local variables, parameters, and register maps. But this first step is a prerequisite for all of those other things, and I’m thinking that it’s a relatively easy step compared to the others.

– Talin

I may have some fundamental misconceptions of how gcroot works for you today. If you can clarify it for me, then I can better understand your proposal. I always assumed that an essential side effect of the gcroot() intrinsic was to take the address of a local variable, implicitly preventing optimization. But your recent comments indicate that gcroot does nothing beyond annotating a value. Without gcroot, what prevents mem2reg from promoting the alloca? Furthermore, what prevents optimizations such as GVN from optimizing access to the local variable across call sites? I assume stack variables that don't "escape" are fair game for all the usual scalar optimizations--but I could be wrong.

-Andy

Unlike LLVM, the JVM (and I think the CLR) doesn't even have a concept of

a non-GC pointer.

.NET has native pointers that are pointers pointing outside the managed
heap.

Cheers,
Jon.