Stack roots and function parameters

So I’ve managed to get my stack crawler working and passing its unit tests - this is the one I’ve been working on as an alternative to shadow-stack: it uses only static constant data structures (no global variables or thread-local data), which means that it’s fully compatible with a multi-threaded environment.

One question that has arisen, however, is what to do about function parameters that may be stack roots. You can’t call llvm.gcroot() on a function parameter. The only thing I can think of is to copy any function parameter that is a pointer, or which contains a pointer, into a local variable. This seems complicated and inefficient to me (for one thing, it means that every stack frame just got 8 bytes bigger due to the need to store the ‘this’ pointer locally) - is there a better way?

Anyone?

Here’s some examples of what I am talking about (I’ll use Java for the examples):

Example 1:

class Foo {
static String st = “Hello”;

void f1() {
f2(st);
}

void f2(String s) {
st = null; // Destroy static root
// Now allocate something, which might trigger
// a garbage collection.
StringBuffer sb = new StringBuffer();

// Is ‘s’ still live? It’s not stored in any
// alloca, or in any global variable. It only
// exists as a function parameter, which cannot
// be declared as a root via llvm.gcroot.
sb.append(s);
}
}

Example 2:

class Foo {
static void f1() {
// Create a new ‘Foo’ but don’t store it
// anywhere.
new Foo().f2();
}

void f2() {
// Possibly trigger a collection
StringBuffer sb = new StringBuffer();

// Is ‘this’ still live at this point?
sb.append(toString());
}
}

Now, I know that the LLVM “Accurate GC” document says that any “intermediate values” must be declared as a GC root. My question is, does merely loading a global variable - or any mutable, non-strictly-local variable for that matter - count as an “intermediate” value in this case?

My problem is that I can’t see how to address this without generating horribly bloated and inefficient code. Pretty much every function argument that isn’t a simple integer or float will have to be copied to a hidden local variable before every function call - in addition to copying it to the stack to create the call frame. Even local variables are not immune if you allow closures in your language. Under this scenario, calling conventions that pass arguments in registers are utterly futile and save nothing, because everything has to get copied to the stack anyway.

I really wish I could simply declare function arguments as llvm.gcroots. For arguments that are not in registers, it would treat it just like an alloca, since they both represent stack slots. For arguments that are passed in registers, LLVM should automatically lower it to a non-register argument, making a copy if needed. (I can’t do this myself, since I’m not supposed to know which arguments are passed in registers or not - that depends on the target and the code generators and such).

That still leaves me with the problem of declaring as roots function arguments that are the result of complex calculations, but those are far fewer - declaring temporary vars for those won’t bloat the code so badly. But having to copy every single load of a non-local variable into a temporary is a nightmare.

So I've managed to get my stack crawler working and passing its unit tests
- this is the one I've been working on as an alternative to shadow-stack: it
uses only static constant data structures (no global variables or
thread-local data), which means that it's fully compatible with a
multi-threaded environment.
One question that has arisen, however, is what to do about function
parameters that may be stack roots. You can't call llvm.gcroot() on a
function parameter. The only thing I can think of is to copy any function
parameter that is a pointer, or which contains a pointer, into a local
variable. This seems complicated and inefficient to me (for one thing, it
means that every stack frame just got 8 bytes bigger due to the need to
store the 'this' pointer locally) - is there a better way?

If a function parameter is a stack root, then so is the value passed
through that parameter. Since they both point to the same thing, only
one stack root is needed to keep it reachable. If you've got a root
for it in the caller, you don't need one in the callee for the
argument.

Anyone?
Here's some examples of what I am talking about (I'll use Java for the
examples):
Example 1:
class Foo {
static String st = "Hello";
void f1() {
f2(st);
}
void f2(String s) {
st = null; // Destroy static root
// Now allocate something, which might trigger
// a garbage collection.
StringBuffer sb = new StringBuffer();
// Is 's' still live? It's not stored in any
// alloca, or in any global variable. It only
// exists as a function parameter, which cannot
// be declared as a root via llvm.gcroot.

s is definitely still live assuming f1 copied st into a root before
using it, which it generally should.

  sb\.append\(s\);
\}

}
Example 2:
class Foo {
static void f1() {
// Create a new 'Foo' but don't store it
// anywhere.
new Foo().f2();
}
void f2() {
// Possibly trigger a collection
StringBuffer sb = new StringBuffer();
// Is 'this' still live at this point?

Again, yes, if f1 copied the result of "new Foo()" into a root.

  sb\.append\(toString\(\)\);
\}

}
Now, I know that the LLVM "Accurate GC" document says that any "intermediate
values" must be declared as a GC root. My question is, does merely _loading_
a global variable - or any mutable, non-strictly-local variable for that
matter - count as an "intermediate" value in this case?

Yes, I believe it would. In a multithreaded system, if you load a
pointer from a global or the heap, and some other thread changes the
pointer you loaded, you'd better have a stack root for the original
pointer value you loaded while you're using it.

Now if the global is thread-local, you don't need another root for it
since no other threads can see it. And if you can prove that a global
isn't going to change, you also don't need another root for it.

My problem is that I can't see how to address this without generating
horribly bloated and inefficient code. Pretty much every function argument
that isn't a simple integer or float will have to be copied to a hidden
local variable before every function call - in addition to copying it to the
stack to create the call frame. Even local variables are not immune if you
allow closures in your language. Under this scenario, calling conventions
that pass arguments in registers are utterly futile and save nothing,
because everything has to get copied to the stack anyway.

One way or another, stack roots *must* get copied to the stack, so
that the GC can find them. Remember the GC is usually running in some
other thread and can *only* find pointers that live somewhere in
memory (either the stack or the heap).

I really wish I could simply declare function arguments as llvm.gcroots. For
arguments that are not in registers, it would treat it just like an alloca,
since they both represent stack slots. For arguments that are passed in
registers, LLVM should automatically lower it to a non-register argument,
making a copy if needed. (I can't do this myself, since I'm not supposed to
*know* which arguments are passed in registers or not - that depends on the
target and the code generators and such).

I *believe* the code generator has some optimizations that eliminate
redundant copies within a block. At any rate, if you keep your stack
roots in the caller rather than the callee, then there'll be a number
of cases where the caller needs to keep that stack root for purposes
other than that one call. In those cases, having the stack root
doesn't impose a cost you aren't already paying just to allow the GC
to work.

That still leaves me with the problem of declaring as roots function
arguments that are the result of complex calculations, but those are far
fewer - declaring temporary vars for those won't bloat the code so badly.
But having to copy every single load of a non-local variable into a
temporary is a nightmare.

You only have to copy pointers. If you've got a struct, you only need
to keep a root for the pointer member(s) of that struct.

Here’s another way to think about the issue: Compare this whole approach to stack roots to that of a conservative collector. Since we know conservative collectors work, the whole reason for making an accurate collector in the first place is efficiency, right? If we couldn’t make an accurate collector that was more efficient than a conservative collector, then no one would bother because it’s such a pain.

However, looking strictly at stack frames, a conservative collector has a much more compact stack layout. There’s no need to copy anything to local variables, because by the time you call into the collector proper, everything - every register variable, every function parameter, and so on - is already on the stack somewhere.

The stack frame layout dictated by llvm.gcroot, by contrast, is much, much fatter. In a typical deeply-nested call stack, there will be many copies of the same value on the stack. Let’s take the ‘this’ pointer as an example. Say we have some method that calls itself recursively, and we are currently 10 levels deep. That means that we have 20 copies of the ‘this’ pointer on the stack: Because each invocation of the method has to push the ‘this’ pointer on the stack as part of the calling protocol (even if it’s passed in registers, it still has to save the prior value of the register somewhere), and in addition we also have to save the ‘this’ pointer as a stack root. A conservative collector, on the other hand, would be able to get by with only 10 copies of the ‘this’ pointer.

As far as the issue of eliminating redundant stores, I’m pretty sure that the optimizers cannot eliminate redundant roots. Figuring out the minimal set of temporary roots needed for a function call is a non-trivial problem (you don’t want to just blindly make every pointer argument to a function a temporary root, since that just bloats the size of the call frame needlessly.)

My entire goal here is to get a GC that has high performance and efficiency. I want my GC to be able to run on everything from a data center to a beagle board, and use the minimum resources possible.

snip

My problem is that I can’t see how to address this without generating
horribly bloated and inefficient code. Pretty much every function argument
that isn’t a simple integer or float will have to be copied to a hidden
local variable before every function call - in addition to copying it to the
stack to create the call frame. Even local variables are not immune if you
allow closures in your language. Under this scenario, calling conventions
that pass arguments in registers are utterly futile and save nothing,
because everything has to get copied to the stack anyway.

One way or another, stack roots must get copied to the stack, so
that the GC can find them. Remember the GC is usually running in some
other thread and can only find pointers that live somewhere in
memory (either the stack or the heap).

I think the logic is that by putting the responsibility for having stack roots on the caller, the callee’s do not need to concern themselves with their arguments. A function using a gc’ed pointer assumes that a caller somewhere up the call-chain has created a gcroot for that pointer that will outlive its invocation therefore it doesn’t need to do anything with the garbage collector.

In the case of the ‘this’ pointer, only the function that got the ‘this’ pointer from somewhere else (global, heap) would be responsible for creating a gcroot, once this is done the ‘this’ pointer can be passed through any number of calls and no more roots are required as long as the pointer is not stored somewhere that will outlive the invocation.

snip

– Talin


LLVM Developers mailing list
LLVMdev@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

  • Nathan

Talin wrote:

[snip]

Here's another way to think about the issue: Compare this whole approach to stack roots to that of a conservative collector. Since we know conservative collectors work, the whole reason for making an accurate collector in the first place is efficiency, right? If we couldn't make an accurate collector that was more efficient than a conservative collector, then no one would bother because it's such a pain.

One very important to use precise GC is to have a copying collector.
For generational GC, a copying collector is faster and makes the allocator much simpler and faster.

However, looking strictly at stack frames, a conservative collector has a much more compact stack layout. There's no need to copy anything to local variables, because by the time you call into the collector proper, everything - every register variable, every function parameter, and so on - is already on the stack *somewhere*.

The typical stack is *tiny* compared with the typical heap.
Its pretty small compared with the nursery, so even for minor collections it won't matter.

The stack frame layout dictated by llvm.gcroot, by contrast, is much, much fatter. In a typical deeply-nested call stack, there will be many copies of the same value on the stack. Let's take the 'this' pointer as an example. Say we have some method that calls itself recursively, and we are currently 10 levels deep. That means that we have 20 copies of the 'this' pointer on the stack: Because each invocation of the method has to push the 'this' pointer on the stack as part of the calling protocol (even if it's passed in registers, it still has to save the prior value of the register somewhere), and in addition we also have to save the 'this' pointer as a stack root. A conservative collector, on the other hand, would be able to get by with only 10 copies of the 'this' pointer.

If you're really worried about the extra roots, you could do liveness analysis. If its not live across a GC-safe point, then its not a root.
In fact, liveness analysis is important anyway. Retaining a dead object
will hurt your performance much more than tracing a live one twice.

As far as the issue of eliminating redundant stores, I'm pretty sure that the optimizers cannot eliminate redundant /roots/. Figuring out the minimal set of temporary roots needed for a function call is a non-trivial problem (you don't want to just blindly make every pointer argument to a function a temporary root, since that just bloats the size of the call frame needlessly.)

My entire goal here is to get a GC that has high performance and efficiency. I want my GC to be able to run on everything from a data center to a beagle board, and use the minimum resources possible.

You might want to have a look at my GC(s), its probably too
entangled with the rest of the toolkit to be useful directly,
but it might be of interest.

http://code.google.com/p/gvmt/

--
-- Talin

Cheers,
Mark

The best way I can think of to streamline garbage collection is to cut
down on the amount of garbage it has to collect. I'm working on an
optimization pass that transforms GC allocations into region
allocations whenever it can... I'll share it when it's done.

One of my goals is to handle enough cases that I can put off
implementing GC for a while :slight_smile:

Foo

Here’s another way to think about the issue: Compare this whole approach to stack roots to that of a conservative collector. Since we know conservative collectors work, the whole reason for making an accurate collector in the first place is efficiency, right? If we couldn’t make an accurate collector that was more efficient than a conservative collector, then no one would bother because it’s such a pain.

However, looking strictly at stack frames, a conservative collector has a much more compact stack layout. There’s no need to copy anything to local variables, because by the time you call into the collector proper, everything - every register variable, every function parameter, and so on - is already on the stack somewhere.

The stack frame layout dictated by llvm.gcroot, by contrast, is much, much fatter. In a typical deeply-nested call stack, there will be many copies of the same value on the stack. Let’s take the ‘this’ pointer as an example. Say we have some method that calls itself recursively, and we are currently 10 levels deep. That means that we have 20 copies of the ‘this’ pointer on the stack: Because each invocation of the method has to push the ‘this’ pointer on the stack as part of the calling protocol (even if it’s passed in registers, it still has to save the prior value of the register somewhere), and in addition we also have to save the ‘this’ pointer as a stack root. A conservative collector, on the other hand, would be able to get by with only 10 copies of the ‘this’ pointer.

As far as the issue of eliminating redundant stores, I’m pretty sure that the optimizers cannot eliminate redundant roots. Figuring out the minimal set of temporary roots needed for a function call is a non-trivial problem (you don’t want to just blindly make every pointer argument to a function a temporary root, since that just bloats the size of the call frame needlessly.)

My entire goal here is to get a GC that has high performance and efficiency. I want my GC to be able to run on everything from a data center to a beagle board, and use the minimum resources possible.

Forgive my top post but I hate Windows. J

I am surprised you (Talin) say that “we know conservative collectors work” because my experience has very much been of them not working. Indeed, if you have 400Mb of allocated heap blocks on a 32-bit machine is there not a 10% chance of each random 32-bit int “pointing” into your heap, i.e. a false positive? I just did a simple test here (create an array with 1M random 32-bit ints and then allocate blocks n=1…2^16 of size n bytes) and Boehm’s conservative GC sometimes takes 13s and other times runs out of memory after several minutes. It even complains about the “Repeated allocation of very large block (appr. size 65536)”. Is 64kb really “very large” today? I think not…

Performance is certainly an important reason to implement an accurate collector though. However, the performance of a conservative collector is surely dominated by the cost of searching to see if each int “points” to within any allocated block, a completely unnecessary operation that accurate collectors simply do not perform. That overhead will dwarf any doubling up of roots on the stack but the main problem with conservative collectors, as I say, is not that they are slow but that they leak.

I agree that eliminating redundant roots is a desirable goal but I did only the most basic things for HLVM and was very happy with the results. So I think it is premature to value such optimizations. I personally would much rather see a mature VM built on top of LLVM.

Cheers,

Jon.

Talin wrote:

Here’s another way to think about the issue: Compare this whole approach to stack roots to that of a conservative collector. Since we know conservative collectors work, the whole reason for making an accurate collector in the first place is efficiency, right? If we couldn’t make an accurate collector that was more efficient than a conservative collector, then no one would bother because it’s such a pain.

However, looking strictly at stack frames, a conservative collector has a much more compact stack layout. There’s no need to copy anything to local variables, because by the time you call into the collector proper, everything - every register variable, every function parameter, and so on - is already on the stack somewhere.

The stack frame layout dictated by llvm.gcroot, by contrast, is much, much fatter. In a typical deeply-nested call stack, there will be many copies of the same value on the stack. Let’s take the ‘this’ pointer as an example. Say we have some method that calls itself recursively, and we are currently 10 levels deep. That means that we have 20 copies of the ‘this’ pointer on the stack: Because each invocation of the method has to push the ‘this’ pointer on the stack as part of the calling protocol (even if it’s passed in registers, it still has to save the prior value of the register somewhere), and in addition we also have to save the ‘this’ pointer as a stack root. A conservative collector, on the other hand, would be able to get by with only 10 copies of the ‘this’ pointer.

As far as the issue of eliminating redundant stores, I’m pretty sure that the optimizers cannot eliminate redundant roots. Figuring out the minimal set of temporary roots needed for a function call is a non-trivial problem (you don’t want to just blindly make every pointer argument to a function a temporary root, since that just bloats the size of the call frame needlessly.)

My entire goal here is to get a GC that has high performance and efficiency. I want my GC to be able to run on everything from a data center to a beagle board, and use the minimum resources possible.

Forgive my top post but I hate Windows. J

I am surprised you (Talin) say that “we know conservative collectors work” because my experience has very much been of them not working. Indeed, if you have 400Mb of allocated heap blocks on a 32-bit machine is there not a 10% chance of each random 32-bit int “pointing” into your heap, i.e. a false positive? I just did a simple test here (create an array with 1M random 32-bit ints and then allocate blocks n=1…2^16 of size n bytes) and Boehm’s conservative GC sometimes takes 13s and other times runs out of memory after several minutes. It even complains about the “Repeated allocation of very large block (appr. size 65536)”. Is 64kb really “very large” today? I think not…

I guess in this case, my definition of “works” is that the Boehm collector appears to be the default choice for 90% of open source projects that require garbage collection of some kind. However, I agree that the performance is problematic.

Performance is certainly an important reason to implement an accurate collector though. However, the performance of a conservative collector is surely dominated by the cost of searching to see if each int “points” to within any allocated block, a completely unnecessary operation that accurate collectors simply do not perform. That overhead will dwarf any doubling up of roots on the stack but the main problem with conservative collectors, as I say, is not that they are slow but that they leak.

I agree that eliminating redundant roots is a desirable goal but I did only the most basic things for HLVM and was very happy with the results. So I think it is premature to value such optimizations. I personally would much rather see a mature VM built on top of LLVM.

Well, I’ve managed to reduce the redundant roots to a level that I am satisfied with, at least for the moment.

The thing about stack roots is that stack space isn’t always growable like the heap, and on some platforms isn’t very large. This is especially an issue in environments with thousands of threads.