Marking *some* pointers for gc

Hi,

I just found out that it's not practical to mark only some pointers
for GC. Consider:

%a = i8 addrspace(1)* malloc(...)
%b = i8* alloca(...)

The issue then becomes that routine functions declared:

declare i1 foo(i8 addrspace(1)*)

have a choice of accepting either gc'able or non-gc'able pointers. Is
there no way to have a reasonable mix of both?

Ram

Hi,

I just found out that it's not practical to mark only some pointers
for GC. Consider:

%a = i8 addrspace(1)* malloc(...)
%b = i8* alloca(...)

The issue then becomes that routine functions declared:

declare i1 foo(i8 addrspace(1)*)

have a choice of accepting either gc'able or non-gc'able pointers. Is
there no way to have a reasonable mix of both?

Part of the reason for putting GC'able pointers in addrspace 1 is to
have a *strong* distinction (or as strong as LLVM IR will let us have)
between GC-able and non-GC pointers -- it is not a coincidence that
you cannot "forget" (ruling out addrspacecast) that a pointer is
GC-able and pass it to a function that does not expect a GC-able
pointer. For instance, you cannot dereference a GC-able pointer after
passing a safepoint that did not relocate the GC-able pointer; and
your GC probably cannot relocate non-GC'able pointers. In your
example, foo will have to treat its argument differently depending on
whether it is a GC pointer or not.

Does this answer your question?

Sanjoy Das wrote:

In your
example, foo will have to treat its argument differently depending on
whether it is a GC pointer or not.

In practice, this is not true of many functions that don't call other
functions. Take the example of a simple "print" function that takes a
void * to cast and print, type_int to determine what to cast to: why
should it care about whether the pointer is GC'able or not? In the
callsite, I have this information, and I accordingly emit
statepoint/relocate information. But "print" doesn't call other
functions, and doesn't need to emit statepoint/relocate.

Let's say I made the void * argument addrspace(0). Then, in callsites
where I have an addrspace(1) to pass, I have to emit:

  addrspacecast 1 -> 0
  call print
  addrspacecast 0 -> 1

Is the ideal workflow, or should we have some sort of addrspaceany?

The requirements ought to be captured by the nocapture attribute (though that still places some limitations on the GC - it isn't allowed to relocate an object while a pointer to it is passed to GC-oblivious code, which may not be an invariant that's easy to enforce in some designs).

I'm wary of an addrspaceany attribute though - we have different address spaces with different sizes and different register assignments for calling conventions, so this is a bit broad. I'm not totally convinced by the use of address spaces to indicate GC vs non-GC pointers in this way, because we don't have a good way of describing interactions between address spaces in IR currently. DataLayout can tell us the size and alignment for pointers to each AS, but can't currently tell us:

- Whether one address space is contained within another.

- Whether casts from one address space are lossy (if you do addrspacecast from n->m then back, are you guaranteed the same pointer?).

- Whether address space casts between a pair of address spaces are valid always, never, or sometimes.

Your addrspaceany is really a union of the two pointer types (which your high-level language's type system may or may not like), with the assumption that they have the same representation.

David

Sanjoy Das wrote:

In your
example, foo will have to treat its argument differently depending on
whether it is a GC pointer or not.

In practice, this is not true of many functions that don't call other
functions. Take the example of a simple "print" function that takes a
void * to cast and print, type_int to determine what to cast to: why
should it care about whether the pointer is GC'able or not? In the
callsite, I have this information, and I accordingly emit
statepoint/relocate information. But "print" doesn't call other
functions, and doesn't need to emit statepoint/relocate.

You are right that there are some functions which can not trigger garbage collection and thus are not sensitive to the 'type' of the pointer they've been given. I've been calling such functions "gc leaf functions" for lack of a better name. However, there's a good chance that your "simple print function" is not, in fact, such a function. If your print routine contains any non gc-leaf call, or a loop whose bounds are not known at compile time, it may in fact need to do relocation. Depending on your collector, the routine may also need a load or store barrier for one or the other uses. It's highly unlikely that the code between the GC address space and the non-GC address space is actually the same.

There's lots of room to experiment with a gc-leaf function attribute, and - in particular - the inference of such.

Having said all that, I'm really curious why this matters to you. In practice, we haven't found there to be many functions at all which are needed on both gc and non-gc pointers (where the function is *also* a gc-leaf.) Unless you're seeing a bunch of cases like this, I'd just duplicate the shared routines.

Let's say I made the void * argument addrspace(0). Then, in callsites
where I have an addrspace(1) to pass, I have to emit:

   addrspacecast 1 -> 0
   call print
   addrspacecast 0 -> 1

Is the ideal workflow, or should we have some sort of addrspaceany?

I strongly advise against introducing such casts. Doing so makes it much harder to reason about correctness. I would be open to a proposal of an "generic address space" mechanism, but that's a large project. I don't really see the motivation for it currently. You'd need to send a proposal to llvmdev and get feedback on the idea.

Philip

In practice, this is not true of many functions that don't call other
functions. Take the example of a simple "print" function that takes a
void * to cast and print, type_int to determine what to cast to: why
should it care about whether the pointer is GC'able or not? In the
callsite, I have this information, and I accordingly emit
statepoint/relocate information. But "print" doesn't call other
functions, and doesn't need to emit statepoint/relocate.

Let's say I made the void * argument addrspace(0). Then, in callsites
where I have an addrspace(1) to pass, I have to emit:

  addrspacecast 1 -> 0
  call print
  addrspacecast 0 -> 1

Is the ideal workflow, or should we have some sort of addrspaceany?

The requirements ought to be captured by the nocapture attribute (though that still places some limitations on the GC - it isn't allowed to relocate an object while a pointer to it is passed to GC-oblivious code, which may not be an invariant that's easy to enforce in some designs).

FYI, nocapture is *not* enough. A store to a GC pointer may require a store barrier; a store to a non-gc pointer may not. Just because a pointer isn't *captured* doesn't mean that the 'GCness' doesn't effect the code generated.

I'm wary of an addrspaceany attribute though - we have different address spaces with different sizes and different register assignments for calling conventions, so this is a bit broad. I'm not totally convinced by the use of address spaces to indicate GC vs non-GC pointers in this way, because we don't have a good way of describing interactions between address spaces in IR currently. DataLayout can tell us the size and alignment for pointers to each AS, but can't currently tell us:

- Whether one address space is contained within another.

- Whether casts from one address space are lossy (if you do addrspacecast from n->m then back, are you guaranteed the same pointer?).

- Whether address space casts between a pair of address spaces are valid always, never, or sometimes.

Your addrspaceany is really a union of the two pointer types (which your high-level language's type system may or may not like), with the assumption that they have the same representation.

David raises a fair point: addrspaceany is clearly unworkable. Something like an addrspacevariant(X, Y) might be workable, but I'm still not really seeing much need for this.

I'm not particularly in support of the addrspaceany mechanism. I believe it might make sense to discuss, but I haven't seen a compelling need for it to date and it adds a major source of complexity and bugs.

Philip

Philip Reames wrote:

There's lots of room to experiment with a gc-leaf function attribute, and -
in particular - the inference of such.

I'm not sure how this would help: do you have optimizations that we
can apply specifically to gc-leaf functions?

Having said all that, I'm really curious why this matters to you.

While refactoring my code to use statepoints, I noticed that I was
alloca'ing things like global string pointers, while malloc'ing things
like arrays; but since my language is untyped, I was boxing them up in
a malloc'ed structure. The problem doesn't affect me in practice
because I'm always passing around boxed objects, but I was wondering
if others who are doing typed languages would need an addrspaceany.

You don't see the need for it, so I suppose the discussion is closed
until someone does.

You don't need a safepoint (i.e. statepoint) on a call to a gc-leaf function. You also don't need all of the explicit relocation in the IR. At least in principal, this can lead to substantially better optimization.

Philip