Should I add intrinsics to write my own automatic reference counting passes?

My experience with LLVM is limited, but I am trying to figure out how to add optimizations for automatic reference counting. The GC documentation mentions that patch-points could be useful, but it does not state how they would be useful. If this is a FAQ, please let me know…

So this is my idea at this point:

The context is a C++ like language with an aggregate type that is always reference counted. The typesystem differentiate between pointers to objects that is shared between threads and those that does not. I also want a pass that turn shared_ptr to nonshared_ptr if it can be proven.

So what I want to do is to wrap up all the “events” that are relevant as intrinsics and run some simplification passes, then use the pointer capture/escape analysis that LLVM has to turn shared_ptrs to nonshared_ptrs and to elide nonatomic/atomic acquire/release. So basically, the intrinsics will be the type-annotation also.

The compilation will then follow this pattern:

  1. generate LLVM IR
  2. simplification passes
  3. pass for turning shared_ptr to nonshared_ptr
  4. pass for eliding acquire/release
    5, pass that substitute the custom intrinsics to function call
  5. full optimization passes

I think about having the following intrinsics:

ptr = cast_untyped_to_nonshared(ptr) // e.g. used after allocation
ptr = cast_to_shared_irreversible(ptr) // basically a gateway to other threads
nonhared_acquire(ptr)
nonshared_release(ptr)
shared_acquire(ptr)
shared_release(ptr)

I also want weak_ptr at a later stage, but leave it out for now to keep the complexity manageble.

Is this idea completely unreasonable?

Ola.

Hi,

The main problem for this sort of optimization is that it is difficult to do on an IR like LLVM’s, where the semantic relationships between values that exist in the program have been lowered into a sequence of primitive operations with no remaining structural relationship. What you really want is an IR that preserves those relationships of value definition and use, allowing you to say e.g. that one value is a copy of another, that ownership of a value is passed into something, that a value is used for a certain duration but then is no longer used, and so on. With this sort of representation, the optimization turns into fairly straightforward value-forwarding and lifetime manipulations. Dealing with unrelated operations and retroactively attempting to infer relationships, as LLVM IR must, turns it into a fundamentally difficult analysis that often relies on semantic knowledge that isn’t expressed in the IR, so that you’re actually reasoning about what happens under a “well-behaved” frontend.

John.

The main problem for this sort of optimization is that it is difficult
to do on an IR like LLVM’s, where the semantic relationships between
values that exist in the program have been lowered into a sequence of
primitive operations with no remaining structural relationship.

I understand. I think one possible advantage though would be to optimize C++ shared_ptr as well when calling into C++ code (assuming that everything is done with a modified version of clang++), or to provide a C library with macros that will emit the ARC-intrinsics so that C-FFI can benefit from the same optimizations…

Dealing with unrelated operations and retroactively attempting to infer
relationships, as LLVM IR must, turns it into a fundamentally difficult
analysis that often relies on semantic knowledge that isn’t expressed
in the IR, so that you’re actually reasoning about what happens under
a “well-behaved” frontend.

Yes, so basically, if one chooses to do this within the LLVM IR then it might be difficult to share the ARC optimizations with other front ends. I don’t intend to give direct access to the acquire/release or the ref-count value, so I think I can assume that the front end provides “well behaved” IR. I only plan on providing tests for refcount==1 (I guess weak_ptr requires tests for refcount==0 internally as well)

What I worry about with a custom IR is that generic programming can emit a lot of conditionals that should be resolved to true/false statically before doing the ARC passes. So if I write my own IR then I would have to implement much of what LLVM has.

But I guess what you are saying is that it will be a significant job either way…

(Also thanks to Florian for the OBJ-C pointers.)

Ola.

The main problem for this sort of optimization is that it is difficult
to do on an IR like LLVM’s, where the semantic relationships between
values that exist in the program have been lowered into a sequence of
primitive operations with no remaining structural relationship.

I understand. I think one possible advantage though would be to optimize
C++ shared_ptr as well when calling into C++ code (assuming that everything
is done with a modified version of clang++), or to provide a C library with
macros that will emit the ARC-intrinsics so that C-FFI can benefit from the
same optimizations…

Well, I understand that this is the goal, and that as as goal it’d be
quite valuable. I’m just saying that it’s going to be hard to do correctly
on LLVM IR because the high-level structure will have been lost.

Dealing with unrelated operations and retroactively attempting to infer

relationships, as LLVM IR must, turns it into a fundamentally difficult
analysis that often relies on semantic knowledge that isn’t expressed
in the IR, so that you’re actually reasoning about what happens under
a “well-behaved” frontend.

Yes, so basically, if one chooses to do this within the LLVM IR then it
might be difficult to share the ARC optimizations with other front ends.

That’s true but not really what I mean. In practice, it is hard to specify
the rules that will make your analysis and transformations valid on the
LLVM IR level, and you will find yourself reasoning a lot about what a
reasonable frontend would emit for particular patterns of code. When you
find yourself doing that kind of reasoning, you are almost certainly in a
bad place where you’re not going to meet your goals, and you need to be
doing the optimization at a level that still preserves the semantic
structure you’re trying to reason about. Unfortunately this is a problem
in Clang, because there’s no IR designed for optimization prior to LLVM
IRGen.

I don’t intend to give direct access to the acquire/release or the ref-count
value, so I think I can assume that the front end provides “well behaved”
IR.

The main problems are actually around two points:

  1. What code that theoretically could access/change the refcount actually does, and how?
  2. What code relies on the refcount?

For example, if you have a shared reference to an object, and you don’t
use it, and then you drop the shared reference, it’d be nice to eliminate
the unnecessary refcount traffic. But to do that, you have to actually
prove that nothing in between is relying on the refcount being higher.
In particular, there could be multiple LLVM IR values that represent that
shared reference, something like this:

  call @retain(%shared_ref %0)
  …
  call @use(%shared_ref %1)     // %1 is actually the same as %0
  …
  call @release(%shared_ref %2) // %2 is also actually the same as %0

It is highly unlikely that there is a valid pattern of code where LLVM
could figure out that %0 is the same as %2 but not the same as %1.
(For example, imagine that %1 and %2 are loads from a memory location
that %0 was stored into. %2 is presumably loaded after %1, so if LLVM
can forward the store of %0 to %2, it should be able to forward it to %1
as well.) That is, it is almost certain that your optimization will
either the see the above (not optimizable because the retain/release
may be unrelated) or something like the below (still probably not
optimizable because of the intermediate use):

  call @retain(%shared_ref %0)
  …
  call @use(%shared_ref %0)
  …
  call @release(%shared_ref %0)

But it’s possible that LLVM could actually produce this pattern:

  call @retain(%shared_ref %0)
  …
  call @use(%shared_ref %1)     // %1 is actually the same as %0
  …
  call @release(%shared_ref %0)

And then your optimization might fire and eliminate this “unnecessary”
pair of retain/release operations.

The problem here is that IR doesn’t guarantee the structure (the semantic
ties between copies, uses, and destroys) that you need to use to do your
optimization.

John.

That’s true but not really what I mean. In practice, it is hard to specify
the rules that will make your analysis and transformations valid on the
LLVM IR level, and you will find yourself reasoning a lot about what a
reasonable frontend would emit for particular patterns of code. When you
find yourself doing that kind of reasoning, you are almost certainly in a
bad place where you’re not going to meet your goals, and you need to be
doing the optimization at a level that still preserves the semantic
structure you’re trying to reason about. Unfortunately this is a problem
in Clang, because there’s no IR designed for optimization prior to LLVM
IRGen.

Thank you for your thoughtful reply. It seems like you are speaking from experience, so I basically conclude that I do think it is possible to do what I intended to do, but it might limit the passes I could use and I might find myself in a position where it would be difficult to modify my ARC optimization code. Which is not good for a prototype-level language anyway. From what you say it seems like I risk ending up in the IR equivalent of reasoning about spaghetti code.

So I’ve decided to think more about using a more hierarchical high level IR with clang AST nodes as children and see if I can do a “lesser” version on the clang AST instead of the IR. This is just an idea at this point.

I am also rethinking the bruteforce approach of using LLVM and will try and see if I can make subtree results of the high level IR cacheble somehow, so that the results can be used between compilations. Then I guess I could accept less performant simplification algorithms than LLVM uses on my own IR (e.g. using Z3 in the initial version). I guess that could work out ok in the initial version.

I can always revisit this intrinsic idea at a later point.

Again, thanks for your thoughtful reply,
Ola.