Is there a "callback optimization"?

By that I mean an optimization pass (or a combination of them) that turns:

void useCallback(void (*callbackfn)())
{
// Do something
callbackfn();
// Do something else
}

void myCallback()
{
// Respond one way
}

void myOtherCallback()
{
// Respond another way
}

void foo()
{
useCallback(myCallback);
useCallback(myOtherCallback);
}

into:

// Keep the original; it'll get removed
// by other passes if it's nonpublic and dead.
void useCallback(void (*callbackfn)())
{
// Do something
callbackfn();
// Do something else
}

void myCallback()
{
// Respond one way
}

void myOtherCallback()
{
// Respond another way
}

// equivalent of useCallback(myCallback), but doesn't call through a
function pointer.
void useMyCallback()
{
// Do something.
myCallback();
// Do something else
}

// equivalent of useCallback(myOtherCallback), but doesn't call
through a function pointer.
void useMyOtherCallback()
{
// Do something
myOtherCallback();
// Do something else
}

// Changed to use non-virtual versions created above.
void foo()
{
useMyCallback();
useMyOtherCallback();
}

With that transform in place, lots of inlining becomes possible, and
direct function calls replace indirect function calls if inlining
isn't appropriate. If this transform is combined with argpromotion
and scalarrepl, it can be used for devirtualization of C++ virtual
function calls.

There seems to be an awful lot of C++ code out there that uses
templates to perform this same optimization in source code.

Hi Kenneth,

By that I mean an optimization pass (or a combination of them) that turns:

...

With that transform in place, lots of inlining becomes possible, and
direct function calls replace indirect function calls if inlining
isn't appropriate. If this transform is combined with argpromotion
and scalarrepl, it can be used for devirtualization of C++ virtual
function calls.

There seems to be an awful lot of C++ code out there that uses
templates to perform this same optimization in source code.

yes, LLVM does this. For example, running your example through the LLVM
optimizers gives:

define void @foo() nounwind readnone {
entry:
   ret void
}

As you can see, the indirect function calls were resolved into direct
function calls and inlined.

I don't know which passes take care of this however.

Ciao,

Duncan.

When I used -std-compile-opts -disable-inlining, my transform didn't
happen. I think in your test, the inline of UseCallback into foo
automatically made the function pointer into a constant, which turned
it into a direct call that was then inlined.

If UseCallback is too big to inline and uses the callback parameter
inside a loop, this transform is potentially valuable, particularly if
UseCallback is called multiple times with the same callback parameter.

Interestingly, when I had foo call UseCallback multiple times with
*only* callback1, it yanked the function pointer parameter out of
UseCallback and turned the thing into a direct call. (I'm guessing
dead argument elimination came into play here) But as soon as I added
a call to UseCallback with callback2 to the mix, it went back to not
making any indirect call elimination.

There's a partial specialization pass that can do this sort of thing
(IIRC not enabled by default); if you really need the transformation
you're describing, you can adapt it appropriately. That said, if the
inliner doesn't decide to inline useCallback, duplicating the function
without inlining it is probably a bad idea in terms of codesize. IIRC
the inliner heuristics already strongly favor inlining in cases like
the given example, but it's possible that the inliner heuristics could
use some additional tweaking.

-Eli

It should be relatively simple to write a pass that turns each call
that has constant argument(s) into a call to specialized version of
the callee. To devirtualize C++ calls it needs to be smarter, since
the argument is not a constant, but a pointer to a struct that points
to a constant. However, the trick here is
1) Knowing when to perform specialization. If the call was not inlined
the function is probably big. Getting this wrong will generate *a lot*
of code for very small (if not negative) speed gain.
2) Sharing of specializations from different call sites that have the
same constants.
Getting 1) right is crucial but hard. Easy cases are already taken by
inline and dead argument elimination. If some good profiling
information is available it can be used for speed/space trade off
estimation (specialize calls from hot code).

Eugene

As the number of callsites using the same constant grows, inlining
gets more expensive while specializing does not - the cost of
specializing only grows with the number of unique constants combos
specialized. So cases where you'd want to specialize but not inline
shouldn't be all that uncommon, and different cost calculations are
needed to set the threshold.

I didn't see the partial specialization pass in the docs, but I'll
take a look at it now.

As the number of callsites using the same constant grows, inlining
gets more expensive while specializing does not - the cost of
specializing only grows with the number of unique constants combos
specialized. So cases where you'd want to specialize but not inline
shouldn't be all that uncommon, and different cost calculations are
needed to set the threshold.

I didn't see the partial specialization pass in the docs, but I'll
take a look at it now.

Partial specialize is exactly what I need! Thanks!

It has a bug, though... I put in three calls using one function
pointer, and three calls using another function pointer, and it
created both specializations but changed all the callsites to use the
first specialization. I'll svn update, try again, and file a bug if
needed.

Hi,

1) Knowing when to perform specialization. If the call was not inlined
the function is probably big. Getting this wrong will generate *a lot*
of code for very small (if not negative) speed gain.

Could you elaborate why just having (lots of) more code in the final executable will incur a performance _penalty_?
I was thinking of something similiar, but for type-specializations of functions of a dynamicly-typed language, so that the frontend creates more than one function for each function in the sourcecode.

2) Sharing of specializations from different call sites that have the
same constants.
Getting 1) right is crucial but hard. Easy cases are already taken by
inline and dead argument elimination. If some good profiling
information is available it can be used for speed/space trade off
estimation (specialize calls from hot code).

Eugene

Cornelius

Having more executable code tends to reduce the locality of code,
especially if all that code is loaded and executed during the life of
the process (and if it is never loaded and executed, it would probably
have been eliminated as dead code). Also, a larger process executable
takes longer to spin up (which is a Bad Thing in some situations).

Generic reasons -- more data means more cache misses, TLB misses and
page faults. And some code specific reasons -- more code pollutes
branch prediction and trace caches for some CPUs.

Eugene