Function specialisation pass

I am interested in adding a function specialisation(*) pass to LLVM. This frequently
comes up as a reason for performance differences of generated code between
GCC and LLVM. In GCC this is implemented in [1] and gets enabled at
optimisation level -03 and up. There have been two previous attempts in adding
this to LLVM: a while back this was attempted in [2] and very recently in [3].

Both previous attempts were parked at approximately the same point: the
transformation was implemented but the cost-model to control compile-times
and code-size was lacking. This is crucial as the goal would be to have
function specialisation enabled by default (at some opt level) and function
cloning has of course great potential to increase code-size and the
interprocedural analysis isn’t very cheap. Thus, I agree with previous comments
on both tickets that we probably need to have a solid story for this before it makes
sense to add this to the LLVM code-base.

Given that GCC has this enabled by default we are in an excellent position to
evaluate the compile-time/code-size overhead of function specialiation. For two
cases I have tried investigating this overhead (and didn’t evaluate performance of
the generated code). First, I took SQLite’s amalgamation version, i.e. 1 source file
that is “220,000 lines long and over 7.5 megabytes in size” [4] as that seemed like a
good stress test for an IPA transformation pass to me. Comparing GCC 10.2.0
with -O3 and “-O3 -fdisable-ipa-cp” to toggle this on and off, respectively,
showed a 0.3% - 1.5% compile-time difference on different systems and a
1.3% object file increase. Second, I timed compilation of LLVM, representing a
very different codebase. For LLMV compile time differences were all within noise
levels with only a tiny code-size increase. I haven’t looked into details here, but I
guess that it is not doing much, which could be the benefit of a well-tuned
cost-model, so is a good result.

Another reason why I am widening the audience with this email is that
alternative approaches were floated. In [2] it was remarked that “we could
propagate constant sets indicating the functions indirect call sites could possibly
target. Although we would probably want to limit the size of the sets to something
small, the pass could attach the sets via metadata to the calls so that this information
could be consumed by later passes. Such metadata could be used for indirect call
promotion, intersecting the function attributes of the possible targets”.

But I am reluctant to go for this let’s say more experimental approach as the GCC
numbers are very encouraging. I.e., both compile-time and code-size seem very
reasonable and I don’t have seen yet any reasons to abandon the approaches in
[2] and [3] which are very similar. So, the approach I would like to take is complement
[3] with an analysis of the added compile-time/code-size increase, and then propose a
cost-model which then hopefully gives results in the range of GCC; I think that is what
we should be aiming at. But I am interested to hear if there are more/other opinions on
this.

(*) Motivating examples for function specialisation in previous works were e.g.:

long plus(long x) {
return x + 1;
}
long minus(long x) {
return x - 1;
}
long compute(long x, long (*binop)(long) ) {
return binop(x);
}
int foo(int x, int y) {
if (y)
return compute(x, plus);
else
return compute(x, minus);
}

If we specialise compute which gets inlined in foo, we end up with just a return x + 1
and return x -1 in foo. Other motivating examples pass constants to functions, which
can then get propagated after function specialisation.

[1] https://github.com/gcc-mirror/gcc/blob/master/gcc/ipa-cp.c
[2] https://reviews.llvm.org/D36432

[3] https://reviews.llvm.org/D93838
[4] https://www.sqlite.org/amalgamation.html

Just wanted to chime in here and say that having a good robust specialization pass would be very valuable. As you highlight though, the cost modeling is the tricky bit. Getting a cost model worked through that works well in practice - in particular, when combined with the existing inliner cost model - is a major task. I suspect, without much evidence, that you’ll end up having the balance the cost model for specialization against the cost model for inlining - and thus probably have to adjust both. That is, to say the least, challenging. If you’re willing to take on that work, I’ll simply say thank you and good luck. :slight_smile:

One thing you could consider is exploring the notion of “zero cost” specialization. The analogy I’d use is the cost model we use for the new loop unswitch. The old loop unswitch was pretty aggressive, whereas the new one requires the transformation to recover (almost all of) the cost of the transformed loop overhead.

I’ve spent some time in the past thinking about what a zero cost specializer might look like. I think there’s some very interesting possibilities looking at it through the lens of function splitting combined with musttail dispatch. Let me give a small example using one of your motivating routines. Starting with:

long compute(long x, long (*binop)(long) ) {
 return binop(x);
}

int foo(int x, int y) {
 if (y)
  return compute(x, plus);
 else
  return compute(x, minus);

}

This can become (at zero cost):

long compute(long x, long (*binop)(long) ) {
 return binop(x);
}
long computeA(long x) { return compute(x, plus); }

long computeB(long x) { return compute(x, minus); }

int foo(int x, int y) {
 if (y)
  return musttail computeA(x);
 else
  return musttail computeB(x);

}

If you think about what that lowers to, foo ends up simply being the branch dispatch to the other routines. There’s some space overhead due to function alignment, and lost fallthrough opportunity, but that’s it.Â

Why bother? Well, on this example, it’s hard to justify because the inliner would happily blow through the whole thing, but on larger examples we’ve made the dispatch in foo dramatically easier for the caller to inline. That’s particular interesting when a subset of callers have analyzable or predictable (in the branch predictor sense) values for y.Â

Anyways, just a random through dump. If any of this is useful, feel free to use, if not, feel free to ignore. :slight_smile:

Philip