[RFC] Per-callsite inline intrinsics

Hi folks,

TL;DR: I propose to add 3 new C/C++ intrinsics for controlling inlining at callsite:

  • __builtin_no_inline(Foo()) – prevents the call to Foo() from being inlined at that particular callsite.
  • __builtin_always_inline(Foo()) – inlines this call to Foo(), if possible.
  • __builtin_flatten_inline(Foo()) – inlines this call to Foo() and (transitively) everything called within Foo’s body.

These intrinsics apply to the outermost call-like expression and it will be possible to use them with: function calls, member function calls, operator calls, constructor calls, indirect calls (with function pointers, member function pointers, virtual calls).

I proposed patch implementing the first two intrinsics here: https://reviews.llvm.org/D51200. I would really appreciate feedback on the proposed semantics and implementation. I don’t have much experience with Clang, and I’d appreciate any help with the technical problems I mentioned in the code review. Details below.

Motivation:
It’s often the case that the compiler missed some inlining opportunity or inlined a function call excessively. In a lot of cases, it’s possible to map a performance regression to a few wrong inlining decisions. When that happens, we can manually enforce the correct inlining decisions by:

  1. Marking the callees of interest with attribute ((noinline)), attribute ((always_inline)), or gnu::flatten. This affects all call sites with such callees. For more fine-grained control over inlining, one workaround is to create a few copies (or proxies), each marked with a different attribute.
  2. Globally changing the inline thresholds (e.g., -mllvm -inline-threshold=K).
  3. Manually modifying the source in order to change the calculated inlining cost (e.g., splitting function into a few smaller ones), or even inlining a function by hand by copy-pasting it into the callsite.

Problem with the existing solutions:

  • (1) and (2) is that they can affect inlining globally instead of only at the places where it matters.
  • (1) and (3) can have the disadvantage of duplicating code and thus making it less maintainable.
  • (1) and (3) sometimes cannot be applied if for some reason we cannot modify the inlined functions. This can be the case when these functions are declared in an external library.

Proposed solution:
I propose to introduce new Clang intrinsics for controlling inlining at the call-site level. This way, it’s possible to cleanly hint a compiler on what should happen to only a particular function call. These intrinsic are also self-documenting, in the sense that they are easy to reason about for humans and appear directly in source code.

The proposed intrinsics are __builtin_no_inline, __builtin_always_inline, and __builtin_flatten_inline.

Example:
int foo(int) { /* … */ }

void baz(int) { /* … */ }

struct S {

S();

void bar(int);

virtual void virt();

S operator++();

friend S operator+(const S &, const S &);

};

S *GetS();

int main() {

// Inline the function call to foo(0) into main.

int x = __builtin_always_inline(foo(0));

// Prevent the constructor from being inlined into main.

S s = __builtin_no_inline(S());

// Force inline S::bar into main without forcing foo to be inlined.

__builtin_always_inline(s.bar(foo(x)));

// Force inline foo into main without forcing S::bar to be inlined.

s.bar(__builtin_always_inline(foo(x)));

// Force the outer call to baz to be inlined, then try to

// transitively inline every function call from baz’s body.

// Does not force foo to be inlined.

__builtin_flatten_inline(baz(foo(x)));

// Force the operator call S + S to be inlined.

++__builtin_always_inline(s + s);

// Try to inline the virtual call to virt, if possible.

__builtin_always_inline(GetS()->virt());

}

Syntax and semantics:
The inline intrinsics can be applied to function calls, member function calls, constructor calls, virtual calls, function pointer and member function pointer calls, and operator calls. They always affect the outermost call and not subexpressions.

All the intrinsics work on a “best-effort” basis, and make the specified inline decisions happen whenever possible. This may not always be the case, e.g. if you wrap indirect calls with __builtin_always_inline and the target doesn’t happen to be resolved during compilation.

One thing I’m not sure about is what to do when the expression inside inline intrinsic doesn’t happen to be any kind of call. It doesn’t make much sense to be able to write something like:
__builtin_always_inline(1 + 3), but what may happen in generic context (e.g.,
__builtin_always_inline(t + u)), is that it’s not known if expressions will end up operating on primitive types or user-defined ones that actually make function calls. In my opinion, it will make life easier if inline intrinsics over non-call-like expressions will be treated as no-ops, in any context, as the compiler can already reason about them and won’t perform any function calls. One option is to silently not inline when the compiler resolves the call to an operation, which would be consistent with the behavior of silently not inlining calls it cannot resolve. Alternatively we may emit warnings, which would make maintaining code with these intrinsics easier.
I’d really like to get feedback on this issue.

Implementation:
I have already partially implemented the first two intrinsics (__builtin_no_inline and __builtin_always_inline) here: https://reviews.llvm.org/D51200. Calls wrapped with the inline intrinsics are annotated with appropriate attributes during code generation. LLVM seems to already take care of callsites attributed with alwaysinline and noinline. I think it should also be possible to implement some appropriate attribute for flattening, as there’s already gnu::flatten attribute for function declarations.

Let me know what you think,
Kuba

Thank you for the detailed description of the problem and the design rationale. I think this is a reasonable and clean solution to the problem.

Regarding applying the builtins to a non-call: I think this deserves at least an enabled-by-default warning. I’m not sure how compelling the use cases are for applying these builtins within a template – they seem like very surgical tools for controlling inlining, and so applying them to a family of functions is perhaps unwise – and the case where the builtin is applied to a call in some instantiations and to a built-in operator in others does not immediately seem like a primary concern to me, so I’m not too concerned about making this diagnostic an error on that basis. Perhaps we could make this an error in most cases and downgrade it to a warning in template instantiations. (Are there also macro scenarios where you anticipate it being unknown whether the operand is a function call?)

Hi folks,

TL;DR: I propose to add 3 new C/C++ intrinsics for controlling inlining at callsite:

  • __builtin_no_inline(Foo()) – prevents the call to Foo() from being inlined at that particular callsite.
  • __builtin_always_inline(Foo()) – inlines this call to Foo(), if possible.
  • __builtin_flatten_inline(Foo()) – inlines this call to Foo() and (transitively) everything called within Foo’s body.

These intrinsics apply to the outermost call-like expression and it will be possible to use them with: function calls, member function calls, operator calls, constructor calls, indirect calls (with function pointers, member function pointers, virtual calls).

Would it make sense to pre-emptively generalize this as a call-annotation builtin that takes the annotations as a second, string-literal argument? Or maybe even has some special parsing rules so that we can e.g. parse attributes in the later operands? Call-related features tend to end up with a million different variations; you already have three different builtins for three different inlining behaviors, and once this infrastructure is up and working, I expect we’ll want to add more for all sorts of other call-related mechanisms. To me, that means it makes sense to plan this as a more heavyweight language feature, including things like preparing for call sites that the user wants to apply multiple annotations to.

For __builtin_flatten_inline, there are sometimes a lot of implicit calls done on behalf of emitting the primary call: operator->, default arguments, copy constructors for arguments, destructors for arguments, that kind of thing. Are these covered? Not covered because they’re emitted with the caller? Is it okay for the semantics to differ by target for things like argument destructors?

One thing I’m not sure about is what to do when the expression inside inline intrinsic doesn’t happen to be any kind of call. It doesn’t make much sense to be able to write something like:
__builtin_always_inline(1 + 3), but what may happen in generic context (e.g.,
__builtin_always_inline(t + u)), is that it’s not known if expressions will end up operating on primitive types or user-defined ones that actually make function calls. In my opinion, it will make life easier if inline intrinsics over non-call-like expressions will be treated as no-ops, in any context, as the compiler can already reason about them and won’t perform any function calls. One option is to silently not inline when the compiler resolves the call to an operation, which would be consistent with the behavior of silently not inlining calls it cannot resolve. Alternatively we may emit warnings, which would make maintaining code with these intrinsics easier.
I’d really like to get feedback on this issue.

This should definitely be a parse-time error. In a template, you can recognize and decline to diagnose dependent expressions that might be instantiated to a call, which I believe just means the allowing BinaryOperator and UnaryOperator when one of the operands has a dependent type.

John.