Hi folks,
TL;DR: I propose to add 3 new C/C++ intrinsics for controlling inlining at callsite:
- __builtin_no_inline(Foo()) – prevents the call to Foo() from being inlined at that particular callsite.
- __builtin_always_inline(Foo()) – inlines this call to Foo(), if possible.
- __builtin_flatten_inline(Foo()) – inlines this call to Foo() and (transitively) everything called within Foo’s body.
These intrinsics apply to the outermost call-like expression and it will be possible to use them with: function calls, member function calls, operator calls, constructor calls, indirect calls (with function pointers, member function pointers, virtual calls).
I proposed patch implementing the first two intrinsics here: https://reviews.llvm.org/D51200. I would really appreciate feedback on the proposed semantics and implementation. I don’t have much experience with Clang, and I’d appreciate any help with the technical problems I mentioned in the code review. Details below.
Motivation:
It’s often the case that the compiler missed some inlining opportunity or inlined a function call excessively. In a lot of cases, it’s possible to map a performance regression to a few wrong inlining decisions. When that happens, we can manually enforce the correct inlining decisions by:
- Marking the callees of interest with attribute ((noinline)), attribute ((always_inline)), or gnu::flatten. This affects all call sites with such callees. For more fine-grained control over inlining, one workaround is to create a few copies (or proxies), each marked with a different attribute.
- Globally changing the inline thresholds (e.g., -mllvm -inline-threshold=K).
- Manually modifying the source in order to change the calculated inlining cost (e.g., splitting function into a few smaller ones), or even inlining a function by hand by copy-pasting it into the callsite.
Problem with the existing solutions:
- (1) and (2) is that they can affect inlining globally instead of only at the places where it matters.
- (1) and (3) can have the disadvantage of duplicating code and thus making it less maintainable.
- (1) and (3) sometimes cannot be applied if for some reason we cannot modify the inlined functions. This can be the case when these functions are declared in an external library.
Proposed solution:
I propose to introduce new Clang intrinsics for controlling inlining at the call-site level. This way, it’s possible to cleanly hint a compiler on what should happen to only a particular function call. These intrinsic are also self-documenting, in the sense that they are easy to reason about for humans and appear directly in source code.
The proposed intrinsics are __builtin_no_inline, __builtin_always_inline, and __builtin_flatten_inline.
Example:
int foo(int) { /* … */ }
void baz(int) { /* … */ }
struct S {
S();
void bar(int);
virtual void virt();
S operator++();
friend S operator+(const S &, const S &);
};
S *GetS();
int main() {
// Inline the function call to foo(0) into main.
int x = __builtin_always_inline(foo(0));
// Prevent the constructor from being inlined into main.
S s = __builtin_no_inline(S());
// Force inline S::bar into main without forcing foo to be inlined.
__builtin_always_inline(s.bar(foo(x)));
// Force inline foo into main without forcing S::bar to be inlined.
s.bar(__builtin_always_inline(foo(x)));
// Force the outer call to baz to be inlined, then try to
// transitively inline every function call from baz’s body.
// Does not force foo to be inlined.
__builtin_flatten_inline(baz(foo(x)));
// Force the operator call S + S to be inlined.
++__builtin_always_inline(s + s);
// Try to inline the virtual call to virt, if possible.
__builtin_always_inline(GetS()->virt());
}
Syntax and semantics:
The inline intrinsics can be applied to function calls, member function calls, constructor calls, virtual calls, function pointer and member function pointer calls, and operator calls. They always affect the outermost call and not subexpressions.
All the intrinsics work on a “best-effort” basis, and make the specified inline decisions happen whenever possible. This may not always be the case, e.g. if you wrap indirect calls with __builtin_always_inline and the target doesn’t happen to be resolved during compilation.
One thing I’m not sure about is what to do when the expression inside inline intrinsic doesn’t happen to be any kind of call. It doesn’t make much sense to be able to write something like:
__builtin_always_inline(1 + 3), but what may happen in generic context (e.g.,
__builtin_always_inline(t + u)), is that it’s not known if expressions will end up operating on primitive types or user-defined ones that actually make function calls. In my opinion, it will make life easier if inline intrinsics over non-call-like expressions will be treated as no-ops, in any context, as the compiler can already reason about them and won’t perform any function calls. One option is to silently not inline when the compiler resolves the call to an operation, which would be consistent with the behavior of silently not inlining calls it cannot resolve. Alternatively we may emit warnings, which would make maintaining code with these intrinsics easier.
I’d really like to get feedback on this issue.
Implementation:
I have already partially implemented the first two intrinsics (__builtin_no_inline and __builtin_always_inline) here: https://reviews.llvm.org/D51200. Calls wrapped with the inline intrinsics are annotated with appropriate attributes during code generation. LLVM seems to already take care of callsites attributed with alwaysinline and noinline. I think it should also be possible to implement some appropriate attribute for flattening, as there’s already gnu::flatten attribute for function declarations.
Let me know what you think,
Kuba