Approaches to the inliner blowing up

There has been discussion throughout time about the inliner blowing up (e.g. [Inline] Unbounded inlining due to locally hot call sites · Issue #181821 · llvm/llvm-project · GitHub, Alwaysinliner time explosion with new pass manager · Issue #59126 · llvm/llvm-project · GitHub). This is typically due to one of the fundamental flaws of the way we implement bottom up inlining without a global cost budget, which is that it’s very easy to inline small calls in a way that ends up being exponential, e.g. all of the calls in the following individually seem fine to inline when inlining from @f, but in aggregate blow up compile time and binary size

define void @h() {
  call void @i()
  call void @i()
  call void @i()
  call void @i()
}
define void @g() {
  call void @h()
  call void @h()
  call void @h()
  call void @h()
}
define void @f() {
  call void @g()
  call void @g()
  call void @g()
  call void @g()
}

There have been a couple of proposals on limiting inlining. [Inline] Accumulate the cost of the inlined function to the new callsite by dianqk · Pull Request #111104 · llvm/llvm-project · GitHub adds extra cost on inlined call sites. [Inliner] Disable locally hot callsite heuristic for inlined calls by nikic · Pull Request #181827 · llvm/llvm-project · GitHub disables a heuristic threshold increase for inlined call sites.

Talking to @chandlerc, here are some ideas bounced around:

  • First of all, it’s nice to preserve inliner idempotency, meaning running the inliner multiple times shouldn’t cause more calls to be inlined.
    • Having inlining budgets across multiple calls is the easiest way to break idempotency if the budget is only kept in memory and not recorded into the IR.
  • It’s also nice to not have inlining decisions differ depending on the order of processing call sites, which can easily happen with budgets.
  • If we could somehow take account into the cost model of simplifications in the caller, that would help the cost model be more accurate and we may be able to turn down some thresholds.
    • This would require a lot of work to ensure it doesn’t significantly impact compile time by traversing the entire caller O(#calls) times.
    • This was brought up because of the observation that with a global/per-function inlining budget, later calls that can end up reducing code size may not happen.
    • Today we try to simplify the callee with the specific parameters passed into it and cost model based on that.
  • We can batch more inlining decisions up front so that we some sort of a per-call site budget. In other words, rather than adding inlined call sites to the queue of call sites to evaluate, when we analyze the cost of a call site we also determine whether to inline the transitive calls.
  • We have a heuristic added in https://reviews.llvm.org/D36199 which increases the inlining threshold for “locally hot” calls (in compiles without profiles), “locally hot” meaning the call happens more than 60 times in the caller execution. This specifically seems to be the cause of many an inlining blowup. In the meantime we could punt on some of these blowups by lowering the inlining boost.
    • This only applies to builds without profile information, so it wouldn’t affect workloads that are optimized with PGO, which have much better information about inlining.
    • The call is presumably in a loop. We can also adjust the threshold based on how many other calls are in the loop, although that goes into the “budget across multiple calls” territory.
    • None of this really solves the fundamental problem listed at the beginning of this post, but may unblock some bugs people are seeing.
  • Some of the PRs linked above disable/penalize further inlining of inlined call sites. We can also penalize inlined call sites based on the number of inlined call sites.

Discussion appreciated!

I think this is a bit more promising as a short-term improvement. There isn’t a need to use a budget per-se. Instead, the idea would be to only apply the inline threshold boost for “locally hot” calls when there is only <N calls within the loop body. That’s a directly observable property of the IR already, and so should be durable and idempotent.

In general, I think this boost predated the extensive work that has gone in over the years to improve LLVM’s profile-guided inlining, and PGO infrastructure more generally. PGO was much less common at the time than it is now, and so there was much more pressure to devise workarounds based on local profile information such as this boost. So I think it would be broadly worth nudging folks that care about non-PGO performance to experiment with techniques that limit or even eliminate this boost. While not fixing the underlying problem, it would result in simpler thersholds and so seems valuable, generally. And it might reduce the immediate / short-term pressure so we could pursue some more fundamental improvements here.

+1 to what Chandler said here.

I think that the fundamental problem here is that we both

  • rely exclusively on strict bottom-up inlining to prevent exponential inlining, and
  • want to permit limited non-bottom-up inlining for various reasons.

We cannot have both of these. We either need to eliminate all cases where non-bottom-up inlining can occur, which I don’t think we want to do, because allowing it in certain cases is quite important to making better inlining decisions. Or we need to stop relying on bottom-up inlining being the only mechanism that prevents exponential inlining. I think that is the more promising direction.

Here is a random example I constructed that gets us exponential inlining without relying on the locally hot heuristic: Compiler Explorer The interesting aspect of this test case is that it does not depend on any threshold adjustments, it’s based on the inlining having lower cost in a caller further up the call-tree. This kind of case (where inlining becomes profitable in a higher callee due to more call-site information) is exactly the reason why we want to reconsider call-sites for inlining that we have previously declined to inline. But allowing it here leads to exponential inlining.

I’m not sure what exactly we should do here, but I think it has to come in the form of a generic limit of function growth due to inlining, which acts as a backstop for all possible bottom-up inlining bypasses.

I’m thinking something along the lines of:

  1. Record the (SCC) function sizes pre-inlining.
  2. Do one full round of inlining for the initially scheduled calls. This is to avoid order dependence where earlier calls in the function get favorable treatment.
  3. Compute the size growth relative to the pre-inlining size, and record it in a function attribute, so as to preserve idempotence.
  4. Future inlining attempts will now be checked against this attribute based on some heuristic. In particular this includes any inlined calls that were scheduled for another inlining attempt.

Your case is not due to bottom-up inlining but due to top-down aspect of the current BU inliner implementation – the iterative inlining transformation. It is a feature to delay some inlining decisions in BU step until the context is clearer.

One reasonable way to introduce caller size (or CFG complexity) limit (or loop region based limit modeling instruction working set). To make this working, callsite ordering based on profitabiliy is required so that budget is not eaten up by low profit callsites. Different order policy can be implemented too.

I think there is an approach that estimates the total costs of doing a bottom-up inlining with a higher threshold. For instance, it’s weird why they can have different inlined results: Compiler Explorer.

define void @test1_a() {
  call void @test2_a(i32 0)
  ret void
}

define void @test1() {
  call void @test2(i32 0)
  ret void
}

define void @test2(i32 %a) {
  %add = add i32 %a, 1
  ; call @test3 8 times.
  call void @test3(i32 %add)
  ; ...
  ret void
}

define void @test2_a(i32 %a) {
  %add = add i32 %a, 1
  ; call @test4 64 times.
  call void @test4(i32 %add)
  call void @test4(i32 %add)
  ; ...
  ret void
}

Another insight that I’m trying to investigate currently, is the order that the bottom up CGSCC inline. Currently it inlines in a DFS post order. In order to enforce budgets and simultaneously have it distribute the inline budget to the profitable sites, post order traversal of the SCCs most likely won’t be sufficient.

To make this working, callsite ordering based on profitabiliy is required so that budget is not eaten up by low profit callsites. Different order policy can be implemented too.

If we traverse the SCC from the leafs first in a bottom up fashion but visit the individual SCCs in the condensation of the callgraph in topological order, we would have a clearer context and a way to enforce the budget while still allowing the profitable callsites get to inline first before the budget runs out.

I’m still in early phases of experimentations, but so far it looks like the current CGSCC is mostly invariant on the visit order. I have a few ideas on inline budget assignments based on callgraph centrality and profile information too. But early suggestions are welcome.