[RFC] Recursive Inlining

In certain benchmarks involving simple recursive programs (e.g.: llvm-test-suite/SingleSource/Benchmarks/Shootout at main · llvm/llvm-test-suite · GitHub), we noticed that GCC outperforms Clang due to what appears to be recursive inlining: Compiler Explorer. The idea is to iteratively inline a function into its recursive call sites up to a certain recursion depth. Such optimization could be quite beneficial for simple functions with single recursive call sites, providing similar benefits to loop unrolling.

There was an attempt in the past to enable this but it is unclear to me why the patch was abandoned:
https://reviews.llvm.org/D6996

There was some discussion here about manually transforming such functions into loops using a stack, but it was noted that such transformations (and the original recursive inlining approach) interfere with tail recursion elimination.

Indeed when we enabled recursive inlining naively, similarly to the original patch, we found no improvement on the Fibonacci-like examples that we were considering. This is understandable because if we have a function with n recursive call sites, the total number of call sites will increase exponentially with the depth of the recursive inlining, eliminating any opportunities for optimizations like TRE.

Instead, we would like recursive inlining to happen after function simplification passes such as TRE have been applied. Ordinary inlining naturally occurs after the callees have been simplified but before the caller has. Thus as we walk the CGSCC in a bottom-up fashion, the first pass in the CGSCC pipeline for a given SCC is the inliner pass itself and it is then followed by the function simplification pipeline (adapted to the SCC).

In the case of recursive functions, the SCC has a self-edge so when enabling recursive inlining directly, by generalizing the existing InlinerPass, we end up inlining at the recursive calls sites before we’ve simplified the recursive function, which is not what we want in this scenario. Instead we tried to disable recursive inlining by default and only allowing it in a second InlinerPass inserted after the function simplification pipeline:

ModulePassManagerPassBuilder::buildModuleSimplificationPipeline(...)
...

CGSCCPassManager &MainCGPipeline = MIWP.getPM();
// At this point in the pipeline all non-recursive callees have been
// simplified and potentially inlined into callers in the current SCC
...

// Add the core function simplification pipeline nested inside the
// CGSCC walk.
MainCGPipeline.addPass(createCGSCCToFunctionPassAdaptor(
    buildFunctionSimplificationPipeline(Level, Phase),
    PTO.EagerlyInvalidateAnalyses, /*NoRerun=*/true));

// Recursive inlining occurs at this point
MainCGPipeline.addPass(InlinerPass(InlinerPassOptions(false, /*Recurse*/true)));

With these modifications, we get good results on a simple Fibonacci function with two recursive calls because TRE has the chance to reduce the recursive call to one, which is the ideal scenario in which to apply recursive inlining. With a recursion depth of 1, we can transform a function like

entry:
  switch i64 %n, label %recurse [
  i64 1, label %base
  i64 2, label %base
  ]

base:
  ret i64 1

recurse:
  %0 = add i64 %n, -1
  %recurse1 = call i64 @fib(i64 %0)
  %1 = add i64 %n, -2
  %recurse2 = call i64 @fib(i64 %1)
  %add = add nsw i64 %recurse2, %recurse1
  ret i64 %add

into

define i64 @fib(i64 %n) {
entry:
  br label %tailrecurse

tailrecurse:                                      ; preds = %fib.exit, %entry
  %accumulator.tr = phi i64 [ 0, %entry ], [ %add, %fib.exit ]
  %n.tr = phi i64 [ %n, %entry ], [ %3, %fib.exit ]
  switch i64 %n.tr, label %recurse [
    i64 1, label %base
    i64 2, label %base
  ]

base:                                             ; preds = %tailrecurse, %tailrecurse
  %accumulator.ret.tr = add nsw i64 %accumulator.tr, 1
  ret i64 %accumulator.ret.tr

recurse:                                          ; preds = %tailrecurse
  %0 = add i64 %n.tr, -1
  br label %tailrecurse.i

tailrecurse.i:                                    ; preds = %recurse.i, %recurse
  %accumulator.tr.i = phi i64 [ 0, %recurse ], [ %add.i, %recurse.i ]
  %n.tr.i = phi i64 [ %0, %recurse ], [ %2, %recurse.i ]
  switch i64 %n.tr.i, label %recurse.i [
    i64 1, label %fib.exit
    i64 2, label %fib.exit
  ]

recurse.i:                                        ; preds = %tailrecurse.i
  %1 = add i64 %n.tr.i, -1
  %recurse1.i = tail call i64 @fib(i64 %1) #0
  %2 = add i64 %n.tr.i, -2
  %add.i = add nsw i64 %accumulator.tr.i, %recurse1.i
  br label %tailrecurse.i

fib.exit:                                         ; preds = %tailrecurse.i, %tailrecurse.i
  %accumulator.ret.tr.i = add nsw i64 %accumulator.tr.i, 1
  %3 = add i64 %n.tr, -2
  %add = add nsw i64 %accumulator.tr, %accumulator.ret.tr.i
  br label %tailrecurse
}

With a maximum depth of 3, we saw a 150% improvement on a RISC-V CPU and 65% on a slightly more complicated Ackermann function benchmark. However, as this is a significant change we wanted to get your feedback on the general approach and gather some concerns that we’d need to address.

The optimization should probably be restricted to functions with at most one recursive call site, at least initially. What changes, if any, should be introduced to the inline cost heuristics to account for the fact that in this case it is very similar to loop unrolling?

1 Like