Impact of stack size on inlining decisions

Hello, everyone!

We (at Sony) have recently observed some stack size increases that are
largely due to more aggressive inlining. During the investigation we
found that clang/llvm does not seem to consider stack size in its inlining
decisions. In the following contrived example

// big_stack.c
#ifndef SVAR_SIZE_K                                         
#define SVAR_SIZE_K (2048)                                  
#endif                                                    
                                                          
extern void init(int *a);                                
static int foo(int x) {                                   
  int a[SVAR_SIZE_K * 1024];
  init(a);
  return a[x];                                            
}                                                         
                                                          
int bar(int x, int y) {                                   
  if (x < y) {                                            
    return foo(x);
  } else {                                                
    return 0;                                             
  }                                                       
}                                                         
// ----------------

foo() is inlined into bar() no matter how large you make SVAR_K_SIZE, even
though the call too foo is not in straightline code and the compiler has no
idea (at least in the absence of any profiling data) whether foo() is
called often, rarely, ot not at all.

Gcc, on the other hand, supports a couple of flags that influence inlining behavior wrt
stack size:

--param large-stack-frame=<N>
--param large-stack-frame-growth=<P>

If you use the first option, a caller’s stack size cannot grow beyond N bytes, although
the documentation seems to imply that there is a little bit of leeway. 256 bytes seems to
be the default on Linux.

Using the second option limits a stack frame’s growth due to inlining to P percent
of the original stack size. So --param large-stack-frame-growth=2000 means the stack
can’t grow by more than 20 times. The default appears to be 1000, i.e. 10 times the
original stack size. If large-stack-frame is also given, it takes preference over
large-stack-frame-growth.

I’d like to first understand if I missed something in the code that already does what I’m looking for, and if not, solicit some opinions on whether we should

  1. mimic gcc’s behavior by adding similar options and thresholds to potentially
    suppress inlining of functions with large stack sizes.

  2. perhaps go further and introduce a more fine-grained approach, where stack size
    and other factors such as call hotness are considered in the inlining decision.

    Some points to consider would be:

    • The stack size of the inlined function directly contributes to the inlining
      cost.
    • The relative call hotness contributes inversely to the inlining cost, i.e.
      the more often a function is called, the less it matters whether its stack
      is allocated as part of its caller’s stack, since its going to be called
      often anyway. The extreme case is when the function is called in straightline
      code. The stack size should then not matter for the inlining decision.

    So roughly, C = max(S - T, 0)/H, with C being the inlining cost contribution,
    S being the stack size, T a threshold below which we don’t consider the stack
    size at all (could also be a percentage of the caller’s stack size), and H a hotness
    measure.

Thanks,
Wolfgang Pieb
Sony Interactive Entertainment (SIE)

https://github.com/llvm/llvm-project/blob/e91a73de24d60954700d7ac0293c050ab2cbe90b/llvm/lib/Analysis/InlineCost.cpp#L1357

There’s a little bit of logic around inlining large allocas, although in this case it’s skipped because the alloca is alloca [2097152 x i32], rather than alloca i32, 2097152. We should make this apply to all static allocas.

Sorry, that is specifically to prevent promoting a potentially large dynamic alloca to a static alloca, which is a separate issue than simply preventing promotion of large static allocas.

The easiest thing to start experimenting with would be a constant penalty for allocas above a certain size. That would fit in with the existing hotness inline thresholds.

Thanks,

I’d like to propose to experimentally add to clang a switch -finline-max-stacksize=<N>, which prevents the inliner from inlining any function with a stack size larger than N bytes. This would effectively mimic gcc’s --param large-stack-frame=<N>. In the absence of the switch the compiler behaves as it does now. We could introduce a sensible default if others are in favor of it.

I’d like to propose to experimentally add to clang a switch -finline-max-stacksize=<N> , which prevents the inliner from inlining any function with a stack size larger than N bytes

This may be your intention, but it’d be great if this also worked with link-time optimization (LTO) since that is where we see a lot more aggressive inlining.

Right. The intent is certainly to translate the option to some sort of -mllvm option and pass it on to both clangcc1 and lld.