[RFC] Automatic static promotion of large local variables in Flang

In SPEC2017/627.cam4_s, specifically in cospsimulator_intr_run() and COSP(), there are some large derived type declarations. These two spots alone account for roughly 30% of the total execution time.

Simplified IR of type(cosp_gridbox) :: gbx_it:

%_QMmod_cosp_typesTcosp_gridbox = type { … }
@…DerivedInit = internal constant %_QMmod_cosp_typesTcosp_gridbox { … }
%75 = alloca %_QMmod_cosp_typesTcosp_gridbox, align 8
call void @llvm.memcpy(%75, @…DerivedInit, i64 17698768, …)

We found a critical hotspot during profiling with type(cosp_gridbox). Because the Fortran standard forces pointers to be initialized to NULL, the compiler has to initialize the huge fixed-size arrays inside the derived type as well. This triggers a massive 17MB llvm.memcpy every time the function is called. It’s killing the stack and wasting a ton of time.

GCC and ICX handle this by moving large local variables to static storage automatically. I tested this manually by adding the SAVE attribute to move it off the stack. The results were pretty good: about a 37% speedup on x86 (Intel i9-11K) and 16% on RISC-V (SpacemiT K1).

SPEC2017 Benchmark Results

X86(Intel i9-11900K)
llvm21.1.0 Stack-Allocated Large Objects Auto-Staticization of Large Variables SpeedUp
627.cam4_s 1485s 1080s 1.37×

My plan is to implement this ‘automatic static promotion’ in the Flang frontend. I want to add a -fmax-stack-var-size=n flag. If a variable is too big—and it’s safe to do so—we move it to static storage. This means we only pay the initialization cost once at startup instead of every function call. GCC defaults to 64K, and I’m wondering what threshold we should set.

Also, for OpenMP, GCC, ICX, and LLVM all keep variables on the stack for thread safety. However, this amplifies LLVM’s initialization bottleneck, as the massive overhead (as described above) is incurred on every single call. So I’m planning a more aggressive strategy: we automatically promote these large vars to static but attach the threadprivate attribute. By using TLS, we get thread safety, and eliminate the cost of repetitive initialization.

Test Results with 16 Threads in OpenMP Mode

X86(Intel i9-11900K)
llvm21.1.0 Stack-Allocated Large Objects TLS-Based Static Storage SpeedUp
627.cam4_s 586s 298s 1.97×

Does anyone in the community have plans related to this, or suggestions? Many thanks :slight_smile:

But LLVM really chokes on huge stack allocations.

Why not fix that? This shows up a lot in fortran code especially non open mp ones.

Thanks for the comment !

To clarify, when I said ‘LLVM chokes’, I was referring specifically to the runtime overhead of repeated initialization, not the stack allocation itself.

Since the Fortran standard mandates pointer initialization (NULL), leaving these large variables on the stack forces the compiler to generate a massive llvm.memcpy (e.g., 17MB) on every single function entry. Even if we optimize the backend’s stack allocation mechanism, we cannot eliminate this O(N) initialization cost as long as the variable lives on the stack.

My proposal to promote them to static storage converts this expensive runtime initialization into a one-time load-time initialization (O(1)), which is the main source of the performance gain (37% on x86).

But Gfortran does not have this issue with -Ofast which enables stack variables. There seems like something else is going wrong.

I consulted the GCC documentation regarding -fstack-arrays (which is implied by -Ofast). It explicitly defines the scope of this option:

Adding this option makes the Fortran compiler put all arrays of unknown size and array temporaries onto stack memory.

However, the variable in 627.cam4_s (type(cosp_gridbox) :: gbx_it) is a fixed-size scalar instance of a derived type.

I verified this by inspecting the LLVM IR, which shows a compile-time constant size (approx 17MB) and a standard structure allocation:

// Size 17698768 is a compile-time constant
call void @llvm.memcpy(..., i64 17698768, ...)

Since it is a fixed-size scalar, it falls outside the scope of -fstack-arrays. So, it remains subject to the default -fmax-stack-var-size limit.

To confirm this, I compiled with -Ofast -fstack-arrays and -Wsurprising, and Gfortran produced this warning:

Warning: Array ‘gbx_it’ at (1) is larger than limit set by ‘-fmax-stack-var-size=’, moved from stack to static storage. (See attached screenshot)

This confirms that Gfortran does not keep this variable on the stack, even with -Ofast. Instead, it promotes it to static storage due to its size. This supports my proposal: the performance advantage comes from static promotion (avoiding runtime initialization), not from stack allocation."

The standard is actually not the one requiring this. Flang implementation does.

The standard only require initializing ALLOCATABLE to have the unallocated allocation status. For POINTER, it tells the status is undefined.

POINTER and ALLOCATABLE in flang are implemented as the descriptors. Flang chose/need to initialize POINTER because leaving descriptor undefined is a problem for the runtime (and some inline code) that may use the descriptor to get the information about such variable (declared type, rank) before the variable/component is associated.

I think it would be hard to change this and having the assurance that all descriptors are properly set-up before use is actually a quite sane/safe invariant to have for flang runtime. But you could still try to see what it would take to change this.

This seems illegal for the general cases of derived type initialization since the local variable should have the default value on every entry of the function.

Take for instance the following program, if you move `local` to be a static variable, it will have the wrong value on the second entry.

So your optimization would only be safe for function that the compiler can prove are only entered once which seems very restrictive.

module m 
implicit none
type t
 integer :: x(1000000) = 0
end type
contains
subroutine foo()
 type(t) :: local
 print *, local%x(1)
 local%x(1) = 1
end subroutine
end module

use m
call foo()
call foo()
end

So I’m planning a more aggressive strategy: we automatically promote these large vars to static but attach the threadprivate attribute. By using TLS, we get thread safety, and eliminate the cost of repetitive initialization.

What about recursion even in single threaded mode?

I think the key point here regarding the SPEC case is the size of the memcpy. There is not need to set-up the whole derived type storage, only the pointer component storage that is a few bytes in that case should be set-up.

So the easiest and safest optimization seems to detect cases where there is a single (or few, with a threshold to be defined) pointer components where “implicit init“ is required and to generate code that set-ups these to NULL instead of setting up the whole type.

Thanks for the insight, that’s really helpful.

While initializing only pointer members is a valid approach, it still introduces some runtime overhead.

You are absolutely right regarding the standard semantics. Simply promoting local to static storage without handling re-initialization would indeed violate the default initialization rule (e.g., causing the second call to print 1 instead of 0).

Actually, I tested your example with gfortran, and the second call does correctly print 0. By analyzing the output of -fdump-tree-original, I found that GCC achieves this by:

  1. Promoting the variable to static.

  2. On every function entry, creating a temporary instance on the stack.

  3. Initializing that temporary instance.

  4. Copying (assigning) the temporary to the static variable.

void foo ()
{
  static struct t local = {.x={}};
  {
    struct t t.1;
    {
      integer(kind=8) S.2;

      S.2 = 1;
      while (1)
        {
          if (S.2 > 1000000) goto L.1;
          t.1.x[S.2 + -1] = 0;
          S.2 = S.2 + 1;
        }
      L.1:;
    }
    local = t.1;
  }
  ......
}

This approach is problematic because:

  • Stack Overflow Risk: The temporary variable is still fully allocated on the stack, so the risk of stack overflow remains unchanged.

  • Performance: It incurs double overhead: the cost of initializing the stack temporary plus the cost of copying it to static storage. This is likely even slower than LLVM’s current default stack allocation.

But, this gives us a blueprint. We could adopt a modified version of GCC’s logic:

A. If the derived type has NO default initializer: We can safely promote these large variables to static. We assume the user will manually assign values to the variable before use (as per standard behavior for undefined variables).

B. If the derived type HAS a default initializer: We have two options:

  1. Keep it on the stack: This is the simplest compliant behavior, though the stack overflow risk persists.

  2. (Proposed) Promote to static + In-place Initialization: We promote the variable to static to save stack space. But unlike GCC, we do not use a stack temporary. Instead, we generate code to initialize the static memory directly at the function entry. This ensures correctness and is significantly faster than memcpy-ing the entire derived type.

We will explicitly exclude this optimization for procedures marked RECURSIVE or PURE, along with any other unsafe contexts.

RECURSIVE is the default behavior mandated by the Fortran standard since Fortran 2018, and flang being a new compiler is enforcing this by default since it is safer (the fact that there is no recursion cannot be enforced at compile time and is not verified at runtime) and more friendly with parallelization.

This ensures correctness and is significantly faster than memcpy-ing the entire derived type.

Why correlate the way the memory is initialized to the kind of memory? I do not see how choosing to do memcopy vs initializing only the components that require initialization could not apply to stack allocated variable.

though the stack overflow risk persists.

Stack overflow can be solved differently. First, you can run with unlimited stack, and the best compiler solution would be to add an option to allocate array and derived type locals on the heap instead of the stack. The cost of the heap allocation vs stack should be low for such big variables, and heap allocation is safe with regards to re-entrance and parallelism.

Related discussions:

IMHO the best default behavior is what gfortran does: Heap arrays by default, only automatically promote arrays of sufficiently small statically-known size (gfortran: 65kB) to stack.

The reason is that arrays can easily exhaust the default stack size (Linux pthreads: 2MB, even with ulimit set to unlimited). Better have a slower program by default than a non-working one. We can ovveride the default stack size, but that should be an opt-in.

why not select the stack vs heap based on the dynamic size?

Also IMHO: I think that would be fine if someone implemented it but has some additional concerns that may or may not be applicable to Flang:

  • Dynamic alloca may require llvmstacksave/llvm.stackrestore intrinsics, and/or cannot be moved to a function’s entry BB, some important optimizations do not work anymore (mem2ref, SROA, stack coloring), diminishing some of the benefits of having it on the stack. Stack probing might also be more complicated.
  • Some code safety standards do not allow variable-sized stacks on principle (e.g. MISRA, Linux)
  • It could be possible to always reserve that max size on the stack, but use a heap allocation when it does not fit (like llvm::SmallVector). Might be considered waste of memory.
  • The C++ WG21 discussed such a “sometimes stack-based, sometimes heap-based” array and rejected it. I don’t know the reasons though.

Thanks for the tip!

Our experiments on 627.cam4_s show that the issue isn’t stack vs. heap vs. static storage. It actually comes down to the initialization size.

We plan to move forward with the design to only initialize the pointer components in derived types, instead of the whole thing.

SPEC2017 Benchmark Results

X86(Intel i9-11900K) runtime(s) ratio speedup
LLVM21 Base 1492 5.94 1.00
Initialize pointer components only 1113 7.96 1.34×

Any advice before we implement this?

Thanks again!