In SPEC2017/627.cam4_s, specifically in cospsimulator_intr_run() and COSP(), there are some large derived type declarations. These two spots alone account for roughly 30% of the total execution time.
Simplified IR of type(cosp_gridbox) :: gbx_it:
%_QMmod_cosp_typesTcosp_gridbox = type { … }
@…DerivedInit = internal constant %_QMmod_cosp_typesTcosp_gridbox { … }
%75 = alloca %_QMmod_cosp_typesTcosp_gridbox, align 8
call void @llvm.memcpy(%75, @…DerivedInit, i64 17698768, …)
We found a critical hotspot during profiling with type(cosp_gridbox). Because the Fortran standard forces pointers to be initialized to NULL, the compiler has to initialize the huge fixed-size arrays inside the derived type as well. This triggers a massive 17MB llvm.memcpy every time the function is called. It’s killing the stack and wasting a ton of time.
GCC and ICX handle this by moving large local variables to static storage automatically. I tested this manually by adding the SAVE attribute to move it off the stack. The results were pretty good: about a 37% speedup on x86 (Intel i9-11K) and 16% on RISC-V (SpacemiT K1).
SPEC2017 Benchmark Results
| X86(Intel i9-11900K) | |||
|---|---|---|---|
| llvm21.1.0 | Stack-Allocated Large Objects | Auto-Staticization of Large Variables | SpeedUp |
| 627.cam4_s | 1485s | 1080s | 1.37× |
My plan is to implement this ‘automatic static promotion’ in the Flang frontend. I want to add a -fmax-stack-var-size=n flag. If a variable is too big—and it’s safe to do so—we move it to static storage. This means we only pay the initialization cost once at startup instead of every function call. GCC defaults to 64K, and I’m wondering what threshold we should set.
Also, for OpenMP, GCC, ICX, and LLVM all keep variables on the stack for thread safety. However, this amplifies LLVM’s initialization bottleneck, as the massive overhead (as described above) is incurred on every single call. So I’m planning a more aggressive strategy: we automatically promote these large vars to static but attach the threadprivate attribute. By using TLS, we get thread safety, and eliminate the cost of repetitive initialization.
Test Results with 16 Threads in OpenMP Mode
| X86(Intel i9-11900K) | |||
|---|---|---|---|
| llvm21.1.0 | Stack-Allocated Large Objects | TLS-Based Static Storage | SpeedUp |
| 627.cam4_s | 586s | 298s | 1.97× |
Does anyone in the community have plans related to this, or suggestions? Many thanks ![]()
