[RFC] Global Function Merging

Function merging in LLVM currently only merges identical functions, which largely overlaps with the linker’s identical code folding (ICF). Additionally, Swift’s function merger has a slightly better capability to merge similar functions, but it has not yet been upstreamed – RFC for moving Swift’s merge function pass to LLVM. Both methods rely on IR compactors to identify identical or similar functions. However, this approach is less effective in a separate compilation environment due to the limited scope for comparison.

This RFC proposes a stable hash-based function merging approach, which can be used in separate compilation environments. Similar to Swift’s function merger, it tracks differences in constants. However, instead of direct comparison, it encodes stable functions into a map during the analysis phase.

During the merging step, it compares the hash summary with the current IR being processed. If a match is found, it optimistically creates a merging instance using a thunk to provide its specific context. Unlike traditional mergers, this method does not explicitly merge the call sites for safety. The actual size reduction occurs when identically created merging instances are folded by the linker. This approach is akin to the global function outlining introduced in a previous RFC which has been fully upstreamed – RFC Enhanced Machine Outliner Part 2: ThinLTO & NoLTO. You can also refer to a technical paper which is the basis for this PR – ACM Digital Library.

Results

The implementation was tested by building LLD for arm64 MachO with various configurations:

  • LM: Local merging (-enable-global-merge-func) from this RFC

  • LO: Local outlining (-enable-machine-outliner=always, which is on by default)

  • GM: Global merging (LM + using CG Data for merging) from this RFC

  • GO: Global outlining (LO + using CG Data for outlining) from the prior RFC

  • Gen: -fcodegen-data-generate generates CGData in each object file, which the linker merges into a .cgdata (same as the prior RFC)

  • Use: -fcodegen-data-use uses the CGData to perform GM and/or GO (same as the prior RFC)

  • Two-rounds: -codegen-data-thinlto-two-rounds is a special case of thin-LTO that performs the CGData generation and use in place (same as the prior RFC)



The results indicate that Global Merging (GM) achieves an additional 2-3% reduction in code size on top of the state-of-the-art Global Outlining (GO). Although GM may increase build times, particularly when code generation is repeated in place, this increase is minimal compared to the build times associated with full Link-Time Optimization (LTO). GM is also applicable in full-LTO scenarios, where it performs analysis and merging in place within a single large module.

Benchmark Code Size Change
7zip/7zip-benchmark -1.6%
Bullet/bullet -0.3%
ClamAV/clamscan 0.1%
consumer-typeset/consumer-typeset -0.2%
kimwitu++/kc -10.3%
lencod/lencod -0.3%
mafft/pairlocalalign -0.2%
SPASS/SPASS -0.3%
sqlite3/sqlite3 -0.1%
tramp3d-v4/tramp3d-v4 -2.2%

Additional data from the LLVM test suite targeting arm64 MachO with -Oz -flto=thin -codegen-data-thinlto-two-rounds -enable-global-merge-func indicates potential savings of up to 10% further on top of the state-of-the-art Global Outlining (GO) for certain C++ applications. This pass could be particularly useful for languages like Swift or others that extensively use generics/templates or lambdas.

Planned PRs

  • Refactoring structural hash [NFC]: PR #112621

  • Structural hash tracking differences with a custom function: PR #112638

  • Function merging summary data (serialization/deserialization): PR #112662

  • llvm-cgdata integration for new stable function map data: PR #112664

  • The main global merging function pass: PR #112671

  • Mach-O LLD integration, currently plugging this pass into the precodegen pass to fit into the thinlto-two-rounds framework mentioned above: PR #112674

5 Likes

cc. @gl387 @amara @jinlin-bayarea @rlavaee @nocchijiang @petrhosek

Hi Kyungwoo,

Thanks for sharing your excellent work! I have been experimenting with the patches for a few days.

Besides some minor issues that I have posted as comments in the main PR, I believe we should improve the performance for non-LTO setups. The current implementation is barely usable for them because the overhead of loading/iterating the global stable function map is way too high. I am willing to take this work since there are quite a few projects that do not enable LTO at ByteDance. In case you have already made some unpublished improvements in this regards, I would like to listen to your thoughts first.

Thank you for testing this out!

I haven’t yet explored optimizations for non-LTO setups as we’re mostly interested in (dist) thin-LTO. There could be potential improvements by refining the merging operation in parallel and restructuring the stable function map, possibly through compact string encoding or other methods. However, I haven’t implemented any enhancements in this area. If you’re interested in taking on this work, that would be fantastic!

Hi. I have 2 questions during the testing.

About merging the recursive function

I run a simple testing by comparing this RFC’s GlobalMergeFunctions.cpp with Swift repo’s LLVMMergeFunctions.cpp, using the test case from Swift repo’s test/LLVMPasses/merge_func.ll

Because this RFC’s merge pass does not support precise control for minimum instruction count, I modified the code to ignore cost and always merge all the functions, here is the result :

  • LLVM’s one
define internal void @another_recursive_func(i32 %x) {
  tail call void @another_recursive_func.Tgm(i32 %x, ptr @g1, ptr @another_recursive_func) #0
  ret void
}

; Function Attrs: noinline
define internal void @not_really_recursive.Tgm(i32 %0, ptr %1, ptr %2) #0 {
  br i1 undef, label %bb1, label %bb2

bb1:                                              ; preds = %3
  store i32 %0, ptr %1, align 4
  call void %2(i32 %0)
  br label %bb2

bb2:                                              ; preds = %bb1, %3
  ret void
}

define internal void @not_really_recursive(i32 %x) {
  tail call void @not_really_recursive.Tgm(i32 %x, ptr @g2, ptr @callee1) #0
  ret void
}

define void @call_recursive_funcs(i32 %x) {
  call void @recursive1(i32 %x, i32 %x)
  call void @recursive2(i32 %x, i32 %x)
  call void @another_recursive_func(i32 %x)
  call void @not_really_recursive(i32 %x)
  ret void
}
  • Swift’s one
define internal void @another_recursive_func(i32 %x) {
  tail call void @another_recursive_funcTm(i32 %x, ptr @g1, ptr @another_recursive_func)
  ret void
}

define internal void @another_recursive_funcTm(i32 %0, ptr %1, ptr %2) {
  br i1 undef, label %bb1, label %bb2

bb1:                                              ; preds = %3
  store i32 %0, ptr %1, align 4
  call void %2(i32 %0)
  br label %bb2

bb2:                                              ; preds = %bb1, %3
  ret void
}

define void @call_recursive_funcs(i32 %x) {
  call void @recursive1(i32 %x, i32 %x)
  call void @recursive2(i32 %x, i32 %x)
  call void @another_recursive_func(i32 %x)
  call void @another_recursive_funcTm(i32 %x, ptr @g2, ptr @callee1)
  ret void
}

I found a difference that Swift’s one replace (inline ?) the call to @not_really_recursive with the merged function another_recursive_funcTm, which looks a little strange.

By blame the test case I found this related to a commits MergeFunctions: handle self recursive functions correctly. · swiftlang/swift@2584078 · GitHub

And source code logic here: swift/lib/LLVMPasses/LLVMMergeFunctions.cpp at 45d15d1268868b42f87daead40eed8bb6bdddeb7 · swiftlang/swift · GitHub

Is this case still needed for now ? Or is this reasonable, since LLVM has its own inline pass

About the GFM and linker ICF

Seems current GFM relies on the linker to ICF the same merged atom, you can see the test case result here:

; Function Attrs: noinline
define internal i32 @simple_func1.Tgm(i32 %0, i32 %1, ptr %2) #0 {
  %sum = add i32 %0, %1
  %sum2 = add i32 %sum, %1
  %l = load i32, ptr %2, align 4
  %sum3 = add i32 %sum2, %1
  ret i32 %sum3
}

define i32 @simple_func1(i32 %x, i32 %y) {
  %1 = tail call i32 @simple_func1.Tgm(i32 %x, i32 %y, ptr @g1) #0
  ret i32 %1
}

; Function Attrs: noinline
define internal i32 @simple_func2.Tgm(i32 %0, i32 %1, ptr %2) #0 {
  %sum = add i32 %0, %1
  %sum2 = add i32 %sum, %1
  %l = load i32, ptr %2, align 4
  %sum3 = add i32 %sum2, %1
  ret i32 %sum3
}

define i32 @simple_func2(i32 %x, i32 %y) {
  %1 = tail call i32 @simple_func2.Tgm(i32 %x, i32 %y, ptr @g2) #0
  ret i32 %1
}

Is there any reason why we can not do the same thing like Swift’s one merge to just replace all calls to the merged function into a single one ?

Because of this requirement for ICF, I’ve also found that another Global Outliner ([RFC] Enhanced Machine Outliner – Part 2: ThinLTO/NoLTO - #13 by kyulee-com) also needs to handle some special case which generated from the GFM.

Because this RFC’s merge pass does not support precise control for minimum instruction count

We can control this when finalizing candidates – llvm-project/llvm/lib/CGData/StableFunctionMap.cpp at main · llvm/llvm-project · GitHub

Is this case still needed for now ?

I’m not entirely sure about the recursion support in the original Swift merging pass, but this new approach intentionally avoids directly folding the merged candidates, ensuring safety.

Is there any reason why we can not do the same thing like Swift’s one merge to just replace all calls to the merged function into a single one ?

By design, we should avoid directly merging calls based solely on hash matches, as this could result in mis-compilation. The core concept of this approach is to optimistically generate merging instances (candidates), allowing the linker’s Identical Code Folding (ICF) to safely perform the final merging. This method enables the merging of candidates across different compilation units.

Regarding the existing Swift merging pass at the Swift front-end, I believe we could eventually remove it and rely solely on this pass during LLVM’s code generation – I think you can also effectively disable it by passing a threshold = 0. While having both passes shouldn’t be harmful, it might lead to unnecessary processing since the latter will subsume the former. Additionally, this new pass at the LLVM can more effectively support ThinLTO or LTO.

but this new approach intentionally avoids directly folding the merged candidates, ensuring safety.

Agreed. Swift’s one seems to inline during the merge pass, which is not always suitable for LLVM’s one.

allowing the linker’s Identical Code Folding (ICF) to safely perform the final merging. This method enables the merging of candidates across different compilation units.

Actually, if we have linker’s support for ICF, different compile units generated merged symbols can be detected and replaced with the shared one. Means:

Current llvm RFC:

  • simple_func1
  • simple_func1.Tgm
  • simple_func2
  • simple_func2.Tgm
  • simple_func3 (from another.ll)
  • simple_func3.Tgm (from another.ll)

Swift’s one

  • simple_func1
  • simple_func1.Tgm
  • simple_func2
  • simple_func3 (from another.ll)
  • simple_func3.Tgm (from another.ll)

These two will result the same binary if the linker handle the ICF correctly, right ?

But this may introduce the requirement for linker… Some old linkers do not have such an optimization and may cause regression. It’s OK for LLD, but I didn’t test if Apple’s ld-prime (which is the built-in linker) works for this case ?

I guess the choice to relys on ICF is more about the DWARF usage ? Since we can keep the merged symbols’s info into linker and generate the correct DWARF where those merged symbols are from (like when simple_func1.Tgm is called in, is that from simple_func1 or simple_func2 or simple_func3)

These two will result the same binary if the linker handle the ICF correctly, right ?

The outcome depends on how we parameterize differences between similar functions. When choosing parameters among functions within a single module versus across different modules, it can result in distinct sets of parameters. Only if the merging instances happen to be identical will they be folded by the linker.

Note this new approach utilizes CGData, which enables consistent determination of parameters offline or at link time across different modules. This is an improvement over the existing Swift approach, which lacks this capability.

It’s OK for LLD, but I didn’t test if Apple’s ld-prime (which is the built-in linker) works for this case ?

Not sure about ld-prime. But regarding the linker’s ICF (Identical Code Folding) feature, it is a relatively common functionality supported by most modern linkers.

I guess the choice to relys on ICF is more about the DWARF usage ?

Relying on ICF (Identical Code Folding) is primarily about ensuring correctness, as hash-based merging cannot guarantee the identifiability of two functions. Additionally, the use of DWARF for merging instances is not yet fully supported. We are working on upstreaming this in the context of symbolication of stack traces, as mentioned in [RFC] Enhanced Machine Outliner – Part 2: ThinLTO/NoLTO - #11 by kyulee-com. cc @alx32

1 Like

Here is my plan of improving the performance of loading global stable function map for non-LTO setups:

  • The Names are unused during the actual consumption of the Use pass. I believe we should avoid loading them into the NameToId map to reduce memory footprint. Also I believe this should apply for all scenarios regardless of LTO configuration. To make it happen, we have to move the Names to the end of the serialized file and make some adjustments around the existing assertions accordingly (like skipping them during Use pass).

  • We should make a dedicated mode for non-LTO/traditional 1-compiler-process-per-source-file setups that the map should be lazily loaded. The mode should be toggleable via a new flag. This would also lead to a major structural change to the serialized file:

    • Write FuncEntries.size() as before.
    • Write all the (sorted) hashes. They would be the only part that we would unconditionally deserialize into memory for fast binary search.
    • Write the fixed-size fields for each entry. Compute a offset that points to the variable-sized IndexOperandHash vector.
    • Write the IndexOperandHash vector: size first, followed by the entries.
    • Write Names.

    To make the changes transparent to the Use pass, we would need to make a new StableFunctionMap subclass that manages the lazy-loading process: when at() is called for the first time for a given hash during the lifetime of the map instance, it would load all the matching entries from file into memory.

What is your thought on this approach? Any feedbacks are appreciated.