[RFC] Interprocedural Identical Basic Block Folding

IBBF: Inter-procedural Identical Basic Block Folding

Summary

We looked at folding basic blocks for more code size gains, initial experiments show that folding identical basic blocks can potentially reduce text size of large binaries by 20% or more. This can be done without affecting performance by using profile information to only fold cold basic blocks. This feature, like identical function folding, can produce backtraces that are nonsensical. However, these can be reasonably mitigated with improved debug information representation for folded basic blocks.

Background

Linker ICF, available in lld, has been used to fold identical functions and reduce code size of binaries by 6% to 10%. This was made possible due to the availability of linker function sections which makes it possible to identify and fold identical functions in the linker. Recently, Propeller added support for basic block sections. This sparked the question whether basic blocks could be similarly folded, finer in granularity but potentially more rewarding.

Unlike functions, folding basic blocks is a much harder problem. Basic blocks are entered and exited via jump instructions or fall-throughs, unlike call-ret semantics of functions, and folding basic blocks would require additional patching code to keep track of where to jump back to after exiting. Hence, folding really small basic blocks would not be net profitable and this needs to be done only for basic blocks above a certain threshold. Further, basic block layout is one of the key determining factors of performance and folding hot basic blocks could be counter-productive. Hence, this should be profile-guided and only be applied to cold basic blocks and very selectively to hot basic blocks.

Initial experiments on large binaries, like clang, show that 20% to 35% of basic blocks that are 20 bytes or larger are duplicated and can potentially bring in code size reductions of 20% or more.

Experiments

An experimental tool has been developed to statically analyze binaries for basic block similarity on Meta’s BOLT tool. We looked at two large binaries, clang and a large internal binary. While these experiments were primarily on X86 binaries, we believe that ARM binaries should show similar improvements. Both were built using Linker GC and ICF to eliminate false positives from dead sections and folded functions. The table below shows that estimate in size savings and does not take into account additional code that might be needed to materialize the savings. Further, only basic blocks that are at least 20 bytes are considered for folding. This is because some patch code is needed to materialize the savings and basic blocks below 20 bytes might not be net profitable to fold.

Binary Total Basic Blocks Redundant Basic Blocks (% of Total) Indirect Jump or Ret Basic Blocks (% of Redundant) Code Size of Redundant Blocks in bytes (% of Total) Unique Blocks with Copies (Ave. Copies)
clang 2614601 548441 (21%) 19090 (3.5%) 14774949 (25%) 95505(5)
lld 1420571 218589 (15%) 9451 (4.3%) 5899813 (18%) 59612(3)
Internal _ _ _ _ (29%) _

For example, in clang, 21% of all basic blocks that are at least 20 bytes have a duplicate and can be eliminated as they are redundant. Here, the 21% only counts the redundant blocks and not the single copy that must be kept. 3.5% of such blocks end with a return instruction or an indirect jump and can be easily folded. These redundant blocks account for 25% of the total code size (text size of all basic blocks).

The histogram shows the frequency of the various sizes of redundant basic blocks, with the Y-axis in log scale. Folding basic blocks is not free as some patch code needs to be added to return to the right destination after jumping to the folded block. Larger block sizes make it more profitable to fold.

Attribution to text section type

We looked at the % of basic blocks in the hot, cold and unlikely segments by using profile information for a benchmark. We found that more than 35% of the unlikely and the warm segments contain redundant basic blocks. For performance sensitive binaries, folding basic blocks in non-cold regions could hurt performance. Our data shows that this optimization might still be profitable for cold basic blocks.

Merging Identical Basic Blocks

Unlike functions, merging identical basic blocks is a hard problem. Functions use call-ret semantics and there is no state needed to maintain where to jump to upon return. With basic block merging, the copy of the basic block that is kept needs to know where to jump back to after exiting. However, for basic blocks that end with return or indirect jump instructions, this is straightforward and nothing needs to be done.

Another type of basic block that can be easily folded is those ending with tail call instructions. This experiment does not track tail calls and will be done as a follow-up.

Basic blocks that do not end with “ret” or “indirect jump”

These are the more common kinds of basic blocks that are encountered. Efficient schemes for different architectures needs to be designed to keep the size and performance overhead from this as low as possible. The code below shows an outlined block and a jump to it using register %r12. This register saves the destination that the outlined block needs to jump to after exiting. This is a very simplistic solution to illustrate the folding challenge but shows that some patching would be required to materialize the folding.

Jumping to an outlined block would need patching like this:

pushq   %r12   # Patch : Save and restore r12
movabsq $_Z5firstb.2, %r12  # Patch: Save destination
je      outlined_block  # Patch : Jump to outlined block
popq    %r12  # Patch : Restore r12

An outlined block would have to make an indirect jmp at the end:

.section .text.outlined_block, "ax", @progbits
outlined_block:
 	movabsq	$.L.str.1, %rdi
	movb	$0, %al
	callq	printf
	jmp	*%r12   # Patch : Jump to right destination

The target of the jump from the outlined block would need to restore:

_Z5firstb.2:                            # %if.else
	popq    %r12    # Patch: Restore %r12
	jmp	_Z5firstb.3

Basic blocks that do end with “ret” or “indirect jump”

These account for ~3% of the redundant basic blocks and folding them is quite straightforward without requiring much or any patch code, though some jumps may need to be relaxed and made longer.

Interactions with Identical Function Folding

This optimization must be preferably used on size sensitive binaries and should ideally be done after linker garbage collection and identical function folding. Particularly, Identical Function folding would reveal more opportunities for basic block folding. For example, two non-identical basic blocks which only differ in their call target symbols or function symbol accesses could become identical if the differing symbols are found to be identical after function folding. The above experiments were done enabling gc-sections and icf.

Call Frame Information (CFI) Correctness

Call Frame Information (CFI) is used to accurately unwind the stack and contains information on how to obtain the call frame address and obtain register values that are pushed on the stack. These are emitted as CFI directives. Two basic blocks that have identical machine instructions can differ in CFI instructions. This can be a problem when folding such basic blocks as it will lead to incorrect stack unwinding. One solution is to also match the cfi instructions in the basic block and only fold them if the CFI directives match.

Debug Information Correctness

This is a hard problem but has similarities to how debug information should be handled with functions that are folded. The Identical Code Folding paper discusses DWARF extensions to maintain debug information for all folded copies and retrieving the correct one while unwinding. Similar ideas could be used here.

6 Likes

@jmorse - might be of interest in code size reductions (I mentioned this briefly on our call on Monday)

Those percentages are much larger than anything I’ve ever seen with outlining; enough to make me somewhat doubtful about your estimates. But I’ve never looked at anything anywhere near as large as clang; maybe the percentages change on big C++ binaries.

The sequences you’re choosing for jumping to and from outlined seem a bit strange; my first choice for outlining would be to use call/ret. Can you explain your choice here?

Yes, I was quite surprised by the magnitude too! What binaries did you try it on? Maybe I can run it on those too.

I have not thought too hard on the best way to materialize the folding. call-ret is fine but I was wondering if we maybe able to do better, maybe not. Maybe it is also dependent on where we want to do this optimization. This is cross-module, so it has to be late and at closer to link time.

It will be helpful to do some analysis on some of the duplicates. Likely due to inlining? Also for a binary built with Oz, or Os, I suspect the opportunities will be reduced. Collecting stats for those would be useful too.

My guess is that call/ret would make stack accesses in the called block incorrect, if you’re not using a separate frame pointer. Branch-and-link using a register makes sense to me, although it effectively requires reserving one register globally.

Sorry, nothing I can share. But small programs using C/C-style code.

See llvm-project/llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp at 307cd883546348cd658d74699915fd48ae01e9a0 · llvm/llvm-project · GitHub for various ways you can outline short code sequences. I assume the tradeoffs are similar on x86. x86 has the slight advantage that you don’t need to worry about the link register.

As long as you’re not outlining a call, you just need to rewrite sp-relative accesses. It’s not that hard.

2 Likes

@barrelshifter for MachineOutliner, 2018 “What’s New In Outlining”
@kyulee-com for Improving Machine Outliner for ThinLTO and @ellishg for Efficient profile-guided size optimization for native mobile applications
@yroux for Reducing code size with LLVM Machine Outliner on 32-bit Arm targets

MachineOutliner is a similar work, outlining instructions with a finer granularity than one basic block, but does not utilize profile information. MachineOutliner is enabled on all -Oz functions for AArch64/RISC-V. -mllvm -enable-machine-outliner enables it on all functions for other optimization levels. I believe MachineOutliner is primarily used by mobile applications with a focus on size reduction but not optimization. Meta has works to place similar functions together so that LZ77-style compression will work better.

Yes, machine outliner is similar work. AFAICT, it happens earlier in the opt pipeline and is intra-module. I was looking at something cross-module and post-link. I can try to get some numbers on how much more I can squeeze over and above machine outlining.

1 Like

A while back, I worked on using profiles to avoid outlining hot blocks. I’m hoping to try this again soon and collect performance data to see if it’s a good tradeoff between size and performance.
https://reviews.llvm.org/D125743

4 Likes

I agree with both points of view:: a) that outlining should map to changing identical basic blocks into mini/micro-subroutines, I also agree that:: b) this ruins the stack offset layout.

{{aside:: It occurs to me that RISC-V prologue and epilogue sequences are often micro-subroutines but this is not translatable to all ISAs.}}

It also occurs to me that the above aside implies this optimization is being applied at the wrong point in the compile path. If you can find the duplicate basic blocks prior to stack layout and register allocation, the conversion into a micro-subroutine would be more straightforward.