IBBF: Inter-procedural Identical Basic Block Folding
Summary
We looked at folding basic blocks for more code size gains, initial experiments show that folding identical basic blocks can potentially reduce text size of large binaries by 20% or more. This can be done without affecting performance by using profile information to only fold cold basic blocks. This feature, like identical function folding, can produce backtraces that are nonsensical. However, these can be reasonably mitigated with improved debug information representation for folded basic blocks.
Background
Linker ICF, available in lld, has been used to fold identical functions and reduce code size of binaries by 6% to 10%. This was made possible due to the availability of linker function sections which makes it possible to identify and fold identical functions in the linker. Recently, Propeller added support for basic block sections. This sparked the question whether basic blocks could be similarly folded, finer in granularity but potentially more rewarding.
Unlike functions, folding basic blocks is a much harder problem. Basic blocks are entered and exited via jump instructions or fall-throughs, unlike call-ret semantics of functions, and folding basic blocks would require additional patching code to keep track of where to jump back to after exiting. Hence, folding really small basic blocks would not be net profitable and this needs to be done only for basic blocks above a certain threshold. Further, basic block layout is one of the key determining factors of performance and folding hot basic blocks could be counter-productive. Hence, this should be profile-guided and only be applied to cold basic blocks and very selectively to hot basic blocks.
Initial experiments on large binaries, like clang, show that 20% to 35% of basic blocks that are 20 bytes or larger are duplicated and can potentially bring in code size reductions of 20% or more.
Experiments
An experimental tool has been developed to statically analyze binaries for basic block similarity on Meta’s BOLT tool. We looked at two large binaries, clang and a large internal binary. While these experiments were primarily on X86 binaries, we believe that ARM binaries should show similar improvements. Both were built using Linker GC and ICF to eliminate false positives from dead sections and folded functions. The table below shows that estimate in size savings and does not take into account additional code that might be needed to materialize the savings. Further, only basic blocks that are at least 20 bytes are considered for folding. This is because some patch code is needed to materialize the savings and basic blocks below 20 bytes might not be net profitable to fold.
Binary | Total Basic Blocks | Redundant Basic Blocks (% of Total) | Indirect Jump or Ret Basic Blocks (% of Redundant) | Code Size of Redundant Blocks in bytes (% of Total) | Unique Blocks with Copies (Ave. Copies) |
---|---|---|---|---|---|
clang | 2614601 | 548441 (21%) | 19090 (3.5%) | 14774949 (25%) | 95505(5) |
lld | 1420571 | 218589 (15%) | 9451 (4.3%) | 5899813 (18%) | 59612(3) |
Internal | _ | _ | _ | _ (29%) | _ |
For example, in clang, 21% of all basic blocks that are at least 20 bytes have a duplicate and can be eliminated as they are redundant. Here, the 21% only counts the redundant blocks and not the single copy that must be kept. 3.5% of such blocks end with a return instruction or an indirect jump and can be easily folded. These redundant blocks account for 25% of the total code size (text size of all basic blocks).
The histogram shows the frequency of the various sizes of redundant basic blocks, with the Y-axis in log scale. Folding basic blocks is not free as some patch code needs to be added to return to the right destination after jumping to the folded block. Larger block sizes make it more profitable to fold.
Attribution to text section type
We looked at the % of basic blocks in the hot, cold and unlikely segments by using profile information for a benchmark. We found that more than 35% of the unlikely and the warm segments contain redundant basic blocks. For performance sensitive binaries, folding basic blocks in non-cold regions could hurt performance. Our data shows that this optimization might still be profitable for cold basic blocks.
Merging Identical Basic Blocks
Unlike functions, merging identical basic blocks is a hard problem. Functions use call-ret semantics and there is no state needed to maintain where to jump to upon return. With basic block merging, the copy of the basic block that is kept needs to know where to jump back to after exiting. However, for basic blocks that end with return or indirect jump instructions, this is straightforward and nothing needs to be done.
Another type of basic block that can be easily folded is those ending with tail call instructions. This experiment does not track tail calls and will be done as a follow-up.
Basic blocks that do not end with “ret” or “indirect jump”
These are the more common kinds of basic blocks that are encountered. Efficient schemes for different architectures needs to be designed to keep the size and performance overhead from this as low as possible. The code below shows an outlined block and a jump to it using register %r12. This register saves the destination that the outlined block needs to jump to after exiting. This is a very simplistic solution to illustrate the folding challenge but shows that some patching would be required to materialize the folding.
Jumping to an outlined block would need patching like this:
pushq %r12 # Patch : Save and restore r12
movabsq $_Z5firstb.2, %r12 # Patch: Save destination
je outlined_block # Patch : Jump to outlined block
popq %r12 # Patch : Restore r12
An outlined block would have to make an indirect jmp at the end:
.section .text.outlined_block, "ax", @progbits
outlined_block:
movabsq $.L.str.1, %rdi
movb $0, %al
callq printf
jmp *%r12 # Patch : Jump to right destination
The target of the jump from the outlined block would need to restore:
_Z5firstb.2: # %if.else
popq %r12 # Patch: Restore %r12
jmp _Z5firstb.3
Basic blocks that do end with “ret” or “indirect jump”
These account for ~3% of the redundant basic blocks and folding them is quite straightforward without requiring much or any patch code, though some jumps may need to be relaxed and made longer.
Interactions with Identical Function Folding
This optimization must be preferably used on size sensitive binaries and should ideally be done after linker garbage collection and identical function folding. Particularly, Identical Function folding would reveal more opportunities for basic block folding. For example, two non-identical basic blocks which only differ in their call target symbols or function symbol accesses could become identical if the differing symbols are found to be identical after function folding. The above experiments were done enabling gc-sections and icf.
Call Frame Information (CFI) Correctness
Call Frame Information (CFI) is used to accurately unwind the stack and contains information on how to obtain the call frame address and obtain register values that are pushed on the stack. These are emitted as CFI directives. Two basic blocks that have identical machine instructions can differ in CFI instructions. This can be a problem when folding such basic blocks as it will lead to incorrect stack unwinding. One solution is to also match the cfi instructions in the basic block and only fold them if the CFI directives match.
Debug Information Correctness
This is a hard problem but has similarities to how debug information should be handled with functions that are folded. The Identical Code Folding paper discusses DWARF extensions to maintain debug information for all folded copies and retrieving the correct one while unwinding. Similar ideas could be used here.