Hi, we’ve observed some significant speedups after using BOLT on Postgres, particularly for OLTP workloads with short queries, in some cases close to 40%. It’s not very likely we would just run BOLT on regular builds, but we’re thinking about using BOLT as guidance. I’ve started experimenting with that some time ago, and I mostly imagined something like this:
-
collect the profile as usual
-
use -report-bad-layout to learn about issues in “common” functions
-
change the code to fix those issues, after a bit of reasoning about how universal the issue is (i.e. don’t overfit to the profile)
My assumption/expectation was that there will be a small fraction of places/functions responsible for most of the gains, so I’ve been trying to identify those, but not much luck so far
In particular, my expectation was that if I get all functions referenced in the bad layout report, put them into a file, and pass that to llvm-bolt using the --funcs-file option, that should produce the same optimized binary as using BOLT. Which would be useful for determining a smaller subset of functions to actually optimize “manually”.
Unfortunately, that’s not the behavior I’ve observed - even if I grep all the functions from the report
llvm-bolt ... -report-bad-layout=1000000 > bad-layout.txt
grep "Binary Func" bad-layout.txt | \
sed 's/Binary Function "//' | \
sed 's/".*//' | \
sed 's|/.*||' | \
sed 's/$/.*/' > funcs.txt
llvm-bolt ... --funcs-file=funcs.txt
the benefits are nowhere near just running BOLT without --funcs-file=
.
And the output seems quite different too - if I just do BOLT, I get this:
BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 644899addd8fd789c93e9a0f0727d37eb1b29c55
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: creating new program header table at address 0xa00000, offset 0xa00000
BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it.
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-WARNING: split function detected on input : brin_desummarize_range.cold. The support is limited in relocation mode
BOLT-INFO: pre-processing profile using branch profile reader
BOLT-WARNING: function LockAcquireExtended has an object detected in a padding region at address 0x47e15c
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function HeapTupleSatisfiesVisibility
BOLT-WARNING: skipped 50 functions due to cold fragments
BOLT-INFO: 757 out of 18213 functions in the binary (4.2%) have non-empty execution profile
BOLT-INFO: 32 functions with profile could not be optimized
BOLT-INFO: profile for 1 objects was ignored
BOLT-INFO: 9566 instructions were shortened
BOLT-INFO: removed 2171 empty blocks
BOLT-INFO: basic block reordering modified layout of 479 functions (63.28% of profiled, 2.59% of total)
BOLT-INFO: splitting separates 79644 hot bytes from 191559 cold bytes (29.37% of split functions is hot).
BOLT-INFO: 9 Functions were reordered by LoopInversionPass
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:
1824914 : executed forward branches
686800 : taken forward branches
348112 : executed backward branches
137496 : taken backward branches
206932 : executed unconditional branches
912616 : all function calls
175478 : indirect calls
38453 : PLT calls
21314565 : executed instructions
...
and if I use --funcs-file (with the ~200 lines extracted from the bad layout report), I get this:
BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 644899addd8fd789c93e9a0f0727d37eb1b29c55
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: creating new program header table at address 0xa00000, offset 0xa00000
BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it.
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-WARNING: split function detected on input : brin_desummarize_range.cold. The support is limited in relocation mode
BOLT-INFO: pre-processing profile using branch profile reader
BOLT-WARNING: function LockAcquireExtended has an object detected in a padding region at address 0x47e15c
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function CommitTransaction/1(*2)
BOLT-WARNING: skipped 2 functions due to cold fragments
BOLT-INFO: 208 out of 18213 functions in the binary (1.1%) have non-empty execution profile
BOLT-INFO: 581 functions with profile could not be optimized
BOLT-INFO: profile for 1 objects was ignored
BOLT-INFO: 2506 instructions were shortened
BOLT-INFO: removed 925 empty blocks
BOLT-INFO: basic block reordering modified layout of 205 functions (98.56% of profiled, 1.11% of total)
BOLT-INFO: splitting separates 40312 hot bytes from 79102 cold bytes (33.76% of split functions is hot).
BOLT-INFO: 4 Functions were reordered by LoopInversionPass
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:
864825 : executed forward branches
370854 : taken forward branches
244001 : executed backward branches
101387 : taken backward branches
133993 : executed unconditional branches
421403 : all function calls
65135 : indirect calls
15102 : PLT calls
10179206 : executed instructions
...
Clearly, that’s very different. There’s less than 1/2 executed instructions, executed branches, etc.
So, what am I missing? Or is there a better way to maybe approach this?