Using -report-bad-layout and --funcs-file to optimize functions

TomasV · October 30, 2024, 2:34pm

Hi, we’ve observed some significant speedups after using BOLT on Postgres, particularly for OLTP workloads with short queries, in some cases close to 40%. It’s not very likely we would just run BOLT on regular builds, but we’re thinking about using BOLT as guidance. I’ve started experimenting with that some time ago, and I mostly imagined something like this:

collect the profile as usual
use -report-bad-layout to learn about issues in “common” functions
change the code to fix those issues, after a bit of reasoning about how universal the issue is (i.e. don’t overfit to the profile)

My assumption/expectation was that there will be a small fraction of places/functions responsible for most of the gains, so I’ve been trying to identify those, but not much luck so far

In particular, my expectation was that if I get all functions referenced in the bad layout report, put them into a file, and pass that to llvm-bolt using the --funcs-file option, that should produce the same optimized binary as using BOLT. Which would be useful for determining a smaller subset of functions to actually optimize “manually”.

Unfortunately, that’s not the behavior I’ve observed - even if I grep all the functions from the report

llvm-bolt ... -report-bad-layout=1000000 > bad-layout.txt

grep "Binary Func" bad-layout.txt | \
  sed 's/Binary Function "//' | \
  sed 's/".*//' | \
  sed 's|/.*||' | \
  sed 's/$/.*/' > funcs.txt

llvm-bolt ... --funcs-file=funcs.txt

the benefits are nowhere near just running BOLT without --funcs-file=.

And the output seems quite different too - if I just do BOLT, I get this:

BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 644899addd8fd789c93e9a0f0727d37eb1b29c55
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: creating new program header table at address 0xa00000, offset 0xa00000
BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it.
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-WARNING: split function detected on input : brin_desummarize_range.cold. The support is limited in relocation mode
BOLT-INFO: pre-processing profile using branch profile reader
BOLT-WARNING: function LockAcquireExtended has an object detected in a padding region at address 0x47e15c
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function HeapTupleSatisfiesVisibility
BOLT-WARNING: skipped 50 functions due to cold fragments
BOLT-INFO: 757 out of 18213 functions in the binary (4.2%) have non-empty execution profile
BOLT-INFO: 32 functions with profile could not be optimized
BOLT-INFO: profile for 1 objects was ignored
BOLT-INFO: 9566 instructions were shortened
BOLT-INFO: removed 2171 empty blocks
BOLT-INFO: basic block reordering modified layout of 479 functions (63.28% of profiled, 2.59% of total)
BOLT-INFO: splitting separates 79644 hot bytes from 191559 cold bytes (29.37% of split functions is hot).
BOLT-INFO: 9 Functions were reordered by LoopInversionPass
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:

             1824914 : executed forward branches
              686800 : taken forward branches
              348112 : executed backward branches
              137496 : taken backward branches
              206932 : executed unconditional branches
              912616 : all function calls
              175478 : indirect calls
               38453 : PLT calls
            21314565 : executed instructions
...

and if I use --funcs-file (with the ~200 lines extracted from the bad layout report), I get this:

BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 644899addd8fd789c93e9a0f0727d37eb1b29c55
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: creating new program header table at address 0xa00000, offset 0xa00000
BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it.
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-WARNING: split function detected on input : brin_desummarize_range.cold. The support is limited in relocation mode
BOLT-INFO: pre-processing profile using branch profile reader
BOLT-WARNING: function LockAcquireExtended has an object detected in a padding region at address 0x47e15c
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function CommitTransaction/1(*2)
BOLT-WARNING: skipped 2 functions due to cold fragments
BOLT-INFO: 208 out of 18213 functions in the binary (1.1%) have non-empty execution profile
BOLT-INFO: 581 functions with profile could not be optimized
BOLT-INFO: profile for 1 objects was ignored
BOLT-INFO: 2506 instructions were shortened
BOLT-INFO: removed 925 empty blocks
BOLT-INFO: basic block reordering modified layout of 205 functions (98.56% of profiled, 1.11% of total)
BOLT-INFO: splitting separates 40312 hot bytes from 79102 cold bytes (33.76% of split functions is hot).
BOLT-INFO: 4 Functions were reordered by LoopInversionPass
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:

              864825 : executed forward branches
              370854 : taken forward branches
              244001 : executed backward branches
              101387 : taken backward branches
              133993 : executed unconditional branches
              421403 : all function calls
               65135 : indirect calls
               15102 : PLT calls
            10179206 : executed instructions
...

Clearly, that’s very different. There’s less than 1/2 executed instructions, executed branches, etc.

So, what am I missing? Or is there a better way to maybe approach this?

maksfb · October 30, 2024, 9:02pm

I’ll try to explain why you are not getting the full benefit from using the function list first. There are at least a couple of things that come to mind:

--report-bad-layout will only report functions with a very clear “bad” layout from a profile point of view. Namely, where cold code is in the middle of hot code. However, if the layout was “average”, it will not report such functions. Note that BOLT will still improve such “average” cases.
When BOLT does optimization based on the function list, it will exclude the rest from the function layout optimization and function splitting, thus you will lose the benefit of those.

maksfb · October 30, 2024, 9:04pm

I don’t know much about Postgres, but many DB-type of workloads have a flat profile where the load is evenly distributed between hundreds if not thousands of functions with top functions taking just 1-2% of the CPU time or less.

maksfb · October 30, 2024, 9:07pm

Many times it’s close to impossible to fix those issues at source code level assuming you are using code inlining optimization where the inlined function behavior is dependent on the caller context.

TomasV · October 30, 2024, 10:57pm

You’re right in the case the profile is rather “flat” because the workload is very simple, so the processing does not spend much time in any of the layers (parsing, planning, execution). The “perf top” profile looks like this:

# Overhead  Command          Shared Object      Symbol                                    
# ........  ...............  .................  ..........................................
#
     3.50%  postgres         postgres           [.] base_yyparse
     2.29%  postgres         postgres           [.] palloc0
     1.66%  postgres         postgres           [.] AllocSetAlloc
     1.41%  postgres         postgres           [.] SearchCatCacheInternal
     1.26%  postgres         postgres           [.] hash_search_with_hash_value
     1.20%  postgres         [kernel.kallsyms]  [k] entry_SYSRETQ_unsafe_stack
     1.19%  postgres         postgres           [.] expression_tree_walker_impl
     0.99%  pgbench          pgbench            [.] threadRun
     0.95%  postgres         postgres           [.] core_yylex
     0.93%  pgbench          [kernel.kallsyms]  [k] entry_SYSRETQ_unsafe_stack
     0.79%  postgres         [kernel.kallsyms]  [k] syscall_return_via_sysret
     0.75%  postgres         postgres           [.] _bt_compare
     0.73%  postgres         [kernel.kallsyms]  [k] entry_SYSCALL_64
     0.67%  pgbench          [kernel.kallsyms]  [k] syscall_return_via_sysret
...

So yeah, there are no “heavy” functions, responsible for a significant fraction of the time.

This however reminds me - when I tried this on analytics workload (large complex queries processing large amounts of data, often hitting small number of functions), BOLT failed with a message like this:

BOLT-ERROR: unable to get new address corresponding to input address
            0x2a5185 in function ExecInterpExpr/1(*2). Consider adding
            this function to --skip-funcs=...

and after adding the function to --skip-funcs it worked, but there was almost no improvement. Which may not be all that surprising, because ExecInterpExpr is the expression interpreter where most of the expensive stuff happens, so skipping it skips optimizations for all the interesting parts. I wonder if there are some ways to allow optimizing those functions - for example, maybe there’s some compiler with which it’d would be possible to get a new address for the function?

TomasV · October 30, 2024, 11:02pm

Yeah, that’s what I was afraid might be happening, but I decided to give it a try. And thanks for explaining the -report-bad-layout stuff.

maksfb · November 7, 2024, 7:37pm

I believe this is an indication that the function uses a computed goto extension which is quite common for an implementation of interpreter loops. Additionally, the code is likely compiled with -fpic/fPIC. As a result, the compiler creates dynamic relocations of the kind that BOLT currently does not support.

Even though that’s an important/hot function, skipping it in BOLT unlikely affects the overall performance by more than 1%.

TomasV · November 8, 2024, 2:33pm

Correct, the code does indeed use computed goto. I see we have some sort of workaround for compilers that don’t support that, so I’ll see if that’s good enough.

But if the benefit really is less than 1%, that’d be a bit disappointing. My (very naive) expectation is that for expression-heavy queries (as in analytics) BOLT would help quite a bit, perhaps similar to JIT. But the observed behavior is pretty much exactly the opposite, i.e. little benefit for OLAP, massive benefit for OLTP.

I’m not suggesting this is somehow wrong, just that it goes directly to my layman intuition.

In any case, I very much appreciate the feedback / advice I got here. Thanks!

Topic		Replies	Views
Making Clang/LLVM faster using code layout optimizations LLVM Dev List Archives	3	243	October 19, 2018
BOLT: Optimizing relocatable files BOLT	4	497	March 23, 2023
[RFC] BOLT: A Framework for Binary Analysis, Transformation, and Optimization LLVM Dev List Archives	9	413	November 24, 2020
Error with perf2bolt in LLVM BOLT LLVM Dev List Archives	3	190	April 10, 2020
Porting improvements from llvm-bolt to LLVM LLVM Dev List Archives	0	122	July 19, 2018

Using -report-bad-layout and --funcs-file to optimize functions

Related topics