[BOLT][RFC] Enhance BOLT for Linux kernel

FLZ101 · January 20, 2025, 3:19am

After adding support for AArch64, relocation mode and instrumentation, we managed to make BOLT for Linux kernel work on AArch64 and achieved an improvement of 8+% on our nginx benchmark.

We’d like to contribute our work to the community. Below is a basic plan:

Add support for AArch64. We need to refactor some code first to make it easier to handle different Linux kernel versions and different architectures.
Add support for relocation mode to enable more and better optimizations.
Add support for instrumentation to make BOLT for Linux kernel work on machines without LBR/BRBE.
Improve the documentation to make BOLT for Linux kernel easier to use.

Three kinds of functions in Linux kernel source code need to be specially handled:

Functions whose address must be kept. Their address is usually used in add/sub/comparision. We could gather a list of them in the middle end.
Functions that can not be changed at all. Some functions in assembly code have additional semantics/enforcements. For example, irq_entries_start defined in arch/x86/include/asm/idtentry.h is actually an “array” but BOLT can never know that and views it as an ordinary function. If BOLT applies instrumentation, basic block reordering, etc to it, a run time error happens. As you can see, BOLT just can not handle functions defined in assembly code reliably. We could gather a list of functions defined in C code in the middle end and optimize them only.
Functions whose code can change at run time. Linux kernel could patch a function at run time. In relocation mode, BOLT could create a copy of such a function and Linux kernel can only patch the copy. To avoid undefined behaviors at run time, we fill patch points in the original function with undef instructions.

maksfb · January 24, 2025, 10:51pm

That’s an amazing result! Thanks for sharing.

Overall, the plan sounds good to me. More details below:

Add support for AArch64. We need to refactor some code first to make it easier to handle different Linux kernel versions and different architectures.

Do you have more PRs on top of what you’ve already submitted covering this area?

Add support for relocation mode to enable more and better optimizations.

That’s a good going-forward solution, but it comes with many caveats. Did you measure significant wins on top of linker-driven function reordering + non-relocation mode BOLT?

Add support for instrumentation to make BOLT for Linux kernel work on machines without LBR/BRBE.

Improve the documentation to make BOLT for Linux kernel easier to use.

Fantastic. We should also think about upstreaming changes to Linux.

Functions whose address must be kept. Their address is usually used in add/sub/comparision. We could gather a list of them in the middle end.

What do you mean by a middle end? Can the analysis/scan of such functions be performed in BOLT?

Functions that can not be changed at all. Some functions in assembly code have additional semantics/enforcements. For example, irq_entries_start defined in arch/x86/include/asm/idtentry.h is actually an “array” but BOLT can never know that and views it as an ordinary function. If BOLT applies instrumentation, basic block reordering, etc to it, a run time error happens. As you can see, BOLT just can not handle functions defined in assembly code reliably. We could gather a list of functions defined in C code in the middle end and optimize them only.

Currently, we add such functions to --skip-funcs=... list. Do you keep the original function body or you relocate it without changing the contents (modulo relocation processing)?

Functions whose code can change at run time. Linux kernel could patch a function at run time. In relocation mode, BOLT could create a copy of such a function and Linux kernel can only patch the copy. To avoid undefined behaviors at run time, we fill patch points in the original function with undef instructions.

What is the point of keeping the original? Such functions are currently handled in non-relocation mode. Relocations can help identify new cases, i.e. if the kernel code introduces new feature not recognized by BOLT.

maksfb · January 24, 2025, 10:52pm

Regarding the benchmark, could you please share the detailed setup of your nginx configuration?

FLZ101 · January 25, 2025, 6:38am

I hope to create these PRs soon.

FLZ101 · January 25, 2025, 6:52am

These performance tests have not been conducted yet. We want to enable relocation mode mainly because it is required by instrumentation. Our aarch64 machines do not have BRBE, and to use BOLT for Linux kernel on them, we have to use instrumentation rather than sampling.

FLZ101 · January 25, 2025, 7:15am

I mean somewhere in the IR pipeline (e.g. the codegen prepare pass). Such an analysis might be performed in BOLT, but with more work. In IR level, most of what we need to do is to check whether the address of a function is used as an operand of ptrtoint. Of course, it would be better if it is done in BOLT.

FLZ101 · January 25, 2025, 7:41am

The skip-funcs approach might not be easy to use for users not familiar with Linux kernel source code. To identify those functions to skip, we often need to test & debug the Linux kernel many times, and still might miss some. That is a lot of work.

In my own experience, especially with relocation mode enabled, I have spent a lot of time to fix compilation/runtime errors related to functions defined in assembly code.

Only handling functions defined in C code make development easier and make BOLT more reliable.

FLZ101 · January 25, 2025, 7:53am

I am not sure whether the original code will never be reached, since there are many tricks (and so jumps/calls that might not be identified by BOLT) in the Linux kernel source code.

FLZ101 · January 25, 2025, 7:58am

I hope to share it soon.

maksfb · January 28, 2025, 7:54pm

Did you try ETM as an alternative to LBRs?

maksfb · January 28, 2025, 7:56pm

The analysis in the compiler should be more robust, but then we introduce the dependency on the compiler. Do you have examples of the pointer arithmetic in the kernel beyond pointer comparison?

maksfb · January 28, 2025, 7:58pm

Did you do anything special with respect to jump tables? E.g., did you have to disable JT code generation with -fno-jump-tables?

nickdesaulniers · January 28, 2025, 11:02pm

Cool! I sent a link to this thread to some folks in Android. I bet they’d be interested in this!

FLZ101 · March 3, 2025, 9:00am

We also need to complete the support for exception table, alternative instruction, etc.

FLZ101 · March 5, 2025, 8:52am

We haven’t tried it. Do you know how we can use ETM as an alternative to LBR?

FLZ101 · March 5, 2025, 8:55am

Here is an example:

filter.h - include/linux/filter.h - Linux source code v6.13.2 - Bootlin Elixir Cross Referencer

Here a function address is encoded as difference from address of the BASE function. And when that function is called, we need to get its address first by adding back address of the BASE function. To make things simple, the BASE function should keep its address.

FLZ101 · March 5, 2025, 8:55am

No. I do not remeber I’ve done anything special with respect to jump tables.

maksfb · March 6, 2025, 6:12am

For ETM, please try the following guide: Use BOLT with ETM | Arm Learning Paths

maksfb · March 6, 2025, 6:14am

In that case, disabling jump tables may give you even more performance.

maksfb · March 6, 2025, 6:15am

I’m not familiar with this part of the kernel. Is this macro used for static or dynamic code generation?

Topic		Replies	Views
[RFC] BOLT: A Framework for Binary Analysis, Transformation, and Optimization LLVM Dev List Archives	9	517	November 24, 2020
Testing LLVM on OS X LLVM Dev List Archives	19	162	May 25, 2004
LLVM bytecode simulator/emulator? LLVM Dev List Archives	16	116	July 17, 2006
LLVM benchmarks against GCC LLVM Dev List Archives	9	157	May 1, 2004
Making Clang/LLVM faster using code layout optimizations LLVM Dev List Archives	3	317	October 19, 2018

[BOLT][RFC] Enhance BOLT for Linux kernel

Related topics