After adding support for AArch64, relocation mode and instrumentation, we managed to make BOLT for Linux kernel work on AArch64 and achieved an improvement of 8+% on our nginx benchmark.
We’d like to contribute our work to the community. Below is a basic plan:
Add support for AArch64. We need to refactor some code first to make it easier to handle different Linux kernel versions and different architectures.
Add support for relocation mode to enable more and better optimizations.
Add support for instrumentation to make BOLT for Linux kernel work on machines without LBR/BRBE.
Improve the documentation to make BOLT for Linux kernel easier to use.
Three kinds of functions in Linux kernel source code need to be specially handled:
Functions whose address must be kept. Their address is usually used in add/sub/comparision. We could gather a list of them in the middle end.
Functions that can not be changed at all. Some functions in assembly code have additional semantics/enforcements. For example, irq_entries_start defined in arch/x86/include/asm/idtentry.h is actually an “array” but BOLT can never know that and views it as an ordinary function. If BOLT applies instrumentation, basic block reordering, etc to it, a run time error happens. As you can see, BOLT just can not handle functions defined in assembly code reliably. We could gather a list of functions defined in C code in the middle end and optimize them only.
Functions whose code can change at run time. Linux kernel could patch a function at run time. In relocation mode, BOLT could create a copy of such a function and Linux kernel can only patch the copy. To avoid undefined behaviors at run time, we fill patch points in the original function with undef instructions.
Overall, the plan sounds good to me. More details below:
Add support for AArch64. We need to refactor some code first to make it easier to handle different Linux kernel versions and different architectures.
Do you have more PRs on top of what you’ve already submitted covering this area?
Add support for relocation mode to enable more and better optimizations.
That’s a good going-forward solution, but it comes with many caveats. Did you measure significant wins on top of linker-driven function reordering + non-relocation mode BOLT?
Add support for instrumentation to make BOLT for Linux kernel work on machines without LBR/BRBE.
Improve the documentation to make BOLT for Linux kernel easier to use.
Fantastic. We should also think about upstreaming changes to Linux.
Functions whose address must be kept. Their address is usually used in add/sub/comparision. We could gather a list of them in the middle end.
What do you mean by a middle end? Can the analysis/scan of such functions be performed in BOLT?
Functions that can not be changed at all. Some functions in assembly code have additional semantics/enforcements. For example, irq_entries_start defined in arch/x86/include/asm/idtentry.h is actually an “array” but BOLT can never know that and views it as an ordinary function. If BOLT applies instrumentation, basic block reordering, etc to it, a run time error happens. As you can see, BOLT just can not handle functions defined in assembly code reliably. We could gather a list of functions defined in C code in the middle end and optimize them only.
Currently, we add such functions to --skip-funcs=... list. Do you keep the original function body or you relocate it without changing the contents (modulo relocation processing)?
Functions whose code can change at run time. Linux kernel could patch a function at run time. In relocation mode, BOLT could create a copy of such a function and Linux kernel can only patch the copy. To avoid undefined behaviors at run time, we fill patch points in the original function with undef instructions.
What is the point of keeping the original? Such functions are currently handled in non-relocation mode. Relocations can help identify new cases, i.e. if the kernel code introduces new feature not recognized by BOLT.
These performance tests have not been conducted yet. We want to enable relocation mode mainly because it is required by instrumentation. Our aarch64 machines do not have BRBE, and to use BOLT for Linux kernel on them, we have to use instrumentation rather than sampling.
I mean somewhere in the IR pipeline (e.g. the codegen prepare pass). Such an analysis might be performed in BOLT, but with more work. In IR level, most of what we need to do is to check whether the address of a function is used as an operand of ptrtoint. Of course, it would be better if it is done in BOLT.
The skip-funcs approach might not be easy to use for users not familiar with Linux kernel source code. To identify those functions to skip, we often need to test & debug the Linux kernel many times, and still might miss some. That is a lot of work.
In my own experience, especially with relocation mode enabled, I have spent a lot of time to fix compilation/runtime errors related to functions defined in assembly code.
Only handling functions defined in C code make development easier and make BOLT more reliable.
I am not sure whether the original code will never be reached, since there are many tricks (and so jumps/calls that might not be identified by BOLT) in the Linux kernel source code.
The analysis in the compiler should be more robust, but then we introduce the dependency on the compiler. Do you have examples of the pointer arithmetic in the kernel beyond pointer comparison?
Here a function address is encoded as difference from address of the BASE function. And when that function is called, we need to get its address first by adding back address of the BASE function. To make things simple, the BASE function should keep its address.