Optimizing the Linux kernel with AutoFDO including ThinLTO and Propeller

[RFC] Optimizing the Linux kernel with AutoFDO including ThinLTO and Propeller

Authors

Rong Xu (xur@google.com) and Han Shen (shenhan@google.com)

Contributors

Krzysztof Pszeniczny (kpszeniczny@google.com), Yonghong Song (yhs@meta.com), Luigi Rizzo (lrizzo@google.com), Sriraman Tallam (tmsriram@google.com), Xinliang David Li (davidxl@google.com), and Stephane Eranian (eranian@google.com)

Summary

We would like to make a data-driven case to integrate AutoFDO support into the Linux kernel. AutoFDO is a profile guided optimization technique that uses hardware sampling to optimize binaries. Compared to Instrumentation based FDO (iFDO), AutoFDO is significantly more user-friendly and straightforward to apply. While iFDO typically yields better profile quality and hence more performance than AutoFDO, our results demonstrate that AutoFDO achieves a remarkable level of effectiveness, bringing the performance close to iFDO in optimizing benchmark applications.

In this post, we present performance improvements from optimizing the kernel with FDO, both via hardware sampling and instrumentation, on micro-benchmarks and large warehouse scale applications. Our data makes a strong case for the inclusion of AutoFDO as a supported feature in the upstream kernel.

Introduction

A significant fraction of fleet processing cycles (excluding idle time) from data center workloads are attributable to the kernel. At Google, to maximize performance, the production kernel is optimized with iFDO (a.k.a instrumented PGO, Profile Guided Optimization).

iFDO can significantly enhance application performance but its use within the kernel has raised concerns[1]. AutoFDO is a variant of FDO that uses the hardware’s Performance Monitoring Unit (PMU) to collect profiling data. While AutoFDO typically yields smaller performance gains than iFDO, it presents unique benefits for optimizing kernels.

AutoFDO eliminates the need for instrumented kernels, allowing a single optimized kernel to serve both execution and profile collection. It also minimizes slowdown during profile collection, potentially yielding
higher-fidelity profiling, especially for time-sensitive code, compared to iFDO. Additionally, AutoFDO profiles can be obtained from production environments via the hardware’s PMU whereas iFDO profiles require carefully curated load tests that are representative of real-world traffic.

Finally, AutoFDO facilitates profile collection across diverse targets. Preliminary studies indicate significant variation in kernel hot spots within Google’s infrastructure, suggesting potential performance gains through target-specific kernel customization.

AutoFDO uses advanced PMU features like LBR on Intel machines and cannot be used on architectures that lack such support, AMD Zen2 for instance[2]. However, newer generations are either supporting similar features like AMD Zen3 BRS, SPE on ARM, and BRBE on ARM or have plans to support them in the future.

Furthermore, other advanced compiler optimization techniques, including ThinLTO and Propeller can be stacked on top of AutoFDO, similar to iFDO. We have experimented with AutoFDO combined with ThinLTO and Propeller, and have shared the performance numbers below.

Performance tests

The kernel AutoFDO build support is primarily obtained through changes in the build options. In a similar manner to user-level AutoFDO, it uses the “perf” tool for collecting performance profiles[3]. These perf files are subsequently converted into AutoFDO profiles using offline tools such as create_llvmprof or llvm_profgen. Soon, we will send patches for both the Linux kernel and LLVM (for the llvm_profgen change).

Warehouse-scale applications

We evaluated AutoFDO and iFDO kernel performance using two real-world warehouse scale applications: (1) a Google database application with significant fraction of its cycles spent in the kernel, and (2) a network-intensive Meta service with 30% of its cycles spent in the kernel.

Google Database application

In this experiment, iFDO and AutoFDO used Neper/tcp_rr and Neper/tcp_stream with 200 flows and 200 threads as their training workload. The performance test was conducted in a load-test environment, ensuring identical machines, binaries, and inputs—the only difference being the kernel image. We measured user-level performance metrics to assess the improvement. Each read/write/query benchmark consisted of 3 iterations, with each iteration collecting 55 data points. The final result was determined by aggregating the averages of these data points. This experiment was performed on the 5.10 kernel.

The overall geomean improvement for AutoFDO is 2.6% compared to 2.9% for iFDO.

Benchmark Metrics AutoFDO improvement iFDO improvement
Google Database read/write/query/etc 2.6% 2.9%

Meta service

Yonghong Song (yhs@meta.com) from Meta collaborated with us on this experiment, which was conducted in an A/B test setup. Four machines were allocated and divided into two tiers, T1 and T2, each with two machines. A single production server directed the same traffic to both tiers. Profile collection for iFDO and AutoFDO were performed using the same setup that ran the service for 1200 seconds. It’s important to note that the system was not under stress during this experiment, with CPU utilization at approximately 50%. The final evaluation was based on overall CPU utilization (specifically MHz_busy), for a period of 12 hours, with lower values indicating better performance. This experiment was performed on the 6.4 kernel.

The experiments revealed that AutoFDO consumed 1.25% more CPU than iFDO, with iFDO utilizing 50.0% CPU and AutoFDO utilizing 50.64% CPU. Meta, already aware of iFDO’s performance improvement on this specific service (30% cycles in kernel), observed an overall improvement of ~6% in CPU utilization reduction, comprising a ~3% improvement in each of kernel space and user space code. Based on these results, we can extrapolate the following performance numbers:

Benchmark Metrics AutoFDO improvement iFDO improvement
Meta Service CPU utilization ~5% ~6%

Micro-Benchmarks:

In this part, we used two micro-benchmarks: Neper and UnixBench to evaluate the AutoFDO kernel performance. We selected Neper because we know that the network subsystem is the most effectively optimized component by iFDO and AutoFDO. UnixBench was chosen as a comprehensive test for kernel operations. It’s important to note that micro-benchmarks may not accurately reflect production performance and serve to indicate the performance potential.

Our experimental results showed that system profiles collected during low system load do not perform as well as those collected during high system load[4]. While manually removing samples from system idle functions can improve performance, the results are still inferior compared to profiles obtained during high system load.

As a result, we always use high system load to generate the profile. Specifically, for Neper programs, we use 100-flow, 10-thread, and local network configuration during profile collection. Contrastingly, for performance tests, we opt for a 1-flow, 1-thread, and local network configuration to ensure reliable results. In the case of UnixBench, we collect the profile using a 112-instance configuration and evaluate performance for both 1-instance and 112-instance configurations.

We did the following to reduce the noise from the performance testing: (1) We fixed the machine frequency throughout the tests. (2) Only c_state0 was enabled, while other c_states were disabled.

We used the same machine and performed multiple runs to ensure statistically significant results. These experiments were performed on the upstream 6.8 kernel.

Results

Benchmark Metrics AutoFDO improvement iFDO improvement
Neper / tcp_rr Latency 10.6% 11.8%
Neper / tcp_stream Throughput 6.1% 6.7%
UnixBench (1-instance) Index score 2.2% 3.0%
UnixBench(112-instance) Index score 2.6% 2.6%

In the table, the latency for tcp_rr is computed as the geometric mean of P1, P50, P99, and Mean improvement.

ThinLTO and Propeller on top of AutoFDO

ThinLTO achieves better runtime performance through whole-program analysis and cross module optimizations. The main difference between traditional LTO and ThinLTO is that the latter is scalable in time and memory. Propeller is a profile-guided, post-link optimizer that improves the performance of large-scale applications compiled with LLVM. It operates by relinking the binary based on an additional round of runtime profiles, enabling precise optimizations that are not possible at compile time.

AutoFDO and iFDO work with ThinLTO and Propeller, and it is advisable to apply ThinLTO or ThinLTO+Propeller on top of AutoFDO/iFDO to further the performance gain. The synergy of AutoFDO/iFDO and ThinLTO+Propeller is among the best optimization current compiler techniques can offer.

Experiments on tcp_rr latency show that iFDO+ThinLTO, brings 1.3% latency reduction over iFDO, and iFDO+ThinLTO+Propeller has 2.0% latency reduction over iFDO+ThinLTO. Putting together, iFDO+ThinLTO+Propeller has a latency reduction of 3.3% over iFDO. Similarly, AutoFDO+ThinLTO+Propeller has a latency reduction of 4.5% over AutoFDO. When ThinLTO and Propeller are enabled, AutoFDO obtains equivalent performance to iFDO in the case of tcp_rr.

- +ThinLTO +ThinLTO+Propeller
AutoFDO 1.8% 4.5%
iFDO 1.3% 3.3%

Meta also measured the performance of AutoFDO with ThinLTO for the previously discussed application. The results showed an improvement of approximately 1%, which is comparable to the ThinLTO improvement over iFDO.

Analysis of the performance gains

It is known that iFDO / AutoFDO also improves code layout by optimizing branch instructions (leading to increased fall throughs) and improving the efficiency of i-cache and i-TLB. For the kernel, this is also the case. The following heat maps are from Neper/tcp_rr.:

Instruction heatmap in kernel code space for tcp_rr:

The x-axis is the time and the y-axis is the relative instruction virtual address. Both iFDO and AutoFDO demonstrate a more compact hot instruction band compared to NOFDO, as seen from the plot. It is worth noting that AutoFDO and iFDO utilize different relative addresses for the hot text. This discrepancy arises because AutoFDO does not presume the profile’s accuracy by default. Consequently, it refrains from aggressively marking functions as cold, unlike iFDO. However, enabling the profile accurate option in AutoFDO results in a plot similar to the one presented in the following plot.

AutoFDO_kernel_Heatmap2

The plots reveal a striking similarity between AutoFDO and iFDO in terms of their kernel code layout.

The plots below show how ThinLTO and Propeller further improve the code layout. Notably, for both AutoFDO and iFDO, the combination of ThinLTO and Propeller produces the tightest hot instruction band.

Perf stats for tcp_rr

The below table lists several relevant performance event counters. These counters were collected during the execution of neper/tcp_rr consisting of 600,000 transactions. AutoFDO and iFDO both exhibited substantial decreases in taken branches and L1 instruction cache misses. Since this neper workload has a minuscule memory footprint, the number of iTLB entries does not have a significant impact.

Event No-FDO AutoFDO iFDO AutoFDO + ThinLTO + Propeller iFDO + ThinLTO + Propeller
Instructions 295,987,538,935 275,778,116,664 270,035,174,782 241,879,509,739 234,119,332,317
L1-icache-miss 3,166,611,857 2,200,281,307 2,130,689,871 1,492,354,080 1,290,452,682
iTLB 9,887,397 10,321,133 10,216,133 12,853,700 9,369,720
br_inst_retired.near_taken 36,955,375,430 32,416,170,190 32,002,279,883 28,490,665,793 26,030,665,264

Appendix A: Neper detailed performance (1-flow, 1-thread, and local network)

Benchmark Metrics AutoFDO improvement iFDO improvement
tcp_rr Latency p1 9.4% 10.9%
tcp_rr Latency p50 11.1% 12.2%
tcp_rr Latency p99 9.9% 10.8%
tcp_rr Mean Latency 12.3% 13.6%
tcp_stream Throughput 6.1% 6.7%

Appendix B: UnixBench detailed performance

1 instance UnixBench AutoFDO improvement iFDO improvement
Dhrystone 2 using register variables -0.3% -0.3%
Double-Precision Whetstone -0.1% -0.1%
Execl Throughput 1.9% 2.5%
File Copy 1024 bufsize 2000 maxblocks 3.0% 2.7%
File Copy 256 bufsize 500 maxblocks 4.3% 3.7%
File Copy 4096 bufsize 8000 maxblocks 3.4% 2.7%
Pipe Throughput 0.8% 0.7%
Pipe-based Context Switching 5.5% 7.9%
Process Creation 8.2% 11.3%
Shell Scripts (1 concurrent) 1.7% 2.9%
Shell Scripts (8 concurrent) 1.9% 3.6%
System Call Overhead -3.0% -0.6%
System Benchmarks Index Score 2.2% 3.0%
112 instances UnixBench AutoFDO improvement iFDO improvement
Dhrystone 2 using register variables 0% 0%
Double-Precision Whetstone 0% 0%
Execl Throughput -7.6% -3.7%
File Copy 1024 bufsize 2000 maxblocks -3.2% -3.3%
File Copy 256 bufsize 500 maxblocks 3.5% 6.7%
File Copy 4096 bufsize 8000 maxblocks 3.0% -2.9%
Pipe Throughput 0.8% 3.3%
Pipe-based Context Switching 5.2% 6.2%
Process Creation 3.1% 1.9%
Shell Scripts (1 concurrent) 16.6% 14.6%
Shell Scripts (8 concurrent) 13.2% 10.4%
System Call Overhead -1.0% 0.9%
System Benchmarks Index Score 2.6% 2.6%

Notes


  1. Linus Torvalds wrote: “That odd decision [to use iFDO] seems to not be documented anywhere, and it seems odd and counter-productive, and causes all that odd special buffer handling and that vmlinux.profraw file etc.“ ↩︎

  2. As Linus observed: “I agree that perf profiling works best on Intel. The AMD perf side works ok in Zen 2 from what I’ve seen, but needs to be a full-system profile (“perf record -a”) to use the better options, and ARM is… But with x86 ranging from “excellent” to “usable”, and “ARM hopefully being at least close to getting better proper profile data, I really think it’s the way forward, with instrumentation being a band-aid at best.” ↩︎

  3. In production, a continuous sampler can generate AutoFDO profiles by automating the process. ↩︎

  4. Production sampling of the kernel can happen when the utilization is high to ensure higher quality profiles. ↩︎

12 Likes

Great read, and interesting results!

Some thoughts:

  1. I’m very happy you quoted Linus and brought up the points about the issues around the lack of usable hardware performance counters across distinct architectures and micro architectures. Please consider adding links to the relevant quotes from lore.kernel.org.
  2. Soon, we will send patches for both the Linux kernel and LLVM (for the llvm_profgen change).

I look forward to seeing these. Please cc me, and consider linking to them from here when they are available. As far as I’m concerned, all of the benchmarks here are theoretical until patches are publicly available for others to reproduce. But I am very happy to see you did work with yhs @ Meta.

  1. The instruction heat maps are cool, what did you use to generate them? I hope it’s a publicly available tool that I’m simply unfamiliar with?

  2. The results showed an improvement of approximately 1%, which is comparable to the ThinLTO improvement over iFDO.

I think that can slightly reworded to “comparable to the improvement ThinLTO provided on top of iFDO” otherwise my initial read was a comparison of AutoFDO+ThinLTO being compared to iFDO (without ThinLTO). But maybe it’s just poor reading apprehension on my part.

  1. I know @maksfb has been working on getting BOLT able to build the Linux kernel. That has taken significant effort to produce a bootable kernel image; the kernel has a lot of custom ELF sections that don’t like code motion. I’d love to hear more about:
    a. What were some of the changes necessary to the kernel and propeller framework along those lines. Perhaps posting patches would address this.
    b. a comparison against a BOLT’d kernel. A friendly competition here is in users best interest. While the approaches between BOLT and Propeller are distinct, I’m sure there’s learnings that can be shared across the fence here.

  2. It would be great if you could provide instructions for others to reproduce this using the standard off the shelf Linux perf tool. Google has too many proprietary profile collection utilities that make it such that getting AutoFDO working for the Linux kernel has generally not been possible for folks outside of Google to reproduce. That is a problem. The process should be clearly documented, ideally in the upstream Linux kernel documentation published to docs.kernel.org.

Great read, great work, nice results! Thanks for all of the work that went into this, and the write up. I look forward to having even faster kernels! When I started work on building the kernel with clang, I dreamed of the day we could employ these LLVM technologies to kernel builds. Thanks for making that dream finally a reality.

1 Like

Thanks, Nick, I’ll address some of your comments. And Rong can answer the rest of them.

  1. The instruction heat maps are cool, what did you use to generate them? I hope it’s a publicly available tool that I’m simply unfamiliar with?

This is an in-house bash script tool that uses “perf report” and gnuplot to do the work. Here is a similar script we used for propeller heatmap. For kernel, we used a different one, but the idea is the same.

a. What were some of the changes necessary to the kernel and propeller framework along those lines. Perhaps posting patches would address this.

In short, in the patch for AutoFDO, there are specific bits in each sub-makefile that disable autofdo code optimization for that subdirectory or for some files in that subdirectory. These specific bits need to be maintained while the kernel evolves. (Propeller adopts the same approach)

  1. It would be great if you could provide instructions for others to reproduce this using the standard off the shelf Linux perf tool. Google has too many proprietary profile collection utilities that make it such that getting AutoFDO working for the Linux kernel has generally not been possible for folks outside of Google to reproduce.

All the experiments will be reproducible shortly using standard tools after Rong lands a patch for llvm-profgen. The most important of those are “perf” and “llvm-profgen”. The llvm-profgen is part of llvm project and is maintained by upstream, it can be built as part of llvm.
create_llvm is also a viable alternative to llvm_profgen, which is maintained by google, and we synced the repo to our internal version just two weeks ago. Both should render similar performance gain.

Thanks Nick for giving us good feedback and suggestions.

For reason that I am not aware, I’m unable to make edits to the post anymore. Here I include the references to the quote by Linus. The first one was in kernel PGO code review. The second was in this link.

Definitely. We will CC you for kernel code review. Unlike the kernel iFDO (PGO), the change for AutoFDO kernel support is straightforward. However, as you mentioned, we need to provide clear workflow instructions in the patch.

Neper and UnixBench experiments should be reproducible, and we encourage others to verify the performance. I believe we’ve already included our experimental setup in the post.

As Han has replied, it’s just timeline plot of retired instructions using the relative viral address. I’ll find a good place to share this script.

Thanks for pointing this out. Unfortunately I cannot make edits to the post.

As Han already mentioned, we fount it was also straightforward to apply Propeller to kernel build. We need some linker-script changes, however they are also needed for function-sections or machine-function-split change, which we plan to do in a following up patch after the initial AutoFDO patch.

We are also curious about the BOLT performance to kernel. We will be happy to do the comparison when BOLT is ready.

Our plan was to document the kernel AutoFDO workflow as the rst file in the patch. If needed, we will add the documentation to docs.kernel.org afterwards.

So in other words, Propeller recovers most of the performance lost by going from iFDO to the more easily deployable AutoFDO.

Yes, that’s true. Currently, without propeller, AutoFDO already recovers ~80% of iFDO. Note, stacking propeller on top of AutoFDO requires another round of sample profiling.

We did limited benchmarks with Propeller. For tcp_rr, Propeller works more effectively on top of AutoFDO. The performance of AutoFDO and iFDO is almost identical with Propeller.

However, this may differ for other workloads, and we are conducting experiments with other programs. Nevertheless, we think it’s reasonable that Propeller to get further improvements as it recovers more from profile quality.

At Meta, we successfully applied BOLT from trunk LLVM to a production kernel and measured 2% performance gain for one of our top applications. For the experiment, BOLT was applied to the Linux kernel on top of iFDO+LTO.

I will later write detailed instructions for using BOLT, but it’s not that much different from using BOLT for user-space applications. The kernel didn’t require any modifications other than a single linker script patch for reserving 2MB of space in .text. This patch is optional, but is beneficial to the performance.

We had to add several features to BOLT in order to support assembly-level optimizations in the kernel and to make sure debuggability and observability are preserved. As a result, BOLT now provides a unique disassembly view of the kernel where we annotate assembly instructions with extra information that is otherwise impossible to see using tools like objdump. I believe this feature alone (perhaps refactored into a new CLI) is quite handy for working with the kernel.

3 Likes

It is fantastic news that BOLT is now fully functional for production use. Achieving 2% of the applications is a significant accomplishment.

We will be excited to test BOLT and compare it with Propeller. Ideally, these two systems could learn from each other and provide the users with the flexibility for their usages.

Yes. The linker script change is needed anyway – the current support for function-sections is broken.

1 Like

Consider submitting a CFP to Linux Plumber’s Conf’s Toolchain Track regarding this topic: CFP deadline for LPC 2024 Toolchains Track is approaching - Nathan Chancellor

Looks like we are going to cover AutoFDO, ThinLTO, Propeller, and BOLT at LPC 2024.

I’ve published documentation on applying the BOLT to Linux kernel some time ago. Posting it here for the reference: Optimizing Linux Kernel with BOLT.

3 Likes

Nice! What are the likelyhood of upstreaming those two patches into the kernel proper? I’d recommend wrapping them in some sort of KConfig choice. That way, we’re not reserving space for unused sections otherwise.

Let me [belatedly] add my voice to the chorus and congratulate Google and Meta (including BOLT team) engineers with this impressive achievement!

5-7% of performance gain (adding AutoFDO gain reported by @xur with 2% of post-link binary optimizers gain reported by @maksfb ) for all typical[1] WSC-class applications running on Linux is just incredible!

Well done! Can’t wait to try this myself on our workloads!

Andrey


  1. Or at least “all written in unmanaged languages”, as I expect gains for applications written in managed languages, like Go and Java, to be much smaller. ↩︎

@xur @maksfb I guess Google and Meta are mostly concerned with x86-64. I wonder if your developments (AutoFDO, Propeller and BOLT for Linux kernel) are tested / validated on ARM64?

If not, do you expect any ARM64-specific challenges? (I’m aware of LBR vs SPE, but this is probably incapsulated into perf / create_llvmprof and already supported?)

In general, what is your expectation WRT enabling on ARM64 – do you expect it to work from the get go?

Andrey

Hi Andrey,

We appreciate your interest and support! Our AutoFDO / Propeller patch has been accepted
into the kbuild tree of the Linux kernel. You can also find it in the linux-next tree.
Hopefully, it will be merged into the main tree soon.

For the reference, here is the link to our patch
https://lore.kernel.org/lkml/20241102175115.1769468-1-xur@google.com/

We haven’t yet focused on ARM64 support. However, we believe that with minor work to the tools,
SPE should function correctly on ARM64.

Yabin Cui has posted a patch to enable AutoFDO for ARM64 using ETM, on top of our AutoFDO patch.
https://lore.kernel.org/linux-arm-kernel/202411200958.F8A656A080@keescook/T/

He reported “Experiments on Android show 4% improvement in cold app startup time
and 13% improvement in binder benchmarks.”.

George Burgess also reported some performance gains using native ARM64 profiles (with
x86 profile as the base line). This is on ChromeOS.

I hope this helps.

-Rong

On Tue, Nov 26, 2024 at 6:54 AM Andrey Bokhanko via LLVM Discussion Forums <notifications@llvm.discoursemail.com> wrote:

andreybokhanko
November 26

@xur @maksfb I guess Google and Meta are mostly concerned with x86-64. I wonder if your developments (AutoFDO, Propeller and BOLT for Linux kernel) are tested / validated on ARM64?

If not, do you expect any ARM64-specific challenges? (I’m aware of LBR vs SPE, but this is probably incapsulated into perf / create_llvmprof and already supported?)

In general, what is your expectation WRT enabling on ARM64 – do you expect it to work from the get go?

Andrey


Visit Topic or reply to this email to respond.

To unsubscribe from these emails, click here.

@xur Thanks for the update! – good to know. And again, congratulations with this achievement!

Congrats on getting this upstreamed properly into Linux 6.13!

2 Likes