[RFC] Optimizing the Linux kernel with AutoFDO including ThinLTO and Propeller
Authors
Rong Xu (xur@google.com) and Han Shen (shenhan@google.com)
Contributors
Krzysztof Pszeniczny (kpszeniczny@google.com), Yonghong Song (yhs@meta.com), Luigi Rizzo (lrizzo@google.com), Sriraman Tallam (tmsriram@google.com), Xinliang David Li (davidxl@google.com), and Stephane Eranian (eranian@google.com)
Summary
We would like to make a data-driven case to integrate AutoFDO support into the Linux kernel. AutoFDO is a profile guided optimization technique that uses hardware sampling to optimize binaries. Compared to Instrumentation based FDO (iFDO), AutoFDO is significantly more user-friendly and straightforward to apply. While iFDO typically yields better profile quality and hence more performance than AutoFDO, our results demonstrate that AutoFDO achieves a remarkable level of effectiveness, bringing the performance close to iFDO in optimizing benchmark applications.
In this post, we present performance improvements from optimizing the kernel with FDO, both via hardware sampling and instrumentation, on micro-benchmarks and large warehouse scale applications. Our data makes a strong case for the inclusion of AutoFDO as a supported feature in the upstream kernel.
Introduction
A significant fraction of fleet processing cycles (excluding idle time) from data center workloads are attributable to the kernel. At Google, to maximize performance, the production kernel is optimized with iFDO (a.k.a instrumented PGO, Profile Guided Optimization).
iFDO can significantly enhance application performance but its use within the kernel has raised concerns[1]. AutoFDO is a variant of FDO that uses the hardware’s Performance Monitoring Unit (PMU) to collect profiling data. While AutoFDO typically yields smaller performance gains than iFDO, it presents unique benefits for optimizing kernels.
AutoFDO eliminates the need for instrumented kernels, allowing a single optimized kernel to serve both execution and profile collection. It also minimizes slowdown during profile collection, potentially yielding
higher-fidelity profiling, especially for time-sensitive code, compared to iFDO. Additionally, AutoFDO profiles can be obtained from production environments via the hardware’s PMU whereas iFDO profiles require carefully curated load tests that are representative of real-world traffic.
Finally, AutoFDO facilitates profile collection across diverse targets. Preliminary studies indicate significant variation in kernel hot spots within Google’s infrastructure, suggesting potential performance gains through target-specific kernel customization.
AutoFDO uses advanced PMU features like LBR on Intel machines and cannot be used on architectures that lack such support, AMD Zen2 for instance[2]. However, newer generations are either supporting similar features like AMD Zen3 BRS, SPE on ARM, and BRBE on ARM or have plans to support them in the future.
Furthermore, other advanced compiler optimization techniques, including ThinLTO and Propeller can be stacked on top of AutoFDO, similar to iFDO. We have experimented with AutoFDO combined with ThinLTO and Propeller, and have shared the performance numbers below.
Performance tests
The kernel AutoFDO build support is primarily obtained through changes in the build options. In a similar manner to user-level AutoFDO, it uses the “perf” tool for collecting performance profiles[3]. These perf files are subsequently converted into AutoFDO profiles using offline tools such as create_llvmprof or llvm_profgen. Soon, we will send patches for both the Linux kernel and LLVM (for the llvm_profgen change).
Warehouse-scale applications
We evaluated AutoFDO and iFDO kernel performance using two real-world warehouse scale applications: (1) a Google database application with significant fraction of its cycles spent in the kernel, and (2) a network-intensive Meta service with 30% of its cycles spent in the kernel.
Google Database application
In this experiment, iFDO and AutoFDO used Neper/tcp_rr and Neper/tcp_stream with 200 flows and 200 threads as their training workload. The performance test was conducted in a load-test environment, ensuring identical machines, binaries, and inputs—the only difference being the kernel image. We measured user-level performance metrics to assess the improvement. Each read/write/query benchmark consisted of 3 iterations, with each iteration collecting 55 data points. The final result was determined by aggregating the averages of these data points. This experiment was performed on the 5.10 kernel.
The overall geomean improvement for AutoFDO is 2.6% compared to 2.9% for iFDO.
Benchmark | Metrics | AutoFDO improvement | iFDO improvement |
---|---|---|---|
Google Database | read/write/query/etc | 2.6% | 2.9% |
Meta service
Yonghong Song (yhs@meta.com) from Meta collaborated with us on this experiment, which was conducted in an A/B test setup. Four machines were allocated and divided into two tiers, T1 and T2, each with two machines. A single production server directed the same traffic to both tiers. Profile collection for iFDO and AutoFDO were performed using the same setup that ran the service for 1200 seconds. It’s important to note that the system was not under stress during this experiment, with CPU utilization at approximately 50%. The final evaluation was based on overall CPU utilization (specifically MHz_busy), for a period of 12 hours, with lower values indicating better performance. This experiment was performed on the 6.4 kernel.
The experiments revealed that AutoFDO consumed 1.25% more CPU than iFDO, with iFDO utilizing 50.0% CPU and AutoFDO utilizing 50.64% CPU. Meta, already aware of iFDO’s performance improvement on this specific service (30% cycles in kernel), observed an overall improvement of ~6% in CPU utilization reduction, comprising a ~3% improvement in each of kernel space and user space code. Based on these results, we can extrapolate the following performance numbers:
Benchmark | Metrics | AutoFDO improvement | iFDO improvement |
---|---|---|---|
Meta Service | CPU utilization | ~5% | ~6% |
Micro-Benchmarks:
In this part, we used two micro-benchmarks: Neper and UnixBench to evaluate the AutoFDO kernel performance. We selected Neper because we know that the network subsystem is the most effectively optimized component by iFDO and AutoFDO. UnixBench was chosen as a comprehensive test for kernel operations. It’s important to note that micro-benchmarks may not accurately reflect production performance and serve to indicate the performance potential.
Our experimental results showed that system profiles collected during low system load do not perform as well as those collected during high system load[4]. While manually removing samples from system idle functions can improve performance, the results are still inferior compared to profiles obtained during high system load.
As a result, we always use high system load to generate the profile. Specifically, for Neper programs, we use 100-flow, 10-thread, and local network configuration during profile collection. Contrastingly, for performance tests, we opt for a 1-flow, 1-thread, and local network configuration to ensure reliable results. In the case of UnixBench, we collect the profile using a 112-instance configuration and evaluate performance for both 1-instance and 112-instance configurations.
We did the following to reduce the noise from the performance testing: (1) We fixed the machine frequency throughout the tests. (2) Only c_state0 was enabled, while other c_states were disabled.
We used the same machine and performed multiple runs to ensure statistically significant results. These experiments were performed on the upstream 6.8 kernel.
Results
Benchmark | Metrics | AutoFDO improvement | iFDO improvement |
---|---|---|---|
Neper / tcp_rr | Latency | 10.6% | 11.8% |
Neper / tcp_stream | Throughput | 6.1% | 6.7% |
UnixBench (1-instance) | Index score | 2.2% | 3.0% |
UnixBench(112-instance) | Index score | 2.6% | 2.6% |
In the table, the latency for tcp_rr is computed as the geometric mean of P1, P50, P99, and Mean improvement.
ThinLTO and Propeller on top of AutoFDO
ThinLTO achieves better runtime performance through whole-program analysis and cross module optimizations. The main difference between traditional LTO and ThinLTO is that the latter is scalable in time and memory. Propeller is a profile-guided, post-link optimizer that improves the performance of large-scale applications compiled with LLVM. It operates by relinking the binary based on an additional round of runtime profiles, enabling precise optimizations that are not possible at compile time.
AutoFDO and iFDO work with ThinLTO and Propeller, and it is advisable to apply ThinLTO or ThinLTO+Propeller on top of AutoFDO/iFDO to further the performance gain. The synergy of AutoFDO/iFDO and ThinLTO+Propeller is among the best optimization current compiler techniques can offer.
Experiments on tcp_rr latency show that iFDO+ThinLTO, brings 1.3% latency reduction over iFDO, and iFDO+ThinLTO+Propeller has 2.0% latency reduction over iFDO+ThinLTO. Putting together, iFDO+ThinLTO+Propeller has a latency reduction of 3.3% over iFDO. Similarly, AutoFDO+ThinLTO+Propeller has a latency reduction of 4.5% over AutoFDO. When ThinLTO and Propeller are enabled, AutoFDO obtains equivalent performance to iFDO in the case of tcp_rr.
- | +ThinLTO | +ThinLTO+Propeller |
---|---|---|
AutoFDO | 1.8% | 4.5% |
iFDO | 1.3% | 3.3% |
Meta also measured the performance of AutoFDO with ThinLTO for the previously discussed application. The results showed an improvement of approximately 1%, which is comparable to the ThinLTO improvement over iFDO.
Analysis of the performance gains
It is known that iFDO / AutoFDO also improves code layout by optimizing branch instructions (leading to increased fall throughs) and improving the efficiency of i-cache and i-TLB. For the kernel, this is also the case. The following heat maps are from Neper/tcp_rr.:
Instruction heatmap in kernel code space for tcp_rr:
The x-axis is the time and the y-axis is the relative instruction virtual address. Both iFDO and AutoFDO demonstrate a more compact hot instruction band compared to NOFDO, as seen from the plot. It is worth noting that AutoFDO and iFDO utilize different relative addresses for the hot text. This discrepancy arises because AutoFDO does not presume the profile’s accuracy by default. Consequently, it refrains from aggressively marking functions as cold, unlike iFDO. However, enabling the profile accurate option in AutoFDO results in a plot similar to the one presented in the following plot.
The plots reveal a striking similarity between AutoFDO and iFDO in terms of their kernel code layout.
The plots below show how ThinLTO and Propeller further improve the code layout. Notably, for both AutoFDO and iFDO, the combination of ThinLTO and Propeller produces the tightest hot instruction band.
Perf stats for tcp_rr
The below table lists several relevant performance event counters. These counters were collected during the execution of neper/tcp_rr consisting of 600,000 transactions. AutoFDO and iFDO both exhibited substantial decreases in taken branches and L1 instruction cache misses. Since this neper workload has a minuscule memory footprint, the number of iTLB entries does not have a significant impact.
Event | No-FDO | AutoFDO | iFDO | AutoFDO + ThinLTO + Propeller | iFDO + ThinLTO + Propeller |
---|---|---|---|---|---|
Instructions | 295,987,538,935 | 275,778,116,664 | 270,035,174,782 | 241,879,509,739 | 234,119,332,317 |
L1-icache-miss | 3,166,611,857 | 2,200,281,307 | 2,130,689,871 | 1,492,354,080 | 1,290,452,682 |
iTLB | 9,887,397 | 10,321,133 | 10,216,133 | 12,853,700 | 9,369,720 |
br_inst_retired.near_taken | 36,955,375,430 | 32,416,170,190 | 32,002,279,883 | 28,490,665,793 | 26,030,665,264 |
Appendix A: Neper detailed performance (1-flow, 1-thread, and local network)
Benchmark | Metrics | AutoFDO improvement | iFDO improvement |
---|---|---|---|
tcp_rr | Latency p1 | 9.4% | 10.9% |
tcp_rr | Latency p50 | 11.1% | 12.2% |
tcp_rr | Latency p99 | 9.9% | 10.8% |
tcp_rr | Mean Latency | 12.3% | 13.6% |
tcp_stream | Throughput | 6.1% | 6.7% |
Appendix B: UnixBench detailed performance
1 instance UnixBench | AutoFDO improvement | iFDO improvement |
---|---|---|
Dhrystone 2 using register variables | -0.3% | -0.3% |
Double-Precision Whetstone | -0.1% | -0.1% |
Execl Throughput | 1.9% | 2.5% |
File Copy 1024 bufsize 2000 maxblocks | 3.0% | 2.7% |
File Copy 256 bufsize 500 maxblocks | 4.3% | 3.7% |
File Copy 4096 bufsize 8000 maxblocks | 3.4% | 2.7% |
Pipe Throughput | 0.8% | 0.7% |
Pipe-based Context Switching | 5.5% | 7.9% |
Process Creation | 8.2% | 11.3% |
Shell Scripts (1 concurrent) | 1.7% | 2.9% |
Shell Scripts (8 concurrent) | 1.9% | 3.6% |
System Call Overhead | -3.0% | -0.6% |
System Benchmarks Index Score | 2.2% | 3.0% |
112 instances UnixBench | AutoFDO improvement | iFDO improvement |
---|---|---|
Dhrystone 2 using register variables | 0% | 0% |
Double-Precision Whetstone | 0% | 0% |
Execl Throughput | -7.6% | -3.7% |
File Copy 1024 bufsize 2000 maxblocks | -3.2% | -3.3% |
File Copy 256 bufsize 500 maxblocks | 3.5% | 6.7% |
File Copy 4096 bufsize 8000 maxblocks | 3.0% | -2.9% |
Pipe Throughput | 0.8% | 3.3% |
Pipe-based Context Switching | 5.2% | 6.2% |
Process Creation | 3.1% | 1.9% |
Shell Scripts (1 concurrent) | 16.6% | 14.6% |
Shell Scripts (8 concurrent) | 13.2% | 10.4% |
System Call Overhead | -1.0% | 0.9% |
System Benchmarks Index Score | 2.6% | 2.6% |
Notes
Linus Torvalds wrote: “That odd decision [to use iFDO] seems to not be documented anywhere, and it seems odd and counter-productive, and causes all that odd special buffer handling and that vmlinux.profraw file etc.“ ↩︎
As Linus observed: “I agree that perf profiling works best on Intel. The AMD perf side works ok in Zen 2 from what I’ve seen, but needs to be a full-system profile (“perf record -a”) to use the better options, and ARM is… But with x86 ranging from “excellent” to “usable”, and “ARM hopefully being at least close to getting better proper profile data, I really think it’s the way forward, with instrumentation being a band-aid at best.” ↩︎
In production, a continuous sampler can generate AutoFDO profiles by automating the process. ↩︎
Production sampling of the kernel can happen when the utilization is high to ensure higher quality profiles. ↩︎