Enabling AutoFDO & Propeller optimizations on Arm with SPE

dhoekwater · May 14, 2024, 9:52pm

Overview & Background

Last month, Google revealed Axion, our new Arm-based server CPUs. Now, we’re excited to announce support for AutoFDO and Propeller optimizations on Arm.

On x86, AutoFDO and Propeller use Last Branch Record (LBR) entries to determine edge weights between basic blocks, but many Arm chips have not yet implemented Arm’s PMU branch stack: the Branch Record Buffer Extension (BRBE). In lieu of LBR-like data, we have added support for branch profiles collected with either the Embedded Trace Macrocell or the Statistical Profiling Extension. AutoFDO v0.20 and above feature support for these profiling sources.

In this post, we’ll walk through optimizing a sample program (the Fleetbench proto benchmark) with SPE samples collected on Arm.

Instructions

The first step is to set up the dependencies. For this post, we use AutoFDO v0.20 and Fleetbench v1.0.0.

# Download the benchmark and the AutoFDO tooling library. 
git clone git@github.com:google/fleetbench.git
git -C fleetbench checkout v1.0.0

git clone --recursive git@github.com:google/autofdo.git
git -C autofdo checkout v0.20

# Build the `create_llvm_prof` tool
cmake -G Ninja -DWITH_LLVM=On -DCMAKE_BUILD_TYPE=Release \
  -S autofdo/ -B autofdo/build/
set BINDIR /tmp/binaries/
mkdir $BINDIR
ninja -C autofdo/build create_llvm_prof
cp autofdo/build/create_llvm_prof $BINDIR

AutoFDO

Next, build the base benchmark. Once it’s built, run the benchmark and collect SPE branch samples of its execution.

# Build the base benchmark
cd fleetbench
bazel build -c opt //fleetbench/proto:proto_benchmark \
  --copt=-gmlt --strip=never

cp -f bazel-bin/fleetbench/proto/proto_benchmark \
  $BINDIR/proto_benchmark_base

perf record --no-switch-events \
  -e 'arm_spe_0/branch_filter=1,load_filter=0,store_filter=0,jitter=1/' \
  -c 10007 -N -o /tmp/spe.perf.data -- \
    $BINDIR/proto_benchmark_base --benchmark_min_time=30s \
    --hugepage_text=true --benchmark_filter=all

Once you’ve collected the perf data, you can process it into an AFDO profile using create_llvm_prof and use the profile to build an AFDO-optimized binary.

$BINDIR/create_llvm_prof --binary=$BINDIR/proto_benchmark_base \
  --profile=/tmp/spe.perf.data \
  --profiler=perf_spe --disassemble_arm_branches=true \
  --format=extbinary \
  --out=/tmp/spe.afdo

bazel build --config=release //third_party/fleetbench/proto:proto_benchmark \
  --copt=-gmlt --strip=never \
  --copt=-fprofile-sample-use=/tmp/spe.afdo

cp -f bazel-bin/third_party/fleetbench/proto/proto_benchmark \
  $BINDIR/proto_benchmark_afdo

Running the opt benchmark against the base, we can see that the optimized binary takes less time to execute.

Running /tmp/binaries/proto_benchmark_base
--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
BM_Protogen_Arena      9271691 ns      9215129 ns         4558
...

Running /tmp/binaries/proto_benchmark_afdo
--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
BM_Protogen_Arena      9152992 ns      9099613 ns         4534
...

Propeller

We can further optimize the binary with Propeller post-link optimizations. To do so, we start by building the optimized AFDO binary with Propeller annotations.

bazel build --config=release //third_party/fleetbench/proto:proto_benchmark \
  --copt=-gmlt --strip=never \
  --copt=-fprofile-sample-use=/tmp/spe.afdo \
  --copt=-fbasic-block-address-map \
  --linkopt=-fuse-ld=lld \
  --linkopt=-Wl,--lto-basic-block-address-map

cp -f bazel-bin/third_party/fleetbench/proto/proto_benchmark \
  $BINDIR/proto_benchmark_annotated

Once again, we run the benchmark and collect SPE perf profiles, then process the perf profile with create_llvm_prof. Note that, this time, we

perf record --no-switch-events \
  -e 'arm_spe_0/branch_filter=1,load_filter=0,store_filter=0,jitter=1/' \
  -c 10007 -N -o /tmp/spe.perf.data -- \
    $BINDIR/proto_benchmark_annotated --benchmark_min_time=30s \
    --hugepage_text=true --benchmark_filter=all

$BINDIR/create_llvm_prof --binary=$BINDIR/proto_benchmark_annotated \
  --profile=@<(printf "/tmp/spe.perf.data\n#") \
  --profiler=perf_spe \
  --format=propeller \
  --out=/tmp/spe_cc_propeller.txt \
  --propeller_symorder=/tmp/spe_ld_propeller.txt

Lastly, we optimize the binary with the Propeller profile.

bazel build --config=release //third_party/fleetbench/proto:proto_benchmark \
  --copt=-gmlt --strip=never \
  --copt=-fprofile-sample-use=/tmp/spe.afdo \
  --copt=-fbasic-block-sections=list=/tmp/spe_cc_propeller.txt \
  --linkopt=-fuse-ld=lld \
  --linkopt=-Wl,--lto-basic-block-sections=/tmp/spe_cc_propeller.txt \
  --linkopt=-Wl,--symbol-ordering-file=/tmp/spe_ld_propeller.txt \
  --linkopt=-Wl,--no-warn-symbol-ordering

cp -f bazel-bin/third_party/fleetbench/proto/proto_benchmark \
  $BINDIR/proto_benchmark_afdo_propeller

Comparing the Propeller-optimized benchmark against the AFDO-only and base binaries, we can see that the Propeller-optimized binary takes the least time to execute.

Running /tmp/binaries/proto_benchmark_base
--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
BM_Protogen_Arena      9271691 ns      9215129 ns         4558
...

Running /tmp/binaries/proto_benchmark_afdo
--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
BM_Protogen_Arena      9152992 ns      9099613 ns         4534
...

Running /tmp/binaries/proto_benchmark_afdo_propeller
--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
BM_Protogen_Arena      9052150 ns      8982803 ns         4690
...

Note: the benchmark is pretty noisy, so results may vary.

Wrap-up

Having a quality source of native profiles means that feedback-driven optimizations don’t have to rely on instrumentation or cross-profiling, bringing PGO on AArch64 to feature parity with x86. We’re looking forward to bringing new and existing profile-guided optimizations to AArch64!

hansw2000 · May 15, 2024, 8:57am

There doesn’t seem to be a single mention of Propeller in the LLVM repository. Would it be possible to improve the documentation for this feature?

dhoekwater · May 15, 2024, 6:55pm

There doesn’t seem to be a single mention of Propeller in the LLVM repository.

Currently, the guts of Propeller live within AutoFDO, which isn’t currently a part of the LLVM repo. We’ll most likely try to move them into LLVM, but that’s not necessarily something I can commit to at this moment.

Would it be possible to improve the documentation for this feature?

Yep! Improved documentation is one of my main priorities for the next ~6 months, so things should be better on that front.

hansw2000 · May 16, 2024, 1:23pm

Excellent, thank you!

Topic		Replies	Views
Optimizing the Linux kernel with AutoFDO including ThinLTO and Propeller IR & Optimizations	16	2193	December 2, 2024
Propeller can work with pgo use one profile? IR & Optimizations pgo , clang	9	242	April 11, 2024
[RFC] Control Flow Sensitive AutoFDO (FS-AFDO) LLVM Dev List Archives	5	204	November 20, 2020
[RFC] Adding Matching and Inference Functionality to Propeller IR & Optimizations	5	160	May 9, 2025
[RFC] Propeller: A frame work for Post Link Optimizations LLVM Dev List Archives	55	578	February 11, 2020