Overview & Background
Last month, Google revealed Axion, our new Arm-based server CPUs. Now, we’re excited to announce support for AutoFDO and Propeller optimizations on Arm.
On x86, AutoFDO and Propeller use Last Branch Record (LBR) entries to determine edge weights between basic blocks, but many Arm chips have not yet implemented Arm’s PMU branch stack: the Branch Record Buffer Extension (BRBE). In lieu of LBR-like data, we have added support for branch profiles collected with either the Embedded Trace Macrocell or the Statistical Profiling Extension. AutoFDO v0.20 and above feature support for these profiling sources.
In this post, we’ll walk through optimizing a sample program (the Fleetbench proto benchmark) with SPE samples collected on Arm.
Instructions
The first step is to set up the dependencies. For this post, we use AutoFDO v0.20 and Fleetbench v1.0.0.
# Download the benchmark and the AutoFDO tooling library.
git clone git@github.com:google/fleetbench.git
git -C fleetbench checkout v1.0.0
git clone --recursive git@github.com:google/autofdo.git
git -C autofdo checkout v0.20
# Build the `create_llvm_prof` tool
cmake -G Ninja -DWITH_LLVM=On -DCMAKE_BUILD_TYPE=Release \
-S autofdo/ -B autofdo/build/
set BINDIR /tmp/binaries/
mkdir $BINDIR
ninja -C autofdo/build create_llvm_prof
cp autofdo/build/create_llvm_prof $BINDIR
AutoFDO
Next, build the base benchmark. Once it’s built, run the benchmark and collect SPE branch samples of its execution.
# Build the base benchmark
cd fleetbench
bazel build -c opt //fleetbench/proto:proto_benchmark \
--copt=-gmlt --strip=never
cp -f bazel-bin/fleetbench/proto/proto_benchmark \
$BINDIR/proto_benchmark_base
perf record --no-switch-events \
-e 'arm_spe_0/branch_filter=1,load_filter=0,store_filter=0,jitter=1/' \
-c 10007 -N -o /tmp/spe.perf.data -- \
$BINDIR/proto_benchmark_base --benchmark_min_time=30s \
--hugepage_text=true --benchmark_filter=all
Once you’ve collected the perf data, you can process it into an AFDO profile using create_llvm_prof
and use the profile to build an AFDO-optimized binary.
$BINDIR/create_llvm_prof --binary=$BINDIR/proto_benchmark_base \
--profile=/tmp/spe.perf.data \
--profiler=perf_spe --disassemble_arm_branches=true \
--format=extbinary \
--out=/tmp/spe.afdo
bazel build --config=release //third_party/fleetbench/proto:proto_benchmark \
--copt=-gmlt --strip=never \
--copt=-fprofile-sample-use=/tmp/spe.afdo
cp -f bazel-bin/third_party/fleetbench/proto/proto_benchmark \
$BINDIR/proto_benchmark_afdo
Running the opt benchmark against the base, we can see that the optimized binary takes less time to execute.
Running /tmp/binaries/proto_benchmark_base
--------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------
BM_Protogen_Arena 9271691 ns 9215129 ns 4558
...
Running /tmp/binaries/proto_benchmark_afdo
--------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------
BM_Protogen_Arena 9152992 ns 9099613 ns 4534
...
Propeller
We can further optimize the binary with Propeller post-link optimizations. To do so, we start by building the optimized AFDO binary with Propeller annotations.
bazel build --config=release //third_party/fleetbench/proto:proto_benchmark \
--copt=-gmlt --strip=never \
--copt=-fprofile-sample-use=/tmp/spe.afdo \
--copt=-fbasic-block-address-map \
--linkopt=-fuse-ld=lld \
--linkopt=-Wl,--lto-basic-block-address-map
cp -f bazel-bin/third_party/fleetbench/proto/proto_benchmark \
$BINDIR/proto_benchmark_annotated
Once again, we run the benchmark and collect SPE perf profiles, then process the perf profile with create_llvm_prof
. Note that, this time, we
perf record --no-switch-events \
-e 'arm_spe_0/branch_filter=1,load_filter=0,store_filter=0,jitter=1/' \
-c 10007 -N -o /tmp/spe.perf.data -- \
$BINDIR/proto_benchmark_annotated --benchmark_min_time=30s \
--hugepage_text=true --benchmark_filter=all
$BINDIR/create_llvm_prof --binary=$BINDIR/proto_benchmark_annotated \
--profile=@<(printf "/tmp/spe.perf.data\n#") \
--profiler=perf_spe \
--format=propeller \
--out=/tmp/spe_cc_propeller.txt \
--propeller_symorder=/tmp/spe_ld_propeller.txt
Lastly, we optimize the binary with the Propeller profile.
bazel build --config=release //third_party/fleetbench/proto:proto_benchmark \
--copt=-gmlt --strip=never \
--copt=-fprofile-sample-use=/tmp/spe.afdo \
--copt=-fbasic-block-sections=list=/tmp/spe_cc_propeller.txt \
--linkopt=-fuse-ld=lld \
--linkopt=-Wl,--lto-basic-block-sections=/tmp/spe_cc_propeller.txt \
--linkopt=-Wl,--symbol-ordering-file=/tmp/spe_ld_propeller.txt \
--linkopt=-Wl,--no-warn-symbol-ordering
cp -f bazel-bin/third_party/fleetbench/proto/proto_benchmark \
$BINDIR/proto_benchmark_afdo_propeller
Comparing the Propeller-optimized benchmark against the AFDO-only and base binaries, we can see that the Propeller-optimized binary takes the least time to execute.
Running /tmp/binaries/proto_benchmark_base
--------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------
BM_Protogen_Arena 9271691 ns 9215129 ns 4558
...
Running /tmp/binaries/proto_benchmark_afdo
--------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------
BM_Protogen_Arena 9152992 ns 9099613 ns 4534
...
Running /tmp/binaries/proto_benchmark_afdo_propeller
--------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------
BM_Protogen_Arena 9052150 ns 8982803 ns 4690
...
Note: the benchmark is pretty noisy, so results may vary.
Wrap-up
Having a quality source of native profiles means that feedback-driven optimizations don’t have to rely on instrumentation or cross-profiling, bringing PGO on AArch64 to feature parity with x86. We’re looking forward to bringing new and existing profile-guided optimizations to AArch64!