Based on discussions in Adding ffmpeg in LLVM test suite?, we realized that ffmpeg and dav1d would be useful test and benchmarks for LLVM’s autovectorization support.
Both projects contain decoders (and encoders) for various audio/video formats, which have quite hot codepaths, that normally use handwritten assembly. The corresponding C functions for those assembly functions also should be autovectorizable by the compiler (although they usually are too generic to exploit all the tricks that the handwritten assembly does). Both projects contain a testing/benchmarking framework for the assembly functions checkasm
, which also can be used for benchmarking the compiler generated code.
I did the tests on an AWS Graviton 3 server (c7g.2xlarge to be exact), testing the AArch64 code generation, on Ubuntu 24.04. But I’ll try to detail all the steps I did, to allow others repeat the same measurements on their architecture of choice.
First off, I tested dav1d, tag 1.5.0.
I tested building with GCC 13.2.0 (from Ubuntu 24.04) and Clang 19.1.2 (built manually).
I set up the compilation like this:
$ cd dav1d
$ mkdir build-gcc
$ cd build-gcc
$ CFLAGS="-fno-tree-vectorize" CC=gcc meson setup .. -Dtrim_dsp=false --default-library static -Denable_asm=true
$ ninja
$ cd ..
$ mkdir build-clang
$ cd build-clang
$ CFLAGS="-fno-vectorize -fno-slp-vectorize" CC=clang meson setup .. -Dtrim_dsp=false --default-library static -Denable_asm=true
$ ninja
(Not sure if -fno-slp-vectorize
does something additional on top of -fno-vectorize
.)
Within each of the GCC/Clang builds, I varied the following parameters:
- I built with
-Denable_asm=true
and-Denable_asm=false
- I omitted the
CFLAGS="-fno..."
parameter setting
For total big picture numbers, I tested decoding of one long AV1 video clip. I used the Chimera test clip from Netflix, currently downloaded from http://dgql.org/~unlord/Netflix/Chimera/Chimera-AV1-8bit-1920x1080-6736kbps.ivf. (It is usually available via https://download.opencontent.netflix.com/ or something similar, but that site was down today.) I ran decoding with a command like ./tools/dav1d -i ~/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o n.null
.
I got the following decoding speeds:
- For both GCC and Clang, in the default configuration (with
enable_asm=true
), I got around 450-460 fps (frames per second, higher is better) - For GCC with assembly disabled, I got 185 fps
- For GCC with assembly disabled and vectorization disabled, I got 128 fps
- For Clang with assembly disabled, I got 177 fps
- For Clang with assembly disabled and vectorization disabled, I got 118 fps
(These numbers are from multithreaded decoding, and I ran it on an 8 core server. The number of cores affects the absolute numbers a lot - but their absolute values doesn’t matter much here.)
Thus, in short, for this codebase, GCC generally performs marginally better, on aarch64. On both compilers, autovectorization gives roughly a 40-50% speedup - but it obviously doesn’t get anywhere near the performance of the handwritten assembly.
For microbenchmarking, the checkasm
tool can be used.
Specifically on AArch64, the dav1d checkasm tool defaults to using pmccntr_el0
as timer - but this isn’t normally directly accessible in userspace. With a kernel module you can enable direct access to it in userspace, otherwise you can edit tests/checkasm/checkasm.h
and replace pmccntr_el0
with cntvct_el0
. That timer has less precision for tuning an individual handwritten assembly function, but should be good enough for coarse benchmarks.
Additionally, to reduce the long runtime for benchmarking, I also edited tests/checkasm/checkasm.h
, changing #define BENCH_RUNS (1 << 12)
into #define BENCH_RUNS (1 << 10)
.
For doing a run of benchmarking all functions, I ran a command like ./tests/checkasm --bench > bench-clang-vect.txt
. (This command took 1-3 minutes to run in this setup; without the modification of BENCH_RUNS
it would take 4 times longer.)
This outputs a lot of measurements like this:
warp_8x8_8bpc_c: 584.1 ( 1.00x)
warp_8x8_8bpc_neon: 140.7 ( 4.15x)
This shows that the C version of the warp_8x8_8bpc
function/testcase ran in 584.1 timer units, while the NEON handwritten assembly version ran in 140.7 timer units. It also shows that the handwritten version was 4.15x faster than the baseline.
(While we’re not interested in the runtimes of the assembly versions here, we need to build with -Denable_asm=true
, because otherwise the checkasm
tool isn’t built at all.)
To study the performance of one individual function closer, one can run e.g. ./tests/checkasm --bench --function=warp*
(where warp*
is a wildcard pattern matching the base of the function name, i.e. omitting the _c
or _neon
suffix).
I collected the outputs from both GCC and Clang in this way, and have collected the outputs at Index of /temp/dav1d-autovectorize, including scripts to compare those results. In https://martin.st/temp/dav1d-autovectorize/speedup-clang-vect.txt we see the relative speedup of the Clang autovectorized functions, over the non-autovectorized versions; here a number like 2.00 means that the autovectorized function was 2x as fast, i.e. ran in half the time - higher is better.
I also included similar comparisons of how much faster/slower Clang is than GCC, for each individual function.
As trivia, we can note that out of 2214 benchmarked functions, Clang had 218 functions that were <= 0.95 of the non-vectorized version (i.e. where autovectorization made it slower, for the tested case), and had 1548 functions that was >= 1.10 of the baseline (i.e. where vectorization made things at least a little bit faster). GCC had 249 functions that became slower and 1457 functions that became faster.
FFmpeg also has a lot of similar benchmarkable functions.
I only did measurements on FFmpeg with Clang, because FFmpeg explicitly disables autovectorization with GCC due to issues, see Revert "configure: Enable GCC vectorization on ≥4.9 on x86" · FFmpeg/FFmpeg@fd6dbc5 · GitHub.
I did my measurements on FFmpeg commit c98810ab47fa1cf339b16045e27fbe12b3a19951
I set up my build like this:
$ mkdir ffmpeg-build
$ cd ffmpeg-build
$ ../ffmpeg/configure --cc=clang --disable-linux-perf --optflags='-O3 -fno-vectorize -fno-slp-vectorize'
$ make -j$(nproc) checkasm
I had done the following changes to the source before building:
diff --git a/libavutil/aarch64/timer.h b/libavutil/aarch64/timer.h
index 922b0c5598..f8016ea7cc 100644
--- a/libavutil/aarch64/timer.h
+++ b/libavutil/aarch64/timer.h
@@ -33,7 +33,7 @@ static inline uint64_t read_time(void)
uint64_t cycle_counter;
__asm__ volatile(
"isb \t\n"
-#if defined(__ANDROID__) || defined(__APPLE__)
+#if defined(__ANDROID__) || defined(__APPLE__) || 1
// cntvct_el0 has lower resolution than pmccntr_el0, but is usually
// accessible from user space by default.
"mrs %0, cntvct_el0 "
diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c
index c9d2b5faf1..dac5c79464 100644
--- a/tests/checkasm/checkasm.c
+++ b/tests/checkasm/checkasm.c
@@ -626,8 +626,7 @@ static void print_benchs(CheckasmFunc *f)
if (f) {
print_benchs(f->child[0]);
- /* Only print functions with at least one assembly version */
- if (f->versions.cpu || f->versions.next) {
+ if (1) {
CheckasmFuncVersion *v = &f->versions;
const CheckasmPerf *p = &v->perf;
const double baseline = avg_cycles_per_call(p);
(This, to make it use cntvct_el0
rather than pmccntr_el0
, as for dav1d, and to make it print benchmarks for functions that don’t have any corresponding assembly implementation.)
I ran benchmarks with a command like ./tests/checkasm/checkasm --bench --runs=9 > bench-clang-novect.txt
. (Here, the number of benchmarked iterations is settable with a command line parameter.)
For testing individual functions, you can do e.g. ./tests/checkasm/checkasm --bench=shuffle_bytes*
. (I.e. similar to the case for dav1d above, but slightly different command line parameters, currently.)
The measurements from this run is available at Index of /temp/ffmpeg-autovectorize.