Performance Regression in LLVM - A SPEC CPU 2017 Study

cristianassaiante · February 25, 2025, 12:19pm

Hi,

While conducting a performance evaluation for a project using the SPEC CPU 2017 benchmark suite, we measured a regression, in recent versions, on the execution time of several targets compiled with clang.

We ran the 8 benchmarks used in the Propeller [1] paper (as we are conducting parallel studies on the efficacy of profile-guided optimizations) using 7 different clang versions with levels O1, O2 and O3, starting from 20.1.9 and going back to 8.0.1-9 (details on versions are given below).

All benchmarks were compiled without any profiles. For each one of those, we did 5 runs and took the median value as by SPEC practices. We measured negligible standard deviation for all benchmarks except 505.mcf_r (due to 1-2 slower runs in each configuration of levels and versions, but with consistent trends).

The experiments ran on a server equipped with an Intel Xeon E5-2699 v4 CPU, 256 GB of RAM, and Linux OpenNebula3 kernel 5.4.0, providing 20 physical cores to the runcpu SPEC tool that runs concurrent instances for each benchmark.

Results Overview

The tables below compare the execution time (seconds) measured after compiling with the latest version of clang with the best measured result for each benchmark among all the tested versions. In short, only at O2 the most recent version of clang is the one most often giving the best performance.

At O1, the latest version of the compiler is the best on 2 benchmark (505.mcf_r and 523.xalancbmk_r), while on the other targets the best results are spread across all versions between 12 and 18, going back to version 12 for only one of them (500.perlbench_r).

At O2, the latest version of the compiler is the best on 6 benchmarks, while on the remaining targets (500.perlbench_r and 531.deepsjeng_r) the best results are at version 14, with 500.perlbench_r performing better with version 14 and version 10 with equal measure.

At O3, the latest version of the compiler is the best on 4 benchmarks (505.mcf_r, 523.xalancbmk_r, 525.x264_r and 667.xz_r), while on the remaining targets the best results are at versions 8, 10, 12 and 16, going back to version 8 for a benchmark (500.perlbench_r).

benchmark	O1
benchmark	exec. time latest version	exec. time best version	best version	Δ (%)
500.perlbench_r	560	546	12	2,5641
502.gcc_r	472	462	14	2,1645
505.mcf_r	713	-	20	-
523.xalancbmk_r	594	-	20	-
525.x264_r	601	573	16/14	4,8866
531.deepsjeng_r	409	405	14	0,9877
541.leela_r	613	612	18	0,1634
557.xz_r	556	553	18	0,5425

benchmark	O2
benchmark	exec. time latest version	exec. time best version	best version	Δ (%)
500.perlbench_r	542	532	14/10	1,8797
502.gcc_r	453	-	20	-
505.mcf_r	747	-	20	-
523.xalancbmk_r	584	-	20	-
525.x264_r	225	-	20	-
531.deepsjeng_r	402	394	14	2,0305
541.leela_r	574	-	20	-
557.xz_r	543	-	20	-

benchmark	O3
benchmark	exec. time latest version	exec. time best version	best version	Δ (%)
500.perlbench_r	540	531	8	1,6949
502.gcc_r	443	442	10	0,2262
505.mcf_r	737	-	20	-
523.xalancbmk_r	586	-	20	-
525.x264_r	226	-	20	-
531.deepsjeng_r	402	384	12	4,6875
541.leela_r	559	558	16	0,1792
557.xz_r	542	-	20	-

We noticed that only with clang 20 the performance started to get consistently better (except the regressions above). Initially, for reasons related to our project, we stopped at clang 18 and noticed that, worryingly, clang 18 was the best at O1 for only 3 benchmarks, the best at O2 for 3 benchmarks, and never the best at O3.

The next tables show the speedup of each compiler version against the oldest one tested (8.0.1-9). Highlighted in bold we have the best speedup values among all versions, while in italic we have the best speedup values among all versions ignoring version 20. We can see how prior to the latest version, the issue is more evident as the regression is present in most of the benchmarks.

For O1, we notice generally worse performance in clang 10 and 12 (with the exception of 500.perlbench_r), and clang 14 giving solid performance, most often superior to what we measured for clang 18 and even 20.

benchmark	speedup/slowdown of each version against clang 8 at O1
benchmark	20	18	16	14	12	10
500.perlbench_r	1.0089	1.0018	1.0107	1.0310	*1.0348*	1.0125
502.gcc_r	1.2161	1.2397	1.2371	*1.2424*	0.9897	0.9863
505.mcf_r	1.0281	1.0138	1.0000	1.0014	0.9839	0.9959
523.xalancbmk_r	2.1599	2.1277	2.1348	2.1033	0.9469	0.9610
525.x264_r	1.0566	1.0781	*1.1082*	*1.1082*	1.0275	1.0079
531.deepsjeng_r	1.0196	1.0221	1.0196	*1.0296*	0.9766	0.9835
541.leela_r	2.4274	*2.4314*	2.3923	2.4274	0.9809	1.0020
557.xz_r	1.1133	*1.1193*	1.1073	1.1054	0.9825	0.9794

For O2, we notice generally better performance on clang 20. For prior versions, we observe worse performance on clang 10 (with the exception of 500.perlbench_r) and 12, and clang 14 and 16 giving solid performance on most of the benchmarks.

benchmark	speedup/slowdown of each version against clang 8 at O2
benchmark	20	18	16	14	12	10
500.perlbench_r	1.0240	1.0183	1.0054	*1.0432*	1.0000	*1.0432*
502.gcc_r	1.1634	1.1582	1.1608	1.1608	1.0498	1.1608
505.mcf_r	1.0321	1.0105	0.9961	0.9710	0.9723	0.9735
523.xalancbmk_r	1.0257	1,0067	1.0101	1.0101	1.0204	1.0118
525.x264_r	1.4667	1.2132	1.2088	1.0345	1.0855	1.0123
531.deepsjeng_r	1.0000	0.9975	0.9975	*1.0203*	1.0075	0.9926
541.leela_r	1.0679	1.0642	1.0606	1.0624	1.0217	1.0132
557.xz_r	1.1823	1.1673	1.1780	1.1630	1.0473	1.1630

For O3, we notice generally better performance on clang 20. For prior versions, we observe overall good performance on clang 16 and clang 12. If we analyze in detail the 500.perlbench_r case, we have that clang 8 resulted in the best performance, thus each other version resulted in a slowdown.

benchmark	speedup/slowdown of each version against clang 8 at O3
benchmark	20	18	16	14	12	10
500.perlbench_r	0.9833	0.9299	0.9672	*0.9981*	0.9888	0.9907
502.gcc_r	1.1354	1.1128	1.1303	1.0182	1.1031	*1.1380*
505.mcf_r	1.0244	1.0107	0.9755	0.9805	1.0121	1.0080
523.xalancbmk_r	1.0205	1.0017	1.0118	1.0050	1.0101	1.0118
525.x264_r	1.5841	1.2786	1.3876	0.9917	0.9944	1.0199
531.deepsjeng_r	1.0050	0.9573	1.0228	0.9975	*1.0521*	0.9902
541.leela_r	1.0716	1.0436	*1.0735*	1.0716	1.0239	1.0135
557.xz_r	1.1900	1.1601	1.1857	1.0271	1.1685	1.1664

Preliminary investigation

We conducted an analysis to assess whether clang 20 may had introduced new optimizations that would improve the general performance and mask the enduring presence of a regression. By comparing the optimizing pipelines of version 18 and version 20, we note that Merge disjoint stack slots and CoroAnnotationElidePass are enabled only in version 20, and TLS Variable Hoist only in version 18. Since the benchmarks do not make use of modern C++ coroutines, we assume CoroAnnotationElidePass to be irrelevant here.

We would love to hear the opinion of the community on this. As we work in academia, SPEC CPU 2017 is a de-facto standard for our evaluations, but what we observed does not necessarily generalize (and we guess it probably does not, at least not for the benchmarks developers use).

We welcome your insights and hope they can lead us to debunk this phenomenon.

Tested LLVM Versions

LLVM 20: version 20.1.9 (commit id: 0e240b8, from branch release/20.x)
LLVM 18: version 18.1.8 (commit id: 3b5b5c1, from branch release/18.x)
LLVM 16: version 16.0.6 (commit id: 7cbf1a2, from branch release/16.x)
LLVM 14: version 14.0.6 (commit id: f28c006. from branch release/14.x)
LLVM 12: version 12.0.1 (commit id: fed4134, from branch release/12.x)
LLVM 10: version 10.0.0-4ubuntu1
LLVM 8: version 8.0.1-9 (tags/RELEASE_801/final)

Employed Compilation Flags

The SPEC CPU 2017 benchmarks require some compilation flags to be set to ensure the compatibility of the tested software with the target platform. All the benchmarks share the following: if the source code language is C “-g -std=c99 -m64", if the source code language is C++ “-g -std=c++03 -m64”. Some subjects require additional specific flags:

502.gcc_r: -fgnu89-inline -fwrapv
523.xalancbmk_r: -fdelayed-template-parsing
525.x264_r: -fcommon

References

[1] H. Shen, K. Pszeniczny, R. Lavaee, S. Kumar, S. Tallam, and X. D. Li. 2023. Propeller: A Profile Guided, Relinking Optimizer for Warehouse-Scale Applications. In Proc. of ASPLOS 2023. DOI.

snehasish · February 28, 2025, 7:37pm

Thanks for sharing the data. It’s interesting to see non-monotonic performance for O2 across released LLVM versions and several significant regressions.

Since the LLVM release documentation mentions benchmarking, we are curious if we need to enhance this with SPEC benchmarks? Is the llvm-test-suite used in the benchmarking step mentioned in the document?

cc: @tstellar @rnk

(Full disclosure, this investigation is part of research work in collaboration with Google compiler opt folks to improve sample based profiling accuracy. These numbers are some intermediate results which surprised us.)

tstellar · February 28, 2025, 9:06pm

Any kind of testing (e.g. llvm-test-suite,benchmarks, etc.) beyond ninja check-all is done by volunteers. There is no official lists of tests that get run either, so I don’t know if any of the testers run the llvm-test-suite or not.

If you or someone else is interested in tracking SPEC performance in LLVM, I would suggest setting up some automated testing that periodically tests the main branch and then file issues for any regressions. I know SPEC is important, but be aware that just because a change regressions performance in SPEC doesn’t mean it will be automatically fixed. Especially, it it would end up causing regressions in other workloads.

boomanaiden154-1 · March 3, 2025, 10:06pm

These results are definitely pretty interesting. Having continuous performance testing on these sorts of benchmarks might be nice to have if someone is willing to maintain it.

Nikita has been maintaining the LLVM compile time tracker for quite a while now (5+ years?), and while it is mainly focused on how performance differences due to code changes in LLVM/clang, it can provide some insight on how optimization changes impact performance with the multi-stage builds, with the most recent example I remember seeing was differing inlining decisions causing reasonably big performance regressions (1-2% from what I remember). I would think that significant regressions would be caught through tooling like this, or on internal benchmarks maintained by Google, Meta, etc.

I’m not too familiar enough with SPEC to know how close these benchmarks are to common server binaries/clang, but I know others often mention that SPEC is not very representative of their workloads. That might somewhat explain why other benchmarks that people do track would not have regressed while SPEC did in some circumstances. Although that could also be partially explained by different compilation setups with most of those users utilizing various flavors of profile guided optimizations.

Tracking SPEC performance and ensuring we do not have significant regressions would be nice, but as Tom mentions, regressions in SPEC will probably not be fixed if they negatively impact other real world workloads. People also need to be interested in fixing issues related to SPEC performance/any other automated performance testing (maybe FleetBench?) in order to fully close the loop.

jmorse · March 4, 2025, 4:08pm

Hi,

I’m happy to report that our (Sony) test-team has got a bot doing hourly SPEC2017 runs, from which we’re planning on regularly publishing data. It’ll most likely end up being a “push experiments to a public GitHub repo” situation like Nikita’s compile-time-tracker bot, we’re not in a position to get a website up interpreting the results sadly. The bot is only doing the “train” portion of a SPEC run rather than the full multi-reference-run config, as we’re aiming for throughput at this stage.

As mentioned above, the SPEC workloads aren’t a perfect representation of what all users of LLVM are concerned with. Having continuous measurements on a decent set of workloads is valuable for detecting regressions, and I expect this could highlight changes (inadvertent or otherwise) when they occur.

boomanaiden154-1 · March 4, 2025, 7:14pm

Do you have more information on how this is setup? Are you running SPEC through LLVM test suite’s external projects support or on its own?

jmorse · March 4, 2025, 7:25pm

It’s running through the LLVM test-suite external project config. The CPU is an off-the-shelf AMD4700s, the exact config will be in the experiments repo when we get it online.

snehasish · March 6, 2025, 12:29am

I’m happy to report that our (Sony) test-team has got a bot doing hourly SPEC2017 runs, from which we’re planning on regularly publishing data

Thanks for sharing information about the SPEC regression testing setup! Looking forward to hearing from you when its publicly available.

As mentioned above, the SPEC workloads aren’t a perfect representation of what all users of LLVM are concerned with.

Yes, we (at Google) also continuously track performance with micro and macro benchmarks. However, this is primarily for configurations we are interested in e.g. using PGO. Furthermore these workloads are a moving target themselves and may hide regressions. There is value in keeping the measured workload static to find opportunities. While SPEC is often criticized for not being representative of larger datacenter workloads, it still serves as a starting point for CPU qualification efforts.

lukel · March 6, 2025, 8:39am

We at Igalia are also performing SPEC2017 train runs for a handful of RISC-V configurations, albeit only nightly. It’s running on a Banana Pi BPI-F3 which isn’t exactly the fastest thing ever, so runs can take ~8 hours with multisampling. The results can also be quite noisy in places but it’s enough to detect large regressions/performance gains.

We’re publishing the results onto a small personal LNT instance for now, but it’s available publicly here:

It also collects profiling data so you can see where the cycles are spent for each benchmark etc.

The plan is to eventually start publishing these results to https://lnt.llvm.org

pcc · March 19, 2025, 10:03pm

It would be interesting to see if these results still hold up after applying LLD’s measurement bias control feature. I noted that many results are within 0.5% of each other, which in my experience is within the range explainable by this form of measurement bias.

cristianassaiante · April 22, 2025, 9:40am

Thank you for the responses! We decided to go ahead using clang itself as target.

We took the LLVM versions from the SPEC CPU 2017 experiments and used them to compile clang-12. We tested 3 different configurations:

Release: performance in compiling clang-12 using the different versions of clang selected
IR PGO: like above, but with a PGO-optimized compiler using profiles obtained from IR instrumentation
AutoFDO: like above, with profiles obtained using AutoFDO

For PGO optimized binaries, we collected the profiles running the first 100 compilation commands of the clang build, sticking to what was done in the Propeller paper’s evaluation.

Each performance experiment used 20 physical cores and was repeated 10 times. We attach the box plots of the measurements to show their statistical validity and to ease visual comparisons.

The table below summarizes the median execution time (in seconds) of each LLVM version in all 3 configurations and the variation in percentage computed using clang-12 as baseline. We highlighted in bold the best results.

Release execution times (s)
llvm version	LLVM 8	LLVM 10	LLVM 12	LLVM 14	LLVM 16	LLVM 18	LLVM 20
median	479.8591	478.0542	474.1651	481.5587	476.0653	474.3586	471.583
Release Δ (%) over clang-12 Release
llvm version	LLVM 8	LLVM 10	LLVM 12	LLVM 14	LLVM 16	LLVM 18	LLVM 20
%	-1.1866	-0.8135	0.0000	-1.5353	-0.3991	-0.0408	0.5475

autofdo execution times (s)
llvm version	LLVM 8	LLVM 10	LLVM 12	LLVM 14	LLVM 16	LLVM 18	LLVM 20
median	434.9479	435.6348	433.1651	446.5027	445.2327	448.1797	440.5234
AutoFDO Δ (%) over clang-12 AutoFDO
llvm version	LLVM 8	LLVM 10	LLVM 12	LLVM 14	LLVM 16	LLVM 18	LLVM 20
%	-0.4099	-0.5669	0.0000	-2.9871	-2.7104	-3.3501	-1.6704

IR PGO execution times (s)
llvm version	LLVM 8	LLVM 10	LLVM 12	LLVM 14	LLVM 16	LLVM 18	LLVM 20
median	440.5082	418.9543	416.4505	417.4447	417.4101	411.8662	410.7991
IR PGO Δ (%) over clang-12 IR PGO
llvm version	LLVM 8	LLVM 10	LLVM 12	LLVM 14	LLVM 16	LLVM 18	LLVM 20
%	-5.4614	-0.5976	0.0000	-0.2382	-0.2299	1.1131	1.3757

In Release configuration, there is a slight performance regression on LLVM 14, but newer versions always performed better than the previous one, with LLVM 20 being the best.

Instead, for AutoFDO, the best version overall is LLVM 12, and the overall performance degraded with later versions.

IR PGO, similarly to Release, shows LLVM 20 has the best version, and LLVM 8 as the worst. Since IR PGO showed worse performance than AutoFDO with LLVM 8, we suspect there was something wrong with the instrumentation back then.

Regarding the percentage variation, we measured it using clang-12 (the best version with AutoFDO) as baseline. We can see how, for AutoFDO, the degradation goes as down as -3.35% in LLVM 18, with improvements in LLVM 20 causing it to be -1.67% in the most recent version. The first signs of regression are visible right after version 12.

From the IR PGO results, the trend is consistent with the Release results.

The table below shows the speedup/slowdown obtained by applying IR PGO and AutoFDO.

IR PGO speedup/slowdown (using the Release counterpart as reference)
LLVM 8	LLVM 10	LLVM 12	LLVM 14	LLVM 16	LLVM 18	LLVM 20
1.0893	1.1411	1.1386	1.1536	1.1405	1.1517	1.1480
AutoFDO speedup/slowdown (using the Release counterpart as reference)
LLVM 8	LLVM 10	LLVM 12	LLVM 14	LLVM 16	LLVM 18	LLVM 20
1.1033	1.0974	1.0947	1.0785	1.0693	1.0584	1.0705
AutoFDO vs IR PGO - speedup/slowdown difference
LLVM 8	LLVM 10	LLVM 12	LLVM 14	LLVM 16	LLVM 18	LLVM 20
0.0139	-0.0437	-0.0439	-0.0751	-0.0713	-0.0933	-0.0775

By diffing the speedup from AutoFDO and IR PGO, we can quantify the AutoFDO loss. Indeed, we can see how after LLVM 12, the difference between AutoFDO and IR PGO has worsened, reaching the negative peak of -9.3% at version 18. We suspect this is caused by a degradation of debug information availability, specifically affecting the lines relevant for AutoFDO profiles generation. We are currently studying this phenomenon.

davidxl · April 22, 2025, 3:56pm

Thanks for the data.

Regarding AutoFDO, can you experiment with PseudoProbe based one (and compare with the debug line based)? It can be an indicator if debug info maintenance across passes plays a role in the regressions.

David

Topic		Replies	Views
Issues affecting LLVM LLVM Project	5	465	January 14, 2026
llvm and clang are getting slower LLVM Dev List Archives	37	506	April 1, 2016
FYI: Phoronix GCC vs. LLVM-GCC benchmarks LLVM Dev List Archives	30	382	September 20, 2009
llvm (the middle-end) is getting slower, December edition LLVM Dev List Archives	32	410	December 22, 2016
[3.6 Release] RC3 has been tagged LLVM Dev List Archives	66	525	February 20, 2015