Performance Regression in LLVM - A SPEC CPU 2017 Study

Hi,

While conducting a performance evaluation for a project using the SPEC CPU 2017 benchmark suite, we measured a regression, in recent versions, on the execution time of several targets compiled with clang.

We ran the 8 benchmarks used in the Propeller [1] paper (as we are conducting parallel studies on the efficacy of profile-guided optimizations) using 7 different clang versions with levels O1, O2 and O3, starting from 20.1.9 and going back to 8.0.1-9 (details on versions are given below).

All benchmarks were compiled without any profiles. For each one of those, we did 5 runs and took the median value as by SPEC practices. We measured negligible standard deviation for all benchmarks except 505.mcf_r (due to 1-2 slower runs in each configuration of levels and versions, but with consistent trends).

The experiments ran on a server equipped with an Intel Xeon E5-2699 v4 CPU, 256 GB of RAM, and Linux OpenNebula3 kernel 5.4.0, providing 20 physical cores to the runcpu SPEC tool that runs concurrent instances for each benchmark.

Results Overview

The tables below compare the execution time (seconds) measured after compiling with the latest version of clang with the best measured result for each benchmark among all the tested versions. In short, only at O2 the most recent version of clang is the one most often giving the best performance.

At O1, the latest version of the compiler is the best on 2 benchmark (505.mcf_r and 523.xalancbmk_r), while on the other targets the best results are spread across all versions between 12 and 18, going back to version 12 for only one of them (500.perlbench_r).

At O2, the latest version of the compiler is the best on 6 benchmarks, while on the remaining targets (500.perlbench_r and 531.deepsjeng_r) the best results are at version 14, with 500.perlbench_r performing better with version 14 and version 10 with equal measure.

At O3, the latest version of the compiler is the best on 4 benchmarks (505.mcf_r, 523.xalancbmk_r, 525.x264_r and 667.xz_r), while on the remaining targets the best results are at versions 8, 10, 12 and 16, going back to version 8 for a benchmark (500.perlbench_r).

benchmark O1
exec. time latest version exec. time best version best version Δ (%)
500.perlbench_r 560 546 12 2,5641
502.gcc_r 472 462 14 2,1645
505.mcf_r 713 - 20 -
523.xalancbmk_r 594 - 20 -
525.x264_r 601 573 16/14 4,8866
531.deepsjeng_r 409 405 14 0,9877
541.leela_r 613 612 18 0,1634
557.xz_r 556 553 18 0,5425
benchmark O2
exec. time latest version exec. time best version best version Δ (%)
500.perlbench_r 542 532 14/10 1,8797
502.gcc_r 453 - 20 -
505.mcf_r 747 - 20 -
523.xalancbmk_r 584 - 20 -
525.x264_r 225 - 20 -
531.deepsjeng_r 402 394 14 2,0305
541.leela_r 574 - 20 -
557.xz_r 543 - 20 -
benchmark O3
exec. time latest version exec. time best version best version Δ (%)
500.perlbench_r 540 531 8 1,6949
502.gcc_r 443 442 10 0,2262
505.mcf_r 737 - 20 -
523.xalancbmk_r 586 - 20 -
525.x264_r 226 - 20 -
531.deepsjeng_r 402 384 12 4,6875
541.leela_r 559 558 16 0,1792
557.xz_r 542 - 20 -

We noticed that only with clang 20 the performance started to get consistently better (except the regressions above). Initially, for reasons related to our project, we stopped at clang 18 and noticed that, worryingly, clang 18 was the best at O1 for only 3 benchmarks, the best at O2 for 3 benchmarks, and never the best at O3.

The next tables show the speedup of each compiler version against the oldest one tested (8.0.1-9). Highlighted in bold we have the best speedup values among all versions, while in italic we have the best speedup values among all versions ignoring version 20. We can see how prior to the latest version, the issue is more evident as the regression is present in most of the benchmarks.

For O1, we notice generally worse performance in clang 10 and 12 (with the exception of 500.perlbench_r), and clang 14 giving solid performance, most often superior to what we measured for clang 18 and even 20.

benchmark speedup/slowdown of each version against clang 8 at O1
20 18 16 14 12 10
500.perlbench_r 1.0089 1.0018 1.0107 1.0310 1.0348 1.0125
502.gcc_r 1.2161 1.2397 1.2371 1.2424 0.9897 0.9863
505.mcf_r 1.0281 1.0138 1.0000 1.0014 0.9839 0.9959
523.xalancbmk_r 2.1599 2.1277 2.1348 2.1033 0.9469 0.9610
525.x264_r 1.0566 1.0781 1.1082 1.1082 1.0275 1.0079
531.deepsjeng_r 1.0196 1.0221 1.0196 1.0296 0.9766 0.9835
541.leela_r 2.4274 2.4314 2.3923 2.4274 0.9809 1.0020
557.xz_r 1.1133 1.1193 1.1073 1.1054 0.9825 0.9794

For O2, we notice generally better performance on clang 20. For prior versions, we observe worse performance on clang 10 (with the exception of 500.perlbench_r) and 12, and clang 14 and 16 giving solid performance on most of the benchmarks.

benchmark speedup/slowdown of each version against clang 8 at O2
20 18 16 14 12 10
500.perlbench_r 1.0240 1.0183 1.0054 1.0432 1.0000 1.0432
502.gcc_r 1.1634 1.1582 1.1608 1.1608 1.0498 1.1608
505.mcf_r 1.0321 1.0105 0.9961 0.9710 0.9723 0.9735
523.xalancbmk_r 1.0257 1,0067 1.0101 1.0101 1.0204 1.0118
525.x264_r 1.4667 1.2132 1.2088 1.0345 1.0855 1.0123
531.deepsjeng_r 1.0000 0.9975 0.9975 1.0203 1.0075 0.9926
541.leela_r 1.0679 1.0642 1.0606 1.0624 1.0217 1.0132
557.xz_r 1.1823 1.1673 1.1780 1.1630 1.0473 1.1630

For O3, we notice generally better performance on clang 20. For prior versions, we observe overall good performance on clang 16 and clang 12. If we analyze in detail the 500.perlbench_r case, we have that clang 8 resulted in the best performance, thus each other version resulted in a slowdown.

benchmark speedup/slowdown of each version against clang 8 at O3
20 18 16 14 12 10
500.perlbench_r 0.9833 0.9299 0.9672 0.9981 0.9888 0.9907
502.gcc_r 1.1354 1.1128 1.1303 1.0182 1.1031 1.1380
505.mcf_r 1.0244 1.0107 0.9755 0.9805 1.0121 1.0080
523.xalancbmk_r 1.0205 1.0017 1.0118 1.0050 1.0101 1.0118
525.x264_r 1.5841 1.2786 1.3876 0.9917 0.9944 1.0199
531.deepsjeng_r 1.0050 0.9573 1.0228 0.9975 1.0521 0.9902
541.leela_r 1.0716 1.0436 1.0735 1.0716 1.0239 1.0135
557.xz_r 1.1900 1.1601 1.1857 1.0271 1.1685 1.1664

Preliminary investigation

We conducted an analysis to assess whether clang 20 may had introduced new optimizations that would improve the general performance and mask the enduring presence of a regression. By comparing the optimizing pipelines of version 18 and version 20, we note that Merge disjoint stack slots and CoroAnnotationElidePass are enabled only in version 20, and TLS Variable Hoist only in version 18. Since the benchmarks do not make use of modern C++ coroutines, we assume CoroAnnotationElidePass to be irrelevant here.

We would love to hear the opinion of the community on this. As we work in academia, SPEC CPU 2017 is a de-facto standard for our evaluations, but what we observed does not necessarily generalize (and we guess it probably does not, at least not for the benchmarks developers use).

We welcome your insights and hope they can lead us to debunk this phenomenon.

Tested LLVM Versions

  • LLVM 20: version 20.1.9 (commit id: 0e240b8, from branch release/20.x)
  • LLVM 18: version 18.1.8 (commit id: 3b5b5c1, from branch release/18.x)
  • LLVM 16: version 16.0.6 (commit id: 7cbf1a2, from branch release/16.x)
  • LLVM 14: version 14.0.6 (commit id: f28c006. from branch release/14.x)
  • LLVM 12: version 12.0.1 (commit id: fed4134, from branch release/12.x)
  • LLVM 10: version 10.0.0-4ubuntu1
  • LLVM 8: version 8.0.1-9 (tags/RELEASE_801/final)

Employed Compilation Flags

The SPEC CPU 2017 benchmarks require some compilation flags to be set to ensure the compatibility of the tested software with the target platform. All the benchmarks share the following: if the source code language is C “-g -std=c99 -m64", if the source code language is C++ “-g -std=c++03 -m64”. Some subjects require additional specific flags:

  • 502.gcc_r: -fgnu89-inline -fwrapv
  • 523.xalancbmk_r: -fdelayed-template-parsing
  • 525.x264_r: -fcommon

References

[1] H. Shen, K. Pszeniczny, R. Lavaee, S. Kumar, S. Tallam, and X. D. Li. 2023. Propeller: A Profile Guided, Relinking Optimizer for Warehouse-Scale Applications. In Proc. of ASPLOS 2023. DOI.

2 Likes

Thanks for sharing the data. It’s interesting to see non-monotonic performance for O2 across released LLVM versions and several significant regressions.

Since the LLVM release documentation mentions benchmarking, we are curious if we need to enhance this with SPEC benchmarks? Is the llvm-test-suite used in the benchmarking step mentioned in the document?

cc: @tstellar @rnk

(Full disclosure, this investigation is part of research work in collaboration with Google compiler opt folks to improve sample based profiling accuracy. These numbers are some intermediate results which surprised us.)

1 Like

Any kind of testing (e.g. llvm-test-suite,benchmarks, etc.) beyond ninja check-all is done by volunteers. There is no official lists of tests that get run either, so I don’t know if any of the testers run the llvm-test-suite or not.

If you or someone else is interested in tracking SPEC performance in LLVM, I would suggest setting up some automated testing that periodically tests the main branch and then file issues for any regressions. I know SPEC is important, but be aware that just because a change regressions performance in SPEC doesn’t mean it will be automatically fixed. Especially, it it would end up causing regressions in other workloads.

1 Like

These results are definitely pretty interesting. Having continuous performance testing on these sorts of benchmarks might be nice to have if someone is willing to maintain it.

Nikita has been maintaining the LLVM compile time tracker for quite a while now (5+ years?), and while it is mainly focused on how performance differences due to code changes in LLVM/clang, it can provide some insight on how optimization changes impact performance with the multi-stage builds, with the most recent example I remember seeing was differing inlining decisions causing reasonably big performance regressions (1-2% from what I remember). I would think that significant regressions would be caught through tooling like this, or on internal benchmarks maintained by Google, Meta, etc.

I’m not too familiar enough with SPEC to know how close these benchmarks are to common server binaries/clang, but I know others often mention that SPEC is not very representative of their workloads. That might somewhat explain why other benchmarks that people do track would not have regressed while SPEC did in some circumstances. Although that could also be partially explained by different compilation setups with most of those users utilizing various flavors of profile guided optimizations.

Tracking SPEC performance and ensuring we do not have significant regressions would be nice, but as Tom mentions, regressions in SPEC will probably not be fixed if they negatively impact other real world workloads. People also need to be interested in fixing issues related to SPEC performance/any other automated performance testing (maybe FleetBench?) in order to fully close the loop.

2 Likes

Hi,

I’m happy to report that our (Sony) test-team has got a bot doing hourly SPEC2017 runs, from which we’re planning on regularly publishing data. It’ll most likely end up being a “push experiments to a public GitHub repo” situation like Nikita’s compile-time-tracker bot, we’re not in a position to get a website up interpreting the results sadly. The bot is only doing the “train” portion of a SPEC run rather than the full multi-reference-run config, as we’re aiming for throughput at this stage.

As mentioned above, the SPEC workloads aren’t a perfect representation of what all users of LLVM are concerned with. Having continuous measurements on a decent set of workloads is valuable for detecting regressions, and I expect this could highlight changes (inadvertent or otherwise) when they occur.

1 Like

Do you have more information on how this is setup? Are you running SPEC through LLVM test suite’s external projects support or on its own?

It’s running through the LLVM test-suite external project config. The CPU is an off-the-shelf AMD4700s, the exact config will be in the experiments repo when we get it online.

I’m happy to report that our (Sony) test-team has got a bot doing hourly SPEC2017 runs, from which we’re planning on regularly publishing data

Thanks for sharing information about the SPEC regression testing setup! Looking forward to hearing from you when its publicly available.

As mentioned above, the SPEC workloads aren’t a perfect representation of what all users of LLVM are concerned with.

Yes, we (at Google) also continuously track performance with micro and macro benchmarks. However, this is primarily for configurations we are interested in e.g. using PGO. Furthermore these workloads are a moving target themselves and may hide regressions. There is value in keeping the measured workload static to find opportunities. While SPEC is often criticized for not being representative of larger datacenter workloads, it still serves as a starting point for CPU qualification efforts.

We at Igalia are also performing SPEC2017 train runs for a handful of RISC-V configurations, albeit only nightly. It’s running on a Banana Pi BPI-F3 which isn’t exactly the fastest thing ever, so runs can take ~8 hours with multisampling. The results can also be quite noisy in places but it’s enough to detect large regressions/performance gains.

We’re publishing the results onto a small personal LNT instance for now, but it’s available publicly here:

It also collects profiling data so you can see where the cycles are spent for each benchmark etc.

The plan is to eventually start publishing these results to https://lnt.llvm.org

2 Likes

It would be interesting to see if these results still hold up after applying LLD’s measurement bias control feature. I noted that many results are within 0.5% of each other, which in my experience is within the range explainable by this form of measurement bias.

1 Like

Thank you for the responses! We decided to go ahead using clang itself as target.

We took the LLVM versions from the SPEC CPU 2017 experiments and used them to compile clang-12. We tested 3 different configurations:

  • Release: performance in compiling clang-12 using the different versions of clang selected
  • IR PGO: like above, but with a PGO-optimized compiler using profiles obtained from IR instrumentation
  • AutoFDO: like above, with profiles obtained using AutoFDO

For PGO optimized binaries, we collected the profiles running the first 100 compilation commands of the clang build, sticking to what was done in the Propeller paper’s evaluation.

Each performance experiment used 20 physical cores and was repeated 10 times. We attach the box plots of the measurements to show their statistical validity and to ease visual comparisons.

The table below summarizes the median execution time (in seconds) of each LLVM version in all 3 configurations and the variation in percentage computed using clang-12 as baseline. We highlighted in bold the best results.

Release execution times (s)
llvm version LLVM 8 LLVM 10 LLVM 12 LLVM 14 LLVM 16 LLVM 18 LLVM 20
median 479.8591 478.0542 474.1651 481.5587 476.0653 474.3586 471.583
Release Δ (%) over clang-12 Release
llvm version LLVM 8 LLVM 10 LLVM 12 LLVM 14 LLVM 16 LLVM 18 LLVM 20
% -1.1866 -0.8135 0.0000 -1.5353 -0.3991 -0.0408 0.5475
autofdo execution times (s)
llvm version LLVM 8 LLVM 10 LLVM 12 LLVM 14 LLVM 16 LLVM 18 LLVM 20
median 434.9479 435.6348 433.1651 446.5027 445.2327 448.1797 440.5234
AutoFDO Δ (%) over clang-12 AutoFDO
llvm version LLVM 8 LLVM 10 LLVM 12 LLVM 14 LLVM 16 LLVM 18 LLVM 20
% -0.4099 -0.5669 0.0000 -2.9871 -2.7104 -3.3501 -1.6704
IR PGO execution times (s)
llvm version LLVM 8 LLVM 10 LLVM 12 LLVM 14 LLVM 16 LLVM 18 LLVM 20
median 440.5082 418.9543 416.4505 417.4447 417.4101 411.8662 410.7991
IR PGO Δ (%) over clang-12 IR PGO
llvm version LLVM 8 LLVM 10 LLVM 12 LLVM 14 LLVM 16 LLVM 18 LLVM 20
% -5.4614 -0.5976 0.0000 -0.2382 -0.2299 1.1131 1.3757

In Release configuration, there is a slight performance regression on LLVM 14, but newer versions always performed better than the previous one, with LLVM 20 being the best.

Instead, for AutoFDO, the best version overall is LLVM 12, and the overall performance degraded with later versions.

IR PGO, similarly to Release, shows LLVM 20 has the best version, and LLVM 8 as the worst. Since IR PGO showed worse performance than AutoFDO with LLVM 8, we suspect there was something wrong with the instrumentation back then.

Regarding the percentage variation, we measured it using clang-12 (the best version with AutoFDO) as baseline. We can see how, for AutoFDO, the degradation goes as down as -3.35% in LLVM 18, with improvements in LLVM 20 causing it to be -1.67% in the most recent version. The first signs of regression are visible right after version 12.

From the IR PGO results, the trend is consistent with the Release results.

The table below shows the speedup/slowdown obtained by applying IR PGO and AutoFDO.

IR PGO speedup/slowdown (using the Release counterpart as reference)
LLVM 8 LLVM 10 LLVM 12 LLVM 14 LLVM 16 LLVM 18 LLVM 20
1.0893 1.1411 1.1386 1.1536 1.1405 1.1517 1.1480
AutoFDO speedup/slowdown (using the Release counterpart as reference)
LLVM 8 LLVM 10 LLVM 12 LLVM 14 LLVM 16 LLVM 18 LLVM 20
1.1033 1.0974 1.0947 1.0785 1.0693 1.0584 1.0705
AutoFDO vs IR PGO - speedup/slowdown difference
LLVM 8 LLVM 10 LLVM 12 LLVM 14 LLVM 16 LLVM 18 LLVM 20
0.0139 -0.0437 -0.0439 -0.0751 -0.0713 -0.0933 -0.0775

By diffing the speedup from AutoFDO and IR PGO, we can quantify the AutoFDO loss. Indeed, we can see how after LLVM 12, the difference between AutoFDO and IR PGO has worsened, reaching the negative peak of -9.3% at version 18. We suspect this is caused by a degradation of debug information availability, specifically affecting the lines relevant for AutoFDO profiles generation. We are currently studying this phenomenon.



2 Likes

Thanks for the data.

Regarding AutoFDO, can you experiment with PseudoProbe based one (and compare with the debug line based)? It can be an indicator if debug info maintenance across passes plays a role in the regressions.

David

1 Like