Hi,
While conducting a performance evaluation for a project using the SPEC CPU 2017 benchmark suite, we measured a regression, in recent versions, on the execution time of several targets compiled with clang.
We ran the 8 benchmarks used in the Propeller [1] paper (as we are conducting parallel studies on the efficacy of profile-guided optimizations) using 7 different clang versions with levels O1, O2 and O3, starting from 20.1.9 and going back to 8.0.1-9 (details on versions are given below).
All benchmarks were compiled without any profiles. For each one of those, we did 5 runs and took the median value as by SPEC practices. We measured negligible standard deviation for all benchmarks except 505.mcf_r (due to 1-2 slower runs in each configuration of levels and versions, but with consistent trends).
The experiments ran on a server equipped with an Intel Xeon E5-2699 v4 CPU, 256 GB of RAM, and Linux OpenNebula3 kernel 5.4.0, providing 20 physical cores to the runcpu SPEC tool that runs concurrent instances for each benchmark.
Results Overview
The tables below compare the execution time (seconds) measured after compiling with the latest version of clang with the best measured result for each benchmark among all the tested versions. In short, only at O2 the most recent version of clang is the one most often giving the best performance.
At O1, the latest version of the compiler is the best on 2 benchmark (505.mcf_r and 523.xalancbmk_r), while on the other targets the best results are spread across all versions between 12 and 18, going back to version 12 for only one of them (500.perlbench_r).
At O2, the latest version of the compiler is the best on 6 benchmarks, while on the remaining targets (500.perlbench_r and 531.deepsjeng_r) the best results are at version 14, with 500.perlbench_r performing better with version 14 and version 10 with equal measure.
At O3, the latest version of the compiler is the best on 4 benchmarks (505.mcf_r, 523.xalancbmk_r, 525.x264_r and 667.xz_r), while on the remaining targets the best results are at versions 8, 10, 12 and 16, going back to version 8 for a benchmark (500.perlbench_r).
| benchmark | O1 | |||
| exec. time latest version | exec. time best version | best version | Δ (%) | |
| 500.perlbench_r | 560 | 546 | 12 | 2,5641 |
| 502.gcc_r | 472 | 462 | 14 | 2,1645 |
| 505.mcf_r | 713 | - | 20 | - |
| 523.xalancbmk_r | 594 | - | 20 | - |
| 525.x264_r | 601 | 573 | 16/14 | 4,8866 |
| 531.deepsjeng_r | 409 | 405 | 14 | 0,9877 |
| 541.leela_r | 613 | 612 | 18 | 0,1634 |
| 557.xz_r | 556 | 553 | 18 | 0,5425 |
| benchmark | O2 | |||
| exec. time latest version | exec. time best version | best version | Δ (%) | |
| 500.perlbench_r | 542 | 532 | 14/10 | 1,8797 |
| 502.gcc_r | 453 | - | 20 | - |
| 505.mcf_r | 747 | - | 20 | - |
| 523.xalancbmk_r | 584 | - | 20 | - |
| 525.x264_r | 225 | - | 20 | - |
| 531.deepsjeng_r | 402 | 394 | 14 | 2,0305 |
| 541.leela_r | 574 | - | 20 | - |
| 557.xz_r | 543 | - | 20 | - |
| benchmark | O3 | |||
| exec. time latest version | exec. time best version | best version | Δ (%) | |
| 500.perlbench_r | 540 | 531 | 8 | 1,6949 |
| 502.gcc_r | 443 | 442 | 10 | 0,2262 |
| 505.mcf_r | 737 | - | 20 | - |
| 523.xalancbmk_r | 586 | - | 20 | - |
| 525.x264_r | 226 | - | 20 | - |
| 531.deepsjeng_r | 402 | 384 | 12 | 4,6875 |
| 541.leela_r | 559 | 558 | 16 | 0,1792 |
| 557.xz_r | 542 | - | 20 | - |
We noticed that only with clang 20 the performance started to get consistently better (except the regressions above). Initially, for reasons related to our project, we stopped at clang 18 and noticed that, worryingly, clang 18 was the best at O1 for only 3 benchmarks, the best at O2 for 3 benchmarks, and never the best at O3.
The next tables show the speedup of each compiler version against the oldest one tested (8.0.1-9). Highlighted in bold we have the best speedup values among all versions, while in italic we have the best speedup values among all versions ignoring version 20. We can see how prior to the latest version, the issue is more evident as the regression is present in most of the benchmarks.
For O1, we notice generally worse performance in clang 10 and 12 (with the exception of 500.perlbench_r), and clang 14 giving solid performance, most often superior to what we measured for clang 18 and even 20.
| benchmark | speedup/slowdown of each version against clang 8 at O1 | |||||
| 20 | 18 | 16 | 14 | 12 | 10 | |
| 500.perlbench_r | 1.0089 | 1.0018 | 1.0107 | 1.0310 | 1.0348 | 1.0125 |
| 502.gcc_r | 1.2161 | 1.2397 | 1.2371 | 1.2424 | 0.9897 | 0.9863 |
| 505.mcf_r | 1.0281 | 1.0138 | 1.0000 | 1.0014 | 0.9839 | 0.9959 |
| 523.xalancbmk_r | 2.1599 | 2.1277 | 2.1348 | 2.1033 | 0.9469 | 0.9610 |
| 525.x264_r | 1.0566 | 1.0781 | 1.1082 | 1.1082 | 1.0275 | 1.0079 |
| 531.deepsjeng_r | 1.0196 | 1.0221 | 1.0196 | 1.0296 | 0.9766 | 0.9835 |
| 541.leela_r | 2.4274 | 2.4314 | 2.3923 | 2.4274 | 0.9809 | 1.0020 |
| 557.xz_r | 1.1133 | 1.1193 | 1.1073 | 1.1054 | 0.9825 | 0.9794 |
For O2, we notice generally better performance on clang 20. For prior versions, we observe worse performance on clang 10 (with the exception of 500.perlbench_r) and 12, and clang 14 and 16 giving solid performance on most of the benchmarks.
| benchmark | speedup/slowdown of each version against clang 8 at O2 | |||||
| 20 | 18 | 16 | 14 | 12 | 10 | |
| 500.perlbench_r | 1.0240 | 1.0183 | 1.0054 | 1.0432 | 1.0000 | 1.0432 |
| 502.gcc_r | 1.1634 | 1.1582 | 1.1608 | 1.1608 | 1.0498 | 1.1608 |
| 505.mcf_r | 1.0321 | 1.0105 | 0.9961 | 0.9710 | 0.9723 | 0.9735 |
| 523.xalancbmk_r | 1.0257 | 1,0067 | 1.0101 | 1.0101 | 1.0204 | 1.0118 |
| 525.x264_r | 1.4667 | 1.2132 | 1.2088 | 1.0345 | 1.0855 | 1.0123 |
| 531.deepsjeng_r | 1.0000 | 0.9975 | 0.9975 | 1.0203 | 1.0075 | 0.9926 |
| 541.leela_r | 1.0679 | 1.0642 | 1.0606 | 1.0624 | 1.0217 | 1.0132 |
| 557.xz_r | 1.1823 | 1.1673 | 1.1780 | 1.1630 | 1.0473 | 1.1630 |
For O3, we notice generally better performance on clang 20. For prior versions, we observe overall good performance on clang 16 and clang 12. If we analyze in detail the 500.perlbench_r case, we have that clang 8 resulted in the best performance, thus each other version resulted in a slowdown.
| benchmark | speedup/slowdown of each version against clang 8 at O3 | |||||
| 20 | 18 | 16 | 14 | 12 | 10 | |
| 500.perlbench_r | 0.9833 | 0.9299 | 0.9672 | 0.9981 | 0.9888 | 0.9907 |
| 502.gcc_r | 1.1354 | 1.1128 | 1.1303 | 1.0182 | 1.1031 | 1.1380 |
| 505.mcf_r | 1.0244 | 1.0107 | 0.9755 | 0.9805 | 1.0121 | 1.0080 |
| 523.xalancbmk_r | 1.0205 | 1.0017 | 1.0118 | 1.0050 | 1.0101 | 1.0118 |
| 525.x264_r | 1.5841 | 1.2786 | 1.3876 | 0.9917 | 0.9944 | 1.0199 |
| 531.deepsjeng_r | 1.0050 | 0.9573 | 1.0228 | 0.9975 | 1.0521 | 0.9902 |
| 541.leela_r | 1.0716 | 1.0436 | 1.0735 | 1.0716 | 1.0239 | 1.0135 |
| 557.xz_r | 1.1900 | 1.1601 | 1.1857 | 1.0271 | 1.1685 | 1.1664 |
Preliminary investigation
We conducted an analysis to assess whether clang 20 may had introduced new optimizations that would improve the general performance and mask the enduring presence of a regression. By comparing the optimizing pipelines of version 18 and version 20, we note that Merge disjoint stack slots and CoroAnnotationElidePass are enabled only in version 20, and TLS Variable Hoist only in version 18. Since the benchmarks do not make use of modern C++ coroutines, we assume CoroAnnotationElidePass to be irrelevant here.
We would love to hear the opinion of the community on this. As we work in academia, SPEC CPU 2017 is a de-facto standard for our evaluations, but what we observed does not necessarily generalize (and we guess it probably does not, at least not for the benchmarks developers use).
We welcome your insights and hope they can lead us to debunk this phenomenon.
Tested LLVM Versions
- LLVM 20: version
20.1.9(commit id:0e240b8, from branchrelease/20.x) - LLVM 18: version
18.1.8(commit id:3b5b5c1, from branchrelease/18.x) - LLVM 16: version
16.0.6(commit id:7cbf1a2, from branchrelease/16.x) - LLVM 14: version
14.0.6(commit id:f28c006. from branchrelease/14.x) - LLVM 12: version
12.0.1(commit id:fed4134, from branchrelease/12.x) - LLVM 10: version
10.0.0-4ubuntu1 - LLVM 8: version
8.0.1-9(tags/RELEASE_801/final)
Employed Compilation Flags
The SPEC CPU 2017 benchmarks require some compilation flags to be set to ensure the compatibility of the tested software with the target platform. All the benchmarks share the following: if the source code language is C “-g -std=c99 -m64", if the source code language is C++ “-g -std=c++03 -m64”. Some subjects require additional specific flags:
- 502.gcc_r:
-fgnu89-inline -fwrapv - 523.xalancbmk_r:
-fdelayed-template-parsing - 525.x264_r:
-fcommon
References
[1] H. Shen, K. Pszeniczny, R. Lavaee, S. Kumar, S. Tallam, and X. D. Li. 2023. Propeller: A Profile Guided, Relinking Optimizer for Warehouse-Scale Applications. In Proc. of ASPLOS 2023. DOI.


