I’ll go ahead and redo these patches.
New revisions:
https://reviews.llvm.org/D123803
https://reviews.llvm.org/D123804
https://reviews.llvm.org/D123969
https://reviews.llvm.org/D123971
Over the past few days I’ve been running performance tests with clang as my
test case. I focused on ThinLTO, since using Unified LTO introduces many
changes to the ThinLTO optimization pipeline. This first test measured how
long it takes to compile clang itself. This table compares unpatched clang
without Unified LTO to patched clang with Unified LTO enabled.
| # | Current (s) | Unified (s) |
|---|---|---|
| 1 | 3000.62 | 3036.92 |
| 2 | 3005.87 | 3036.31 |
| 3 | 3003.12 | 3025.09 |
| AVG No Unified (s) | AVG Unified (s) | % diff |
|---|---|---|
| 3003.20 | 3032.77 | 0.98% |
As an aside, there is a configuration not shown here. That is: using a patched
clang with LTO enabled, but without enabling unified LTO. We didn’t include it
here because there is no measurable impact on the compile/link time. The
generated executables are unchanged in both FullLTO and ThinLTO so, by
definition, there is no impact in run-time.
This next table compares two versions of clang, one built using ThinLTO without
Unified LTO and the other with Unified LTO. This test again uses clang as a
test case. I measured the time it takes to build clang in Release mode when
using each of these compilers.
| # | Built w/Current | Built w/Unified |
|---|---|---|
| 1 | 2279.84 | 2272.69 |
| 2 | 2277.96 | 2282.27 |
| 3 | 2278.11 | 2273.07 |
| AVG Built w/Current | AVG Built w/Unified | %diff |
|---|---|---|
| 2278.64 | 2276.01 | 0.12% |
We didn’t see any major differences between ThinLTO with Unified LTO enabled
and standard ThinLTO in these tests. Roughly a 1% penalty in compiletime
and a very small benefit in runtime.
That looks even better than WebKit build. I don’t know if @mehdi_amini still remembers all the performance/size difference when he brought up the thinLTO pipeline at the first place and what benchmark was used. The other set of number might be helpful to collect is: GitHub - llvm/llvm-test-suite
Also while you have it, can you also paste the code size difference for clang as well?
From these two number I don’t really have any concerns if you added this as a new LTO pipeline, if we can come up with a clear guideline for when you should use UnifiedLTO vs. Full/ThinLTO.
I don’t have that on hand, but it would be good data to collect. I’ll go ahead and do that.
That would be useful. I’ll see if I can get that set up.
After a long delay, I’m finally ready to post some numbers. There was a crash
in the Full LTO pipeline that caused some delays. Anyway, here’s a couple of
important things we noticed. First of all, after measuring the executable size
difference between current regular LTO and unified regular LTO, we noticed that
there was a consistent difference. While this is definitely unexpected, as the
pipelines are nearly identical, there are some differences in internalization
and symbol resolution that may be causing these changes. Either way, this is
not what we said in our initial post of performance numbers, so I’m very glad
this was caught. After seeing these differences we felt it would be good to
compare current and patched clang without unified LTO enabled. I’ve put the
numbers below. Overall, I think the differences here look like they’re in the
noise, as expected.
Current
| # | Full | Thin |
|---|---|---|
| 1 | 5142.28 | 3022.03 |
| 2 | 5134.68 | 3018.76 |
| 3 | 5132.87 | 3016.36 |
| AVG | 5136.61 | 3019.05 |
Patched (without unified LTO)
| # | Full | Thin |
|---|---|---|
| 1 | 5141.39 | 3012.91 |
| 2 | 5132.55 | 3008.45 |
| 3 | 5133.66 | 3006.32 |
| AVG | 5135.87 | 3009.23 |
| Full %diff | 0.20% |
| Thin %diff | -0.33% |
And finally, here is a table comparing the executable size of clang-15
generated by the various pipelines. Again, not 100% identical, but very minor.
| Full | Thin | |
|---|---|---|
| Current | 146642752 | 151187864 |
| Patched | 147783464 | 152192368 |
| Full %diff: | -0.78% |
| Thin %diff: | -0.66% |
Can you provide the instructions to reproduce?
There really should not be any difference between fullLTO before and after patch, right? Did you try set ShouldPreserveUseListOrder on the bitcode module and see if that is what causes the diff? Other than that, I can’t see how fullLTO can be different before and after. It is important to understand the reason for difference before claiming it is negligible.
Sure. For the performance tests or the binary size differences?
Yes, that’s what we expected.
I haven’t, but that’s definitely something to try.
Good point. Let me take a look at that and see what I can find.
Now I am puzzled about what you’re measuring and the point of it? I thought you’d show the difference between FullLTO and FullLTO with the new proposed unified pipeline?
The goal of the latest performance tests was to show that the patched compiler has an identical non-unified pipeline behavior to the current compiler, for both FullLTO and ThinLTO.
The differences in binary size appear to be caused by enabling split LTO units.
The increased size of .symtab and .strtab are the main contributors along
with some codegen changes. Since split LTO units + unified LTO + full LTO is a
SIE-specific configuration (as discussed above), I’ve re-run the binary
comparison again with split LTO units disabled using compilers with identical
version strings. This setup produced identical binaries.
It’s been a few weeks since I last posted here, and I wanted to post some furtherperformance numbers we’ve gathered from the LLVM test suite. These CTMark results show the difference between the ThinLTO frontend pipeline and the LTO pipeline more clearly. We see a maximum of 34% increase compile time of which the vast majority is the frontend. I think this is expected at a certain level. Running more passes is going to take more time. But I also think that larger test cases amortize the cost better. Particularly when more backend tasks are required, the compile time cost is much less overall. One interesting data point we’re seeing here is a run-time speedup for “CTMark/kimwitu++/kc.test” (13%) as well as a run-time hit for “CTMark/tramp3d-v4/tramp3d-v4.test” (21%). We’re looking into where those differences are coming from.
| compile time | runtime | |||||||
|---|---|---|---|---|---|---|---|---|
| patched | current | diff | %diff | patched | current | diff | %diff | |
| test-suite :: CTMark/kimwitu++/kc.test | 23.75 | 21.84 | 1.91 | 8% | 0.05 | 0.06 | -0.01 | -13% |
| test-suite :: CTMark/sqlite3/sqlite3.test | 14.89 | 12.67 | 2.22 | 16% | 2.64 | 2.64 | 0.00 | 0% |
| test-suite :: CTMark/consumer-typeset/consumer-typeset.test | 15.83 | 13.45 | 2.38 | 16% | 0.18 | 0.18 | 0.00 | 2% |
| test-suite :: CTMark/SPASS/SPASS.test | 23.23 | 19.91 | 3.32 | 15% | 7.68 | 7.71 | -0.03 | 0% |
| test-suite :: CTMark/mafft/pairlocalalign.test | 14.01 | 9.91 | 4.1 | 34% | 15.00 | 15.23 | -0.23 | -1% |
| test-suite :: CTMark/Bullet/bullet.test | 54.04 | 49.17 | 4.87 | 9% | 3.62 | 3.67 | -0.06 | -2% |
| test-suite :: CTMark/ClamAV/clamscan.test | 24.44 | 19.55 | 4.89 | 22% | 0.13 | 0.14 | 0.00 | -2% |
| test-suite :: CTMark/tramp3d-v4/tramp3d-v4.test | 30.23 | 24.93 | 5.3 | 19% | 0.28 | 0.22 | 0.05 | 21% |
| test-suite :: CTMark/lencod/lencod.test | 27.87 | 20.98 | 6.89 | 28% | 3.94 | 3.91 | 0.03 | 1% |
| test-suite :: CTMark/7zip/7zip-benchmark.test | 75.77 | 67.11 | 8.66 | 12% | 6.83 | 6.88 | -0.05 | -1% |
We see a maximum of 34% increase compile time of which the vast majority is the frontend.
What do you mean by frontend and backend in this case? I don’t see why there should be a time difference in clang frontend since all the passes are in backend?
In general, the compile time increase is expected and inline with what originally thought (double digits percentage quote @mehdi_amini). Since this is an opt-in feature, as long as the users understand the cost of this model, we can provide that.
One interesting data point we’re seeing here is a run-time speedup for “CTMark/kimwitu++/kc.test” (13%) as well as a run-time hit for “CTMark/tramp3d-v4/tramp3d-v4.test” (21%). We’re looking into where those differences are coming from.
Looking forward to see what happens in those cases.
I usually think of the LTO frontend as anything clang does to generate the build’s LTO bitcode files. The backend is all link-time optimization, plus any symbol management and bitcode I/O the linker needs to do.
Yes, if we are talking about clang’s frontend and backend, there shouldn’t be a difference. We don’t measure that here, though. CTMark measures clang’s total runtime, which in our case measures the amount of time it takes to generate the unified LTO bitcode files. The difference is rooted in the “double digit percentage” you referred to. However, these are very small test cases. When compiling a larger application, the cost is amortized better.
Agreed.
Sorry it’s been so long without an update here. It’s been pretty busy over here. As far as the questions remaining in this RFC, I’m now pretty convinced that the outliers in the dataset above are due interference on the system. I’ve rerun the entire test suite with a larger number of runs (N=30), and got more consistent results. Note that large differences in compile time are expected in these small benchmarks, but are not seen when building real-world applications.
| test name | patched comp | current comp | diff | %diff | patched exec | current exec | diff | %diff |
|---|---|---|---|---|---|---|---|---|
| 7zip-benchmark | 97.434 | 67.024 | 30.41 | 37% | 6.131 | 6.096 | 0.035 | 1% |
| bullet | 69.95 | 49.237 | 20.713 | 34% | 3.065 | 3.1 | -0.035 | -1% |
| clamscan | 32.636 | 19.547 | 13.089 | 50% | 0.105 | 0.105 | 0.0 | 0% |
| consumer-typeset | 21.884 | 13.338 | 8.546 | 49% | 0.111 | 0.111 | 0.0 | 0% |
| kc | 31.139 | 21.73 | 9.409 | 36% | 0.043 | 0.04 | 0.003 | 7% |
| lencod | 32.568 | 20.913 | 11.655 | 44% | 3.482 | 3.44 | 0.042 | 1% |
| SPASS | 32.105 | 19.733 | 12.372 | 48% | 6.833 | 6.862 | -0.029 | -0% |
| sqlite3 | 20.277 | 12.777 | 7.5 | 45% | 2.199 | 2.235 | -0.036 | -2% |
| tramp3d-v4 | 41.48 | 25.234 | 16.246 | 49% | 0.208 | 0.204 | 0.004 | 2% |
At this point, we’d like to move forward with this. Is there anything else that needs to be looked at?
Since these reviews have been open for a while at this point, I’ll go ahead and rebase them.
What does that table represent, at least for the compile time?
Total accumulated time in milliseconds, over all 30 invocations?
I’m using the CTMark test suite here, so the compile time numbers represent average total user time in seconds, over 30 invocations.