[RFC] A Unified LTO Bitcode Frontend

ormris · April 18, 2022, 10:37pm

I’ll go ahead and redo these patches.

ormris · April 19, 2022, 12:50am

New revisions:
https://reviews.llvm.org/D123803
https://reviews.llvm.org/D123804
https://reviews.llvm.org/D123969
https://reviews.llvm.org/D123971

ormris · April 25, 2022, 4:39pm

Over the past few days I’ve been running performance tests with clang as my
test case. I focused on ThinLTO, since using Unified LTO introduces many
changes to the ThinLTO optimization pipeline. This first test measured how
long it takes to compile clang itself. This table compares unpatched clang
without Unified LTO to patched clang with Unified LTO enabled.

#	Current (s)	Unified (s)
1	3000.62	3036.92
2	3005.87	3036.31
3	3003.12	3025.09

AVG No Unified (s)	AVG Unified (s)	% diff
3003.20	3032.77	0.98%

As an aside, there is a configuration not shown here. That is: using a patched
clang with LTO enabled, but without enabling unified LTO. We didn’t include it
here because there is no measurable impact on the compile/link time. The
generated executables are unchanged in both FullLTO and ThinLTO so, by
definition, there is no impact in run-time.

This next table compares two versions of clang, one built using ThinLTO without
Unified LTO and the other with Unified LTO. This test again uses clang as a
test case. I measured the time it takes to build clang in Release mode when
using each of these compilers.

#	Built w/Current	Built w/Unified
1	2279.84	2272.69
2	2277.96	2282.27
3	2278.11	2273.07

AVG Built w/Current	AVG Built w/Unified	%diff
2278.64	2276.01	0.12%

We didn’t see any major differences between ThinLTO with Unified LTO enabled
and standard ThinLTO in these tests. Roughly a 1% penalty in compiletime
and a very small benefit in runtime.

cachemeifyoucan · April 25, 2022, 4:57pm

That looks even better than WebKit build. I don’t know if @mehdi_amini still remembers all the performance/size difference when he brought up the thinLTO pipeline at the first place and what benchmark was used. The other set of number might be helpful to collect is: GitHub - llvm/llvm-test-suite

Also while you have it, can you also paste the code size difference for clang as well?

From these two number I don’t really have any concerns if you added this as a new LTO pipeline, if we can come up with a clear guideline for when you should use UnifiedLTO vs. Full/ThinLTO.

ormris · April 25, 2022, 6:10pm

I don’t have that on hand, but it would be good data to collect. I’ll go ahead and do that.

That would be useful. I’ll see if I can get that set up.

ormris · May 24, 2022, 5:57pm

After a long delay, I’m finally ready to post some numbers. There was a crash
in the Full LTO pipeline that caused some delays. Anyway, here’s a couple of
important things we noticed. First of all, after measuring the executable size
difference between current regular LTO and unified regular LTO, we noticed that
there was a consistent difference. While this is definitely unexpected, as the
pipelines are nearly identical, there are some differences in internalization
and symbol resolution that may be causing these changes. Either way, this is
not what we said in our initial post of performance numbers, so I’m very glad
this was caught. After seeing these differences we felt it would be good to
compare current and patched clang without unified LTO enabled. I’ve put the
numbers below. Overall, I think the differences here look like they’re in the
noise, as expected.

Current

#	Full	Thin
1	5142.28	3022.03
2	5134.68	3018.76
3	5132.87	3016.36
AVG	5136.61	3019.05

Patched (without unified LTO)

#	Full	Thin
1	5141.39	3012.91
2	5132.55	3008.45
3	5133.66	3006.32
AVG	5135.87	3009.23


Full %diff	0.20%
Thin %diff	-0.33%

And finally, here is a table comparing the executable size of clang-15
generated by the various pipelines. Again, not 100% identical, but very minor.

	Full	Thin
Current	146642752	151187864
Patched	147783464	152192368


Full %diff:	-0.78%
Thin %diff:	-0.66%

mehdi_amini · May 24, 2022, 6:29pm

Can you provide the instructions to reproduce?

cachemeifyoucan · May 24, 2022, 7:28pm

There really should not be any difference between fullLTO before and after patch, right? Did you try set ShouldPreserveUseListOrder on the bitcode module and see if that is what causes the diff? Other than that, I can’t see how fullLTO can be different before and after. It is important to understand the reason for difference before claiming it is negligible.

ormris · May 24, 2022, 7:49pm

Sure. For the performance tests or the binary size differences?

ormris · May 24, 2022, 8:04pm

Yes, that’s what we expected.

I haven’t, but that’s definitely something to try.

Good point. Let me take a look at that and see what I can find.

mehdi_amini · May 24, 2022, 10:13pm

Now I am puzzled about what you’re measuring and the point of it? I thought you’d show the difference between FullLTO and FullLTO with the new proposed unified pipeline?

ormris · May 25, 2022, 9:02pm

The goal of the latest performance tests was to show that the patched compiler has an identical non-unified pipeline behavior to the current compiler, for both FullLTO and ThinLTO.

ormris · June 22, 2022, 10:41pm

The differences in binary size appear to be caused by enabling split LTO units.
The increased size of .symtab and .strtab are the main contributors along
with some codegen changes. Since split LTO units + unified LTO + full LTO is a
SIE-specific configuration (as discussed above), I’ve re-run the binary
comparison again with split LTO units disabled using compilers with identical
version strings. This setup produced identical binaries.

ormris · July 22, 2022, 9:01pm

It’s been a few weeks since I last posted here, and I wanted to post some furtherperformance numbers we’ve gathered from the LLVM test suite. These CTMark results show the difference between the ThinLTO frontend pipeline and the LTO pipeline more clearly. We see a maximum of 34% increase compile time of which the vast majority is the frontend. I think this is expected at a certain level. Running more passes is going to take more time. But I also think that larger test cases amortize the cost better. Particularly when more backend tasks are required, the compile time cost is much less overall. One interesting data point we’re seeing here is a run-time speedup for “CTMark/kimwitu++/kc.test” (13%) as well as a run-time hit for “CTMark/tramp3d-v4/tramp3d-v4.test” (21%). We’re looking into where those differences are coming from.

	compile time				runtime
	patched	current	diff	%diff	patched	current	diff	%diff
test-suite :: CTMark/kimwitu++/kc.test	23.75	21.84	1.91	8%	0.05	0.06	-0.01	-13%
test-suite :: CTMark/sqlite3/sqlite3.test	14.89	12.67	2.22	16%	2.64	2.64	0.00	0%
test-suite :: CTMark/consumer-typeset/consumer-typeset.test	15.83	13.45	2.38	16%	0.18	0.18	0.00	2%
test-suite :: CTMark/SPASS/SPASS.test	23.23	19.91	3.32	15%	7.68	7.71	-0.03	0%
test-suite :: CTMark/mafft/pairlocalalign.test	14.01	9.91	4.1	34%	15.00	15.23	-0.23	-1%
test-suite :: CTMark/Bullet/bullet.test	54.04	49.17	4.87	9%	3.62	3.67	-0.06	-2%
test-suite :: CTMark/ClamAV/clamscan.test	24.44	19.55	4.89	22%	0.13	0.14	0.00	-2%
test-suite :: CTMark/tramp3d-v4/tramp3d-v4.test	30.23	24.93	5.3	19%	0.28	0.22	0.05	21%
test-suite :: CTMark/lencod/lencod.test	27.87	20.98	6.89	28%	3.94	3.91	0.03	1%
test-suite :: CTMark/7zip/7zip-benchmark.test	75.77	67.11	8.66	12%	6.83	6.88	-0.05	-1%

cachemeifyoucan · July 26, 2022, 8:50pm

We see a maximum of 34% increase compile time of which the vast majority is the frontend.

What do you mean by frontend and backend in this case? I don’t see why there should be a time difference in clang frontend since all the passes are in backend?

In general, the compile time increase is expected and inline with what originally thought (double digits percentage quote @mehdi_amini). Since this is an opt-in feature, as long as the users understand the cost of this model, we can provide that.

One interesting data point we’re seeing here is a run-time speedup for “CTMark/kimwitu++/kc.test” (13%) as well as a run-time hit for “CTMark/tramp3d-v4/tramp3d-v4.test” (21%). We’re looking into where those differences are coming from.

Looking forward to see what happens in those cases.

ormris · July 26, 2022, 10:22pm

I usually think of the LTO frontend as anything clang does to generate the build’s LTO bitcode files. The backend is all link-time optimization, plus any symbol management and bitcode I/O the linker needs to do.

Yes, if we are talking about clang’s frontend and backend, there shouldn’t be a difference. We don’t measure that here, though. CTMark measures clang’s total runtime, which in our case measures the amount of time it takes to generate the unified LTO bitcode files. The difference is rooted in the “double digit percentage” you referred to. However, these are very small test cases. When compiling a larger application, the cost is amortized better.

Agreed.

ormris · January 5, 2023, 5:26pm

Sorry it’s been so long without an update here. It’s been pretty busy over here. As far as the questions remaining in this RFC, I’m now pretty convinced that the outliers in the dataset above are due interference on the system. I’ve rerun the entire test suite with a larger number of runs (N=30), and got more consistent results. Note that large differences in compile time are expected in these small benchmarks, but are not seen when building real-world applications.

test name	patched comp	current comp	diff	%diff	patched exec	current exec	diff	%diff
7zip-benchmark	97.434	67.024	30.41	37%	6.131	6.096	0.035	1%
bullet	69.95	49.237	20.713	34%	3.065	3.1	-0.035	-1%
clamscan	32.636	19.547	13.089	50%	0.105	0.105	0.0	0%
consumer-typeset	21.884	13.338	8.546	49%	0.111	0.111	0.0	0%
kc	31.139	21.73	9.409	36%	0.043	0.04	0.003	7%
lencod	32.568	20.913	11.655	44%	3.482	3.44	0.042	1%
SPASS	32.105	19.733	12.372	48%	6.833	6.862	-0.029	-0%
sqlite3	20.277	12.777	7.5	45%	2.199	2.235	-0.036	-2%
tramp3d-v4	41.48	25.234	16.246	49%	0.208	0.204	0.004	2%

At this point, we’d like to move forward with this. Is there anything else that needs to be looked at?

ormris · January 18, 2023, 1:37am

Since these reviews have been open for a while at this point, I’ll go ahead and rebase them.

LebedevRI · January 18, 2023, 1:49am

What does that table represent, at least for the compile time?
Total accumulated time in milliseconds, over all 30 invocations?

ormris · January 18, 2023, 7:35pm

I’m using the CTMark test suite here, so the compile time numbers represent average total user time in seconds, over 30 invocations.

Topic		Replies	Views
exploring possibilities for unifying ThinLTO and FullLTO frontend + initial optimization pipeline LLVM Dev List Archives	14	373	April 30, 2018
Supporting Regular and Thin LTO with a Single LTO Bitcode Format LLVM Dev List Archives	11	379	April 14, 2021
Updated RFC: ThinLTO Implementation Plan LLVM Dev List Archives	28	541	August 21, 2015
RFC: ThinLTO Impementation Plan LLVM Dev List Archives	70	653	July 21, 2015
Clang lld thin lto footprint and run-time performance outperformed by GCC ld LLD thinlto	10	1673	May 21, 2024

[RFC] A Unified LTO Bitcode Frontend

Current

Patched (without unified LTO)

Related topics