[RFC] A Unified LTO Bitcode Frontend

:+1: I’ll go ahead and redo these patches.

New revisions:
https://reviews.llvm.org/D123803
https://reviews.llvm.org/D123804
https://reviews.llvm.org/D123969
https://reviews.llvm.org/D123971

Over the past few days I’ve been running performance tests with clang as my
test case. I focused on ThinLTO, since using Unified LTO introduces many
changes to the ThinLTO optimization pipeline. This first test measured how
long it takes to compile clang itself. This table compares unpatched clang
without Unified LTO to patched clang with Unified LTO enabled.

# Current (s) Unified (s)
1 3000.62 3036.92
2 3005.87 3036.31
3 3003.12 3025.09
AVG No Unified (s) AVG Unified (s) % diff
3003.20 3032.77 0.98%

As an aside, there is a configuration not shown here. That is: using a patched
clang with LTO enabled, but without enabling unified LTO. We didn’t include it
here because there is no measurable impact on the compile/link time. The
generated executables are unchanged in both FullLTO and ThinLTO so, by
definition, there is no impact in run-time.

This next table compares two versions of clang, one built using ThinLTO without
Unified LTO and the other with Unified LTO. This test again uses clang as a
test case. I measured the time it takes to build clang in Release mode when
using each of these compilers.

# Built w/Current Built w/Unified
1 2279.84 2272.69
2 2277.96 2282.27
3 2278.11 2273.07
AVG Built w/Current AVG Built w/Unified %diff
2278.64 2276.01 0.12%

We didn’t see any major differences between ThinLTO with Unified LTO enabled
and standard ThinLTO in these tests. Roughly a 1% penalty in compiletime
and a very small benefit in runtime.

That looks even better than WebKit build. I don’t know if @mehdi_amini still remembers all the performance/size difference when he brought up the thinLTO pipeline at the first place and what benchmark was used. The other set of number might be helpful to collect is: GitHub - llvm/llvm-test-suite

Also while you have it, can you also paste the code size difference for clang as well?

From these two number I don’t really have any concerns if you added this as a new LTO pipeline, if we can come up with a clear guideline for when you should use UnifiedLTO vs. Full/ThinLTO.

I don’t have that on hand, but it would be good data to collect. I’ll go ahead and do that.

That would be useful. I’ll see if I can get that set up.

After a long delay, I’m finally ready to post some numbers. There was a crash
in the Full LTO pipeline that caused some delays. Anyway, here’s a couple of
important things we noticed. First of all, after measuring the executable size
difference between current regular LTO and unified regular LTO, we noticed that
there was a consistent difference. While this is definitely unexpected, as the
pipelines are nearly identical, there are some differences in internalization
and symbol resolution that may be causing these changes. Either way, this is
not what we said in our initial post of performance numbers, so I’m very glad
this was caught. After seeing these differences we felt it would be good to
compare current and patched clang without unified LTO enabled. I’ve put the
numbers below. Overall, I think the differences here look like they’re in the
noise, as expected.

Current

# Full Thin
1 5142.28 3022.03
2 5134.68 3018.76
3 5132.87 3016.36
AVG 5136.61 3019.05

Patched (without unified LTO)

# Full Thin
1 5141.39 3012.91
2 5132.55 3008.45
3 5133.66 3006.32
AVG 5135.87 3009.23
Full %diff 0.20%
Thin %diff -0.33%

And finally, here is a table comparing the executable size of clang-15
generated by the various pipelines. Again, not 100% identical, but very minor.

Full Thin
Current 146642752 151187864
Patched 147783464 152192368
Full %diff: -0.78%
Thin %diff: -0.66%

Can you provide the instructions to reproduce?

There really should not be any difference between fullLTO before and after patch, right? Did you try set ShouldPreserveUseListOrder on the bitcode module and see if that is what causes the diff? Other than that, I can’t see how fullLTO can be different before and after. It is important to understand the reason for difference before claiming it is negligible.

1 Like

Sure. For the performance tests or the binary size differences?

Yes, that’s what we expected.

I haven’t, but that’s definitely something to try.

Good point. Let me take a look at that and see what I can find.

Now I am puzzled about what you’re measuring and the point of it? I thought you’d show the difference between FullLTO and FullLTO with the new proposed unified pipeline?

The goal of the latest performance tests was to show that the patched compiler has an identical non-unified pipeline behavior to the current compiler, for both FullLTO and ThinLTO.

The differences in binary size appear to be caused by enabling split LTO units.
The increased size of .symtab and .strtab are the main contributors along
with some codegen changes. Since split LTO units + unified LTO + full LTO is a
SIE-specific configuration (as discussed above), I’ve re-run the binary
comparison again with split LTO units disabled using compilers with identical
version strings. This setup produced identical binaries.