Clang lld thin lto footprint and run-time performance outperformed by GCC ld

Hi,

I am working on a project that currently compiles an arm application for an arm cortex-a9 target. We use GCC 13 and ld setup with lto. Our optimzation level is O2 for release builds.
Our GCC flash footprint of a stripped application with (–strip-all) is 29402kb.
Our ld linktime is ~10min.
In general both the GCC and Clang builds run on windows 11.
See details of the Clang compiler build at the bottom of this post.

As an experiment to save link time we have tried to compile our application with Clang 18.1.3 and lld using -flto=thin -Wl,–thinlto-jobs=0 -Wl,–thinlto-cache-policy=cache_size_bytes=5g
Our Clang flash footprint of a stripped application (–strip-all) is 33670kb.
Our lld linktime is ~2min + we get incremental linking in the range of ~15s , great :slight_smile:

But as it can be seen we also get a footprint that has increased from:
GCC 29402kb → Clang 33670kb = 4268kb

4268kb is quite a huge increase in footprint, and it could be a dealbreaker for us.
The increase makes me wonder if we are missing something when we run thin lto?
Hence I have also done a compile using -fno-lto.

The Clang -fno-lto yields a footprint of 33802kb.
So from a footprint perspective Clang thin lto saves: 33802kb-33670kb = 132kb.
In comparison if I compile without lto using GCC and ld I get a footprint of 32494kb.
GCC footprint savings using lto: 32494kb-29402kb = 3092kb.

I did not expect Clang thin lto optimizations on code size to be so meager (132kb), I did expect something in the same order of magnitude as GCC (3092kb).

Since above shows that footprint of our application is 4268kb larger than our current GCC builds, I have also tried to do an Os build optimizing more for size.
Clang thinLTO -Os and GCC LTO -Os:

bin-size:
Clang: 32190 kb
GCC: 26690 kb

Idle thread :
Clang: 64.5 %
GCC: 70.4 %

Both footprint and runtime performance is worse for Clang when running Os, I have not been able to run Clang O2 builds yet due to run-time errors.

My question to the forum is what can I check to to figure out what goes wrong? It is quite hidden what optimizations thin lto actually applies is it possible to see what it actually does/ does not do, to close in on if something is wrong with our Clang builds? Any ideas on what I can check?

To detail what Clang compiler I am using:
My build is based on Clang 18.1.3 and newlib 4.3.0.
I am using GitHub - ARM-software/LLVM-embedded-toolchain-for-Arm: A project dedicated to building LLVM toolchain for 32-bit Arm embedded targets..
Runtimes are build in ubuntu for arm and thumb mode: armv7a_hard_neon_exn_rtti and armv7a_thumb_hard_neon_exn_rtti.
Bulk part of our application runs in thumb mode, only our board support package (bsp) runs in arm mode.
Our Clang compiler toolchain is build on windows 11.

I hope someone in here can help with some clues on what to check/ investigate to get to the bottom of what we are doing wrong, or of if Clang is simply is outperformed by GCC and ld for our setup. I can provide additional details on our builds on request.

:slight_smile:

1 Like

This is just a drive by comment. There are others that know this area far better than I do.

Unless your application is gigantic, I’m assuming for 32-bit Arm it probably isn’t, it may be worth trying full lto. While I don’t have any direct experience with comparing it to thin lto, there have been a few recent Dev Meeting talks that suggests that full lto performs better for size optimization. An example talk https://www.youtube.com/watch?v=8Uiv2RsPim4

One high-level way to get more details is clang optimization remarks. The documentation implies that these can be enabled with LTO Remarks — LLVM 19.0.0git documentation

One other thing worth noting is that Clang -Os isn’t as aggressive at size optimisation as GCC -Os. Clang does have an -Oz option that may do more size optimization at the expense of performance.

If the program is C++ you may get some additional benefit out of -fwhole-program-vtables which can be used with LTO.

Finally LLD (I’m assuming gc-sections is enabled) can do better but slower string merging with -O2, and can do identical code folding with --icf=safe although full LTO may limit the ICF opportunities.

Hi,

Thank you for your clues on what we can try, drive by comments are also appreciated :slight_smile:
I will definitely look into trying your suggestions.

To clarify your assumptions:

  • Target is 32bit arm.
  • The program is C++, we run c++20 currently.
  • As you can see from of our compile/ linker flags below we do run -fwhole-program-vtables and --gc-sections

For clarity here are our current compiler flags:
-Wall
-Wno-overloaded-virtual
-std=20
-fexceptions
-frtti
-Wno-psabi
-mthumb
-mfloat-abi=hard
-mcpu=cortex-a9
-mfpu=neon
-fdata-sections
-ffunction-sections
-DNDEBUG
-O2
-flto=thin
-fno-fat-lto-objects
-fwhole-program-vtables
-fwrapv
-fstrict-aliasing
-fstrict-enums
-fstrict-vtable-pointers
-fstrict-return
-fstrict-float-cast-overflow
-fstrict-flex-arrays=3
-fno-unique-section-names

Linker flags:
-TC:/project/lib/scripts/cmakemodules/…/cmakemodules/platform/UCOS_cortex-a9_0_cl.ld
-mlittle-endian
-Wl,-stats,–warn-common
-Wl,–gc-sections -BC:/tools/llvm-project/18.1.3.B2.03052024/bin
-z norelro
-z nognustack
-Wl,–target2=rel
-Wl,–thinlto-jobs=0
-Wl,–thinlto-cache-dir=C:/project/lib/cc/build/ninjaclang/thinlto_cache
-Wl,–thinlto-cache-policy=cache_size_bytes=5g

If something looks off to you in the compiler/ linker options i provided please let me know :slight_smile:

Hi,

As an FYI today I tried playing with ICF as suggested by

it gave the following results.
icf=all
Link time ~3min
binary stripped 28.346kb

icf=safe
Link time ~3min
binary stripped 29198kb

Note that GCC O2 with full lto was
Link time ~14min
binary stripped 29402kb

So basically ICF seems to give the footprint savings needed :slight_smile:

However the time critical parts of our application now has scheduler overruns, which basically means that application run time behavior is worse than with GCC and ld full lto.

I wanted to try out optimization remarks as you suggested,

but I have not managed to do so today, maybe I manage tomorrow :slight_smile:
I am thinking maybe optimization remarks can yield some clue to why our run-time behavior has gone worse.

Since we do have gc-sections enabled is the only choice we have to accept the slower string merging with -O2, or is it possible in some way to adjust this with more speed in mind?

Since we do have gc-sections enabled is the only choice we have to accept the slower string merging with -O2, or is it possible in some way to adjust this with more speed in mind?

From the LLD source code -O2 does two things:

  • Enables tail merging when string merging, this prevents the algorithm from running in parallel. With the default of -O1 identical strings are merged which can be done in parallel. With -O0 no string merging is done.
  • If ZLIB compression is used to compress debug sections then -O2 increases the compression level.

I suggest trying without -O2 to see if the tail merging is worth it.

Have you tried full LTO with clang and lld? What gcc does is essentially full LTO (e.g. optimization on the full IR).

ThinLTO gets its scalability (reflected in your much smaller link times) from doing purely summary-based whole program analysis. Generally this still gives the biggest LTO performance wins (which typically come from cross-module inlining), but it is not (yet) able to do some optimizations that can be done with whole program IR, especially some optimizations that reduce code size.

I’m curious how much worse it is in run time performance, as I’ve generally seen it pretty close to full LTO, but there are going to be cases where it is worse. There may be cases where gcc is doing a better job that we should look at more closely.

Hi,

As for teresajohnsons questions.

We have done a compile of our project with clang O2 lto full icf=safe but got the following linker error:

ld.lld: error: ld-temp.o :2:2: out of range pc-relative fixup value
LDR r8, =0xF8F00000 /* MPCORE base address */
^
ld.lld: error: ld-temp.o :16:2: out of range pc-relative fixup value
LDR r12, =do_remap + 1
^
clang++: error: ld.lld command failed with exit code 1 (use -v to see invocation)
ninja: build stopped: subcommand failed.

It relates to a piece of our code that handles onchip memory.
For references I have attached the OnchipMemoryCodeSnippet.txt.
OnchipMemoryCodeSnippet.txt (2.6 KB)

We are a bit surprised that we have to do the fixes marked with TODO, but for now that at least allows us to link with clang lld full lto.

Clang O2 lto full icf safe
~28935kb.
Clang O2 lto thin icf safe
~29198kb

Unfortunately our application crashes on startup with clang O2 lto full icf=safe, so we have not been able to do runtime performance measurements yet. If we manage to fix this I can get back with this information next week.

We have also looked into what smithp35 suggested

Clang O1 lto thin icf safe
28854kb
Clang O2 lto thin icf safe
29198kb

Hence footprint is actually smaller with O1.

Below is shown how idle task is running.

Thread Name | Priority| Max CPU Usage|current CPU Used|average CPU Used| CtxSwitchIn/S|

Clang O2 lto thin icf safe
uC/OS-III Idle Task | 63| 57.86%| 34.21%| 39.17%| 1205|
Clang O1 lto thin icf safe
uC/OS-III Idle Task | 63| 57.59%| 35.88%| 37.27%| 1298|
GCC O2 lto full
uC/OS-III Idle Task | 63| 72.05%| 56.7%| 55.29%| 1780|

There is no significant difference between clang O2 and O1 builds in above idle task readings (~2% in average CPU used).
So far it seems like icf=safe gives us the footprint we need but run-time performance is ~20% worse with clang compared to GCC.

ICF seems quite potent with regards to code size, but so far we have not been able to determine what it means for us with regards to run time performance. When we do clang O2 lto thin icf=none our application crashes, hence I have no performance measurements for this scenario.

The best we have right now is clang Os lto thin icf none mentioned at the start of this thread. This does not crash. The build seems to suggest that icf=none costs us ~7% when we compare clang Os lto thin icf none to Clang O2 lto thin icf safe.
But then again we are comparing Os with O2 so the comparision might not be fair.

We are looking into if performance remarks can give us some hints to what optimizations works or what we need to work more on.

Any suggestions for identifying and improving performance would be much appreciated :slight_smile:

Tuning import-instr-limit might be useful, e.g. -fuse-ld=lld -Wl,-mllvm=-import-instr-limit=10.

The default compression level for zlib is now independent of linker optimization level.
([ELF] Adjust --compress-sections to support compression level by MaskRay · Pull Request #90567 · llvm/llvm-project · GitHub).
–compress-debug-sections=zstd should be slightly smaller than using zlib.
The default compression level is low.
You can try a higher level like --compress-sections=.debug_*=zstd:13 with latest lld.

There was a talk in 2022 about inlining for size

but the work does not seem to be upstreamed.

Hi

I am not sure what “import-instr-limit” does?
I found the following files in the llvm repo:

  • llvm-project\compiler-rt\test\profile\instrprof-thinlto-indirect-call-promotion.cpp
  • llvm-project\llvm\lib\Transforms\IPO\FunctionImport.cpp

I interpret “import-instr-limit” as a parameter to control when thinlto considers a function for inlining? Is this right understood or can you provide a better explanation :slight_smile:

We apply lto only on our release builds, mainly because we at time of writing use GCC ld and full lto which is quite build time consuming. Developers then mainly run on faster build time debug builds for their incremental development cycles and switch to release builds only when required runtime performance is required.
Hence we have no debug sections to compress in release builds, which is why this will not give us any code footprint savings in release builds.
However if it turns out we can use clang lld and thinlto this might change due to shorter build time and then we would consider using thin lto in our debug builds :slight_smile:

import-instr-limit is an internal option (not user-facing, but commonly used) to control the threshold to import a function. It seems that mobile users often set this to a small value like 5, 10, or 40 to minimize the size bloat impact due to importing+inlining (note: inlining a very short function can also decrease the code size).

It seems that mobile users often set this to a small value like 5

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=22d429e75f24d