State of the art for reducing executable size with heavily optimized program

Hi! I am looking for a little input. The context is that I am trying to reduce the size of our arm64 (elf) executable. We can’t afford to lower the optimisation level due to performance targets, but we already do the following to keep exe sizes down:

  • visibility=hidden
  • We use a PGO profile
  • ThinLTO
  • data/function sections and gc-sections
  • We don’t use RTTI/exceptions
  • Enabled ICF in lld

Is there something I am missing? I have searched through the forums and there has been talk about a machine outliner that can work with the PGO profile to selectively outline cold functions. Is that enabled by default?

I also found something about a machine splitter that could help. Is that also enabled by default?

Using bloaty to profile the binary the text section is the largest section followed by rela and eh frame.

I can disable the eh frame with an option, but I think we might need it for our crash reporting tool, but it’s something I need to investigate.

Thanks for any help in advance!

2 Likes

You can take a look at ML guided inlining for size. It has resulted in 3-7% reductions in binary sizes at Google (more information in this blog post).

cc: @mtrofin @boomanaiden154-1

I also found something about a machine splitter that could help. Is that also enabled by default?

If this is referring to -fsplit-machine-functions then it will not help. In fact, it will increase overall binary size by increasing the eh_frame size. It is not enabled by default since more work remains to be done on COFF and MachO.

Can you elaborate on what is the motivation behind reducing the executable size?

Thanks! I’ll have a look, but at first glance it seems like might require quite a bit of effort.

The platform we are working with has some strict constraints for both size on disk and VM size.

Others are working on reducing the size with code changes, I am mainly focused on the toolchain side to see what we can do there. So far using PGO and ICF has been the biggest wins.

I’m not sure if it will help on top of techniques you already used, but using something like coldcc for exported functions that are consumed only by the code under your control might help.

Yeah, MLGO will require quite a bit of effort. Trying one of the existing models would be pretty trivial, assuming you’re doing the build on Linux. Doing it on Windows probably will take a ton of effort, at least until we broadly deploy EmitC, which we’re hoping to do soon.

Most of our training loops also don’t take into account linker optimizations. Things should still work, but they might not be as effective.

We did achieve good results on Chrome though only running the for size inliner on cold functions. There were significant size wins without performance regressions. If you’re interested in trying it out I can dig up the flags.

I don’t recall if it’s enabled by default, but given that rela is your second largest section type, I imagine you’ve got a lot of R_*_RELATIVE dynamic relocations? If so, enabling RELR relocations could considerably help that size. LLD has the option --pack-dyn-relocs=relr to do this. It would require the loader to understand them though, which isn’t a guarantee, I expect.

1 Like

Oh wow. That shrunk the rela.dyn section from 28MB to 238kb. That’s a massive savings. I am not sure we can use this yet, we need to verify it on the platform, but if this is something we can use it’s a massive win. Thanks!

2 Likes

I checked chromium build.gn and found the flags. It seems to me that I need to be able to rebuild LLVM in order to use the ML inliner. Unfortunately I can’t do that on this platform at this stage, but I am very interested in testing this on our other platforms where we have more control over the toolchain. I am also super interested when this comes to Windows/COFF.

The experimental relative vtables Clang command line argument reference — Clang 22.0.0git documentation option, depending on how many vtables there are in the program has a chance at reducing code-size (+1 instruction, each entry is 32-bytes rather than 64). It also means that no R_AARCH64_RELATIVE relocations are needed for the vtable entries.

The downside is that all parts of the program accessing the vtable need to agree, so I expect that this could be difficult to deploy unless the C++ library is statically linked.

Last time I tried. -fvisibility=hidden is not as effective as it could be, as it seems to only affect definition visibility, so code that only references a symbol is unaffected. Some improvements can be made by putting attribute (visibility(”hidden))) on declarations, but this would be non-trivial on a large code-base.

In some cases, we use PGO to only outline cold code. We have some logic to optimistically or conservatively outline in the case where a function is missing PGO data, usually when it is stale. If you’re interested, I could upstream that work.

2 Likes

Are you using the machine outliner too? -mllvm enable-machine-outliner. Full LTO will give you more from this but you should get a couple of % from module-level outlining.

2 Likes

Hi! I am not using the outliner yet. How does that interact with PGO? Will hot functions still be inlined?

Thanks! I have seen this and it seems like we need a lot of testing to deploy this. But it’s something I will keep in mind for the future.

The machine outliner is a late pass. Does not affect PGO inlining.

This sounds useful. It doesn’t help me right now, since I can’t rebuild the toolchain at this stage, but it sounds like it could be good for the future!

If link times are not a concern and the application is not excessively large, I’d recommend using regular LTO over ThinLTO. In our experience, LTO does much better job at reducing size and this would also likely improve performance.

There are also some experimental options you could try. We’ve seen size reduction from enabling GVN sink/hoist with -mllvm -enable-gvn-sink=1 and -mllvm -enable-gvn-hoist=1 but your mileage may vary. If you have control over the C++ ABI (i.e. you don’t need to interface with prebuilt C++ libraries), you can also try -Xclang -fexperimental-omit-vtable-rtti.

1 Like

I’ve just published [MachineOutliner] Add profile guided outlining by ellishg · Pull Request #154437 · llvm/llvm-project · GitHub. Let me know what you think!

Thank you everyone! Just to give a update, we managed to get sizes down quite a bit with some of the suggestions here. Here is what we do to reduce the size of the application without losing performance so that future seachers might find it:

Start by using bloaty to profile your application. It can give you a good indication where space is used.

  • -fdata-sections -ffunction-sections to clang and -Wl,-gc-sections to lld is very basic but saves a lot of space. This should probably be enabled by most applications. This reduces the size of the .text section.
  • -Wl,-icf=safe or -Wl,-icf=all this enables code folding and saved about 10% of space in our application (YMMV), be aware that the all flag makes your application standards incompatible and you might end up with issues. We have already used all on other platforms and it worked for us here. Reduced the .text section size.
  • -fvisibility=hidden - we build a statically linked self-contained application so this makes sense for us. We saved another 10% with this, but only size on disk, not VM size. It almost removed the .strtab section for us. If you can’t use this use -fvisibility-inline-hidden which only hides the inlines.
  • -Wl,--pack-dyn-relocs=relr worked for us as well! I understand it depends on the deployment target’s loader, but it passed all our testing and removed a massive 12% of our application size by reducing the rela.dyn section down to almost nothing.
  • We tried the machine-outliner as well, it reduced the size with 5% but unfortunately it decreased performance with around 2-3% which we where unable to accept. Looking forward to try @ellishg’s outliner based on PGO when it lands in our toolchain.
  • Experimental options like -fexperimental-relative-c++-abi-vtables and -fexperimental-omit-vtable-rtti where deemed to risky for where we are in our project currently, so while they seem very good for several reasons we need some more time to trial this before we enable it.
  • I have not tested the gvn sink/hoist - I was not sure exactly what they did and we more or less hit our size target with the relr change for now.

Thanks everyone for contributing and hopefully this will help someone else in the future that want to reduce the size of the binary.

16 Likes

There is a Global Machine Outliner, U can try use this if you can’t flto=full. It default enable when -Oz, if not -Oz(also enable it by clang option -moutline).