Default compression level for -compress-debug-info=zlib?

Folks,

I’d like to get expert’s opinion on which compression level is suitable for lld’s -compress-debug-section=zlib option, which let the linker compress .debug_* sections using zlib.

Currently, lld uses compression level 9 which produces the smallest output in exchange for a longer link time. My question is, is this what people actually want? We didn’t consciously choose compression level 9. That was just the default compression level for zlib::compress function.

For an experiment, I created a patch to use compression level 1 instead of 9 and linked clang using that modified lld. By default, lld takes 1m4s to link clang with --compress-debug-sections=zlib. With that patch, it took only 31s.

Here is a comparison of clang executable size with various configurations:

no debug sections: 275 MB

level 9 compression: 855 MB
level 1 compression: 922 MB
no compression: 2044 MB

Given that the best compression takes significantly longer time than the fastest compression, we probably should change the default to level 1. Any objections?

I wonder what is the best compression level when -O2 is passed to lld. We could use level 9 when -O2 is passed, but is there any reason to compress debug sections that hard in the first place?

I don't claim to be an expert, but I did some zlib compression
benchmarks in the past. IIRC, my conclusion from that was that the
"DEFAULT" zlib level (6) is indeed a very good default for a lot of
cases -- it does not generate much larger outputs, while being
significantly faster than the max level. This all depends on the data
set and what you intend to do with the resulting data, of course, but
I guess my point is you don't have to choose only between 1 and 9. I
think it would be interesting to at least get the data for the default
level before making choice.

cheers,
pl

Also not an expert, but would it make sense for this to be configurable at a fine-grained level, perhaps with another option, or an extension to the compress-debug-sections switch interface? That way users who care about the finer details can configure it themselves. And we should pick sensible options for the default.

James

More data on different compression levels will be good. In this case we’re compressing fairly consistent looking input data (a DWARF section) so I think we stand a good chance of being able to pick a very reasonable level.

I cringe at the thought of yet another user-facing knob, though.

–paulr

Hi Rui,

What’s the intended advantage for compressing the debug sections? - (i) Improved link time through smaller IO / (ii) Improved Load / startup time for the debugger / (iii) Smaller exe with debug info for distribution / disk space?

For i) and ii) how much this is worth it depends on balance of storage bandwidth to compression (i) / decompression (ii) bandwidth. For spinning drives it might be a win but for SATA and especially PCIe / NVMe SSD it could be a CPU bottleneck? Though we should also bear in mind that compression can be pipelined with writes in i) and debug info loading could be lazy in ii)

(e.g. for highly compressible data we’ve generally seen ~10MiB/s output bandwidth on single thread i7 @3.2GHz memory to memory for zlib9 with 32KiB window, that doesn’t stack up well against modern IO)

How is the compression implemented in lld? Is it chunked and therefore paralellizable (and able to be pipelined with IO) or more serial?

I think the intention is i) so we’d be happy to link a few of our game titles with varying compression levels vs storage types and let you know the results. Might be a couple of weeks…

I wonder what is the best compression level when -O2 is passed to lld.

Just to check, if the default is changed to compress at -O2 we’ll still be able to override to disable compression with -compress-debug-section=none ?

Thanks,

Simon

Here is the results of linking clang (with debug info) with various compression levels. Apparently, the current compression level 9 seems overkill.

Level Time Size
0 0m17.128s 2045081496 Z_NO_COMPRESSION
1 0m31.471s 922618584 Z_BEST_SPEED
2 0m32.659s 903642376
3 0m36.749s 890805856
4 0m41.532s 876697184
5 0m48.383s 862778576
6 1m3.176s 855251640 Z_DEFAULT_COMPRESSION
7 1m15.335s 853755920
8 2m0.561s 852497560
9 2m33.972s 852397408 Z_BEST_COMPRESSION

Augmenting Rui’s table with percentages relative to no compression.

Level Time Size Cost Gain

0 0m17.128s 2045081496 0% 0% Z_NO_COMPRESSION

1 0m31.471s 922618584 83.7% 54.9% Z_BEST_SPEED

2 0m32.659s 903642376 90.7% 55.8%

3 0m36.749s 890805856 114.6% 56.4%

4 0m41.532s 876697184 142.5% 57.1%

5 0m48.383s 862778576 182.5% 57.8%

6 1m3.176s 855251640 268.9% 58.2% Z_DEFAULT_COMPRESSION

7 1m15.335s 853755920 339.8% 58.3%

8 2m0.561s 852497560 603.9% 58.3%

9 2m33.972s 852397408 798.9% 58.3% Z_BEST_COMPRESSION

SimonW’s questions are pertinent too. Assuming lld –O2 does other useful things pertinent to runtime performance, I think debug-info compression ought to remain under a separate option.

–paulr

Hi Rui,

What’s the intended advantage for compressing the debug sections? - (i) Improved link time through smaller IO / (ii) Improved Load / startup time for the debugger / (iii) Smaller exe with debug info for distribution / disk space?

I think (1) is definitely the case, and that’s also true for a distributed build system with which a lot of object files are copied between machines.

I doubt (2) is true. Does compressing debug sections improve debugger load time? Of course, as you mentioned, it depends on the ratio of CPU speed and IO speed, but since linked debug info isn’t as large as the total of input files, I think it is at least less important than (1).

As to (3), in most cases, I believe it is rare to distribute executables with debug info widely. Only developers need debug info.

For i) and ii) how much this is worth it depends on balance of storage bandwidth to compression (i) / decompression (ii) bandwidth. For spinning drives it might be a win but for SATA and especially PCIe / NVMe SSD it could be a CPU bottleneck? Though we should also bear in mind that compression can be pipelined with writes in i) and debug info loading could be lazy in ii)

(e.g. for highly compressible data we’ve generally seen ~10MiB/s output bandwidth on single thread i7 @3.2GHz memory to memory for zlib9 with 32KiB window, that doesn’t stack up well against modern IO)

How is the compression implemented in lld? Is it chunked and therefore paralellizable (and able to be pipelined with IO) or more serial?

I think the intention is i) so we’d be happy to link a few of our game titles with varying compression levels vs storage types and let you know the results. Might be a couple of weeks…

I wonder what is the best compression level when -O2 is passed to lld.

Just to check, if the default is changed to compress at -O2 we’ll still be able to override to disable compression with -compress-debug-section=none ?

My suggestion was to use compression level 9 when both -O2 and -compress-debug-section=zlib are specified.

As to (3), in most cases, I believe it is rare to distribute executables with debug info widely
I think it is at least less important than (1).

Agreed.

I think (1) is definitely the case, and that’s also true for a distributed build system with which a lot of object files are copied between machines.
My suggestion was to use compression level 9 when both -O2 and -compress-debug-section=zlib are specified.

Ok great, I’m less concerned if it still requires an explicit -compress-debug-section=zlib even with -O2 (I thought you were proposing to add to O2)

Still for informational / advisory purposes it would be good for us to produce link time vs compression level vs total exe size, ideally with a couple of different storage types (at least PCIe SSD vs spinning) and CPUs.

Thanks,

Simon

Hi Rui,

My suggestion was to use compression level 9 when both -O2 and -compress-debug-section=zlib are specified.

The data clearly shows that beyond level 6 there is a huge increase in compute time (roughly triples) for nearly no benefit (incremental ~0.1% size reduction). If we’re going to do anything more than level 1, then I think 6 is the right place to stop.

In my experience, building for release really should not be a different process than a team normally uses. So, we cannot rely on an argument that “building for release is rare so it’s okay to make it incredibly expensive.” This reinforces the idea of using compression level 6 instead of level 9.

As to (3), in most cases, I believe it is rare to distribute executables with debug info widely

Agreed, however debug executables are (or should be!) archived, and compressed debug info looks like it would more than double the number of versions one can archive on the same media. Long-term storage may be cheap but it’s not free.

–paulr

As to (3), in most cases, I believe it is rare to distribute executables with debug info widely
I think it is at least less important than (1).

Agreed.

I think (1) is definitely the case, and that’s also true for a distributed build system with which a lot of object files are copied between machines.
My suggestion was to use compression level 9 when both -O2 and -compress-debug-section=zlib are specified.

Ok great, I’m less concerned if it still requires an explicit -compress-debug-section=zlib even with -O2 (I thought you were proposing to add to O2)

Still for informational / advisory purposes it would be good for us to produce link time vs compression level vs total exe size, ideally with a couple of different storage types (at least PCIe SSD vs spinning) and CPUs.

Debug sections are compresssed using zlib, so I think such benchmark would be testing the performance of zlib itself on various conditions.

Not really, as well as some sensitivity to the input data overall performance of the link with compression will depend on how this is implemented in lld - how is it parallelized? How is it chunked? Is it effectively pipelined with IO?

Or, I wouldn’t feel comfortable being able to make a recommendation to our end-users on whether to use this option or not based on my existing extensive benchmarking of zlib in isolation. It’s necessary to test in real conditions.

Thanks,

Simon

For the record, GNU linkers that implement --compress-debug-sections=

  ld.bfd bfd/compress.c uses Z_DEFAULT_COMPRESSION (6)
  gold gold/compressed_output.c 1 if -O0 (default), 9 if -O1 or higher

-O2 in LLD enables other optimizations (currently just tail string
optimization). It may be better to use a different knob to represent
compression level if we ever want to tune this.

lld/ELF/OutputSections.cpp delegates to zlib::compress in
lib/Support/Compression.cpp, which specifies compression level 6 by
default. Not it has other two users in InstrProf.cpp (BEST_SIZE) and
ASTWriter (default).

zlib does not support parallelized compression. Switching to other
libraries might not be acceptable.

Another parallelism is to compress multiple debug sections simultaneously
(sorting debug sections by size and parallelize compressing them, e.g.
.debug_info .debug_ranges). But this could increase memory usage
dramatically.
For each debug section, a compression buffer of compressBound(sourceLen)
bytes has to allocated first. zlib's compressBound is a theoretical
maximum which is larger than sourceLen and can increase memory usage
a lot, though usually only a small portion of the buffer is actually
used.

zlib supports a streaming interface. (LLVM currently doesn't use it, but it's not hard.)

-Eli

Not really, as well as some sensitivity to the input data overall performance of the link with compression will depend on how this is implemented in lld - how is it parallelized? How is it chunked? Is it effectively pipelined with IO?

In order to compress a section, lld creates an in-memory image of the section and then passes it to zlib. No parallelization and no chunking.

Or, I wouldn’t feel comfortable being able to make a recommendation to our end-users on whether to use this option or not based on my existing extensive benchmarking of zlib in isolation. It’s necessary to test in real conditions.

Are you already using the option?