[RFC] Zstandard as a second compression method to LLVM

Hello all,

The LLVM project currently has support for zlib as a compression algorithm. Usage of it varies from compression of ELF debug sections, to serialization of performance stats and AST data structures.

We would like to add Zstandard (A.K.A. Zstd) as an alternative to zlib, which tends to achieve higher compression rates while being faster across the board. Using those for internal tooling could lead to speed improvements in places where we compress AST’s etc, without sacrificing the compressed size of them.

In terms of implementation, here are some initial ideas:

  • Relocating the llvm::zlib namespace as declared in lib/Support/Compression.cpp, to a namespace like llvm::compression::zlib.

  • Add a llvm::compression::zstd namespace.

  • define a namespace alias llvm::compression::tooling, that either aliases to the Zstd or zlib namespaces, based on llvm cmake flags.

This allows us to easily spot code that needs to be updated, and simple to keep tools consistently using the same compression internally by using llvm::compression::tooling instead of llvm::compression::zlib or llvm::compression::zstd when possible.

I have been able to create a working POC with this implementation pattern, and it seems to work, but would not be surprised if there are more appealing approaches.

Some important questions:

  • Are there other implementation details/external projects that we would also need to account for?

– Cole Kissane


I know @resistor looked into zstd measurements for DWARF compression, so tagging him here in case he’s got ideas.

At least for DWARF compression, I believe this wouldn’t be a compile-time swappable implementation, as the compression scheme is user-selectable and encoded in the ELF file format (so clients know how to decompress it).

It’s probably worth doing something similar for other compression use cases in LLVM (except in cases that are locked to the exact version of LLVM, like Clang serialized modules) - so that compatible compilers can still be used together without problems (or that incompatibilities can be detected - eg: if clang encodes with zstd but you link with an lld that hasn’t been built with zstd support)


I reported an issue with poor compression of DWARF to zstd last year: Significantly worse compression than ZLIB on DWARF debug info · Issue #2832 · facebook/zstd · GitHub

I’m generally supportive of zstd as an alternative compression scheme to skin for most things in LLVM.

@resistor I’m glad your generally supportive of zstd as an alternative compression scheme to skin for most things in LLVM.
I would also like to note that that the github link in your reply sparked my curiosity and I have locally experimented with adding zstd as a dwarf debug compression method and I found the following on the named section of clang++
(zstd is at level 7 and zlib is at 6 (the default for llvm-project):

File: ./bin/clang++.debug_str_offsets
  Size: 378000136
File: ./bin/clang++.debug_str_offsets-zlib
  Size: 336024112
File: ./bin/clang++.debug_str_offsets-zstd
  Size: 313555712 (93% size of zlib)

I got these files using llvm-objcopy

$ time ./bin/llvm-objcopy --compress-debug-sections=zstd ./bin/clang++.debug_str_offsets ./bin/clang++.debug_str_offsets-zstd
./bin/llvm-objcopy --compress-debug-sections=zstd    1.15s user 0.46s system 99% cpu 1.608 total
$ time ./bin/llvm-objcopy --compress-debug-sections=zlib ./bin/clang++.debug_str_offsets ./bin/clang++.debug_str_offsets-zlib
./bin/llvm-objcopy --compress-debug-sections=zlib    4.49s user 0.64s system 99% cpu 5.140 total

Note that though compression through zstd level 7 was only 7% smaller than zlib, it was 3.2x faster than zlib, so I’d venture to say that ZSTD could actually be pretty applicable to DWARF debug info!