Changing bitstream format for smaller zeros

I’ve been looking at reducing the size of Clang serialized AST files (PCH and PCM)[1], and so digging into how it uses the bitstream format.
I see a way to reduce the size by ~9%, which is large compared to other changes I’ve managed, but it involves changing the bitstream container format (also for LLVM bitcode etc).
I’d like to know whether such a change would be acceptable, or what alternatives people can suggest.

Observations:

  • Clang serializes a lot of data in unabbreviated records, which is unfortunate but hard to change at scale[2].
  • This means values are encoded as VBR6 - some number of 6-bit chunks sufficient to fit the value. (This is hard-coded into the Bitstream format).
  • A lot of these values are zeros (for example, invalid SourceLocations). VBR6 encodes zero as 6 zero bits.

Proposal:

  • define a new encoding: “nullable VBR”/VBRZ. This is just VBR with an “is nonzero” bit at the front. Zero encodes as a single bit (0).
  • change the encoding of unabbreviated record values from VBR6 to VBRZ6. (This is the part that breaks the bitstream container format).
  • this is profitable for filesize if >1/6 of the affected values are zero.

Measurements:

  • Significant win for clang ASTs: -9% (clangd preamble for clang-tools-extra/clangd/AST.cpp 40.1 => 36.5 MB)
  • neutral-ish for LLVM IR[3]: -0.3% (clang -emit-llvm for clang-tools-extra/clangd/ParsedAST.cpp 93.6 => 93.3 KB)

Stability concerns:

  • LLVM bitcode and clang ASTs aren’t stable, so it seems OK to break them
  • if there are out-of-tree tools, breaking bitstream may be more disruptive than a “normal” format break
  • serialized diagnostics files (*.dia) use bitstream, and the test clang/test/Misc/serialized-diags-stable.c expects their format to be stable. I don’t know how critical or reasonable this is.
  • maybe we could make this opt-in and turn it on for clang ASTs only?

What do you think?

Prototype: ⚙ D126038 [Bitstream] Define VBRZ encoding, and use VBRZ6 for unabbreviated values

[1] My immediate motivation is that we deploy clangd in a datacenter and keep PCHes in RAM. But I would think smaller PCMs are pretty valuable as in C++20 they become common compiler inputs/outputs.
[2] Partly because record contents vary in ways that abbreviations don’t support well, partly because there are simply so many types of nodes that maintaining the abbreviations is significant complexity.
[3] After ⚙ D126029 [Bitcode] Add abbreviation for STRUCT_NAME when the name is not char6. Without that patch this change is a ~3% regression.

Bitcode has a guarantee that newer compilers can still read older bitcode files. The format has to be compatible to that extent. I don’t know about the other cases you listed.

Oops, you’re right of course. (Not sure why I was reading stackoverflow instead of the source of truth).

I suppose a fairly clean way to handle this would be to allow the default value encoding (for unabbreviated records) to be specified by a record in the BLOCKINFO, and to specify this as VBRZ6 for newly generated clang AST files.

Files that made use of this feature would not be readable by old binaries, but old files would still be readable (we’d default to VBR6).
And there’s probably no reason to set this at all for IR files, just clang ASTs.

Another option would be to use compression in the PCH/PCM files, even something very simple like LZ4. Bitcode tends to compress poorly because the data isn’t byte-aligned, but it’s possible to generate byte-aligned bitcode in a backwards-compatible way (implemented here).

Compression is tricky because PCH reads are random-access. It’s probably possible to come up with a reasonable compression scheme which can support that, but that seems like a large project.

Isn’t zstd really well optimised for reading? I know many tools around have switched to that.

For random access? A large fraction of PCH reads are of the form “read value X, jump to bit Offset+X, start reading records”.

Are you tied to the bitstream format for PCMs?

Well, I don’t have any plans to rewrite all the serialization code, or any particular insight into how to design that. If someone were to propose replacing it, I’d certainly be curious.