I’ve been looking at reducing the size of Clang serialized AST files (PCH and PCM), and so digging into how it uses the bitstream format.
I see a way to reduce the size by ~9%, which is large compared to other changes I’ve managed, but it involves changing the bitstream container format (also for LLVM bitcode etc).
I’d like to know whether such a change would be acceptable, or what alternatives people can suggest.
- Clang serializes a lot of data in unabbreviated records, which is unfortunate but hard to change at scale.
- This means values are encoded as VBR6 - some number of 6-bit chunks sufficient to fit the value. (This is hard-coded into the Bitstream format).
- A lot of these values are zeros (for example, invalid SourceLocations). VBR6 encodes zero as 6 zero bits.
- define a new encoding: “nullable VBR”/VBRZ. This is just VBR with an “is nonzero” bit at the front. Zero encodes as a single bit (0).
- change the encoding of unabbreviated record values from VBR6 to VBRZ6. (This is the part that breaks the bitstream container format).
- this is profitable for filesize if >1/6 of the affected values are zero.
- Significant win for clang ASTs: -9% (clangd preamble for clang-tools-extra/clangd/AST.cpp 40.1 => 36.5 MB)
- neutral-ish for LLVM IR: -0.3% (clang -emit-llvm for clang-tools-extra/clangd/ParsedAST.cpp 93.6 => 93.3 KB)
LLVM bitcode andclang ASTs aren’t stable, so it seems OK to break them
- if there are out-of-tree tools, breaking bitstream may be more disruptive than a “normal” format break
- serialized diagnostics files (*.dia) use bitstream, and the test
clang/test/Misc/serialized-diags-stable.cexpects their format to be stable. I don’t know how critical or reasonable this is.
- maybe we could make this opt-in and turn it on for clang ASTs only?
What do you think?
 My immediate motivation is that we deploy clangd in a datacenter and keep PCHes in RAM. But I would think smaller PCMs are pretty valuable as in C++20 they become common compiler inputs/outputs.
 Partly because record contents vary in ways that abbreviations don’t support well, partly because there are simply so many types of nodes that maintaining the abbreviations is significant complexity.
 After ⚙ D126029 [Bitcode] Add abbreviation for STRUCT_NAME when the name is not char6. Without that patch this change is a ~3% regression.