[RFC] Compressing AST files by default?

Hi,

It seems that compressing AST files with simple "gzip --fast" makes them 30-40% smaller.
So the questions are:
  1. Is current AST serialization format really non-compressed (only abbreviations in bit stream format)?
  2. Is it worthwhile to compress AST by default (with -emit-ast)?
  3. Will this break things like PCH?
  4. What's the current trade-off between PCH compile time and disk usage? If AST compression makes compilation a bit slower, but reduces the disk usage significantly, will this be appropriate for users or not?

LLVM already has a support for compression (functions compress/uncompress in include/llvm/Support/Compression.h).

Best regards,
Ilya Palachev

Is there a need for this disk usage? If the main use of AST files is C++ modules / PCH, what is a typical size for a module cache directory?
(Compression is expensive)

for compression techniques like 'gzip --fast' or zlib, compression is
almost free

Hi!

Hi Gabor!

Yes, your guess is right. This work is XTU-related.
We didn't see any performance improvements, however. But Android's ASTs have the size reduction to 38Gb from 70Gb. I think it's a nice improvement.

21.10.2016 22:13, Gábor Horváth via cfe-dev пишет:

The current AST format is designed for lazy, partial loading from disk; we
make heavy use of file offsets to pull in only the small portions of AST
files that are actually used. In a compilation using hundreds or thousands
of AST files, it's essential that we don't load any more than we need to
(just the file headers) since we should need essentially nothing from
almost all loaded files.

Any approach that requires the entire file to be decompressed seems like a
non-starter. I would expect you could get something like the 30-40%
improvements you're seeing under gzip by making better use of abbreviations
and using smarter representations generally. There is some easy low-hanging
fruit here.

Hi!

Are you speculating or do you have numbers on the actual AST writer/reader?

Also a good starting point would be to consider not storing the AST as “blob” in the bitcode but using proper abbrev.

Sure, adding a compression layer on top for this particular application seems interesting, but you don’t need to have it on by default to support your use case though.
Having it always-on would require as a starting point to look closely at the impact on memory/time when including modules for example.

Hi!

>
> Hi,
>
> It seems that compressing AST files with simple "gzip --fast" makes
them 30-40% smaller.
> So the questions are:
> 1. Is current AST serialization format really non-compressed (only
abbreviations in bit stream format)?
> 2. Is it worthwhile to compress AST by default (with -emit-ast)?
> 3. Will this break things like PCH?
> 4. What's the current trade-off between PCH compile time and disk
usage? If AST compression makes compilation a bit slower, but reduces the
disk usage significantly, will this be appropriate for users or not?

Is there a need for this disk usage? If the main use of AST files is C++
modules / PCH, what is a typical size for a module cache directory?
(Compression is expensive)

In some cases compression can actually improve the peformance, because in
some cases the bottleneck is the I/O, and less data read from the disk and
a fast decompression can be beneficial to the overall performance.

Are you speculating or do you have numbers on the actual AST writer/reader?

Also a good starting point would be to consider not storing the AST as
“blob” in the bitcode but using proper abbrev.

I do not have any numbers for this use case, but as far as I remember there
were some benchmarks for some executable packers like UPX that can reduce
the startup time of some applications in some cases.

Hi,

It seems that compressing AST files with simple "gzip --fast" makes them
30-40% smaller.
So the questions are:
1. Is current AST serialization format really non-compressed (only
abbreviations in bit stream format)?
2. Is it worthwhile to compress AST by default (with -emit-ast)?
3. Will this break things like PCH?
4. What's the current trade-off between PCH compile time and disk usage?
If AST compression makes compilation a bit slower, but reduces the disk
usage significantly, will this be appropriate for users or not?

LLVM already has a support for compression (functions compress/uncompress
in include/llvm/Support/Compression.h).

The current AST format is designed for lazy, partial loading from disk; we
make heavy use of file offsets to pull in only the small portions of AST
files that are actually used. In a compilation using hundreds or thousands
of AST files, it's essential that we don't load any more than we need to
(just the file headers) since we should need essentially nothing from
almost all loaded files.

Any approach that requires the entire file to be decompressed seems like a
non-starter. I would expect you could get something like the 30-40%
improvements you're seeing under gzip by making better use of abbreviations
and using smarter representations generally. There is some easy low-hanging
fruit here.

I agree that I did see some low hanging fruits in the serialized AST
format. I did one measurement to see which parts of the ASTs are
contributing the most to the AST dumps' size. For the details see the json
attached to this mail:
http://clang-developers.42468.n3.nabble.com/Two-pass-analysis-framework-AST-merging-approach-tp4051301p4052577.html

Cool! Is the tool you used to produce this available somewhere? (Are you
combining the results of llvm-bcanalyzer or inspecting the bitcode files
directly yourself?)

Some obvious targets for adding abbrevs (>1GB and completely unabbreviated):

                "DECL_CXX_METHOD": {
                    "count": 54583363,
                    "bits": "4.49 GB",
                    "abv": 0.0
                },

                "DECL_CXX_CONSTRUCTOR": {
                    "count": 17594183,
                    "bits": "1.47 GB",
                    "abv": 0.0
                },

                "DECL_CXX_RECORD": {
                    "count": 24180665,
                    "bits": "1.1 GB",
                    "abv": 0.0
                },

                "DECL_CLASS_TEMPLATE_SPECIALIZATION": {
                    "count": 17971702,
                    "bits": "1.77 GB",
                    "abv": 0.0
                },

A couple of other things I've been planning to improve AST file size (but
not got around to yet):

* We should allow each Decl kind to specify a list of abbreviations (from
most specific to most general) and use the first one that fits the data. We
should always use *some* abbreviation for every Decl, even if we only get
to abbreviate the base Decl fields and use an array of VBR6 for the rest.

* We store SourceLocations as absolute offsets right now, wasting a lot of
bits on redundant information. Instead, we should store SourceLocations as
a delta from the root location of the record we're reading (the location of
the Decl or the start of the Stmt tree).

Hi,

It seems that compressing AST files with simple "gzip --fast" makes
them 30-40% smaller.
So the questions are:
1. Is current AST serialization format really non-compressed (only
abbreviations in bit stream format)?
2. Is it worthwhile to compress AST by default (with -emit-ast)?
3. Will this break things like PCH?
4. What's the current trade-off between PCH compile time and disk
usage? If AST compression makes compilation a bit slower, but reduces the
disk usage significantly, will this be appropriate for users or not?

LLVM already has a support for compression (functions
compress/uncompress in include/llvm/Support/Compression.h).

The current AST format is designed for lazy, partial loading from disk;
we make heavy use of file offsets to pull in only the small portions of AST
files that are actually used. In a compilation using hundreds or thousands
of AST files, it's essential that we don't load any more than we need to
(just the file headers) since we should need essentially nothing from
almost all loaded files.

Any approach that requires the entire file to be decompressed seems like
a non-starter. I would expect you could get something like the 30-40%
improvements you're seeing under gzip by making better use of abbreviations
and using smarter representations generally. There is some easy low-hanging
fruit here.

I agree that I did see some low hanging fruits in the serialized AST
format. I did one measurement to see which parts of the ASTs are
contributing the most to the AST dumps' size. For the details see the json
attached to this mail: http://clang-developers.42468.
n3.nabble.com/Two-pass-analysis-framework-AST-merging-
approach-tp4051301p4052577.html

Cool! Is the tool you used to produce this available somewhere? (Are you
combining the results of llvm-bcanalyzer or inspecting the bitcode files
directly yourself?)

I modified the llvm-bcanalyzer to output the info in JSON format and used a
python script to summarize the projects. (And before that, I used -emit-ast
to all TU in the LLVM and Clang source tree).
I attached the python script I used to aggregate the output of the modified
bcanalyzer.

summarizeAsts.py (2.75 KB)

In OpenCL we have 20K lines header files that we auto include by the compiler. So we make heavy use of PCH to avoid long parse time. Unfortunately, the size is indeed quite large (a couple of Mb) especially for embedded implementation for which OpenCL is also popular use case.

We have done some investigations and our main problem is that FuncDecl are not abbreviated completely in the serialized format. The header is indeed composed mainly of the function declarations. Adding at least generic abbreviations for those could help us.

Some extra things we found that we could get advantage from was removing source information which should be acceptable for some internal compiler use cases. This could go under some flag let’s say ‘clang –pch-no-source-info’. Compression was also one of the things that helped, but the growth of the compilation time was too high for those algorithms we’ve tried.

Overall, it would be really nice to see PCH size issue addressed by the community!

Cheers,

Anastasia