Hello,
We have been looking into making optimization remarks more scalable.
We looked into a few formats that satisfy the following requirements:
* allows streaming to a file: we want to avoid keeping all the remarks in memory
* allows string deduplication: most of the strings are repeated [1]
* is fast to parse: building clang with remarks results in 24,205,892 remarks
* is compact to save on disk: building clang with YAML remarks results in 17.6GB of remarks
* supports some kind of key-value pairing: we need to have arbitrary remark “arguments” [2]
We took a look at a few formats:
* YAML: 3. & 4. are very far from being reasonable using this format.
* MessagePack [3]: having support for this in LLVM is an advantage for this format. It allowed us to make parsing 5.5x faster and remark files more than 2x smaller.
* clangd’s RIFF-based format [4]. 1. & 5. are not satisfied here.
* .dia: parsing this format (using libclang) is not fast enough for us.
* custom format: we managed to make remarks 11x smaller, and parsing fast enough. The main concern with a custom format is the maintenance and versioning of the format.
* LLVM bitstream:
1. by emitting a block per remark, we can stream to a file
2. by using a string table that is found in the metadata separately we can deduplicate strings
3. llvm-bcanalyzer runs in 20s over all the remark files for clang
4. total size of remarks for clang is 1.3GB -> 13.4x smaller
5. we can have an arbitrary number of records and describe them using abbreviations to provide a key-value-like pairing
We decided to go ahead with LLVM bitstream since it satisfies all our requirements and it is well-known by the community.
The remark generation part of the format is available for review at: ⚙ D63466 [Remarks] Add an LLVM-bitstream-based remark serializer.
Another goal is to make it easy to find remarks for a given object file or binary. The way we want to do this on Darwin is to follow the debug info model: add a section to the object file, make the linker ignore it, let dsymutil pack it up and put the final result in the .dSYM bundle.
For that, I plan on making a few more changes:
* Emit the bitstream metadata in the __LLVM,__remarks/.remarks section
* Add the parsing logic to lib/Remarks/RemarksParser and make it usable through the C API
* Add a tool, llvm-remarkutil, to merge the remarks from the object files into a standalone remark file
* Add support do dsymutil to merge and generate a standalone remark file in the .dSYM bundle
* Add support to llvm-remarkutil to convert from YAML to bitstream, to extract metadata from sections, and other utilities
Please let me know what you think!
Thanks,