[RFC] Optimization remarks: LLVM bitstream format and future plans

Hello,

We have been looking into making optimization remarks more scalable.

We looked into a few formats that satisfy the following requirements:
  * allows streaming to a file: we want to avoid keeping all the remarks in memory
  * allows string deduplication: most of the strings are repeated [1]
  * is fast to parse: building clang with remarks results in 24,205,892 remarks
  * is compact to save on disk: building clang with YAML remarks results in 17.6GB of remarks
  * supports some kind of key-value pairing: we need to have arbitrary remark “arguments” [2]

We took a look at a few formats:
  * YAML: 3. & 4. are very far from being reasonable using this format.
  * MessagePack [3]: having support for this in LLVM is an advantage for this format. It allowed us to make parsing 5.5x faster and remark files more than 2x smaller.
  * clangd’s RIFF-based format [4]. 1. & 5. are not satisfied here.
  * .dia: parsing this format (using libclang) is not fast enough for us.
  * custom format: we managed to make remarks 11x smaller, and parsing fast enough. The main concern with a custom format is the maintenance and versioning of the format.
  * LLVM bitstream:
    1. by emitting a block per remark, we can stream to a file
    2. by using a string table that is found in the metadata separately we can deduplicate strings
    3. llvm-bcanalyzer runs in 20s over all the remark files for clang
    4. total size of remarks for clang is 1.3GB -> 13.4x smaller
    5. we can have an arbitrary number of records and describe them using abbreviations to provide a key-value-like pairing

We decided to go ahead with LLVM bitstream since it satisfies all our requirements and it is well-known by the community.

The remark generation part of the format is available for review at: ⚙ D63466 [Remarks] Add an LLVM-bitstream-based remark serializer.

Another goal is to make it easy to find remarks for a given object file or binary. The way we want to do this on Darwin is to follow the debug info model: add a section to the object file, make the linker ignore it, let dsymutil pack it up and put the final result in the .dSYM bundle.

For that, I plan on making a few more changes:
  * Emit the bitstream metadata in the __LLVM,__remarks/.remarks section
  * Add the parsing logic to lib/Remarks/RemarksParser and make it usable through the C API
  * Add a tool, llvm-remarkutil, to merge the remarks from the object files into a standalone remark file
  * Add support do dsymutil to merge and generate a standalone remark file in the .dSYM bundle
  * Add support to llvm-remarkutil to convert from YAML to bitstream, to extract metadata from sections, and other utilities

Please let me know what you think!

Thanks,