[RFC] Deprecating Compact Binary Sample Profile Format

Proposal

We are looking to deprecate the Compact Binary format for sample profiles used by FDO builds.
Related discussion: ⚙ D76255 [SampleFDO] Port MD5 name table support to extbinary format.
Implementation here: ⚙ D149400 [llvm-profdata] Deprecate Compact Binary Sample Profile Format

Motivation

Currently there are 4 profile formats: Text, Binary, Compact Binary, and Extensible Binary. The implementation of the profile reader classes is poorly maintained, with a lot of code repetition and intertwined function calls. Extensible Binary is currently most commonly used because of its forward compatibility, and it is capable of representing any profile in the other three formats (but not necessarily the other way around). When compared to Extensible Binary, Compact Binary has several major disadvantages that cannot be fixed:

  1. The lower 64 bits of MD5 of function names are stored as variable length integers (ULEB128) ranging from 1 to 10 bytes. In average it takes 10 (9.49) bytes to store a random uint64, which is worse than storing it unencoded (8 bytes), and the reader has to decode every value before being able to read the function offset table.
  2. Unable to store function metadata.
  3. Not forward compatible. Unlike Extensible Binary format, there is no easy way to add a new (or customized) section with additional profiling information.

Furthermore, I am planning a series of refactoring to significantly speed up profile loading time for industrial usage. The refactoring affect the implementation of all profile reader classes, so having one fewer format to support will significantly reduce maintenance workload.

Migrating from Compact Binary to Extensible Binary

[to be added]

Please comment if you are aware of any LLVM user still using Compact Binary sample profiles on non-trivial projects.

The Binary format (which is the default output format in llvm-profdata) is also inferior to Extensible Binary format for similar reasons, so I am looking for comments about whether it should be deprecated as well. Since it’s the default option I am expecting many users are using Binary format, I am suggesting llvm-profdata to change its default output format to Extensible Binary, and after a sufficient period of time Binary format can be removed if nobody uses it anymore.

A minor note for any future such patches: I’d recommend using terminology other than “deprecated”, which can commonly mean support remains but a warning is emitted or something similar. It would be clearer to say that support was removed.

1 Like