[RFC] Faster Sample Profile Loading

Background

We’ve noticed that the sample profile loader in the compiler spends a lot of time loading the profile. In one compilation, the compiler spends about 11% of time on:

llvm::sampleprof::SampleProfileReaderExtBinaryBase::readImpl

This is due to the fact that the sample profile loader eagerly loads two sections SecFuncOffsetTable and SecNameTable in their entirety even though the compiler references a tiny fraction of the contents.

Design

We will lazily load two sections SecFuncOffsetTable and SecNameTable. That is, we load the metadata first and then load the actual data only upon request.

Combining SecFuncOffsetTable and SecLBRProfile

We propose to combine SecFuncOffsetTable and SecLBRProfile into a single, combined section: SecLBRProfileHashTable.

Current Format

  • SecFuncOffsetTable (serialized map):
    • Key: An index, expressed in ULEB128, into the NameTable (an array of MD5 hashes)
    • Value: A byte offset, expressed in ULEB128, into the serialized FunctionSamples records in SecLBRProfile
    • Note: This is the map that we read into DenseMap at startup.
  • SecLBRProfile (serialized variable-length records):
    • Contains the actual FunctionSamples data. Because they are variable-sized, the offsets from SecFuncOffsetTable are required for random access (on-demand loading).

New Format

The proposed SecLBRProfileHashTable will combine the two sections above into a single section SecLBRProfileHashTable, implemented with OnDiskChainedHashTable.

  • Key: 32-bit index into the NameTable
  • Value: FunctionSamples

The sample profile loader continues to recognize the old format and read from it if present.

Expected Performance

  • Compile time: We expect to save 9% of time spent on AutoFDO ThinLTO pre-link and ThinLTO backend compilations when using very large, multi-target profiles (e.g., a 2 GB profile containing 10M+ name entries).
  • Profile file size: We expect to increase the AutoFDO profile file size by about 15 bytes per profiled function. The increase primarily comes from the overhead of the on-disk hash table, such as the index structure and hash values.
  • Heap size: We expect to save about 32 bytes of heap memory per entry for not having to construct a DenseMap from SecFuncOffsetTable. (16 bytes per entry divided by the average load factor of 0.5)

Lazy-Load SecNameTable

We propose to lazily load SecNameTable without changing the format.

Current Method

The SecNameTable section consists of an array of 8-byte MD5 hash values of symbol names, without any padding in between. In the fixed MD5 mode, SampleProfileReaderExtBinaryBase::readNameTableSec eagerly loads the entire section and constructs a 16-byte structure FunctionId for each MD5 hash value.

New Method

The reader will parse and retain the section’s metadata like the file offset of this array. FunctionId instances are then materialized on-demand.

Expected Performance

  • Compile Time: We expect to save 1% of time spent on AutoFDO ThinLTO pre-link and ThinLTO backend compilations.
  • Profile file size: No change because the format stays the same.
  • Heap size: We expect to save 16 bytes per name table entry (≈ 16 MB per million entries).

@mingmingl-llvm @hokein @snehasish @davidxl

What is the compile time improvement for autoFDO compile with non-shared profile?

The first part requires format change.Can you describe your planned series of changes and format migration plan (Assuming the old format will be deprecated)?

I’m estimating the savings by using the parsing time as the proxy for the savings:

Section Non-shared profiles Shared profiles
SecFuncOffsetTable 1.64% 8.88%
SecLBRProfile 0.16% 0.80%
SecNameTable 0.56% 2.78%
Total Savings 2.37% 12.46%

I’m thinking of something like this:

  1. Prepare OnDishChainedHashTable PR200992
  2. Land the reader and writer with the writer being off by default.
  3. Do more internal testing.
  4. Turn on the writer by default.

I don’t think we can deprecate the old format quickly because the new format only covers the specific scenario:

!Remapper && !ProfileIsCS && useMD5()

As I’ve highlighted above, this project has two components – speeding up SecNameTable and SecFuncOffsetTable. The former is being addressed as:

I’d like to revise the proposal for the latter – SecFuncOffsetTable – as I’ve learned more details.

[RFC] Faster Sample Profile Loading (Update)

We continue to use the on-disk hash table for function offsets, but we won’t integrate SecLBRProfile.

Design & Compatibility

  • Format Version v104: To support the on-disk hash table without breaking backward compatibility for older compilers, we are introducing a new profile format version, v104.
  • Separation of Index and Data: We do not integrate the SecLBRProfile (the actual profile data) into the hash table. The hash table acts strictly as a lightweight index, mapping a 64-bit function name GUID to a fixed 32-bit byte offset pointing into the SecLBRProfile section. This avoids the complexity of dealing with the variable-length serialization of FunctionSamples inside the hash table payload.

Storage & Memory Impact

  • Storage Cost: The index structure introduces a minor storage overhead. For a 1.2GB profile (containing ~600k total symbols, with ~96k written to the flat offset table), the total file size increases by roughly 10MB (~0.8%).
  • Heap Savings: In the older format, the reader must parse the entire offset table at startup to construct a DenseMap in memory. For a profile with ~600k symbols, this map alone consumes 16MB of heap. For a flat profile of this size, v104 completely eliminates this 16MB allocation. For split CS profiles, we save the heap for the flat section index (~2MB for the ~96k flat symbols).
  • Startup Speedup: Based on our measurements, parsing the offset table takes up to 8.88% of the total compilation time for shared profiles (XFDO) and 1.64% for non-shared profiles. By querying the on-disk hash table directly, we expect to recover almost all of this overhead during compiler startup.

Version Handling

  • Default to v103: The writer will continue to default to v103 to ensure no impact on existing workflows unless explicitly requested.
  • Opt-in to v104: Generation of the new format is opt-in via a new hidden command-line flag (-sample-profile-format-version=104).
  • Transparent Reader: The reader supports both v103 and v104 transparently.

Compression Restriction (for Mmap Efficiency)

To preserve zero-copy mmap loading, we disable compression on the hash table section (even if global compression is enabled). The reader will reject compressed hash tables as malformed. MD5 GUIDs are incompressible, and compression defeats zero-copy mmap. The SecLBRProfile section itself remains compressible.

Phased Patch & Rollout Plan

To ensure a safe deployment and keep code reviews manageable, we plan to split the work into three sequential phases:

  • Phase 1: Hash Table Infrastructure (Off by Default)
    Introduce the serialization helper classes (traits) for the on-disk chained hash table, verified with isolated unit tests (~100 lines of library changes). This is a pure library addition with no functional changes to FDO yet.
  • Phase 2: Integration & Production Verification (Opt-in)
    Integrate the helpers into SampleProfReader and SampleProfWriter, implement compression restrictions, and enable the v104 format under a hidden flag (-sample-profile-format-version=104) (~150 lines). This allows testing and verifying the new format in large-scale production environments on an opt-in basis before enabling it globally.
  • Phase 3: Flip Default (Default On)
    Flip the default format version in SampleProfWriter.cpp from 103 to 104 (~5 lines), making the new format the default for all users after successful production verification.

is the small file size increase measured with compression on?

No, the file size increase was measured with the default settings.