[RFC] Faster Sample Profile Loading

kazutakahirata · June 1, 2026, 11:02pm

Background

We’ve noticed that the sample profile loader in the compiler spends a lot of time loading the profile. In one compilation, the compiler spends about 11% of time on:

llvm::sampleprof::SampleProfileReaderExtBinaryBase::readImpl

This is due to the fact that the sample profile loader eagerly loads two sections SecFuncOffsetTable and SecNameTable in their entirety even though the compiler references a tiny fraction of the contents.

Design

We will lazily load two sections SecFuncOffsetTable and SecNameTable. That is, we load the metadata first and then load the actual data only upon request.

Combining `SecFuncOffsetTable` and `SecLBRProfile`

We propose to combine SecFuncOffsetTable and SecLBRProfile into a single, combined section: SecLBRProfileHashTable.

Current Format

SecFuncOffsetTable (serialized map):
- Key: An index, expressed in ULEB128, into the NameTable (an array of MD5 hashes)
- Value: A byte offset, expressed in ULEB128, into the serialized FunctionSamples records in SecLBRProfile
- Note: This is the map that we read into DenseMap at startup.
SecLBRProfile (serialized variable-length records):
- Contains the actual FunctionSamples data. Because they are variable-sized, the offsets from SecFuncOffsetTable are required for random access (on-demand loading).

New Format

The proposed SecLBRProfileHashTable will combine the two sections above into a single section SecLBRProfileHashTable, implemented with OnDiskChainedHashTable.

Key: 32-bit index into the NameTable
Value: FunctionSamples

The sample profile loader continues to recognize the old format and read from it if present.

Expected Performance

Compile time: We expect to save 9% of time spent on AutoFDO ThinLTO pre-link and ThinLTO backend compilations when using very large, multi-target profiles (e.g., a 2 GB profile containing 10M+ name entries).
Profile file size: We expect to increase the AutoFDO profile file size by about 15 bytes per profiled function. The increase primarily comes from the overhead of the on-disk hash table, such as the index structure and hash values.
Heap size: We expect to save about 32 bytes of heap memory per entry for not having to construct a DenseMap from SecFuncOffsetTable. (16 bytes per entry divided by the average load factor of 0.5)

Lazy-Load SecNameTable

We propose to lazily load SecNameTable without changing the format.

Current Method

The SecNameTable section consists of an array of 8-byte MD5 hash values of symbol names, without any padding in between. In the fixed MD5 mode, SampleProfileReaderExtBinaryBase::readNameTableSec eagerly loads the entire section and constructs a 16-byte structure FunctionId for each MD5 hash value.

New Method

The reader will parse and retain the section’s metadata like the file offset of this array. FunctionId instances are then materialized on-demand.

Expected Performance

Compile Time: We expect to save 1% of time spent on AutoFDO ThinLTO pre-link and ThinLTO backend compilations.
Profile file size: No change because the format stays the same.
Heap size: We expect to save 16 bytes per name table entry (≈ 16 MB per million entries).

@mingmingl-llvm @hokein @snehasish @davidxl

davidxl · June 2, 2026, 4:33am

What is the compile time improvement for autoFDO compile with non-shared profile?

The first part requires format change.Can you describe your planned series of changes and format migration plan (Assuming the old format will be deprecated)?

kazutakahirata · June 2, 2026, 8:44pm

I’m estimating the savings by using the parsing time as the proxy for the savings:

Section	Non-shared profiles	Shared profiles
SecFuncOffsetTable	1.64%	8.88%
SecLBRProfile	0.16%	0.80%
SecNameTable	0.56%	2.78%
Total Savings	2.37%	12.46%

I’m thinking of something like this:

Prepare OnDishChainedHashTable PR200992
Land the reader and writer with the writer being off by default.
Do more internal testing.
Turn on the writer by default.

I don’t think we can deprecate the old format quickly because the new format only covers the specific scenario:

!Remapper && !ProfileIsCS && useMD5()

kazutakahirata · June 7, 2026, 5:15am

As I’ve highlighted above, this project has two components – speeding up SecNameTable and SecFuncOffsetTable. The former is being addressed as:

[SampleProfile] Switch getNameTable() to return iterator_range (NFC) by kazutakahirata · Pull Request #200995 · llvm/llvm-project · GitHub (merged)
[ProfileData] Lazy-load fixed-length MD5 name table by kazutakahirata · Pull Request #202014 · llvm/llvm-project · GitHub (pending review)

I’d like to revise the proposal for the latter – SecFuncOffsetTable – as I’ve learned more details.

[RFC] Faster Sample Profile Loading (Update)

We continue to use the on-disk hash table for function offsets, but we won’t integrate SecLBRProfile.

Design & Compatibility

Format Version v104: To support the on-disk hash table without breaking backward compatibility for older compilers, we are introducing a new profile format version, v104.
Separation of Index and Data: We do not integrate the SecLBRProfile (the actual profile data) into the hash table. The hash table acts strictly as a lightweight index, mapping a 64-bit function name GUID to a fixed 32-bit byte offset pointing into the SecLBRProfile section. This avoids the complexity of dealing with the variable-length serialization of FunctionSamples inside the hash table payload.

Storage & Memory Impact

Storage Cost: The index structure introduces a minor storage overhead. For a 1.2GB profile (containing ~600k total symbols, with ~96k written to the flat offset table), the total file size increases by roughly 10MB (~0.8%).
Heap Savings: In the older format, the reader must parse the entire offset table at startup to construct a DenseMap in memory. For a profile with ~600k symbols, this map alone consumes 16MB of heap. For a flat profile of this size, v104 completely eliminates this 16MB allocation. For split CS profiles, we save the heap for the flat section index (~2MB for the ~96k flat symbols).
Startup Speedup: Based on our measurements, parsing the offset table takes up to 8.88% of the total compilation time for shared profiles (XFDO) and 1.64% for non-shared profiles. By querying the on-disk hash table directly, we expect to recover almost all of this overhead during compiler startup.

Version Handling

Default to v103: The writer will continue to default to v103 to ensure no impact on existing workflows unless explicitly requested.
Opt-in to v104: Generation of the new format is opt-in via a new hidden command-line flag (-sample-profile-format-version=104).
Transparent Reader: The reader supports both v103 and v104 transparently.

Compression Restriction (for Mmap Efficiency)

To preserve zero-copy mmap loading, we disable compression on the hash table section (even if global compression is enabled). The reader will reject compressed hash tables as malformed. MD5 GUIDs are incompressible, and compression defeats zero-copy mmap. The SecLBRProfile section itself remains compressible.

Phased Patch & Rollout Plan

To ensure a safe deployment and keep code reviews manageable, we plan to split the work into three sequential phases:

Phase 1: Hash Table Infrastructure (Off by Default)
Introduce the serialization helper classes (traits) for the on-disk chained hash table, verified with isolated unit tests (~100 lines of library changes). This is a pure library addition with no functional changes to FDO yet.
Phase 2: Integration & Production Verification (Opt-in)
Integrate the helpers into SampleProfReader and SampleProfWriter, implement compression restrictions, and enable the v104 format under a hidden flag (-sample-profile-format-version=104) (~150 lines). This allows testing and verifying the new format in large-scale production environments on an opt-in basis before enabling it globally.
Phase 3: Flip Default (Default On)
Flip the default format version in SampleProfWriter.cpp from 103 to 104 (~5 lines), making the new format the default for all users after successful production verification.

davidxl · June 8, 2026, 3:48pm

is the small file size increase measured with compression on?

kazutakahirata · June 17, 2026, 12:15pm

No, the file size increase was measured with the default settings.

kazutakahirata · July 11, 2026, 3:33am

Optimizing AutoFDO Profile Symbol List Loading with Eytzinger Binary Search

The RFC above was more about those AutoFDO profiles that use MD5 values in SecNameTable. Since we already have this topic, let me add a closely related topic to the same thread – Profile Symbol List.

Problem Statement

The compiler spends a significant amount of time loading and querying
SecProfileSymbolList in AutoFDO sample profiles:

Decompressing and parsing the SecProfileSymbolList string archive.
Constructing an in-memory DenseSet<StringRef> to answer cold symbol
membership queries across C++ compilation and inlining passes.

Background

The compiler performs cold symbol membership queries via
ProfileSymbolList::contains(StringRef Name) to determine whether a function
symbol is recorded in the sample profile’s cold symbol table
(SecProfileSymbolList).

Solution

We format SecProfileSymbolList as a decompressed array of 64-bit MD5 hashes
arranged in Eytzinger (breadth-first) order so the reader can perform an
efficient binary search. The resulting array is written to
SecProfileSymbolList and tagged with a section flag:
SecProfileSymbolListFlags::SecFlagEytzinger.

For details on how to perform an Eytzinger binary search efficiently, see:

Current Status and Preliminary Results

Compilation Performance

Benchmarking an internal AutoFDO application with the Eytzinger Profile Symbol
List versus the compressed string list demonstrates compile-time speedups:

Compilation Step	Speedup Improvement
ThinLTO Post-link Backend	23.3% speedup
ThinLTO Pre-link Compilation	6.7% speedup

Profile Size on Disk

The same AutoFDO application above has 349,227 cold symbols in
SecProfileSymbolList. Replacing the compressed string list with an Eytzinger
array of 64-bit MD5 hashes saves 41% for the section:

Before (Compressed String Archive): 4.8 MB zlib-compressed string
block (47 MB decompressed).
After (Eytzinger MD5 Array): 2.8 MB decompressed.

Impact on Heap Allocation

Before (Compressed SecProfileSymbolList): ~55 MB (47 MB string
buffer and 8 MB for DenseSet<StringRef>) to store 349,227 cold symbols
after decompressing 4.8 MB of zlib data.
After (Eytzinger SecProfileSymbolList): Zero heap allocation for
the cold symbol table. The reader requires only 16 bytes of metadata to
access the mmap disk buffer.

Upstreaming Plan

We plan to upstream the Eytzinger Profile Symbol List in three patches:

Core Data Structures (EytzingerTableSpan): Introduce the non-owning
binary search span to llvm/include/llvm/ADT/Eytzinger.h.
Section Flag Definition: Add SecProfileSymbolListFlags::SecFlagEytzinger
to uniquely identify Eytzinger-formatted cold symbol sections.
Writer & Reader Integration: Update SampleProfileWriterExtBinaryBase to
emit 64-bit MD5 arrays in Eytzinger layout when SecFlagEytzinger is
requested, and update SampleProfileReaderExtBinaryBase to query
EytzingerTableSpan directly on the underlying memory buffer.

@davidxl

kazutakahirata · July 14, 2026, 6:18am

RFC Amendment: Eytzinger Name Table and Parallel Function Offsets

Problem Statement

The compiler spends a significant amount of time loading and parsing
SecFuncOffsetTable in large AutoFDO profiles. The problem lies in the
encoding of the section where we use pairs of variable-length ULEB128 integers.
Since we cannot perform random access on the mmap memory, we must eagerly load
the entire section regardless of how many entries the current module is
interested in retrieving.

Background

Here is how we use SecFuncOffsetTable:

The compiler invokes getSamplesFor.
We look up its GUID in SecFuncOffsetTable (or its in-memory equivalent).
We find the corresponding file offset in SecLBRProfile (relative to the
beginning of the section).
We retrieve FunctionSamples

In other words, we are essentially operating a hash map, where
SecFuncOffsetTable acts as key-value pairs, with the value being a file
offset.

Solution

The solution below applies to those cases where the user wishes to generate a
MD5 profile. The raw string-based AutoFDO profile is out of scope.

We propose a space-efficient profile format that does not require eager loading.
The basic setup is parallel arrays:

SecNameTable (or SecCSNameTable): Continues to be an array of MD5 values
in fixed length uint64_t.
SecFuncOffsetTable: This section will hold an array of file offsets in
uint32_t to FunctionSamples in SecLBRProfile only. It will no longer
hold any keys (GUIDs).

`SecNameTable`

Now, we have three mutually exclusive sets of GUIDs:

CSKeys: Those GUIDs used as keys in Context-Sensitive SecLBRProfile
FlatKeys: Those GUIDs used as keys in Flat SecLBRProfile
Inlinees: Those GUIDs mentioned in SecLBRProfile but not as keys

The new SecNameTable will consist of:

Metadata
- the number of CSKeys
- the number of FlatKeys
- the number of Inlinees
Payload
- CSKeys in Eytzinger layout
- FlatKeys in Eytzinger layout
- Inlinees in Eytzinger layout

Notice that we have exactly the same number of keys as before, but they are
arranged differently into three groups, each of which can be easily searched
with a binary search.

`SecFuncOffsetTable`

The SecFuncOffsetTable for the Context-Sensitive SecLBRProfile has as many
entries as CSKeys, each being fixed-length uint32_t. We form key-value pairs
as follows:

Key (GUID): The k-th entry of CSKeys in SecNameTable
Value (File offset): The k-th entry of SecFuncOffsetTable for
Context-Sensitive SecLBRProfile. This file offset is within
SecLBRProfile just as before.

The same setup applies to SecFuncOffsetTable for Flat SecLBRProfile.

`SecLBRProfile`

This section encodes GUIDs as indexes into the name table as ULEB128. The
encoding scheme stays the same as before, but since the name table is in a new
order, actual encodings of ULEB128 will change.

Preliminary Results

Compilation Performance

Benchmarking on internal AutoFDO application using a large MD5-based profile
shows the following speed up with the new format:

Compilation Step	Speedup Improvement
ThinLTO Pre-link Compilation	34.5% speedup
ThinLTO Post-link Backend	79.0% speedup

Profile Size on Disk

The same XFDO application above has 794,393 top-level function symbols across
its split hot and cold SecFuncOffsetTable sections. Replacing ULEB128 pairs
with a parallel 4-byte (uint32_t) array reduces the section size on disk
by 53%:

Before (ULEB128 Pairs): 6.73 MB across both sections.
After (Parallel uint32_t Array): 3.18 MB across both sections.

Impact on Heap Allocation

Before (SecFuncOffsetTable): 17.8 MB for DenseMap to store
622,380 top-level entries from the larger hot/CS section after eagerly
decoding ULEB128 pairs.
After (Parallel uint32_t Array): 80 bytes for various metadata.
The reader operates directly over memory-mapped disk slices.

Auxiliary Benefits

Fast Global Membership Queries

The three Eytzinger spans in SampleProfileNameTable let
SampleProfileLoader::runOnFunction check symbol membership instantly on
demand, eliminating upfront DenseSet (GUIDsInProfile) construction.

Synergy with Profile Symbol List

As shown in:

I’m planning to encode Profile Symbol List in Eytzinger layout. We can use the
same encoder and decoder (binary search).

On-Disk Hash Table

Earlier, I was planning to use OnDiskChainedHashTable in SecFuncOffsetTable:

I have since shelved the idea for the following reasons:

OnDiskChainedHashTable uses about 6 times as much disk space for
SecFuncOffsetTable.
OnDiskChainedHashTable is an one-off solution with no synergy with Profile
Symbol List in Eytzinger layout.

Upstreaming

Finish upstreaming Eytzinger encoder. (The decoder has already landed.)
Support Eytzinger layout in SecNameTable (while leaving SecFuncOffsetTable
as is in terms of the encoding scheme).
Use a uint32_t array in SecFuncOffsetTable.

@davidxl

Topic		Replies	Views
RFC: Binary format for instrumentation based profiling data LLVM Dev List Archives	38	481	April 18, 2014
[RFC] Dynamic Type Profiling and Optimizations in LLVM IR & Optimizations llvm	12	2565	January 23, 2024
[RFC] VTable Type Profiling for SampleFDO IR & Optimizations llvm	17	1198	June 10, 2026
[RFC] Profile guided section layout LLVM Dev List Archives	23	580	August 5, 2017
[RFC] Profile Guided Static Data Partitioning IR & Optimizations llvm	12	1742	January 23, 2025

[RFC] Faster Sample Profile Loading

Background

Design

Combining SecFuncOffsetTable and SecLBRProfile

Current Format

New Format

Expected Performance

Lazy-Load SecNameTable

Current Method

New Method

Expected Performance

[RFC] Faster Sample Profile Loading (Update)

Design & Compatibility

Storage & Memory Impact

Version Handling

Compression Restriction (for Mmap Efficiency)

Phased Patch & Rollout Plan

Optimizing AutoFDO Profile Symbol List Loading with Eytzinger Binary Search

Problem Statement

Background

Solution

Current Status and Preliminary Results

Compilation Performance

Profile Size on Disk

Impact on Heap Allocation

Upstreaming Plan

RFC Amendment: Eytzinger Name Table and Parallel Function Offsets

Problem Statement

Background

Solution

SecNameTable

SecFuncOffsetTable

SecLBRProfile

Preliminary Results

Compilation Performance

Profile Size on Disk

Impact on Heap Allocation

Auxiliary Benefits

Fast Global Membership Queries

Synergy with Profile Symbol List

On-Disk Hash Table

Upstreaming

Related topics

Combining `SecFuncOffsetTable` and `SecLBRProfile`

`SecNameTable`

`SecFuncOffsetTable`

`SecLBRProfile`