Hi folks,
In this RFC, we present profile-guided static data partitioning, to segregate data in binary ELF sections based on access frequency.
We looking forward to your feedback, during the holiday season or in the new year!
Authors: Mingming Liu (mingmingl@google.com), Snehasish Kumar (snehasishk@google.com), David Xinliang Li (davidxl@google.com)
Summary
Observing that it’s generally complex to capture indirect data accesses with code analysis and that not all compiler-generated data are symbolizable by design, this work uses a hybrid approach. It generates data access profiles to partition symbolizable data, and uses PGO profiles to infer hotness of module-internal data.
- The data access profiles are collected from production binaries and used to optimize binary in new releases. They are generated from symbolized memory accessing events and represented as a new payload section in memory profiles (MemProf). They are used at compile time to partition hot and cold data. This approach covers the data that are tracked by an executable’s symbol table.
- The rest of data are those with temporary, non-linker retained symbols and not symbolizable out of the box. This approach implements code analysis within an IR module to infer data hotness, with the goal to split cold data out with high confidence. Either instrumented or sampled PGO profiles will be used for code analysis.
A prototype over .rodata
section partitioning shows 0.3% throughput improvement for a large internal benchmark. Additional opportunities exist in other data sections like .data.rel.ro
.data
and .bss
. In addition, we are exploring fine-grained tuning to model the data access affinity to further improve locality.
This work will focus on ELF and DWARF for platform-specific implementations.
Terminology
-
Read-only data: refers to
.rodata
and.data.rel.ro
sections. -
Static data: include
.data
and.bss
in addition to read-only data sections.
Background
Binaries which are compiled with profile guided optimizations can place code in distinct text sections based on their observed hotness. When the application runs, we map the hot code onto huge pages, improving locality (particularly for the iTLB) while minimizing the number of huge pages required to do so. We apply the same idea to static data contained in the binary.
We prototype the effectiveness of partitioning hot and cold data using .rodata
section for a large internal workload using hardware performance counter events. Specifically, profiling memory loads and stores via hardware events (like Intel PEBS, AMD IBS, Arm SPE) allows us to quantify static data hotness using the Linux perf tool. The charts below show the access pattern of static data before and after partitioning. This prototype uses the MEM_INST_RETIRED.ALL_LOADS PEBS event to profile data accesses by collecting data addresses and mapping them to known symbols in the binary. For the rest of unmapped data objects (symbols are not preserved in the executable’s symbol table), their hotness is inferred from the FDO profiles of the surrounding code. We reorder the symbols in the .rodata section by adding a suffix to the symbol name and using a linker script to cluster the hot data. Profiling the binary again shows us the chart on the right. With this improved layout we measured a throughput improvement of 0.3% on a large internal Google workload.
Overview
What does the optimized data layout look like?
In the optimized data layout, the cold data will be kept in the original .<section>
, and a new .<section>.hot
section will cluster the rest, keeping a high recall of hot data (i.e., the hot-suffixed section keeps a high percentage of all truly hot data).
Profile source and where they are applied
The table is a summary of profile sources and what data types they are applied to. Multiple sources of profiles may exist for a piece of data (say local-linkage global variables are symbolizable and amenable for code analysis). In this case, a piece of data is placed into the hot-suffixed section if any source shows it is hot.
Profile Source | Where they are applied |
---|---|
Instrumented or sample PGO | Data whose direct accesses come solely from objects in the same module, like jump tables, constant pools, local or private linkage global variables |
Symbol name from symbolized data access profiles | Data with stable mangled names across binary releases, like vtables, RTTI, global variables with source code counterparts |
Symbol source location, by joining symbolized data access profiles and enhanced DWARF | Symbolizable data without stable names, most notably LTO-externalized string literals (i.e., .str.<N>.llvm.<hash> ) |
Detailed Design
Data tracked by the symbol table (e.g., global variables with external or local linkage) are termed ‘symbolizable’, and data whose direct accesses come solely from objects in the module (e.g., global variables with private or local linkage, or those with private prefixes in its name) are termed ‘module-internal’ data. Based on how symbols are selected into .o file’s symbol tables as well as how the executable’s symbol table forms, we believe handling the two types of data will cover the majority of cases. Moreover, data that are not understood by the two approaches will be kept in the .hot
-suffixed section for higher recall.
We propose to implement the partitioning for module-internal data first since it requires no new additional profile format changes. Augmenting MemProf profiles with hot symbols and extending debug information for better tracking will improve the precision of the hot section.
Module-Internal Data
From our analysis, un-symbolizable data (a subset of module-internal data) account for ~30% accesses out of all read-only data sections. Assuming representative profiles are provided, the module IR analysis could tell absolutely cold module-internal data apart from the rest, making it feasible to split out cold module-internal data with high confidence. Jump tables and constant pools are not represented in the IR; their partitioning is described in the sections following global variables.
Module-Internal Global Variables
To find out cold global variables in the module, we aggregate its references from functions and other global variables, and place the global variables in .section
(the cold section) if code analysis can determine the aggregated access is cold. The analysis feasibility also applies to local-linkage global variables, which are tracked by symbol table.
The address of module-internal data can escape out of the module through an external global variable or function. As a result, indirect accesses can still happen from outside the module. The trade-off is to place the data in the .hot-suffixed section if code analysis relying on iFDO or AFDO has limited information and cannot categorize the data as cold with high confidence. In practice we find that this occurs rarely.
Jump Table
From LLVM IR perspective, source code control flow changes like C++ switch statement could be represented by the switch
IR instruction, and if-else might be also converted to switch
. Switch lowering can lower switch
instructions into jump tables.
When the assembler emits a machine function, a jump table is tied to the specific function that uses it. A machine function keeps track of its list of jump tables, and each jump table tracks the list of destination basic blocks. A jump table is not mergeable among functions or among different branches within a function.
A standalone pass will be implemented to annotate jump table hotness based on basic block hotness. The assembler will place the jump table according to annotation. The heuristic is to place a jump table into .rodata
if both the source basic block and the destination basic blocks are cold.
Constant Pool
Similar to the jump table, a machine function keeps track of its constant pool entries. A constant pool keeps track of constants referenced by a function that must be spilled to memory. This typically includes floating point and large integer constants. Different from jump tables, constant pool data is placed in a mergeable section and lld can merge identical content.
A standalone pass will be implemented to annotate constant pools based on basic block hotness. The heuristic is to de-duplicate identical constant pool entries within an IR module, place a constant pool entry in .rodata
if the aggregated count of all machine instructions that references the constant pool entry is cold, and place it in .rodata.hot
otherwise.
Symbolizable Data
Profile Construction
To profile data hotness, we sample MEM_INST_RETIRED.ALL_LOADS PEBS
events from production jobs. Each sample contains the precise instruction and data address of an access. Symbolizing the data address from perf.data
and aggregating access counts by symbols gives a list of symbols and their counts. Note lld can merge data with identical content when the address is not significant (as a demo program shows) , so we need to find all symbols for a given data address by joining the ELF’s symbol table information.
In the de-serialized (in-memory) format, DataAccess describes a symbol and a list of its possible locations. A symbol is represented by its canonical name, and a location is represented by a module’s SourceFileName and line number. The location information is optional for global variables with stable names, and crucial for global variables without stable names (described in debug information enhancement in the subsequent section). The profile payload will store MD5 of the strings for space efficiency.
struct DataAccess {
// The MD5 of canonical symbol name
uint64_t HashOfSymbolCanonicalName ;
// A data symbol could have a couple of locations.
// For instance, lld merges identical string literals used in different modules.
// A global variable with linkonce_odr could also have multiple locations.
Location[] Loc;
};
struct Location {
// The MD5 of the module's SourceFileName
uint64_t ModuleHash;
uint32_t LineNumber;
};
Enhance Debug Information Generation
Some global variables (most notably string literals with a name pattern like .str.<N>.llvm.<hash>
) don’t have a stable mangled name across binary releases. Therefore, the names from a profiled binary are not useful to build the next binary for a new release. We propose to enhance debug information generation for such variables such that we can find out such global variables’ module and line number in dwarf.
Take string literals from Compiler Explorer as an example, Clang doesn’t give a name or linkage name for @.str.1
in its DIGlobalVariableExpression
, even though the file and line information are available (in the !9
metadata from godbolt example). Implementation-wise, we plan to update the global variable’s debug metadata with its LTO-externalized name and make sure the updated debug metadata is emitted as a part of the DWARF file. This way the module source file and line number are available by looking up the symbol in dwarf files (e.g., what llvm-dwarfdump --name=<name> binary.dwp
can do).
Profile representation as a new section in MemProf profiles
We propose to convey symbolized data access profiles using memprof. The memprof profile format extends the iFDO profile format by adding a new section. A memory profile may also be passed to clang separately using -fmemory-profile-use
. This way, both iFDO and AFDO-optimized binaries can adopt it by reading memory profiles.
Profile Consumption
Given the named symbol profiles, the AsmPrinter will iterate each global variable in a module and annotate it as hot if the data with the same canonical name or source location are hot according to the profiles.
The way to implement section prefix for global variables will be similar to how function section prefix is set and used; AsmPrinter will set the MD_section_prefix
metadata of a global variable, and target lowering library will add this section prefix when selecting sections.
Place hot data into a contiguous section
To preserve the cold section suffix, we’ll use a linker script. Alternatively, extending --keep-text-section-prefix
in lld to data sections would work if a linker script cannot be used.
Ongoing efforts and follow-ups
On top of read-only data hot-cold splitting, we plan to expand partition to other data sections like .data.rel.ro
, .bss
and .data
, and explore finer-grained data clustering like function-based clustering. The rationale to do finer-grained data clustering is that data access patterns can exhibit affinity on the granularity of a workload or function.