[RFC] Profile Guided Static Data Partitioning

Hi folks,

In this RFC, we present profile-guided static data partitioning, to segregate data in binary ELF sections based on access frequency.

We looking forward to your feedback, during the holiday season or in the new year!

Authors: Mingming Liu (mingmingl@google.com), Snehasish Kumar (snehasishk@google.com), David Xinliang Li (davidxl@google.com)

Summary

Observing that it’s generally complex to capture indirect data accesses with code analysis and that not all compiler-generated data are symbolizable by design, this work uses a hybrid approach. It generates data access profiles to partition symbolizable data, and uses PGO profiles to infer hotness of module-internal data.

  1. The data access profiles are collected from production binaries and used to optimize binary in new releases. They are generated from symbolized memory accessing events and represented as a new payload section in memory profiles (MemProf). They are used at compile time to partition hot and cold data. This approach covers the data that are tracked by an executable’s symbol table.
  2. The rest of data are those with temporary, non-linker retained symbols and not symbolizable out of the box. This approach implements code analysis within an IR module to infer data hotness, with the goal to split cold data out with high confidence. Either instrumented or sampled PGO profiles will be used for code analysis.

A prototype over .rodata section partitioning shows 0.3% throughput improvement for a large internal benchmark. Additional opportunities exist in other data sections like .data.rel.ro .data and .bss. In addition, we are exploring fine-grained tuning to model the data access affinity to further improve locality.

This work will focus on ELF and DWARF for platform-specific implementations.

Terminology

  • Read-only data: refers to .rodata and .data.rel.ro sections.

  • Static data: include .data and .bss in addition to read-only data sections.

Background

Binaries which are compiled with profile guided optimizations can place code in distinct text sections based on their observed hotness. When the application runs, we map the hot code onto huge pages, improving locality (particularly for the iTLB) while minimizing the number of huge pages required to do so. We apply the same idea to static data contained in the binary.

We prototype the effectiveness of partitioning hot and cold data using .rodata section for a large internal workload using hardware performance counter events. Specifically, profiling memory loads and stores via hardware events (like Intel PEBS, AMD IBS, Arm SPE) allows us to quantify static data hotness using the Linux perf tool. The charts below show the access pattern of static data before and after partitioning. This prototype uses the MEM_INST_RETIRED.ALL_LOADS PEBS event to profile data accesses by collecting data addresses and mapping them to known symbols in the binary. For the rest of unmapped data objects (symbols are not preserved in the executable’s symbol table), their hotness is inferred from the FDO profiles of the surrounding code. We reorder the symbols in the .rodata section by adding a suffix to the symbol name and using a linker script to cluster the hot data. Profiling the binary again shows us the chart on the right. With this improved layout we measured a throughput improvement of 0.3% on a large internal Google workload.

Data access Heatmap
Before After

Overview

What does the optimized data layout look like?

In the optimized data layout, the cold data will be kept in the original .<section>, and a new .<section>.hot section will cluster the rest, keeping a high recall of hot data (i.e., the hot-suffixed section keeps a high percentage of all truly hot data).

Profile source and where they are applied

The table is a summary of profile sources and what data types they are applied to. Multiple sources of profiles may exist for a piece of data (say local-linkage global variables are symbolizable and amenable for code analysis). In this case, a piece of data is placed into the hot-suffixed section if any source shows it is hot.

Profile Source Where they are applied
Instrumented or sample PGO Data whose direct accesses come solely from objects in the same module, like jump tables, constant pools, local or private linkage global variables
Symbol name from symbolized data access profiles Data with stable mangled names across binary releases, like vtables, RTTI, global variables with source code counterparts
Symbol source location, by joining symbolized data access profiles and enhanced DWARF Symbolizable data without stable names, most notably LTO-externalized string literals (i.e., .str.<N>.llvm.<hash>)

Detailed Design

Data tracked by the symbol table (e.g., global variables with external or local linkage) are termed ‘symbolizable’, and data whose direct accesses come solely from objects in the module (e.g., global variables with private or local linkage, or those with private prefixes in its name) are termed ‘module-internal’ data. Based on how symbols are selected into .o file’s symbol tables as well as how the executable’s symbol table forms, we believe handling the two types of data will cover the majority of cases. Moreover, data that are not understood by the two approaches will be kept in the .hot-suffixed section for higher recall.

We propose to implement the partitioning for module-internal data first since it requires no new additional profile format changes. Augmenting MemProf profiles with hot symbols and extending debug information for better tracking will improve the precision of the hot section.

Module-Internal Data

From our analysis, un-symbolizable data (a subset of module-internal data) account for ~30% accesses out of all read-only data sections. Assuming representative profiles are provided, the module IR analysis could tell absolutely cold module-internal data apart from the rest, making it feasible to split out cold module-internal data with high confidence. Jump tables and constant pools are not represented in the IR; their partitioning is described in the sections following global variables.

Module-Internal Global Variables

To find out cold global variables in the module, we aggregate its references from functions and other global variables, and place the global variables in .section (the cold section) if code analysis can determine the aggregated access is cold. The analysis feasibility also applies to local-linkage global variables, which are tracked by symbol table.

The address of module-internal data can escape out of the module through an external global variable or function. As a result, indirect accesses can still happen from outside the module. The trade-off is to place the data in the .hot-suffixed section if code analysis relying on iFDO or AFDO has limited information and cannot categorize the data as cold with high confidence. In practice we find that this occurs rarely.

Jump Table

From LLVM IR perspective, source code control flow changes like C++ switch statement could be represented by the switch IR instruction, and if-else might be also converted to switch. Switch lowering can lower switch instructions into jump tables.

When the assembler emits a machine function, a jump table is tied to the specific function that uses it. A machine function keeps track of its list of jump tables, and each jump table tracks the list of destination basic blocks. A jump table is not mergeable among functions or among different branches within a function.

A standalone pass will be implemented to annotate jump table hotness based on basic block hotness. The assembler will place the jump table according to annotation. The heuristic is to place a jump table into .rodata if both the source basic block and the destination basic blocks are cold.

Constant Pool

Similar to the jump table, a machine function keeps track of its constant pool entries. A constant pool keeps track of constants referenced by a function that must be spilled to memory. This typically includes floating point and large integer constants. Different from jump tables, constant pool data is placed in a mergeable section and lld can merge identical content.

A standalone pass will be implemented to annotate constant pools based on basic block hotness. The heuristic is to de-duplicate identical constant pool entries within an IR module, place a constant pool entry in .rodata if the aggregated count of all machine instructions that references the constant pool entry is cold, and place it in .rodata.hot otherwise.

Symbolizable Data

Profile Construction

To profile data hotness, we sample MEM_INST_RETIRED.ALL_LOADS PEBS events from production jobs. Each sample contains the precise instruction and data address of an access. Symbolizing the data address from perf.data and aggregating access counts by symbols gives a list of symbols and their counts. Note lld can merge data with identical content when the address is not significant (as a demo program shows) , so we need to find all symbols for a given data address by joining the ELF’s symbol table information.

In the de-serialized (in-memory) format, DataAccess describes a symbol and a list of its possible locations. A symbol is represented by its canonical name, and a location is represented by a module’s SourceFileName and line number. The location information is optional for global variables with stable names, and crucial for global variables without stable names (described in debug information enhancement in the subsequent section). The profile payload will store MD5 of the strings for space efficiency.

struct DataAccess {
   // The MD5 of canonical symbol name
   uint64_t HashOfSymbolCanonicalName ;

  // A data symbol could have a couple of locations.
  // For instance, lld merges identical string literals used in different modules.
  // A global variable with linkonce_odr could also have multiple locations.
  Location[] Loc;
};

struct Location {
  // The MD5 of the module's SourceFileName
  uint64_t ModuleHash;
  uint32_t LineNumber;
};
Enhance Debug Information Generation

Some global variables (most notably string literals with a name pattern like .str.<N>.llvm.<hash>) don’t have a stable mangled name across binary releases. Therefore, the names from a profiled binary are not useful to build the next binary for a new release. We propose to enhance debug information generation for such variables such that we can find out such global variables’ module and line number in dwarf.

Take string literals from Compiler Explorer as an example, Clang doesn’t give a name or linkage name for @.str.1 in its DIGlobalVariableExpression, even though the file and line information are available (in the !9 metadata from godbolt example). Implementation-wise, we plan to update the global variable’s debug metadata with its LTO-externalized name and make sure the updated debug metadata is emitted as a part of the DWARF file. This way the module source file and line number are available by looking up the symbol in dwarf files (e.g., what llvm-dwarfdump --name=<name> binary.dwp can do).

Profile representation as a new section in MemProf profiles

We propose to convey symbolized data access profiles using memprof. The memprof profile format extends the iFDO profile format by adding a new section. A memory profile may also be passed to clang separately using -fmemory-profile-use. This way, both iFDO and AFDO-optimized binaries can adopt it by reading memory profiles.

Profile Consumption

Given the named symbol profiles, the AsmPrinter will iterate each global variable in a module and annotate it as hot if the data with the same canonical name or source location are hot according to the profiles.

The way to implement section prefix for global variables will be similar to how function section prefix is set and used; AsmPrinter will set the MD_section_prefix metadata of a global variable, and target lowering library will add this section prefix when selecting sections.

Place hot data into a contiguous section

To preserve the cold section suffix, we’ll use a linker script. Alternatively, extending --keep-text-section-prefix in lld to data sections would work if a linker script cannot be used.

Ongoing efforts and follow-ups

On top of read-only data hot-cold splitting, we plan to expand partition to other data sections like .data.rel.ro, .bss and .data, and explore finer-grained data clustering like function-based clustering. The rationale to do finer-grained data clustering is that data access patterns can exhibit affinity on the granularity of a workload or function.

11 Likes

Tagging @snehasish @davidxl and @WenleiHe @modimo @wlei

Also @tcreech-intel @williamweixiao since it is related to their HWPGO effort.

Hi @mingmingl-llvm, really exciting work!
We have been also working on similar memory profile guided data symbols ordering on MachO binaries, which mainly focuses on improving mobile app startup performance by reducing page faults with hot/code splitting.

Specifically,
(1) we added memprof support for Darwin platform, and a new functionality to dump addresses belonging to loaded binaries/modules whose counter is not 0. Note that we change the MEM_GRANULARITY to 8 bytes, because according to our analysis, most data symbols sizes fall into 8 - 64 bytes in mobile apps.

(2) by running the memprof instrumented mobile apps, we can get the binary access profile for app startup.

(3) Then we need to translate/map the runtime addresses to symbols, so we know what the hot data symbols that are accessed at app startup are. We have also considered using dwarf debug info, but we are using linker maps for simplicity for now.

(4) we then order the regular mobile apps’ data symbols in lld based on the symbol order file generated in step 3. We use the existing -order_file to order the symbolizable data with stable names, mostly in __DATA_CONST,__const and __TEXT,__const in MachO. We added a similar -order_file_cstring for string literals ordering (will upstream soon) in lld and use the hash of the string literal to represent and match them in the order file. This mostly orders __TEXT,__cstring in MachO.

Our experiments on a large social app show that, the work can reduce the pre-main startup time by 11-12%, and page faults by 5%.

We face the same issue of matching un-symbolizable data, such as the linker local symbol starting with l_; the symbolizable global variables with non-unique names. For the latter, we are also thinking identifying them with the aid of their source location, but we are leaning to the idea of appending the hashes of their location info, e.g. file name, class name etc to their symbol names.

I also chatted about this work with @teresajohnson at the LLVM Dev meeting earlier this year.

Are you planning to upstream the code any time soon? We are interested to see how the new symbolized data access profiles are integrated into the existing memprof profile format.
Thanks!
Sharon

2 Likes

Thanks for sharing the methodologies and results Sharon! It’s great to hear about data layout optimizations on the mobile space. My responses are inline.

Are you planning to upstream the code any time soon?

I plan to upstream the patches that handles module-internal data in the coming days as a first step, and extend memprof with data access profiles after that.

We are interested to see how the new symbolized data access profiles are integrated into the existing memprof profile format.

Does extending memprof with symbolized data access profiles fit somewhere in the work on your end? Memprof has seen with many nice improvements recently (driven by @kazutakahirata ), and we’d be willing to collaborate on the common pieces if any.

we added memprof support for Darwin platform, and a new functionality to dump addresses belonging to loaded binaries/modules whose counter is not 0. Note that we change the MEM_GRANULARITY to 8 bytes, because according to our analysis, most data symbols sizes fall into 8 - 64 bytes in mobile apps.

Are there plans to upstream the memprof data instrumentation work? While we are not focused on instrumentation-based approach for data access, it’d be nice to see related work in the upstream.

We added a similar -order_file_cstring for string literals ordering (will upstream soon) in lld and use the hash of the string literal to represent and match them in the order file

I’d be interested to take a read when the upstream patches are ready.

We face the same issue of matching un-symbolizable data, such as the linker local symbol starting with l_ ; the symbolizable global variables with non-unique names. For the latter, we are also thinking identifying them with the aid of their source location, but we are leaning to the idea of appending the hashes of their location info, e.g. file name, class name etc to their symbol names.

It’s good to know that source locations are favored as anchors in both work :smiley:.

I wonder if the filenames recorded in the symbol table would be useful for the cases you describe.

  • One potential limitation is that only base filename (without full path) is recorded for ELF when I try it (and MachO as well from reading the code, given the specific config option is false for both). And emitting full path will increase symbol table size.
2 Likes

Thanks for the detailed RFC! This is really exciting work and I think there is significant overlap with what @sharonxu and I have planned.

A while back I extended IRPGO to support “Temporal Profiling”.

The main motivation was to improve binary startup time (which is important in the mobile space) by reducing the number of .text section page faults via a function order. We do this by instrumenting function timestamps (which measures how early a function is called for the first time) and running a new Balanced Partitioning algorithm during linking to come up with an optimal function order using our profile data. We believe we can expand on this approach to reorder data. (Based on @Colibrow’s recent work, I believe they would be very interested in this)

Temporal PGO is similar to this work, but there are some key differences. Let me know if I get something wrong.

  • This work improves throughput during steady state execution, I assume by reducing data cache misses. Temporal PGO reduces page faults during startup.
  • This work partitions hot data into a new section at compile time. Temporal PGO orders data in the linker, which has the affect of hot/cold splitting as well as colocating functions often used in sequence.
  • This work uses samples profile data in production while Temporal PGO uses IRPGO instrumented binaries to collect profile data.

How do you plan to extend the MemProf profile format to support this? I still need to study this profile format, but it would be greatly helpful to us if your extension could include a “timestamp” for each symbol, similar to how we extended IRPGO to support function timestamps. Since I assume MEM_INST_RETIRED.ALL_LOADS does not collect timestamps of data loads, you would probably use a value of zero for your case. However, if you were to collect this info, I believe you could further improve data locality by colocating data that are often used close in time during execution.

Have you considered using --symbol-ordering-file (or -order_file in Mach-O) to cluster these sections in the linker? Or you could add a new symbol orderer in LLD, similar to --call-graph-ordering-file or --bp-startup-sort? This would give you much more control over where these symbols are laid out.

1 Like

Has this feature landed? (We also talked about deprecating a previous implementation).

The original Temporal PGO proposal is about function order. Is the data ordering supported by the memprof based approach?

This work also relies on linker to group based on section names.

This work can use multiple source of profile information depending on availability.

This is possible but no essential. The compiler can enforce the relative ordering via other means with discovered access affinity information.

Yes. And yes we can deprecate -enable-order-file-instrumentation.

Not yet, but we are working on extending this work to support data ordering.

As @sharonxu mentioned, our preliminary experiments show good results.

Thanks for the feedback and questions so far. My responses are inline

This work improves throughput during steady state execution, I assume by reducing data cache misses. Temporal PGO reduces page faults during startup.

Yes, hot/cold data improves throughput at steady state. It improves dTLB usages (TLB usages are worth optimizing for servers), and can also reduces data cache misses for read-only data sections with a reduced working set.

This work partitions hot data into a new section at compile time. Temporal PGO orders data in the linker, which has the affect of hot/cold splitting as well as colocating functions often used in sequence.

It’s correct that profiles are consumed at compile time and data sections get the .hot (or none) suffix when they are provided as linker input. There is no linker reordering involved now, but we are interested in re-ordering as a followup.

  • This work uses samples profile data in production while Temporal PGO uses IRPGO instrumented binaries to collect profile data.

This is correct.

How do you plan to extend the MemProf profile format to support this? I still need to study this profile format, but it would be greatly helpful to us if your extension could include a “timestamp” for each symbol, similar to how we extended IRPGO to support function timestamps. Since I assume MEM_INST_RETIRED.ALL_LOADS does not collect timestamps of data loads, you would probably use a value of zero for your case. However, if you were to collect this info, I believe you could further improve data locality by colocating data that are often used close in time during execution.

Memprof has a raw format (generated from runtime) and a indexed format (i.e., the format consumed at compile time). Our work will skip the raw, process Linux perf samples (in out-of-tree tools) and directly insert the symbolized profiles into the indexed format.

Presumably the timestamp information exists in the raw-format profile. It’d be lightweight to carry timestamp into the indexed format and reasonable to do so if it helps.

Have you considered using --symbol-ordering-file (or -order_file in Mach-O) to cluster these sections in the linker? Or you could add a new symbol orderer in LLD, similar to --call-graph-ordering-file or --bp-startup-sort ? This would give you much more control over where these symbols are laid out.

Yes, we tried --symbol-ordering-file to experiment with finer-grained ordering for symbolizable data only, and the experiment shows finer-grained ordering is worth pursuing.

To place both symbolized data and un-symbolized ones that are frequently accessed by a function, presumably we’d need to change lld to do section-prefix (or suffix) based ordering.

2 Likes

Thanks for the RFC! This work looks great to me. As @ellishg mentioned, I’m currently working on porting the IRPGO and reorder tasks. So far, I’ve been focusing on function ordering, primarily to reduce code size. Later, I’ll shift my focus to performance during startup, which aligns with the goals of this work. I’m looking forward to seeing the upstream patches!

Hi @mingmingl-llvm, thanks for the great RFC! We’re experimenting similar idea for x86 client workloads with HWPGO and even observed bigger performance gains on some real big applications, which proves it’s a promising direction. Look forward to your patches upstreaming and we can help code review. For traditional function reordering, we have Pettis-Hansen (PH) , C^3 and other algorithms to do reordering based on global call graph information. Will you do similar reordering for static data?

Hello

I am trying to learn more about this space. Can you go into more details what needs to be done on instrumented/sample pgo side?

Thanks

Thanks for all the feedback and code reviews so far! It’s great to know that the direction of hot-cold data splitting and ordering is of interest for both start up performance and steady-state performance and the promising results from the community.

I started to upstream code changes. The first one was merged yesterday, and the second one is in review. I’ll keep this RFC regularly updated with major progress, and code reviews are always welcome.

It’s great to know the interest and potential to apply data partitioning to optimize binary start-up time.

Server’s start-up time are usually dominated by user-space activities (e.g., read from storage or set up network connection) to initialize server states. From this perspective, I don’t expect the typical server-side metrics to be accurate enough to fine tune binary start-up time on our side, but we can follow-up as the code changes get in.

Thanks for sharing the insight! The idea is to model data affinity by the code (functions) that reference it, and reordering them for better cache locality (and less false sharing for read-write data). I don’t know which algorithm are going to work best yet. Hopefully I’ll collect data and share them when I get a better idea of it.

Sure. There will be no additional instrumentation inserted to the iFDO-instrumented binary, and the compiler and linker changes will apply to both iFDO and SPGO optimized binaries. To provide data access profiles to compilers, we plan to extend MemProf, which has the same format for {iFDO, SamplePGO}-optimized binary.

Let me know if you have follow-up questions!

2 Likes