Summary
Prior work does more efficient indirect call promotion using vtable profiles for Instrumented PGO. This RFC proposes to extend the SamplePGO profile format to include vtable type information, allowing the compiler to annotate vtable types on IR instructions and perform vtable-based indirect call promotion for SampleFDO. The proposal consists of two main changes:
- Profile Format: PR 148002 implements the change for the extensible binary format (the default binary format of SamplePGO) and text format. The profile format change for ext-binary is backward compatible.
- llvm-profgen: A prototype demonstrates how llvm-profgen could be extended to process perf data with Intel MEM_INST_RETIRED.ALL_LOADS samples and produce sample profiles with vtable information. For feature parity across different hardwares, future work could incorporate support for AMD Instruction-Based Sampling (IBS) and ARM Statistical Profiling Extension (SPE).
We welcome your feedback and suggestions on this proposal and the accompanying pull requests.
Thanks,
Mingming, Teresa, Snehasish, David
Motivation
Similar to type profiling for Instrumented PGO, the motivating use case to have vtable type information is to compare vtables for indirect-call-promotion in SampleFDO, moving the vtable load out of critical path into the fallback indirect branch handling. The before-vs-after transformation is illustrated with pseudo code below
Before | After |
---|---|
|
|
What does the profile look like
The before-vs-after of a functionās profile is illustrated in text format below. The highlighted lines show the vtable type profiles for both body samples and inlined callsite samples. The keyword vtables is used to indicate that a line represents vtable counts for virtual calls.
The vtable counters represents the raw counters converted from memory access profiles, and thereby should be reasoned about within one line location as the vtablesā relative frequencies. In practice, the memory access events and LBR events could be collected independent of each other and with different sampling period. When the profile is used to compile programs, an optimizer can scale the vtable counters using the relative frequency to make them comparable with LBR-based counters. This leaves more flexibility for compiler heuristic tuning.
Original Sample Profile | Sample Profile with vtable info |
---|---|
|
|
Description of Implementation
In-memory Representation of VTable Type Profiles
Currently, the class FunctionSamples keeps track of a functionās profile in the memory. It consists of head sample count, total sample count, and the profiles for each line location inside this function. Each line location represents the relative line offset and discriminator, and per-location samples could be a functionās sample profile for the inlined callees, and SampleRecord for the other cases. To store vtable type profiles, a new field VirtualCallsiteTypeCounts
is introduced in a functionās in-memory profile. This field is a map, keyed by line locations, and value by pairs of vtables and counters. The class structure is pasted below.
// Key is vtable, and value is the counter.
using TypeCountMap = std::map<FunctionId, uint64_t>;
class FunctionSamples {
uint64_t TotalSamples;
uint64_t TotalHeadSamples;
map<LineLocation, SampleRecord> BodySamples;
map<LineLocation, map<FunctionId, FunctionSample>> CallsiteSamples;
// Key is location, value is the <vtable, counter> pairs.
map<LineLocation, TypeMap> VirtualCallsiteTypeCounts;
};
Extensible Binary Profile Format Change
The extensible-binary format is the default SamplePGO binary format. It organizes different data payloads into separate sections. For each section, its section header uses a bitmap (referred as flag in actual implementations) to describe the section features. In this format, a name table section contains de-duplicated function names and linearizes function names to an integer ID, and the function-profile section references function names by the integer ID.
The following changes are proposed for ext-binary profile format change:
- To represent the vtable names, use the name table section will store and linearize virtual table names in addition to function names.
- The in-memory representation of vtables (the field
VirtualCallsiteTypeCounts
in classFunctionSamples
above) are serialized as a part of the FunctionSample. Implementation wise, the vtables are serialized after the serialization of inlined callsites within a FunctionSample. TheLineLocation
class and counters are serialized as they currently are, and vtable strings are represented by their linearized IDs in the name table. - For flag configurability, we introduce an LLVM boolean option (named
extbinary-write-vtable-type-prof
) and make it off by default. The ext-binary profile writer reads the boolean option and decides whether to set the section flag bitSecFlagHasVTableTypeProf
and serialize the class memberVirtualCallsiteTypeCounts
. - For backward compatibility, we take one bit from ProfSummaryFlag to indicate whether the SampleFDO profile has vtable information. The profile reader will read the section flag bit and know whether the binary profiles have type profiles. The enum class definition is pasted below with existing and new fields.
enum class SecProfSummaryFlags : uint32_t {
SecFlagInValid = 0,
/// SecFlagPartial means the profile is for common/shared code.
/// The common profile is usually merged from profiles collected
/// from running other targets.
SecFlagPartial = (1 << 0),
/// SecFlagContext means this is context-sensitive flat profile for CSSPGO
SecFlagFullContext = (1 << 1),
/// SecFlagFSDiscriminator means this profile uses flow-sensitive discriminators.
SecFlagFSDiscriminator = (1 << 2),
/// SecFlagIsPreInlined means this profile contains ShouldBeInlined contexts thus this is CS preinliner computed.
SecFlagIsPreInlined = (1 << 4),
/// The new bit. SecFlagHasVTableTypeProf means this profile contains vtable type profiles.
SecFlagHasVTableTypeProf = (1 << 5),
};
Extend llvm-profgen to generate vtable profiles
To generate SampleFDO profiles from branch records, llvm-profgen
processes Linux Perf script output containing hardware samples and memory map (mmap) events, and uses the mmap events to āsymbolizeā the instruction pointers, getting the IPās inline stack and location anchor. It populates the in-memory representation of function samples and serializes them as profiles.
A sample record of a MEM_INST_RETIRED.ALL_LOADS
event essentially gives the memory address accessed and the instruction pointer of the load instruction. Like how instruction pointers are symbolized for SampleFDO, symbolizing the load instruction gives the inlined stack of the instruction. If the data address is symbolizable and has _ZTV
prefix [1], this instruction loads from a vtable. llvm-profgen
can parse a perf raw trace [2] of the memory access events into a list of <ip, data-symbol, count>
tuples. Using the symbolized inline stack of instruction as an anchor, llvm-profgen
can further process the tuples into a list of vtable symbols and their counts for a <function, LineLocation>
.
A prototype is implemented for non-context sensitive SampleFDO. The source code and operations are attached in the Appendix section. Invoking llvm-profgen
for SampleFDO profiles with this change gives the following entries
# generate profile
$ ./bin/llvm-profgen --perfscript=path/to/lbr-perf.script --data-access-profile=path/to/dap-perf.txt --binary=path/to/main --format=text --pid=<pid> -ignore-stack-samples -use-dwarf-correlation -o main.afdo
# show profile
$ ./bin/llvm-profdata show --sample --function=_Z9loop_funciii src-pie/main.profgen.afdo
# The terminal output of 'llvm-profdata show'
Samples collected in the function's body {
0: 636241
1: 681458, calls: _Z10createTypei:681458
3: 543499, calls: _ZN12_GLOBAL__N_18Derived24funcEii:410621 _ZN8Derived14funcEii:132878
3: vtables: _ZTV8Derived1:1377 _ZTVN12_GLOBAL__N_18Derived2E:4250
6.1: 602201, calls: _ZN12_GLOBAL__N_18Derived2D0Ev:454635 _ZN8Derived1D0Ev:147566
5.1: vtables: _ZTV8Derived1:227 _ZTVN12_GLOBAL__N_18Derived2E:765
7: 511057
}
To interpret the data, we can examine the virtual function calls at line offset 3. In the source code, loop_func at line 3 calls Derived2::func
and Derived1::func
for 3/4 and 1/4 out of all calls. Both the indirect call profiles and vtable profiles have a rough ratio of 3:1. Similarly, a rough 3:1 ratio is expected at the profile of the second virtual call. For the demo program that exercises a simple hoop, itās observed that a perf.data
with MEM_INST_RETIRED.ALL_LOADS
collected under a short time window (e.g., 30 seconds) and a high sampling frequency could have biased vtable counters. In a real SampleFDO profile-and-compile set-up, profile counters are sampled from data center workloads at much lower frequency and in a a continuous manner, so presumably the biases ratio is mitigated because of statistical data. Besides, compiler can statically analyze the vtable and virtual function mapping in the whole program to make the numbers more usable.
Iterative Compilation
Unlike Instrumentation PGO, SamplePGO is often used in an iterative compilation set-up, meaning the profiles are collected from SamplePGO-optimized binaries. When the profiled binary is optimized with vtable-based indirect call promotion and thereby optimizes away the vtable load instructions, the memory load profiles wonāt capture vtable load events.
Acknowledging the importance of stable performance in iterative compilation set-up, this RFC and the patches aim to make the SamplePGO profile format change, the first step towards the end goal. One potential solution that we plan to pursue is to record the vtable-based transformations as metadata (e.g., for each machine instruction that loads vtable from objects, the list of vtables and their counts) in non-alloc ELF sections, and use the profile mapping metadata as well as memory access profiles jointly to reconstruct the vtable profiles. This borrows ideas to record profile mapping metadata in the binary from Propeller basic block address map [3] and PC-keyed metadata, and the solution could be generalized for other passes that observe iterative compilation challenges [4].
Appendix
Source Code
- lib.h
#include <stdio.h>
#include <stdlib.h>
class Base {
public:
virtual int func(int a, int b) = 0;
virtual ~Base() {};
};
class Derived1 : public Base {
public:
int func(int a, int b) override;
~Derived1() {}
};
__attribute__((noinline)) Base *createType(int a);
- lib.cpp
#include "lib.h"
namespace {
class Derived2 : public Base {
public:
int func(int a, int b) override { return a * (a - b); }
~Derived2() {}
};
} // namespace
int Derived1::func(int a, int b) { return a * (a - b); }
Base *createType(int a) {
Base *base = nullptr;
if (a % 4 == 0)
base = new Derived1();
else
base = new Derived2();
return base;
}
- main.cpp
#include "lib.h"
#include <iostream>
#include <chrono>
#include <thread>
__attribute__((noinline)) int loop_func(int i, int a, int b) {
Base *ptr = createType(i);
int sum = ptr->func(a, b);
delete ptr;
return sum;
}
int main(int argc, char **argv) {
int sum = 0;
auto startTime = std::chrono::steady_clock::now();
// run the program long enough to make manual sampling easier
std::chrono::minutes duration(7200);
while(std::chrono::steady_clock::now() - startTime < duration) {
for (int i = 0; i < 100000; ++i) {
sum += loop_func(i, i + 1, i + 2);
}
}
printf("total sum is %d\n", sum);
return 0;
}
Operations
- Compile and run source code
# Build a position dependent binary so runtime addr is the same as static virtual addr for parsing
# profiles.
./bin/clang++ -v -static -g -O2 -fdebug-info-for-profiling -fno-omit-frame-pointer -fpseudo-probe-for-profiling -fno-unique-internal-linkage-name -fuse-ld=lld lib.cpp main.cpp -o main
# run the program and get its process id
./main &
- Run Linux Perf to collect memory access perf.data, and generate its raw trace
perf5 record -a -c <prime-number-as-sampling-period> -p <pid> -d --pfm-events MEM_INST_RETIRED.ALL_LOADS:u:pinned:precise=3 -g -i -m 16 --buildid-mmap -N -o - --skip-hugepages sleep 30 perf5 inject -b --buildid-all -i - -o dap-perf.data
perf5 report -D -i dap-perf.data >dap-perf.txt
- Run Linux Perf to collect LBR, and generate its perf script
perf5 record --pfm-events br_inst_retired:near_taken:u -c <prime-number-as-sampling-period> -b -p <pid> -o lbr-perf.data sleep 30
perf5 script -F ip,brstack -i lbr-perf.data --show-mmap-event &> lbr-perf.script
Running llvm-profgen [5] gives the function profile with vtable counts.
assuming Itanium ABI is used ā©ļø
a raw trace from
perf report -D
is used to address a tooling limitation ā perf script doesnāt print leaf address inperf script -F ip
for the memory access profiles ā©ļøexplained in the section 3.2 of Propeller: A Profile Guided, Relinking Optimizer for Warehouse-Scale Applications ā©ļø
explained in the section 4.1.4 and 5.2 in paper AutoFDO: Automatic Feedback-Directed Optimization for Warehouse-Scale Applications ā©ļø
./bin/llvm-profgen --perfscript=path/to/lbr-perf.script --data-access-profile=path/to/dap-perf.txt --binary=path/to/main --format=text --pid= -ignore-stack-samples -use-dwarf-correlation -o main.afdo ā©ļø