Overview
We propose to enrich IRPGO profiles with dynamic type information and leverage the type information to decide the best comparison sequence with cost benefit analysis for speculative indirect call promotion. This removes a vtable load from the critical path. Our prototype (patch 1, 2 and 3) shows this improves throughput by ~0.3% for some internal workloads. We’d like to share the design and results from our prototype, and get feedback so we could upstream it.
Thanks,
Mingming, Teresa, David, Snehasish
Background
Profile-guided indirect call promotion pass improves performance by improving branch prediction and more importantly exposing function inline opportunities. Currently IRPGO has two types of value profiling: indirect call targets and mem* op sizes. Value profile intrinsics are inserted for value profiling sites, and the following information are collected for indirect call target profiling:
- For an indirect callsite, the list of runtime addresses of the called function and their counts.
- Per function profile information, including the MD5 hash of function names and runtime addresses of a function.
- (Compressed) Function names.
MD5 hash bridges runtime addresses to symbol (function) names, to discover the top 3 indirect call function targets of a given callsite for speculative indirect call promotions.
At the 2020 Meeting C++ Conference, Teresa Johnson gave a talk about whole program devirtualizations and mentions relevant optimizations on indirect-call-promotion and further devirtualization opportunities. This work implements the vtable-comparison optimization mentioned in the talk.
Motivating Use Case
The motivating use case of dynamic type information is to compare vtables for indirect-call-promotion.
Given a virtual call ptr->func()
, the comparison sequence inserted by the current indirect call promotion pass gives
vptr = ptr->_vptr
func_ptr = *(vptr + function-offset); // vtable load
if (func_ptr == HotType::func)
HotType::func(); // highly frequent path
else
call func_ptr(); // less likely path
The proposed vtable comparison gives the following sequence, moving the vtable load of funcptr out of the critical path and into the fallback indirect branch handling
vptr = ptr->_vptr;
if (vptr == &vtable_HotType)
HotType::func() // highly frequent path
else { // less likely path
func_ptr = *(vptr + function-offset) // vtable load
call func_ptr
}
Design
Dynamic type profile generation and annotation
Overall, the goal is to detect virtual table load instructions and profile the underlying virtual table types. The vtable type profiling will dump the following information in raw profiles.
- The vtable addresses used for virtual calls.
- For each virtual table objects
- The beginning and end addresses.
- The MD5 hashes and names of the virtual table objects
By looking up an address in (sorted) address ranges, a runtime address could be associated with the corresponding virtual table object, and annotated as a MD5 hash with counters in a value profile metadata.
Binary instrumentation
Pick instrumentation points and instrument runtime types
Naturally, vtable value profile intrinsics will be inserted by pgo-instr-gen
pass. The value profile intrinsic llvm.instrprof.value.profile
is used to instrument runtime addresses of vtables. A new value type IPVK_VTableTarget
will be introduced in ValueProfKind to indicate that the annotated value is the MD5 hash of vtable objects.
Since the vtable profiles is just one kind of vtable profiles with its own type, existing value profiling infrastructure works out of the box to track and store its content into raw profiles, and indexed profiles stores per function value profiles of all types. No additional profile format changes are needed to store the actual values.
A heuristic is used to pick the instrumentation points. This heuristic looks at an address from which an indirect callee is loaded, strips the constant offsets if any to get a base address. This is the profiled address.
This heuristic profiles both real vtable addresses and non-vtable address feeding instructions (false positives). We will record the address ranges of vtable definitions (elaborated in the next subsection). By looking up the profiled address in address ranges, a false positive address gets associated to no symbols [1], and the optimizer pass will drop false positive addresses when using the profiles. In other words, false positives could increase the instrumentation overhead but doesn’t affect profile-use. To eliminate the overhead, a possible refinement is to insert type intrinsics (e.g., llvm.type.test
) under clang option -fprofile-generate
so only type tested addresses, which will be vtables, are instrumented.
Here is an example. The following C++ code snippet has two indirect calls, one callee is a virtual function and the other is a captured function pointer. We’ll show a snippet of IR with the value profile intrinsic added.
class Base {
public:
virtual int func();
};
typedef int(*FuncPtr)(int);
class FuncPtrWrapper {
public:
FuncPtr* funcptr;
};
int func(Base* b, FuncPtrWrapper* wrapper) {
return wrapper->funcptr[0](b->func());
}
The simplified IR [2] with vtable type profiling is shown below
define i32 @_Z4funcP4BaseP14FuncPtrWrapper(ptr %b, ptr %wrapper) {
entry:
// 1
call void @llvm.instrprof.increment(ptr @__profn__Z4funcP4BaseP14FuncPtrWrapper, i64 567090795815895039, i32 1, i32 0)
%0 = load ptr, ptr %wrapper
// usually a no-op
%1 = ptrtoint ptr %0 to i64
// 2
call void @llvm.instrprof.value.profile(ptr @__profn__Z4funcP4BaseP14FuncPtrWrapper, i64 567090795815895039, i64 %1, i32 2, i32 1)
%2 = load ptr, ptr %0
%vtable = load ptr, ptr %b
// usually a no-op
%3 = ptrtoint ptr %vtable to i64
// 3
call void @llvm.instrprof.value.profile(ptr @__profn__Z4funcP4BaseP14FuncPtrWrapper, i64 567090795815895039, i64 %3, i32 2, i32 0)
%4 = load ptr, ptr %vtable
// usually a no-op
%5 = ptrtoint ptr %4 to i64
// 4
call void @llvm.instrprof.value.profile(ptr @__profn__Z4funcP4BaseP14FuncPtrWrapper, i64 567090795815895039, i64 %5, i32 0, i32 0)
%call = call i32 %4(ptr nonnull align 8 dereferenceable(8) %b)
// usually a no-op
%6 = ptrtoint ptr %2 to i64
// 5
call void @llvm.instrprof.value.profile(ptr @__profn__Z4funcP4BaseP14FuncPtrWrapper, i64 567090795815895039, i64 %6, i32 0, i32 1)
%call1 = call i32 %2(i32 %call)
ret i32 %call1
}
In the IR, instructions with a comment in the previous line are generated are inserted by IRPGO, and the rest are original IR. 2
and 3
(along with the ptrtoint
in the previous line) will be inserted by type profiling, and {1
, 4
, 5
} are inserted by existing IRPGO. Note 3
is the real profiled address and 2
is the false positive. Since the profiled value from 2
is not within the address range of the vtable object, optimizer pass will get no vtable symbols and skip the transformation for it at profile-use time.
Collect vtable object address range, names and MD5 hash of names
We’ll collect the following information from instrumented binary and runtime.
- The address range (i.e., start and end address) of a virtual table object, and the MD5 hash of the vtable’s global name.
- The global name of virtual tables.
To find virtual tables in a binary, we propose to insert LLVM type metadata for vtable objects under clang option -fprofile-generate
. If a variable is annotated with type metadata, we regard it as a virtual table definition and profile the address range.
In the current instrumented PGO, one object-file section is created for each type of profile data in the instrumented binary, and data instances are stored contiguously in this section.
For vtable profiling purpose, two new object file sections are introduced in the instrumented binary [3]
- A section named
__llvm_prf_vtabnames
to record the compressed names of virtual table variables. - A section named
__llvm_prf_vtab
to record the metadata of virtual tables. The metadata section for vtable is similar to what__llvm_prf_data
does for functions. The vtable profile data in C++ looks like this
struct VTableProfileInfo {
// The address is collected at runtime.
// The type of address could be uint32_t or uint64_t, depending on system.
uint64_t VTableStartAddress;
// The byte size of the vtable. The address range is used for look up.
uint32_t VTableByteSize;
uint64_t MD5HashOfName
};
We measured the objdump data from the prototyped workload. Using llvm-objdump -h
output, the binary size increase from two new sections is 0.4%.
Profile format changes for type profiling
The IRPGO profiles currently come in three forms.
- The raw profile format, generated by profiler runtime.
- The indexed profile format, generated by
llvm-profdata
. - The text profile format, for user inspection.
The proposed changes in profile format are described as follows.
Raw Profile Format
As mentioned above, the actual vtable value profiles from instrumented points are stored in the existing sections without additional format chan
The IRPGO raw profile is a memory dump of the data, and the raw profile header indicates the layout of different sections and records the metadata to infer the byte offset of each section. To include type profiles, we will add two sections in raw profiles as illustrated in fig 1
and bump INSTR_PROF_RAW_VERSION
by one.
+-----------------------+---+
| Magic | |
+-----------------------+ |
| Version | H
+-----------------------+ E
+--- | Other Existing | A
| | Header Fields | D
| +-----------------------+ E
| +--- | Count of | R
| | | VTableProfData | |
| | +-----------------------+ |
| | | Byte Size of | |
infer offset | | | VTable Names | |
| | +-----------------------+---+
| | | Existing sections |
| | +-----------------------+---+
+----> | VTableProfData | N
| +-----------------------+ E
+--> | VTable Names | W
+-----------------------+---+
fig 1: type profiles in raw profile format
Similar to the existing IRPGO, raw profile writer would leverage linker-inserted section symbols in ELF [3] to find the vtable names and VTableProfData
and create memory dumps of C++ structs. Raw profile reader would infer the section byte offset and read section byte size directly from profile header to deserialize data.
Currently there is already a section to store compressed function names in the binary and raw profile. We propose to store compressed function names and compressed vtable separately in the binary and raw profile. This way, raw profile readers can conveniently access each set of names by reading the corresponding section; the implementation is straightforward. If function names and vtable names are stored together, the raw profile reader needs to join the MD5 hash from per function profile data and the decompressed set of names to infer the set of names that are not functions. Looking at joint data introduces additional RAM usage and map look-up operations in profile readers.
If a binary is compiled with -fPIE
and runs with ASLR enabled, the actual profiled address of the same instruction (functions or vtables) could vary from run to run. Additional information [4] is needed to merge value profiles of addresses correctly in various settings (pie or not, ASLR or not). To begin with, we propose to disallow the merge of raw profiles if they contain value profiles as a collateral part of this effort. For the profile merging use case, converting individual raw profiles to indexed format and merging the indexed profiles would work.
Indexed Profile Format
As mentioned above, the vtable value profile is just one type of value profiles with its own type, the indexed profiles could store per function value profiles without additional format change.
When llvm-profdata
reads raw profiles, the vtable runtime addresses from raw profiles will be mapped back to MD5 hash of virtual table names and recorded in the per-function, per CFG hash InstrProfRecord. This follows the current implementation to map function runtime addresses to function MD5 names. As a result, indexed-format profiles don’t need to store an equivalent of VTableProfileInfo
.
To display the human readable virtual table types (e.g., support llvm-profdata show --function=foo --show-vtables
like what we do for llvm-profdata show --function=foo --ic-target
), we propose to add a new section in the indexed profile to store the compressed vtable names and bump the INSTR_PROF_INDEX_VERSION
by 1.
The indexed profile format is displayed in fig2
+-----------------------+---|
| Magic | |
+-----------------------+ H
| Version | E
+-----------------------+ A
| Other Existing | D
| Header Fields | E
+-----------------------+ R
+----- | VTable Names Offset | |
| +-----------------------+---+
| | Other Existing |
| | Payloads |
| +-----------------------+
+----> | Compressed |
| VTableNames |
+-----------------------+
fig 2:proposed changes in indexed profile format
Text Profile Format
For human inspection, text profiles organize and store the profiles by functions and display human-readable function names for indirect callees. The profiles (including counters and value profiles) of one function are stored together in the text after the profiles of another function as illustrated in Fig3
.
+-----------------------+
| Text Header |
+-----------------------+
| Text profile for |
| function1 with cfg1 |
+-----------------------+
| Text profile for |
| function1 with cfg2 |
+-----------------------+
| Text profile for |
| function2 with cfg3 |
+-----------------------+
| Text profile for |
| function2 with cfg4 |
+-----------------------+
fig 3: text format
As such, we’ll store human readable names vtable names directly. This doesn’t require any format change in text profiles. We just need to teach text reader/writer to understand the new value type IPVK_VTableTarget
and vtable names.
Here is an example by extending the previous C++ code snippet as C++ executable [5] . The text profile is shown below. The first indirect callsite collects two candidate functions _ZN4Base4funcEv
and _ZN7Derived4funcEv
, and the second indirect callsite collects _Z4echoi
and _Z6squarei
. The first profiled vtable site has two types _ZTV4Base
and _ZTV7Derived
, and the second profiled site is a false positive, showing as external symbols.
:ir
main
# Func Hash:
783881049970552793
# Num Counters:
3
# Counter Values:
500
500
1
# Num Value Kinds:
2
# ValueKind = IPVK_IndirectCallTarget:
0
# NumValueSites:
2
2
_ZN4Base4funcEv:500
_ZN7Derived4funcEv:500
2
_Z4echoi:500
_Z6squarei:500
# ValueKind = IPVK_VTableTarget:
2
# NumValueSites:
2
2
_ZTV4Base:500
_ZTV7Derived:500
2
** External Symbol **:500
** External Symbol **:500
Type Profile Annotation In Optimized Build.
Thanks to existing value profile infrastructure in IRPGO, using a new kind of value type in an optimized build is straightforward.
At profile use time of IRPGO, the PGOInstrumentationUse
pass annotates counters and values with a per-function profile record; this pass only applies profiles iff the CFG hash from profiles and from IR matches. As a result, the pass could pick precisely the instrumented instructions for each value type, and annotates value profile IR metadata on the instructions.
Using the previous C++ example, the annotated IR will be
define i32 @_Z4funcP4BaseP14FuncPtrWrapper(ptr %b, ptr %wrapper) {
entry:
%0 = load ptr, ptr %wrapper
%1 = load ptr, ptr %0
%vtable = load ptr, ptr %b, !prof !0 ; 1
%2 = tail call i1 @llvm.public.type.test(ptr %vtable, metadata !"_ZTS4Base")
tail call void @llvm.assume(i1 %3)
%4 = load ptr, ptr %vtable
%call = tail call i32 %4(ptr %b), !prof !1 ; 2
%call1 = tail call i32 %1(i32 %call), !prof !2 ; 3
ret i32 %call1
}
declare i1 @llvm.public.type.test(ptr, metadata)
declare void @llvm.assume(i1 )
!0 = !{!"VP", i32 2, i64 1600, i64 1960855528937986108, i64 1600}
!1 = !{!"VP", i32 0, i64 1600, i64 5459407273543877811, i64 1600}
!2 = !{!"VP", i32 0, i64 1600, i64 379032081350541304, i64 1600}
The two tail call instructions (commented as 2
and 3
) along with !prof
metadata are existing indirect function call value profiles. The load instruction along with !prof
(commented as 1
) is for the new vtable value profile. Note non-vtable load won’t be annotated with value profiles.
Virtual Table Definition Import
This is relevant in the context of ThinLTO.
In the postlink optimizer pipeline, we need to know the function of each vtable and need to import a virtual table definition referenced by vtable value profile metadata if it doesn’t exist in the module.
Currently, ThinLTO builds a list of referenced global variables for each function and imports the referenced variables. To import vtable definitions, we will synthesize referenced edges for the profiled vtable definitions as read-only variable references, and assign value id like we do for indirect calls since the vtable declaration symbol may not exist in the module during prelink. Existing ThinLTO infrastructure already handles the rest to actually import the definitions.
It’s worth mentioning that IRLinker
already makes sure the list of virtual functions referenced by virtual table definitions at least have declarations in the module even if the virtual function definitions are not imported. So here no additional work is needed to avoid unknown functions referenced in a vtable.
Cost Benefit Analysis
Comparing vtables does not always give the best comparison sequence. For instance, if there is one dominant target function and five vtable candidates, comparing vtables could introduce too many compare instructions whereas comparing functions is one load with one compare. For cost-benefit analysis, parameters are introduced to control the number of additional ALU operations (icmp
or address calculation) for the vtable-based comparison sequence. Our tunings show more than 4 additional ALU operations won’t help performance any more.
In general, the parameters should be configurable and overridable for different backend targets. For instance, aarch64 needs to materialize large constants (vtable addresses) with two instructions whereas x86 could typically encode immediate value in a machine instruction, exhibiting cost differences in instruction cache and register usages.
Note vtable-based selection not only saves a load from critical passes, but exposes jump threading opportunities.
To illustrate, origin code is
ptr->func1();
ptr->func2();
Comparing functions gives the following code snippets
vtbl_ptr = ptr->_vptr;
func_ptr = *(vtbl_ptr + offset_of_func1);
if (func_ptr == HotType::func1)
HotType::func1(); // hot path
else
func_ptr (); // cold path
func_ptr = *(vtbl_ptr + offset_of_func2);
if (func_ptr == HotType::func2)
HotType::func2(); // hot path
else
func_ptr (); // cold path
Comparing virtual tables gives the following code snippets
vtbl_ptr = ptr->_vptr;
if (vtbl_ptr == &HotTypeVTable)
{
HotType::func1();
HotType::func2 ();
}
else
{
func_ptr = *(vtbl_ptr + offset_foo_entry);
func_ptr ();
func_ptr = *(vtbl_ptr + offset_bar_entry);
func_ptr ();
}
The latter sequence not only has fewer branches, but also exposes further devirtualization opportunities after function inline. For instance, if HotType::func1()
is inlined, some virtual calls inside it could be statically devirtualized with class-hierarchy analysis.
Status
The changes have been implemented and tested. The prototype was implemented and a performance test was done on one internal workload. The baseline binary compares function addresses, whereas the test binary compares vtable addresses with tuned cost benefit analysis. The test binary shows a geomean of 0.3% QPS increase over the baseline binary on some internal workloads. The patches are either drafted or already in review.
Follow up Work
In order to get precise type profiles for vtable loads in sample-based PGO, we’ll need data access profiling. For instance, the MEM_INST_RETIRED.ALL_LOADS_PS
precise event on Intel processors samples retired load instructions, recording the actual delivery of the data. The smaps
data is already recorded in the Sample-PGO profile data, so translation from runtime addresses to static addresses (and therefore symbol names,sizes) can be done. Sample PGO profile format needs to be extended to store value profiles.
With dynamic type information in the profiles, profile-guided interprocedural type propagation and type profile driven function cloning is feasible. Meanwhile, vtable definition importing based on hotness could achieve the goal of importing function declarations for better call-graph sort with more complete call graph information. Another important use case of type profiling is vtable layout optimization based on vtable hotness, a part of RO data section splitting effort that Snehasish Kumar works on.
[1] More accurately, aince a non-vtable profiled address is not within the address range of vtable objects, it’s stored as zero in indexed profiles. A pass that looks up symbol with an zero hash will (almost) always
find nullptr and skip the actual transformation (e.g., comparison of symbols). So the performance overhead from non-vtable profiled address is negligible if exists at all. Comparing loaded address
with symbol address guarantees correctness.
[2] A full example from C++ to IR without the proposed vtable type profiling is in Compiler Explorer
[3] Only ELF supports are implemented for the two new object-file sections at this point. Support for other platforms could be added to other platforms if needed.
[4] To support merging of value profiles in the raw profiles generally, one option is to let the compiler-runtime expose the segment address ranges for binary sections (e.g., for .text* sections and .data.rel.ro
), and record segment address ranges in each individual raw profiles. This way, the recorded runtime address could be converted to a static virtual address (which remains unchanged for the same binary) for each individual raw profile and merged together correctly.
[5] A full C++ executable is in Compiler Explorer