[RFC] Dynamic Type Profiling and Optimizations in LLVM

Overview

We propose to enrich IRPGO profiles with dynamic type information and leverage the type information to decide the best comparison sequence with cost benefit analysis for speculative indirect call promotion. This removes a vtable load from the critical path. ​​Our prototype (patch 1, 2 and 3) shows this improves throughput by ~0.3% for some internal workloads. We’d like to share the design and results from our prototype, and get feedback so we could upstream it.

Thanks,
Mingming, Teresa, David, Snehasish

Background

Profile-guided indirect call promotion pass improves performance by improving branch prediction and more importantly exposing function inline opportunities. Currently IRPGO has two types of value profiling: indirect call targets and mem* op sizes. Value profile intrinsics are inserted for value profiling sites, and the following information are collected for indirect call target profiling:

  1. For an indirect callsite, the list of runtime addresses of the called function and their counts.
  2. Per function profile information, including the MD5 hash of function names and runtime addresses of a function.
  3. (Compressed) Function names.

MD5 hash bridges runtime addresses to symbol (function) names, to discover the top 3 indirect call function targets of a given callsite for speculative indirect call promotions.

At the 2020 Meeting C++ Conference, Teresa Johnson gave a talk about whole program devirtualizations and mentions relevant optimizations on indirect-call-promotion and further devirtualization opportunities. This work implements the vtable-comparison optimization mentioned in the talk.

Motivating Use Case

The motivating use case of dynamic type information is to compare vtables for indirect-call-promotion.

Given a virtual call ptr->func(), the comparison sequence inserted by the current indirect call promotion pass gives

vptr = ptr->_vptr
func_ptr = *(vptr + function-offset);  // vtable load

if (func_ptr == HotType::func) 
 HotType::func();  // highly frequent path
else
 call func_ptr();  // less likely path

The proposed vtable comparison gives the following sequence, moving the vtable load of funcptr out of the critical path and into the fallback indirect branch handling

vptr = ptr->_vptr;
if (vptr == &vtable_HotType)
  HotType::func()   // highly frequent path
else  {           // less likely path
  func_ptr = *(vptr + function-offset) // vtable load
  call func_ptr
}

Design

Dynamic type profile generation and annotation

Overall, the goal is to detect virtual table load instructions and profile the underlying virtual table types. The vtable type profiling will dump the following information in raw profiles.

  1. The vtable addresses used for virtual calls.
  2. For each virtual table objects
    • The beginning and end addresses.
    • The MD5 hashes and names of the virtual table objects

By looking up an address in (sorted) address ranges, a runtime address could be associated with the corresponding virtual table object, and annotated as a MD5 hash with counters in a value profile metadata.

Binary instrumentation

Pick instrumentation points and instrument runtime types

Naturally, vtable value profile intrinsics will be inserted by pgo-instr-gen pass. The value profile intrinsic llvm.instrprof.value.profile is used to instrument runtime addresses of vtables. A new value type IPVK_VTableTarget will be introduced in ValueProfKind to indicate that the annotated value is the MD5 hash of vtable objects.

Since the vtable profiles is just one kind of vtable profiles with its own type, existing value profiling infrastructure works out of the box to track and store its content into raw profiles, and indexed profiles stores per function value profiles of all types. No additional profile format changes are needed to store the actual values.

A heuristic is used to pick the instrumentation points. This heuristic looks at an address from which an indirect callee is loaded, strips the constant offsets if any to get a base address. This is the profiled address.

This heuristic profiles both real vtable addresses and non-vtable address feeding instructions (false positives). We will record the address ranges of vtable definitions (elaborated in the next subsection). By looking up the profiled address in address ranges, a false positive address gets associated to no symbols [1], and the optimizer pass will drop false positive addresses when using the profiles. In other words, false positives could increase the instrumentation overhead but doesn’t affect profile-use. To eliminate the overhead, a possible refinement is to insert type intrinsics (e.g., llvm.type.test) under clang option -fprofile-generate so only type tested addresses, which will be vtables, are instrumented.

Here is an example. The following C++ code snippet has two indirect calls, one callee is a virtual function and the other is a captured function pointer. We’ll show a snippet of IR with the value profile intrinsic added.

class Base {
public:
 virtual int func();
};


typedef int(*FuncPtr)(int);
class FuncPtrWrapper {
public:
    FuncPtr* funcptr;
 };

int func(Base* b, FuncPtrWrapper* wrapper) {
    return wrapper->funcptr[0](b->func());
}

The simplified IR [2] with vtable type profiling is shown below

define i32 @_Z4funcP4BaseP14FuncPtrWrapper(ptr %b, ptr %wrapper) {
entry:
   // 1
  call void @llvm.instrprof.increment(ptr @__profn__Z4funcP4BaseP14FuncPtrWrapper, i64 567090795815895039, i32 1, i32 0) 
  %0 = load ptr, ptr %wrapper
  // usually a no-op
  %1 = ptrtoint ptr %0 to i64
  // 2
  call void @llvm.instrprof.value.profile(ptr @__profn__Z4funcP4BaseP14FuncPtrWrapper, i64 567090795815895039, i64 %1, i32 2, i32 1) 
  %2 = load ptr, ptr %0
  %vtable = load ptr, ptr %b
  // usually a no-op 
  %3 = ptrtoint ptr %vtable to i64
  // 3
  call void @llvm.instrprof.value.profile(ptr @__profn__Z4funcP4BaseP14FuncPtrWrapper, i64 567090795815895039, i64 %3, i32 2, i32 0)
  %4 = load ptr, ptr %vtable
  // usually a no-op
  %5 = ptrtoint ptr %4 to i64
   // 4
  call void @llvm.instrprof.value.profile(ptr @__profn__Z4funcP4BaseP14FuncPtrWrapper, i64 567090795815895039, i64 %5, i32 0, i32 0)
  %call = call i32 %4(ptr nonnull align 8 dereferenceable(8) %b)
  // usually a no-op
  %6 = ptrtoint ptr %2 to i64
   // 5
  call void @llvm.instrprof.value.profile(ptr @__profn__Z4funcP4BaseP14FuncPtrWrapper, i64 567090795815895039, i64 %6, i32 0, i32 1)
  %call1 = call i32 %2(i32 %call)
  ret i32 %call1
}

In the IR, instructions with a comment in the previous line are generated are inserted by IRPGO, and the rest are original IR. 2 and 3 (along with the ptrtoint in the previous line) will be inserted by type profiling, and {1, 4, 5} are inserted by existing IRPGO. Note 3 is the real profiled address and 2 is the false positive. Since the profiled value from 2 is not within the address range of the vtable object, optimizer pass will get no vtable symbols and skip the transformation for it at profile-use time.

Collect vtable object address range, names and MD5 hash of names

We’ll collect the following information from instrumented binary and runtime.

  1. The address range (i.e., start and end address) of a virtual table object, and the MD5 hash of the vtable’s global name.
  2. The global name of virtual tables.

To find virtual tables in a binary, we propose to insert LLVM type metadata for vtable objects under clang option -fprofile-generate. If a variable is annotated with type metadata, we regard it as a virtual table definition and profile the address range.

In the current instrumented PGO, one object-file section is created for each type of profile data in the instrumented binary, and data instances are stored contiguously in this section.

For vtable profiling purpose, two new object file sections are introduced in the instrumented binary [3]

  1. A section named __llvm_prf_vtabnames to record the compressed names of virtual table variables.
  2. A section named __llvm_prf_vtab to record the metadata of virtual tables. The metadata section for vtable is similar to what __llvm_prf_data does for functions. The vtable profile data in C++ looks like this
struct VTableProfileInfo {
     // The address is collected at runtime.
     // The type of address could be uint32_t or uint64_t, depending on system.
     uint64_t VTableStartAddress;  
     // The byte size of the vtable. The address range is used for look up.
     uint32_t VTableByteSize;
     uint64_t MD5HashOfName
};

We measured the objdump data from the prototyped workload. Using llvm-objdump -h output, the binary size increase from two new sections is 0.4%.

Profile format changes for type profiling

The IRPGO profiles currently come in three forms.

  1. The raw profile format, generated by profiler runtime.
  2. The indexed profile format, generated by llvm-profdata.
  3. The text profile format, for user inspection.

The proposed changes in profile format are described as follows.

Raw Profile Format

As mentioned above, the actual vtable value profiles from instrumented points are stored in the existing sections without additional format chan

The IRPGO raw profile is a memory dump of the data, and the raw profile header indicates the layout of different sections and records the metadata to infer the byte offset of each section. To include type profiles, we will add two sections in raw profiles as illustrated in fig 1 and bump INSTR_PROF_RAW_VERSION by one.

                           +-----------------------+---+
                           |        Magic          |   |
                           +-----------------------+   |
                           |        Version        |   H
                           +-----------------------+   E
                    +---   |   Other Existing      |   A
                    |      |   Header Fields       |   D 
                    |      +-----------------------+   E
                    | +--- |        Count of       |   R
                    | |    |    VTableProfData     |   |
                    | |    +-----------------------+   | 
                    | |    |    Byte Size of       |   |
infer offset        | |    |    VTable Names       |   |
                    | |    +-----------------------+---+
                    | |    |  Existing sections    |   
                    | |    +-----------------------+---+    
                    +----> |  VTableProfData       |   N  
                      |    +-----------------------+   E  
                      +--> |  VTable Names         |   W
                           +-----------------------+---+
                   fig 1: type profiles in raw profile format

Similar to the existing IRPGO, raw profile writer would leverage linker-inserted section symbols in ELF [3] to find the vtable names and VTableProfData and create memory dumps of C++ structs. Raw profile reader would infer the section byte offset and read section byte size directly from profile header to deserialize data.

Currently there is already a section to store compressed function names in the binary and raw profile. We propose to store compressed function names and compressed vtable separately in the binary and raw profile. This way, raw profile readers can conveniently access each set of names by reading the corresponding section; the implementation is straightforward. If function names and vtable names are stored together, the raw profile reader needs to join the MD5 hash from per function profile data and the decompressed set of names to infer the set of names that are not functions. Looking at joint data introduces additional RAM usage and map look-up operations in profile readers.

If a binary is compiled with -fPIE and runs with ASLR enabled, the actual profiled address of the same instruction (functions or vtables) could vary from run to run. Additional information [4] is needed to merge value profiles of addresses correctly in various settings (pie or not, ASLR or not). To begin with, we propose to disallow the merge of raw profiles if they contain value profiles as a collateral part of this effort. For the profile merging use case, converting individual raw profiles to indexed format and merging the indexed profiles would work.

Indexed Profile Format

As mentioned above, the vtable value profile is just one type of value profiles with its own type, the indexed profiles could store per function value profiles without additional format change.

When llvm-profdata reads raw profiles, the vtable runtime addresses from raw profiles will be mapped back to MD5 hash of virtual table names and recorded in the per-function, per CFG hash InstrProfRecord. This follows the current implementation to map function runtime addresses to function MD5 names. As a result, indexed-format profiles don’t need to store an equivalent of VTableProfileInfo.

To display the human readable virtual table types (e.g., support llvm-profdata show --function=foo --show-vtables like what we do for llvm-profdata show --function=foo --ic-target), we propose to add a new section in the indexed profile to store the compressed vtable names and bump the INSTR_PROF_INDEX_VERSION by 1.

The indexed profile format is displayed in fig2

                      +-----------------------+---|
                      |        Magic          |   |
                      +-----------------------+   H
                      |        Version        |   E
                      +-----------------------+   A
                      |    Other Existing     |   D
                      |    Header Fields      |   E
                      +-----------------------+   R
               +----- |   VTable Names Offset |   |
               |      +-----------------------+---+ 
               |      |    Other Existing     |   
               |      |    Payloads           |   
               |      +-----------------------+  
               +----> |    Compressed         |  
                      |    VTableNames        |      
                      +-----------------------+   
           fig 2:proposed changes in indexed profile format
Text Profile Format

For human inspection, text profiles organize and store the profiles by functions and display human-readable function names for indirect callees. The profiles (including counters and value profiles) of one function are stored together in the text after the profiles of another function as illustrated in Fig3.

                   +-----------------------+
                   |      Text Header      |
                   +-----------------------+
                   |  Text profile for     |
                   |  function1 with cfg1  |
                   +-----------------------+
                   |  Text profile for     |
                   |  function1 with cfg2  |
                   +-----------------------+
                   |  Text profile for     |
                   |  function2 with cfg3  |
                   +-----------------------+
                   |  Text profile for     |
                   |  function2 with cfg4  |
                   +-----------------------+ 
                       fig 3: text format

As such, we’ll store human readable names vtable names directly. This doesn’t require any format change in text profiles. We just need to teach text reader/writer to understand the new value type IPVK_VTableTarget and vtable names.

Here is an example by extending the previous C++ code snippet as C++ executable [5] . The text profile is shown below. The first indirect callsite collects two candidate functions _ZN4Base4funcEv and _ZN7Derived4funcEv, and the second indirect callsite collects _Z4echoi and _Z6squarei. The first profiled vtable site has two types _ZTV4Base and _ZTV7Derived, and the second profiled site is a false positive, showing as external symbols.

:ir
main
# Func Hash:
783881049970552793
# Num Counters:
3
# Counter Values:
500
500
1
# Num Value Kinds:
2
# ValueKind = IPVK_IndirectCallTarget:
0
# NumValueSites:
2
2
_ZN4Base4funcEv:500
_ZN7Derived4funcEv:500
2
_Z4echoi:500
_Z6squarei:500
# ValueKind = IPVK_VTableTarget:
2
# NumValueSites:
2
2
_ZTV4Base:500
_ZTV7Derived:500
2
** External Symbol **:500
** External Symbol **:500

Type Profile Annotation In Optimized Build.

Thanks to existing value profile infrastructure in IRPGO, using a new kind of value type in an optimized build is straightforward.

At profile use time of IRPGO, the PGOInstrumentationUse pass annotates counters and values with a per-function profile record; this pass only applies profiles iff the CFG hash from profiles and from IR matches. As a result, the pass could pick precisely the instrumented instructions for each value type, and annotates value profile IR metadata on the instructions.

Using the previous C++ example, the annotated IR will be

define i32 @_Z4funcP4BaseP14FuncPtrWrapper(ptr %b, ptr %wrapper) {
entry:
  %0 = load ptr, ptr %wrapper
  %1 = load ptr, ptr %0
  %vtable = load ptr, ptr %b, !prof !0   ; 1
  %2 = tail call i1 @llvm.public.type.test(ptr %vtable, metadata !"_ZTS4Base")
  tail call void @llvm.assume(i1 %3)
  %4 = load ptr, ptr %vtable
  %call = tail call i32 %4(ptr %b), !prof !1   ; 2
  %call1 = tail call i32 %1(i32 %call), !prof !2  ; 3
  ret i32 %call1
}

declare i1 @llvm.public.type.test(ptr, metadata) 
declare void @llvm.assume(i1 )

!0 = !{!"VP", i32 2, i64 1600, i64 1960855528937986108, i64 1600}
!1 = !{!"VP", i32 0, i64 1600, i64 5459407273543877811, i64 1600}
!2 = !{!"VP", i32 0, i64 1600, i64 379032081350541304, i64 1600}

The two tail call instructions (commented as 2 and 3) along with !prof metadata are existing indirect function call value profiles. The load instruction along with !prof (commented as 1) is for the new vtable value profile. Note non-vtable load won’t be annotated with value profiles.

Virtual Table Definition Import

This is relevant in the context of ThinLTO.

In the postlink optimizer pipeline, we need to know the function of each vtable and need to import a virtual table definition referenced by vtable value profile metadata if it doesn’t exist in the module.

Currently, ThinLTO builds a list of referenced global variables for each function and imports the referenced variables. To import vtable definitions, we will synthesize referenced edges for the profiled vtable definitions as read-only variable references, and assign value id like we do for indirect calls since the vtable declaration symbol may not exist in the module during prelink. Existing ThinLTO infrastructure already handles the rest to actually import the definitions.

It’s worth mentioning that IRLinker already makes sure the list of virtual functions referenced by virtual table definitions at least have declarations in the module even if the virtual function definitions are not imported. So here no additional work is needed to avoid unknown functions referenced in a vtable.

Cost Benefit Analysis

Comparing vtables does not always give the best comparison sequence. For instance, if there is one dominant target function and five vtable candidates, comparing vtables could introduce too many compare instructions whereas comparing functions is one load with one compare. For cost-benefit analysis, parameters are introduced to control the number of additional ALU operations (icmp or address calculation) for the vtable-based comparison sequence. Our tunings show more than 4 additional ALU operations won’t help performance any more.

In general, the parameters should be configurable and overridable for different backend targets. For instance, aarch64 needs to materialize large constants (vtable addresses) with two instructions whereas x86 could typically encode immediate value in a machine instruction, exhibiting cost differences in instruction cache and register usages.

Note vtable-based selection not only saves a load from critical passes, but exposes jump threading opportunities.

To illustrate, origin code is

  ptr->func1();
  ptr->func2();

Comparing functions gives the following code snippets

 vtbl_ptr = ptr->_vptr;
 func_ptr = *(vtbl_ptr + offset_of_func1);    

 if (func_ptr == HotType::func1)                      
   HotType::func1();   // hot path
 else
   func_ptr ();     // cold path       

 func_ptr = *(vtbl_ptr + offset_of_func2);        

 if (func_ptr == HotType::func2)                          
   HotType::func2();   // hot path
 else
   func_ptr ();     // cold path       

Comparing virtual tables gives the following code snippets

vtbl_ptr = ptr->_vptr;
if (vtbl_ptr == &HotTypeVTable)
{
   HotType::func1();
   HotType::func2 ();                        
}
else
{
   func_ptr = *(vtbl_ptr + offset_foo_entry);
   func_ptr ();
   func_ptr = *(vtbl_ptr + offset_bar_entry);
   func_ptr ();
}

The latter sequence not only has fewer branches, but also exposes further devirtualization opportunities after function inline. For instance, if HotType::func1() is inlined, some virtual calls inside it could be statically devirtualized with class-hierarchy analysis.

Status

The changes have been implemented and tested. The prototype was implemented and a performance test was done on one internal workload. The baseline binary compares function addresses, whereas the test binary compares vtable addresses with tuned cost benefit analysis. The test binary shows a geomean of 0.3% QPS increase over the baseline binary on some internal workloads. The patches are either drafted or already in review.

Follow up Work

In order to get precise type profiles for vtable loads in sample-based PGO, we’ll need data access profiling. For instance, the MEM_INST_RETIRED.ALL_LOADS_PS precise event on Intel processors samples retired load instructions, recording the actual delivery of the data. The smaps data is already recorded in the Sample-PGO profile data, so translation from runtime addresses to static addresses (and therefore symbol names,sizes) can be done. Sample PGO profile format needs to be extended to store value profiles.

With dynamic type information in the profiles, profile-guided interprocedural type propagation and type profile driven function cloning is feasible. Meanwhile, vtable definition importing based on hotness could achieve the goal of importing function declarations for better call-graph sort with more complete call graph information. Another important use case of type profiling is vtable layout optimization based on vtable hotness, a part of RO data section splitting effort that Snehasish Kumar works on.

[1] More accurately, aince a non-vtable profiled address is not within the address range of vtable objects, it’s stored as zero in indexed profiles. A pass that looks up symbol with an zero hash will (almost) always
find nullptr and skip the actual transformation (e.g., comparison of symbols). So the performance overhead from non-vtable profiled address is negligible if exists at all. Comparing loaded address
with symbol address guarantees correctness.
[2] A full example from C++ to IR without the proposed vtable type profiling is in Compiler Explorer
[3] Only ELF supports are implemented for the two new object-file sections at this point. Support for other platforms could be added to other platforms if needed.
[4] To support merging of value profiles in the raw profiles generally, one option is to let the compiler-runtime expose the segment address ranges for binary sections (e.g., for .text* sections and .data.rel.ro), and record segment address ranges in each individual raw profiles. This way, the recorded runtime address could be converted to a static virtual address (which remains unchanged for the same binary) for each individual raw profile and merged together correctly.
[5] A full C++ executable is in Compiler Explorer

5 Likes

@teresajohnson @davidxl @snehasish who has contributed and advised this.

@modimo @htyu @WenleiHe This is an RFC for the type profiling work. Questions and feedback are welcome.

@minglotus-6 Nice writeup! I’ve got a much clearer picture of what the patches are doing.

2 questions:

  1. For the ~0.3% improvement you’re seeing, has there been any analysis on where that gain is coming from? For instance, how many indirect call sites is this optimization firing on relative to original ICP and is this coming from a few hot callsites or more generally across a large number of callsites? I’m looking at related optimizations with WPD and eventually something similar here with sample PGO so insights on the performance characteristics would be good to know about to look out for.
  2. For associating profiled loads with vtables, I’m wondering if enough information exists in the binary already to make that happen without a separate section. Vtable symbols are recorded in the ELF .symtab with their starting address and size which seems to be everything in __llvm_prf_vtab.

Has there been any analysis on where that gain is coming from? For instance, how many indirect call sites is this optimization firing on relative to original ICP and is this coming from a few hot callsites or more generally across a large number of callsites?

I haven’t collected aggregated stats; but the local change prints logs (probably worth gated by LLVM_DEBUG). I could try getting this data and update back.

For associating profiled loads with vtables, I’m wondering if enough information exists in the binary already to make that happen without a separate section. Vtable symbols are recorded in the ELF .symtab with their starting address and size which seems to be everything in __llvm_prf_vtab.

This idea reminds me of the profile correlation work based on binary or dwarf debug information.

To look up the profiled address collected at runtime using the address recorded in the binary, the raw profile also needs to record the runtime start address of the relevant segment (e.g. .data.rel.ro). This also means llvm-profdata needs to take the binary as input to process profiles (either to generate indexed profiles from raw profiles, or show the profiled vtable information from a raw profile file)

To provide more context, the instrFDO profiles collection and processing are usually part of the prod binary release process. The release tools (which invoke llvm-profdata in a workflow) currently doesn’t take binary as input for instrFDO afaik. That is, making llvm-profdata taking binary as input may require more tooling and release procedure changes.

For sampleFDO, create_llvm_prof does take binary as input currently.

Having additional logging about optimizations is generally a plus for me. It also makes it easier to debug when something goes incorrectly and I’m not familiar with the pass. Thanks!

I see, that makes sense. We use sampleFDO for all but one application at Meta (and we’re planning on moving that off instrPGO) so it didn’t occur to me that the binary may not be used for profile generation. In that case, I think it makes sense to continue with the existing workstream.

On a tangential topic, checking out both your links they mentions that the __llvm_prf_* sections are currently intimately tied to the binary and can be quite large/loaded into memory. Are these things that matters for your scenarios or not so much?

Having additional logging about optimizations is generally a plus for me. It also makes it easier to debug when something goes incorrectly

Sounds good. Just a quick update, I should be able to get back with stats today or tomorrow, it took longer than I thought mainly because the original profile (corresponding to a version of source code) got garbage-collected. Now I have finished merging pending LLVM codes and building a toolchain, and about to start collecting new profiles for newer source code.

By the way, how does WPD performance number look like on your side?

We use sampleFDO for all but one application at Meta (and we’re planning on moving that off instrPGO) so it didn’t occur to me that the binary may not be used for profile generation.

Got it. Out of curiosity, what’s the main motivation of moving the instrPGO targets (to sampleFDO I suppose)?

On a tangential topic, checking out both your links they mentions that the __llvm_prf_* sections are currently intimately tied to the binary and can be quite large/loaded into memory. Are these things that matters for your scenarios or not so much?

Previously, I measured the binary size increase on-disk for the workload using (sizeof(vtable-metadata object-file section + sizeof(compressed-vtable-name object-file section)) / sum-of-all-object-file-size) and the sizeof is from llvm-objdump -h. The increase is ~0.4% for the binary being measured.

I haven’t measured the in-memory usage increase though. But I have extended the tested workloads after posting the RFC, and haven’t encountered issues without increasing the resource allocation.

As a proxy to measure the resource usage from the increase introduced by two new binary sections, the byte size (from llvm-objdump -h) of two new sections is ~7.7% of __llvm_prf_cnts, and ~1.1% of the sum of {.text.*, .data, .data.rel.ro} in the instrumented binary.

We’ve just enabled WPD in thinLTO without split LTO units (so only single implementation devirtualization) and it’s showing 0.4-0.6% gCPU improvement on three of our large Ads services. I’m currently working on scaling this out further throughout the fleet.

A big part is process homogeneity. Supporting 2 separate workflows is expensive and the onboarding process is much smoother for sampleFDO because sample collection is automatically enabled on every running service.

To clarify I don’t think the additions here impact this existing problem. I’m more asking if the overall cost for all of these sections are something that is interesting for scenarios on your side. If there’s some plans/thoughts on that then the additions here could take those into account.

For the ~0.3% improvement you’re seeing, has there been any analysis on where that gain is coming from? For instance, how many indirect call sites is this optimization firing on relative to original ICP and is this coming from a few hot callsites or more generally across a large number of callsites?

Here are the stats for one workload (this is with WPD enabled on both baseline and experiment)

  • Function-based ICP comparison applies on 32965 indirect callsites and vtable-based ICP comparison applies on 22726.
  • Weighting each line by the block frequency, the weighted counter is 110449362793 for function comparisons and 285933039338 for vtable comparisons.
    • The weight of a callsite is the sum of the per-candidate count for all promoted functions (or vtables).

From the stats, vtable comparisons have smaller non-weight counts but higher weighted counts.

1 Like

Ack, thanks for sharing this!

To clarify I don’t think the additions here impact this existing problem. I’m more asking if the overall cost for all of these sections are something that is interesting for scenarios on your side. If there’s some plans/thoughts on that then the additions here could take those into account.

I see your original question better now. I concur that a design that introduces new object-file sections might push binary size across the edge if the original executable is large. Hopefully the measured metrics justify the trade-off (instrumented binary size increase and the design choice) better.

Thanks for information! That’s a considerable amount of callsites that are able to be optimized to vtable-based ICP and a good data point for incorporating this on our side as well. For reference, when I was digging through the hottest ICP sites in our ads workloads I also found many of them to be able to be transformed to vtable comparisons by inspecting the source code.

I think the trade-off is definitely worth it here. With the other discussions you’ve linked it looks like there’s parties that are interested in decoupling the sections and/or re-using DWARF data that already exists which is something to consider for broader adoptability.

1 Like

An update (cc @modimo @teresajohnson )

I have uploaded the end-to-end implementation into a branch of my LLVM fork. Besides (a squashed version of) the type instrumentation PR, the rest of the branch contains thinlto import of vtables and actual icp transformations (commit 1 and 2). I’ll send out the rest of the commits using stacked reviews (supported by spr). The first PR is not managed by spr and has been reviewed, and I’m still figuring out whether I should just use spr to create a duplication (of the first PR) in order to send the rest of the patches (for formal review) in a stack in this scenario. The one duplication is meant to make diffs from the rest of patches visible but not meant for duplicated reviews.

Meanwhile, with safer WPD, we are planning to look into an optimization that removes the indirect call fallback [1] if there are no other implementations. Presumably with profile-guilded indirect-call-promotion, the BB for indirect fallback should be a cold block (and thereby splitted out of .text.hot with machine-functions-splitter). However this would be useful as an general optimization.

[1]

vptr = ptr->_vptr;
if (vptr == &vtable_HotType1)
  HotType1::func()   // hot path
else if (vptr == &vtable_HotType2)
  HotType2::func() // hot path
// If there are no other implementations with safer WPD, this fallback could be optimized away.
else  {           
  func_ptr = *(vptr + function-offset)
  call func_ptr
}
1 Like