[RFC] VTable Type Profiling for SampleFDO

Summary

Prior work does more efficient indirect call promotion using vtable profiles for Instrumented PGO. This RFC proposes to extend the SamplePGO profile format to include vtable type information, allowing the compiler to annotate vtable types on IR instructions and perform vtable-based indirect call promotion for SampleFDO. The proposal consists of two main changes:

  1. Profile Format: PR 148002 implements the change for the extensible binary format (the default binary format of SamplePGO) and text format. The profile format change for ext-binary is backward compatible.
  2. llvm-profgen: A prototype demonstrates how llvm-profgen could be extended to process perf data with Intel MEM_INST_RETIRED.ALL_LOADS samples and produce sample profiles with vtable information. For feature parity across different hardwares, future work could incorporate support for AMD Instruction-Based Sampling (IBS) and ARM Statistical Profiling Extension (SPE).

We welcome your feedback and suggestions on this proposal and the accompanying pull requests.

Thanks,

Mingming, Teresa, Snehasish, David

Motivation

Similar to type profiling for Instrumented PGO, the motivating use case to have vtable type information is to compare vtables for indirect-call-promotion in SampleFDO, moving the vtable load out of critical path into the fallback indirect branch handling. The before-vs-after transformation is illustrated with pseudo code below

Before After
vptr = ptr->_vptr
func_ptr = *(vptr + function-offset); // vtable load

if (func_ptr == HotType::func)
HotType::func(); // highly frequent path
else
call func_ptr(); // less likely path
vptr = ptr->_vptr;
if (vptr == &vtable_HotType)
HotType::func() // highly frequent path
else { // less likely path
func_ptr = *(vptr + function-offset) // vtable load
call func_ptr
}

What does the profile look like

The before-vs-after of a function’s profile is illustrated in text format below. The highlighted lines show the vtable type profiles for both body samples and inlined callsite samples. The keyword vtables is used to indicate that a line represents vtable counts for virtual calls.

The vtable counters represents the raw counters converted from memory access profiles, and thereby should be reasoned about within one line location as the vtables’ relative frequencies. In practice, the memory access events and LBR events could be collected independent of each other and with different sampling period. When the profile is used to compile programs, an optimizer can scale the vtable counters using the relative frequency to make them comparable with LBR-based counters. This leaves more flexibility for compiler heuristic tuning.

Original Sample Profile Sample Profile with vtable info
main:184019:0
4: 534
4.2: 534
5: 1075
5.1: 1075
6: 2080
7: 534
9: 2064 _Z3bari:1471 _Z3fooi:631
10: inline1:1000
1: 1000
10: inline2:2000
1: 2000
main:184019:0
4: 534
4.2: 534
5: 1075
5.1: 1075
6: 2080
7: 534
9: 2064 _Z3bari:1471 _Z3fooi:631
9: vtables _ZVTbar:1471 _ZVTfoo:630
10: inline1:1000
1: 1000
10: inline2:2000
1: 2000
10:vtables _ZVTinline1:1000 _ZVTinline2:2000

Description of Implementation

In-memory Representation of VTable Type Profiles

Currently, the class FunctionSamples keeps track of a function’s profile in the memory. It consists of head sample count, total sample count, and the profiles for each line location inside this function. Each line location represents the relative line offset and discriminator, and per-location samples could be a function’s sample profile for the inlined callees, and SampleRecord for the other cases. To store vtable type profiles, a new field VirtualCallsiteTypeCounts is introduced in a function’s in-memory profile. This field is a map, keyed by line locations, and value by pairs of vtables and counters. The class structure is pasted below.

// Key is vtable, and value is the counter.
using TypeCountMap = std::map<FunctionId, uint64_t>;

class FunctionSamples {
  uint64_t TotalSamples;
  uint64_t TotalHeadSamples;
  map<LineLocation, SampleRecord> BodySamples;
  map<LineLocation, map<FunctionId, FunctionSample>> CallsiteSamples;
  // Key is location, value is the <vtable, counter> pairs.
  map<LineLocation, TypeMap> VirtualCallsiteTypeCounts;
};

Extensible Binary Profile Format Change

The extensible-binary format is the default SamplePGO binary format. It organizes different data payloads into separate sections. For each section, its section header uses a bitmap (referred as flag in actual implementations) to describe the section features. In this format, a name table section contains de-duplicated function names and linearizes function names to an integer ID, and the function-profile section references function names by the integer ID.

The following changes are proposed for ext-binary profile format change:

  1. To represent the vtable names, use the name table section will store and linearize virtual table names in addition to function names.
  2. The in-memory representation of vtables (the field VirtualCallsiteTypeCounts in class FunctionSamples above) are serialized as a part of the FunctionSample. Implementation wise, the vtables are serialized after the serialization of inlined callsites within a FunctionSample. The LineLocation class and counters are serialized as they currently are, and vtable strings are represented by their linearized IDs in the name table.
  3. For flag configurability, we introduce an LLVM boolean option (named extbinary-write-vtable-type-prof) and make it off by default. The ext-binary profile writer reads the boolean option and decides whether to set the section flag bit SecFlagHasVTableTypeProf and serialize the class member VirtualCallsiteTypeCounts.
  4. For backward compatibility, we take one bit from ProfSummaryFlag to indicate whether the SampleFDO profile has vtable information. The profile reader will read the section flag bit and know whether the binary profiles have type profiles. The enum class definition is pasted below with existing and new fields.
enum class SecProfSummaryFlags : uint32_t {

SecFlagInValid = 0,

/// SecFlagPartial means the profile is for common/shared code.
/// The common profile is usually merged from profiles collected
/// from running other targets.
SecFlagPartial = (1 << 0),

/// SecFlagContext means this is context-sensitive flat profile for CSSPGO
SecFlagFullContext = (1 << 1),

/// SecFlagFSDiscriminator means this profile uses flow-sensitive discriminators.
SecFlagFSDiscriminator = (1 << 2),

/// SecFlagIsPreInlined means this profile contains ShouldBeInlined contexts thus this is CS preinliner computed.
SecFlagIsPreInlined = (1 << 4),

/// The new bit. SecFlagHasVTableTypeProf means this profile contains vtable type profiles.
SecFlagHasVTableTypeProf = (1 << 5),
};

Extend llvm-profgen to generate vtable profiles

To generate SampleFDO profiles from branch records, llvm-profgen processes Linux Perf script output containing hardware samples and memory map (mmap) events, and uses the mmap events to ā€œsymbolizeā€ the instruction pointers, getting the IP’s inline stack and location anchor. It populates the in-memory representation of function samples and serializes them as profiles.

A sample record of a MEM_INST_RETIRED.ALL_LOADS event essentially gives the memory address accessed and the instruction pointer of the load instruction. Like how instruction pointers are symbolized for SampleFDO, symbolizing the load instruction gives the inlined stack of the instruction. If the data address is symbolizable and has _ZTV prefix [1], this instruction loads from a vtable. llvm-profgen can parse a perf raw trace [2] of the memory access events into a list of <ip, data-symbol, count> tuples. Using the symbolized inline stack of instruction as an anchor, llvm-profgen can further process the tuples into a list of vtable symbols and their counts for a <function, LineLocation>.

A prototype is implemented for non-context sensitive SampleFDO. The source code and operations are attached in the Appendix section. Invoking llvm-profgen for SampleFDO profiles with this change gives the following entries

# generate profile
$ ./bin/llvm-profgen --perfscript=path/to/lbr-perf.script --data-access-profile=path/to/dap-perf.txt --binary=path/to/main --format=text --pid=<pid> -ignore-stack-samples -use-dwarf-correlation -o main.afdo

# show profile
$ ./bin/llvm-profdata show --sample --function=_Z9loop_funciii src-pie/main.profgen.afdo

# The terminal output of 'llvm-profdata show'
Samples collected in the function's body {
0: 636241
1: 681458, calls: _Z10createTypei:681458
3: 543499, calls: _ZN12_GLOBAL__N_18Derived24funcEii:410621 _ZN8Derived14funcEii:132878
3: vtables: _ZTV8Derived1:1377 _ZTVN12_GLOBAL__N_18Derived2E:4250
6.1: 602201, calls: _ZN12_GLOBAL__N_18Derived2D0Ev:454635 _ZN8Derived1D0Ev:147566
5.1: vtables: _ZTV8Derived1:227 _ZTVN12_GLOBAL__N_18Derived2E:765
7: 511057
}

To interpret the data, we can examine the virtual function calls at line offset 3. In the source code, loop_func at line 3 calls Derived2::func and Derived1::func for 3/4 and 1/4 out of all calls. Both the indirect call profiles and vtable profiles have a rough ratio of 3:1. Similarly, a rough 3:1 ratio is expected at the profile of the second virtual call. For the demo program that exercises a simple hoop, it’s observed that a perf.data with MEM_INST_RETIRED.ALL_LOADS collected under a short time window (e.g., 30 seconds) and a high sampling frequency could have biased vtable counters. In a real SampleFDO profile-and-compile set-up, profile counters are sampled from data center workloads at much lower frequency and in a a continuous manner, so presumably the biases ratio is mitigated because of statistical data. Besides, compiler can statically analyze the vtable and virtual function mapping in the whole program to make the numbers more usable.

Iterative Compilation

Unlike Instrumentation PGO, SamplePGO is often used in an iterative compilation set-up, meaning the profiles are collected from SamplePGO-optimized binaries. When the profiled binary is optimized with vtable-based indirect call promotion and thereby optimizes away the vtable load instructions, the memory load profiles won’t capture vtable load events.

Acknowledging the importance of stable performance in iterative compilation set-up, this RFC and the patches aim to make the SamplePGO profile format change, the first step towards the end goal. One potential solution that we plan to pursue is to record the vtable-based transformations as metadata (e.g., for each machine instruction that loads vtable from objects, the list of vtables and their counts) in non-alloc ELF sections, and use the profile mapping metadata as well as memory access profiles jointly to reconstruct the vtable profiles. This borrows ideas to record profile mapping metadata in the binary from Propeller basic block address map [3] and PC-keyed metadata, and the solution could be generalized for other passes that observe iterative compilation challenges [4].

Appendix

Source Code

  • lib.h
#include <stdio.h>
#include <stdlib.h>

class Base {
public:
  virtual int func(int a, int b) = 0;
  virtual ~Base() {};
};

class Derived1 : public Base {
public:
int func(int a, int b) override;
~Derived1() {}
};

__attribute__((noinline)) Base *createType(int a);
  • lib.cpp
#include "lib.h"

namespace {

class Derived2 : public Base {
public:
  int func(int a, int b) override { return a * (a - b); }
  ~Derived2() {}
};

} // namespace

int Derived1::func(int a, int b) { return a * (a - b); }

Base *createType(int a) {
  Base *base = nullptr;
  if (a % 4 == 0)
    base = new Derived1();
  else 
    base = new Derived2();
  return base;
}
  • main.cpp
#include "lib.h"
#include <iostream>
#include <chrono>
#include <thread>

__attribute__((noinline)) int loop_func(int i, int a, int b) {
  Base *ptr = createType(i);
  int sum = ptr->func(a, b);
  delete ptr;
  return sum;
}

int main(int argc, char **argv) {
  int sum = 0;

  auto startTime = std::chrono::steady_clock::now();
  // run the program long enough to make manual sampling easier
  std::chrono::minutes duration(7200);

  while(std::chrono::steady_clock::now() - startTime < duration) {
    for (int i = 0; i < 100000; ++i) {
      sum += loop_func(i, i + 1, i + 2);
    }
  }
  printf("total sum is %d\n", sum);
  return 0;
}

Operations

  1. Compile and run source code
# Build a position dependent binary so runtime addr is the same as static virtual addr for parsing 
# profiles. 
./bin/clang++ -v -static -g -O2 -fdebug-info-for-profiling -fno-omit-frame-pointer -fpseudo-probe-for-profiling -fno-unique-internal-linkage-name -fuse-ld=lld lib.cpp main.cpp -o main

# run the program and get its process id
./main &
  1. Run Linux Perf to collect memory access perf.data, and generate its raw trace
perf5 record -a -c <prime-number-as-sampling-period> -p <pid> -d --pfm-events MEM_INST_RETIRED.ALL_LOADS:u:pinned:precise=3 -g -i -m 16 --buildid-mmap -N -o - --skip-hugepages sleep 30 perf5 inject -b --buildid-all -i - -o dap-perf.data

perf5 report -D -i dap-perf.data >dap-perf.txt
  1. Run Linux Perf to collect LBR, and generate its perf script
perf5 record --pfm-events br_inst_retired:near_taken:u -c <prime-number-as-sampling-period> -b -p <pid> -o lbr-perf.data sleep 30
perf5 script -F ip,brstack -i lbr-perf.data --show-mmap-event &> lbr-perf.script

Running llvm-profgen [5] gives the function profile with vtable counts.


  1. assuming Itanium ABI is used ā†©ļøŽ

  2. a raw trace from perf report -D is used to address a tooling limitation – perf script doesn’t print leaf address in perf script -F ip for the memory access profiles ā†©ļøŽ

  3. explained in the section 3.2 of Propeller: A Profile Guided, Relinking Optimizer for Warehouse-Scale Applications ā†©ļøŽ

  4. explained in the section 4.1.4 and 5.2 in paper AutoFDO: Automatic Feedback-Directed Optimization for Warehouse-Scale Applications ā†©ļøŽ

  5. ./bin/llvm-profgen --perfscript=path/to/lbr-perf.script --data-access-profile=path/to/dap-perf.txt --binary=path/to/main --format=text --pid= -ignore-stack-samples -use-dwarf-correlation -o main.afdo ā†©ļøŽ

4 Likes

For feature parity across different hardwares, future work could incorporate support for AMD Instruction-Based Sampling (IBS) and ARM Statistical Profiling Extension (SPE).

cc: @ilinpv who mentioned they were interested in SPE support in llvm-profgen.

1 Like

Regarding the iterative compilation, one solution is to make use of the edge profile in the following release – the compiler can look at the vtable bassed icp specialization guards (still needs to be annotated as proposed in this RFC) generated and infer the vtable access count in unoptimized code.

Can you also elabrate on the profiling bias issue with high sampling rate?

By making use of edge profile, do you mean that vtable counters can be inferred from the new profiled binary? That’s indeed the plan to reconstruct vtable counts if the binary doesn’t optimize the loads away, and edge profiles reflects the latest information about the binary but recorded vtable counts get at least stale by one binary version. I agree that recording the interested instructions are required, but it might be feasible to save the effort of recording the exact vtable counters as binary metadata.

Can you also elabrate on the profiling bias issue with high sampling rate?

Sure. Using the llvm-profdata show output in [RFC] VTable Type Profiling for SampleFDO as an example, the vtable counters are in the range of 200 - 4500, while the counters inferred from LBR are in the range of 500,000 - 700,000. The memory-load raw counters and LBR-inferred counters differ by orders of magnitude [1], and the ratio of virtual function target (from LBR) is often much closer to 3:1 than the ratio of vtables (from memory access events) if we repeat the experiment a couple of times. Presumably with continuous sampling (at much lower sampling rate but across the entire fleet with more machines in a real world setting), the bias is mitigated. I don’t have analysis result over real-world data points though. One way to do this analysis is to have a SampleFDO profile generated with vtable counters from continuous sampled data and analyze the profile.


  1. the Linux Perf tool uses the same sampling period for last branch records and ā€˜MEM_INST_RETIRED.ALL_LOADS’ for the same interval (30 seconds) when this difference is observed ā†©ļøŽ

By edge profile, I mean the vtable counts can be inferred from the new profiled binary. The vtable access count should be the same as the guard if (vptr == &vtable_A) . Ideally if profile processing can peek into the binary and do disassembly, no meta data annotation is needed, but it is more convenient to annontate the guard branch with vtable candidates.

For memory load counts, what matters is the ratio and total count. The total count can be determined from the block counts where the vcall resides.

Thanks for sharing the proposal.

One challenge with using PEBS for value profiling is the lack of filtering. WIth MEM_INST_RETIRED.ALL_LOADS, we’d be sampling all loads with only a tiny portion relevant to vtable load. Then in order to get sufficient amount of sample for vtable value profile, we likely will have huge amount of value profile raw data.

How do you deal with that? Have you tried this prototype on any large scale application to gauge scalability of the solution?

Thanks for the feedback Wenlei!

It’s true that vtable samples is sparse [1] out of all MEM_INST_RETIRED.ALL_LOADS samples.

To get representative vtable profiles (and static data access profiles [2]), we rely on a continuous profiling infrastructure [3] which collects MEM_INST_RETIRED.ALL_LOADS samples from multiple production machines in the fleet (at low frequency to minimize the overhead), canonicalizes the runtime address to binary’s virtual address and aggregates the samples for each data center application.

To estimate the virtual call overhead in real applications, we look at MEM_INST_RETIRED.ALL_LOADS and precise cpu cycle event. We use MEM_INST_RETIRED.ALL_LOADS events to locate the set of vtable load instructions in a binary. (e.g., where the data access is symbolized to a _ZTV-prefixed symbol), and calculate the percentage of precise cycles [4] spent on vtable load instructions out of the binary’s cycles. Analysis on top 50 SamleFDO binaries shows the median percentage number is 0.76%, the average percentage number is 0.88%, and the percentage is as high as 3.3%. There are indeed binaries with low virtual call overhead (e.g., where majority of cycles are in C but not C++).

As a supplementary data point, the percentage of vtable calls (also using the same precise cpu cycle events and MEM_INST_RETIRED.ALL_LOADS events from continuous production samples) drops by 0.4 ~ 0.5% after the vtable-based ICP is applied for instrumented PGO binaries, and this percentage number is roughly the same as the throughput increase.


  1. Unsurprisingly, analyzing the aggregated production samples described below shows majority of memory accesses are from heap rather than binary static data sections and vtable access is a subset of static data access ā†©ļøŽ

  2. for [RFC] Profile Guided Static Data Partitioning ā†©ļøŽ

  3. Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers ā†©ļøŽ

  4. relying on hardware’s precise cycle counters ā†©ļøŽ

Hey @mingmingl-llvm,

Thanks for the proposal.

On needing raw perf traces:

Can you please elaborate on the Linux Perf limitation? Are there any plans of dealing with that other than using -D?

If attaching raw events is also needed for AArch64, this would attach the Arm SPE’s native packets in textual format, making the text file bigger (~50-60% in some quick, rough test I did). We may be interested in improving the handling in Linux for that.

On the discussion of sampling bias:

the vtable counters are in the range of 200 - 4500, while the counters inferred from LBR are in the range of 500,000 - 700,000. The memory-load raw counters and LBR-inferred counters differ by orders of magnitude , and the ratio of virtual function target (from LBR) is often much closer to 3:1 than the ratio of vtables (from memory access events) if we repeat the experiment a couple of times. Presumably with continuous sampling (at much lower sampling rate but across the entire fleet with more machines in a real world setting), the bias is mitigated. I don’t have analysis result over real-world data points though. One way to do this analysis is to have a SampleFDO profile generated with vtable counters from continuous sampled data and analyze the profile.

That is interesting. Of course, this profiling is done in separate steps, and as mentioned, the sampling rate could be configured differently.

From the memory profiling, we are primarily interested in identifying vtable loads, but we also obtain a ratio to compare against the edge profile. In this isolated example, the edge profile appear to be more accurate.

Is the plan to use the memory-profiling ratio as a partial verification, or do you think matching rations would be required?


Some clarifications / naive questions:

I haven’t followed the instrumentation-based implementation (your prior work); does code emission support multiple target types for ICP?

drops by 0.4 ~ 0.5% after the vtable-based ICP is applied for instrumented PGO binaries,

How the above should this be interpreted? Does this vtable improvement apply on top of binaries previously optimized with instrumentation-based PGO? (where this was not used)

Build a position dependent binary so runtime addr is the same as static virtual addr for parsing profiles.

Is this a limitation in supporting PIE/PIC code? If so, would that be part of #148013? I’ve left a comment on the patch.

Thanks again for your work!

Paschalis

Hi all, thanks for the proposal and for your work on SampleFDO and the addition of this new profiling type. Please correct me if I am wrong, but as I understand the proposal and patches, LBR is not strictly required for vtable SampleFDO. Do you have any plans to update llvm-profgen to support non-LBR profiles? Such support would open the path to enabling this on existing AArch64 hardware.
Thanks,
Pavel

Thanks for the follow up and code reviews @paschalis.mpeis ! This reply fell through the cracks. Sorry about that. I replied inline, and happy to follow up in the discourse or PR reviews!

As the reviews for https://github.com/llvm/llvm-project/pull/148002 and https://github.com/llvm/llvm-project/pull/148013 have been going on for a while and prior comments are addressed, I plan to land them in the next two weeks if there are no further questions. But let me know if there are more PR review comments that you’d like to discuss.

Basically, perf script -F ip perf.data command prints the frame pointers but not the instruction pointer. There is an internal feature request to print leaf address the perf script command; after that’s supported, llvm-profgen can be enhanced to make use of perf script output as opposed to raw perf traces (which is implemented in https://github.com/llvm/llvm-project/pull/148013).

If attaching raw events is also needed for AArch64, this would attach the Arm SPE’s native packets in textual format, making the text file bigger (~50-60% in some quick, rough test I did). We may be interested in improving the handling in Linux for that.

This makes sense. Lmk if an example perf.data and external feature request somewhere to Linux perf tool would be useful in some sense; if yes, I’m happy to create a small example reproducing this and file it externally to explain what’s going on.

Of course, this profiling is done in separate steps, and as mentioned, the sampling rate could be configured differently.

This is correct!

From the memory profiling, we are primarily interested in identifying vtable loads, but we also obtain a ratio to compare against the edge profile. In this isolated example, the edge profile appear to be more accurate.

Yes, the edge profiles give more accurate distribution for two vtables than memory load profiling.

Is the plan to use the memory-profiling ratio as a partial verification, or do you think matching rations would be required?

If we are talking about compiler’s verification before making use of vtable load profiles, I think some heuristics (and whole program vtable analysis with additional work) are needed to reconcile the ratio between virtual call targets and virtual call types to strike a balance between having more vtable-based ICP and dropping inaccurate profiles.

I haven’t followed the instrumentation-based implementation (your prior work); does code emission support multiple target types for ICP?

The optimizer is implemented to allow at most 2 vtable targets for the last virtual call target and require one vtable target (with equivalent counters) for the rest. The cost benefit analysis is implemented inside IndirectCallPromoter::isProfitableToCompareVTables.

How the above should this be interpreted? Does this vtable improvement apply on top of binaries previously optimized with instrumentation-based PGO? (where this was not used)

This is correct.

Is this a limitation in supporting PIE/PIC code? If so, would that be part of #148013? I’ve left a comment on the patch.

The build command in the Operations section builds a position dependent binary for illustration purpose – it’s easier to match perf.data addresses with binary addresses without a conversion.

The llvm-profgen change handles address conversion, and you might find the PIE test case already :slight_smile:

1 Like

LBR is not strictly required for vtable SampleFDO.

Essentially, memory profiling gives the <instruction-pointer, vtable-address> pairs, and LBR profilings are used to associate the vtable counters to source code locations. From this perspective, LBR is not a strict requirement.

Do you have any plans to update llvm-profgen to support non-LBR profiles? Such support would open the path to enabling this on existing AArch64 hardware

@dhoekwater expanded native AFDO support by making use of AArch64 SPE. I’d defer to Daniel to share the next steps in this area.

Do you have any plans to update llvm-profgen to support non-LBR profiles?

@ilinpv Also just to clarify, is this more about supporting memory profiling or more about branch profiling for AArch64? The native AFDO support mentioned above is about branch profiling.

Ideally, we should support both, on the AArch64 systems where branch stack record profiling is not available: enabling the use of non-LBR profiles to map vtable counters to source code locations and providing support in llvm-profgen for branch profiles through SPE and for generic ones using PMU events as well ( currently handled by AFDO create_llvm_prof ).

Hey @mingmingl-llvm,

Thanks for addressing my comments in the PR and your detailed answer. :slight_smile:

Lmk if an example perf.data and external feature request somewhere to Linux perf tool would be useful in some sense; if yes, I’m happy to create a small example reproducing this and file it externally to explain what’s going on.

Yes please, a small reproducer would be great. I’ll forward this information to colleagues on Arm Linux team so they can take a look.

The build command in the Operations section builds a position dependent binary for illustration purpose – it’s easier to match perf.data addresses with binary addresses without a conversion.

The llvm-profgen change handles address conversion, and you might find the PIE test case already :slight_smile:

Great, thanks for clarifying and the added test!

I haven’t thought too deeply about the specifics of adding support for SPE branch profiles to llvm-profgen, but porting over the create_llvm_prof implementation would require a significant amount of work due to its use of the perf_data_converter and protobuf libraries. Although I have informally committed to port over SPE support in the event that we decide to bring llvm-profgen fully in sync with create_llvm_prof, there isn’t currently a timeline for doing so.

I’ll look into what it would take to add SPE support to llvm-profgen. Since a lot of perf tools already support SPE, it may not be as tricky as I think.

Here are the steps from which I concluded (and confirmed together with one partner team expert) that perf script -F ip doesn’t print leaf (i.e., instruction pointer) address even when the information exists in the perf.data file. This isn’t a blocking issue for us, so there’s no immediate urgency.

  1. Compile (with the options in operation step1) the example source code into an executable. The executable will loop over loop_func which exercises virtual call dispatches for long enough (2 hours) to make manual sampling easier.
  2. I run the executable in background and get its process id, and then invoke the perf command in [1] to generate perf.data file.
  3. Let’s say the <instr-addr, vtable-data-addr> pair is supposed to be <0x260862, 0x3b3fb0> in an memory load event [1]
    1. Inspecting the raw trace [2] will give groups of output showing the <0x260862, 0x3b3fb0> information exists in the binary perf.data file (otherwise perf report has nowhere to find it)
    2. On the other hand, there is no appearance of 260862 (the instruction that accesses data address) in the perf script output, as shown by command in [3].

[1] perf record -a -c <sampling-period> -p <pid> -d --pfm-events MEM_INST_RETIRED.ALL_LOADS:u:pinned:precise=3,br_inst_retired:near_taken:u -b -g -i --buildid-mmap -N -o - --skip-hugepages sleep 15 | perf inject -b --buildid-all -i - -o perf.data

[2] perf report -D -i perf.data | grep -A 12 "0x260862 period: <sampling-period> addr: 0x3b3fb0"

[3] perf script -i perf.data -F ip,addr | grep "260862"


    1. for the non-PIE binary which won’t have ASLR
    ā†©ļøŽ