[RFC] VTable Type Profiling for SampleFDO

Hey @mingmingl-llvm,

Thanks for the proposal.

On needing raw perf traces:

Can you please elaborate on the Linux Perf limitation? Are there any plans of dealing with that other than using -D?

If attaching raw events is also needed for AArch64, this would attach the Arm SPE’s native packets in textual format, making the text file bigger (~50-60% in some quick, rough test I did). We may be interested in improving the handling in Linux for that.

On the discussion of sampling bias:

the vtable counters are in the range of 200 - 4500, while the counters inferred from LBR are in the range of 500,000 - 700,000. The memory-load raw counters and LBR-inferred counters differ by orders of magnitude , and the ratio of virtual function target (from LBR) is often much closer to 3:1 than the ratio of vtables (from memory access events) if we repeat the experiment a couple of times. Presumably with continuous sampling (at much lower sampling rate but across the entire fleet with more machines in a real world setting), the bias is mitigated. I don’t have analysis result over real-world data points though. One way to do this analysis is to have a SampleFDO profile generated with vtable counters from continuous sampled data and analyze the profile.

That is interesting. Of course, this profiling is done in separate steps, and as mentioned, the sampling rate could be configured differently.

From the memory profiling, we are primarily interested in identifying vtable loads, but we also obtain a ratio to compare against the edge profile. In this isolated example, the edge profile appear to be more accurate.

Is the plan to use the memory-profiling ratio as a partial verification, or do you think matching rations would be required?


Some clarifications / naive questions:

I haven’t followed the instrumentation-based implementation (your prior work); does code emission support multiple target types for ICP?

drops by 0.4 ~ 0.5% after the vtable-based ICP is applied for instrumented PGO binaries,

How the above should this be interpreted? Does this vtable improvement apply on top of binaries previously optimized with instrumentation-based PGO? (where this was not used)

Build a position dependent binary so runtime addr is the same as static virtual addr for parsing profiles.

Is this a limitation in supporting PIE/PIC code? If so, would that be part of #148013? I’ve left a comment on the patch.

Thanks again for your work!

Paschalis