[RFC] Context-sensitive Sample PGO with Pseudo-Instrumentation

Thanks for the feedbacks and discussions! We will be sending up patches soon then. The patches will be organized in three categories: 1) Context-sensitive sample PGO, 2) Pseudo Instrumentation, 3) A new profile generation tool that #1 and #2 depends on.

Thanks for sharing the detailed description of the pseudo probes. It sheds light on the fact that pseudo probes are used not just for address mapping back to IR, but also building the “context-sensitive” profile of each function.
One question (Although I think this was previously asked by David): How precise is the CODE_ADDRESS (specifically in the case of basic blocks being duplicated/merged by machine passes)?

In addition to an IR block id or probe Id, we’ll also need to know the inline context of a probe if it comes from an inlinee. The current pseudo probe encoding is based on a DFS walk of the inline tree. A MIR BB may contain probes from different inlinees, and we may need to extend the BB-info format for encode the inline contexts there. I’m happy to work with you on a encoding format that can be used for both Propeller and pseudo probes.

This is our current encoding format:

// FUNCTION BODY (one for each uninlined function present in the text section)

// GUID (uint64)

// GUID of the function

// NPROBES (ULEB128)

// Number of probes originating from this function.

// NUM_INLINED_FUNCTIONS (ULEB128)

// Number of callees inlined into this function, aka number of

// first-level inlinees

// PROBE RECORDS

// A list of NPROBES entries. Each entry contains:

// INDEX (ULEB128)

// TYPE (uint4)

// 0 - block probe, 1 - indirect call, 2 - direct call

// ATTRIBUTE (uint3)

// 1 - internal linkage, 2 - dangling

// ADDRESS_TYPE (uint1)

// 0 - code address, 1 - address delta

// CODE_ADDRESS (uint64 or ULEB128)

// code address or address delta, depending on Flag

// INLINED FUNCTION RECORDS

// A list of NUM_INLINED_FUNCTIONS entries describing each of the inlined

// callees. Each record contains:

// INLINE SITE

// GUID of the inlinee (uint64)

// Line number | Discriminator (ULEB128)

// FUNCTION BODY

// A FUNCTION BODY entry describing the inlined function.

Thanks a lot for the detailed description.

That’s a good question. During the offline counts processing, the samples collected on the first physical instruction following a probe will be counted towards the probe. This is mostly accurate unless the physical instruction is not on the same control flow path with the probe (e.g, with a label sits in between). The accuracy comes from the semantics associated with a block probe that enforces the probe to be virtually executed exactly the same times before and after an optimization. We rely on a sophisticated counts inference tool to deal with corner cases and hardware noises.

Regarding duplicated blocks, the probes are naturally distributed to newly created blocks and the counts collected on the duplicated probes will be accumulated to the original probe. Block merge will be blocked by pseudo probes since in the form of an intrinsic call they look different in call arguments. However, pseudo probes don’t block instruction merge.

Thanks,

Hongtao