RFC - Making SamplePGO a module pass

Dehao and I have been discussing changes we need to make to SamplePGO to make it more effective.

Currently, SamplePGO is a scalar pass that limits itself to add branch weight annotations. It runs pretty early in the pipeline, so this is fine for other scalar passes that want to use profile data (block layout and regalloc).

However, it does nothing to help module passes. Notably, the inliner. What Dehao has found in his experience with GCC is that in order to help the inliner, SamplePGO needs to become a module pass.

Mainly, it needs to be able to affect inlining decisions. If a branch into a call site has many samples, we want to tell the inliner about it so it increases the inlinining score for that call site.

Additionally, SamplePGO may need to actually perform some inlining before the inliner runs. This is needed to better match the samples obtained from optimized binaries. For example, suppose the binary had 3 functions A(), B() and C() all calling function foo(). When the code is executed assume that A() has many samples (i.e., it’s hot) while B() and C() have no samples.

Also assume that foo() was originally inlined in A(), B() and C(). When SamplePGO is analyzing function A(), it will find samples for the inlined copy of foo().

At that point, SamplePGO may want to perform the inline of foo() into A()'s call site so that it can better match the samples it gets from the profile. At the same time, since B() and C() had no/little samples to them, it wants to mark the respective call sites cold so the inliner doesn’t bother with them.

Chandler, is this something we can realistically do? I believe the first step would be to make SamplePGO a module pass, make sure it runs before the inliner and then we can see how we can implement the above behaviour, or some variant of it that provides the same benefit (e.g., cloning).

Something similar will be needed for devirtualization and indirect calls. Sampling exposes actual devirtualization and indirect call opportunities.

Thanks. Diego.

Dehao and I have been discussing changes we need to make to SamplePGO to
make it more effective.

Currently, SamplePGO is a scalar pass that limits itself to add branch
weight annotations. It runs pretty early in the pipeline, so this is fine
for other scalar passes that want to use profile data (block layout and
regalloc).

However, it does nothing to help module passes. Notably, the inliner. What
Dehao has found in his experience with GCC is that in order to help the
inliner, SamplePGO needs to become a module pass.

Mainly, it needs to be able to affect inlining decisions. If a branch
into a call site has many samples, we want to tell the inliner about it so
it increases the inlinining score for that call site.

Additionally, SamplePGO may need to actually perform some inlining before
the inliner runs. This is needed to better match the samples obtained from
optimized binaries. For example, suppose the binary had 3 functions A(),
B() and C() all calling function foo(). When the code is executed assume
that A() has many samples (i.e., it's hot) while B() and C() have no
samples.

Also assume that foo() was originally inlined in A(), B() and C(). When
SamplePGO is analyzing function A(), it will find samples for the inlined
copy of foo().

At that point, SamplePGO may want to perform the inline of foo() into
A()'s call site so that it can better match the samples it gets from the
profile. At the same time, since B() and C() had no/little samples to
them, it wants to mark the respective call sites cold so the inliner
doesn't bother with them.

Chandler, is this something we can realistically do? I believe the first
step would be to make SamplePGO a module pass, make sure it runs before the
inliner and then we can see how we can implement the above behaviour, or
some variant of it that provides the same benefit (e.g., cloning).

Context sensitive profile matching is one of the most important features
for SamplePGO performance. If we don't make use that information from
inline instance profiles, SamplePGO will not have a chance to match
instrumentation based PGO performance, period. On the other hand, if the
information is used, SamplePGO has more advantages.

So the question is not whether this needs to be done or not, but instead
whether using ModulePass is the right way to do it (but looks like so).

Something similar will be needed for devirtualization and indirect calls.
Sampling exposes actual devirtualization and indirect call opportunities.

yes.

thanks,

David