Dehao and I have been discussing changes we need to make to SamplePGO to make it more effective.
Currently, SamplePGO is a scalar pass that limits itself to add branch weight annotations. It runs pretty early in the pipeline, so this is fine for other scalar passes that want to use profile data (block layout and regalloc).
However, it does nothing to help module passes. Notably, the inliner. What Dehao has found in his experience with GCC is that in order to help the inliner, SamplePGO needs to become a module pass.
Mainly, it needs to be able to affect inlining decisions. If a branch into a call site has many samples, we want to tell the inliner about it so it increases the inlinining score for that call site.
Additionally, SamplePGO may need to actually perform some inlining before the inliner runs. This is needed to better match the samples obtained from optimized binaries. For example, suppose the binary had 3 functions A(), B() and C() all calling function foo(). When the code is executed assume that A() has many samples (i.e., it’s hot) while B() and C() have no samples.
Also assume that foo() was originally inlined in A(), B() and C(). When SamplePGO is analyzing function A(), it will find samples for the inlined copy of foo().
At that point, SamplePGO may want to perform the inline of foo() into A()'s call site so that it can better match the samples it gets from the profile. At the same time, since B() and C() had no/little samples to them, it wants to mark the respective call sites cold so the inliner doesn’t bother with them.
Chandler, is this something we can realistically do? I believe the first step would be to make SamplePGO a module pass, make sure it runs before the inliner and then we can see how we can implement the above behaviour, or some variant of it that provides the same benefit (e.g., cloning).
Something similar will be needed for devirtualization and indirect calls. Sampling exposes actual devirtualization and indirect call opportunities.