Hi, we've implemented an indirect call promotion llvm pass. The design
notes including examples are shown below. This pass complements the
indirect call profile infrastructure
http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-April/084271.html
There are issues with the profiling infrastructure proposal which will
addressed separately.
This part looks sane in general to me. See my rely inline.
We've implemented two heuristics from this paper [1].
a. Hot call site heuristic: only consider for promotion a call site which
contribution (profile count) is more than 0.1% of the total count of all
indirect calls executed during the profile runs.
b. Hot target heuristic: promote the most frequent target at a call site
if it gets at least 40% of all receivers at the call site.
Is the heuristics a || b, or a && b ?
Only the hottest target from a call site is possibly promoted, similarly
to the approach taken in the paper.
In addition to Aigner & Hölzle-based heuristics, we add an inline hint to
the promoted direct call/invoke instruction if it is the single receiver
at the call site according to the profile data or the number of times the
target is executed at the call site is at least 4% of the total count of
all indirect calls. Once the function entry profile counts become
available we will use them to tune the above inline-related heuristic.
I don't think indirectly promoted callsites should be treated any
differently from original direct callsites -- after promotion, the
direct callsites have call count info and the same inline heuristics
should apply.
if (ptr->foo == A::foo)
to
if (ptr->_vptr == A::_vtable)
You can do that if you know class A is final. In general, you need
type or vtable profiling to get it.
This will sink one load from the original block into the less frequently
executed if.false block. This opportunity was found by Balaram Makam.
4. New enhancement patch
-------------------------
Currently our implementation has the following shortcomings:
a. Our heuristics do not depend on the global information on function
counts. It could be that none of the indirect call sites are contributing
highly to the overall calls. Because our current implementation is
deciding what to inline based on the indirect call site sum only, it could
be inlining functions that are in essence cold when all functions in the
source base are considered. This situation will be improved when the
function entry profile counts become available in llvm IR.
Our plan is to add program level summary data for PGO. Any global
decisions need to made based on that because only relative global
hotness matters.
b. Current implementation only transforms the first hot target, the rest
of the targets are never considered even if they are relatively hot.
This is probably a good thing. Going beyond 2 can have negative effect.
We are evaluating a new solution which depends on the
presence/availability of functions counts in clang. We form a sorted
multiset of all functions counts. A given indirect target is considered
for inlining if the target’s count at the call site falls within one of
the ranges that form the top 0-10%, 10-20% or 20-30% of the sorted
multiset. We’ve added checks which become stricter as the target count
falls farther away from the top most called 10%, 20% or 30% of all
functions respectively.
Targets that are classified as making calls to one of the top most called
30% of the functions receive inline hints. Inline hints are communicated
from clang down to LLVM in metadata. Then, on the LLVM side the
transformation pass uses the metadata field for the hint to add an inline
hint at the transformed call site.
Again, there is no need to invent indirect call (promoted) specific
inline heuristics.
thanks,
David