RFC: Indirect Call Promotion LLVM Pass

Hi, we've implemented an indirect call promotion llvm pass. The design
notes including examples are shown below. This pass complements the
indirect call profile infrastructure
http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-April/084271.html

Your feedback and comments will be highly appreciated.

Thanks,
Ivan

============================================================================RFC:
Indirect Call Promotion LLVM Pass
Betul Buyukkurt and Ivan Baev

1. Introduction
Indirect call promotion (ICP) replaces an indirect call instruction to a
set of target addresses with a sequence of tests guarding direct calls to
selected targets, plus a fall through branch containing the original
indirect call. The ICP optimization is found to be the second most
profitable (after inlining) profile-based optimization in a recent study
[2].

We've implemented an ICP LLVM pass that iterates over all indirect call
sites in the module and selectively (under heuristics) performs the
promotion. Here is one example of the transformation.

--------------------ico.ll--------------------------------------------------
define void @foo(i32 %a) {
entry:
  %a1 = add i32 %a, 1
  ret void
}

define void @bar(i32 %a) {
entry:
  %a2 = add i32 %a, 2
  ret void
}

define void @main(void (i32)* %fun) {
entry:
  call void %fun(i32 10), !prof !1
  ret void
}

!1 = !{!"indirect_call_targets", i64 6000, !"foo", i64 5000, !"bar", i64 100}

Hi, we've implemented an indirect call promotion llvm pass. The design
notes including examples are shown below. This pass complements the
indirect call profile infrastructure
http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-April/084271.html

There are issues with the profiling infrastructure proposal which will
addressed separately.

This part looks sane in general to me. See my rely inline.

We've implemented two heuristics from this paper [1].

a. Hot call site heuristic: only consider for promotion a call site which
contribution (profile count) is more than 0.1% of the total count of all
indirect calls executed during the profile runs.

b. Hot target heuristic: promote the most frequent target at a call site
if it gets at least 40% of all receivers at the call site.

Is the heuristics a || b, or a && b ?

Only the hottest target from a call site is possibly promoted, similarly
to the approach taken in the paper.

In addition to Aigner & Hölzle-based heuristics, we add an inline hint to
the promoted direct call/invoke instruction if it is the single receiver
at the call site according to the profile data or the number of times the
target is executed at the call site is at least 4% of the total count of
all indirect calls. Once the function entry profile counts become
available we will use them to tune the above inline-related heuristic.

I don't think indirectly promoted callsites should be treated any
differently from original direct callsites -- after promotion, the
direct callsites have call count info and the same inline heuristics
should apply.

  if (ptr->foo == A::foo)
to
  if (ptr->_vptr == A::_vtable)

You can do that if you know class A is final. In general, you need
type or vtable profiling to get it.

This will sink one load from the original block into the less frequently
executed if.false block. This opportunity was found by Balaram Makam.

4. New enhancement patch
-------------------------
Currently our implementation has the following shortcomings:
a. Our heuristics do not depend on the global information on function
counts. It could be that none of the indirect call sites are contributing
highly to the overall calls. Because our current implementation is
deciding what to inline based on the indirect call site sum only, it could
be inlining functions that are in essence cold when all functions in the
source base are considered. This situation will be improved when the
function entry profile counts become available in llvm IR.

Our plan is to add program level summary data for PGO. Any global
decisions need to made based on that because only relative global
hotness matters.

b. Current implementation only transforms the first hot target, the rest
of the targets are never considered even if they are relatively hot.

This is probably a good thing. Going beyond 2 can have negative effect.

We are evaluating a new solution which depends on the
presence/availability of functions counts in clang. We form a sorted
multiset of all functions counts. A given indirect target is considered
for inlining if the target’s count at the call site falls within one of
the ranges that form the top 0-10%, 10-20% or 20-30% of the sorted
multiset. We’ve added checks which become stricter as the target count
falls farther away from the top most called 10%, 20% or 30% of all
functions respectively.

Targets that are classified as making calls to one of the top most called
30% of the functions receive inline hints. Inline hints are communicated
from clang down to LLVM in metadata. Then, on the LLVM side the
transformation pass uses the metadata field for the hint to add an inline
hint at the transformed call site.

Again, there is no need to invent indirect call (promoted) specific
inline heuristics.

thanks,

David

Hi, we've implemented an indirect call promotion llvm pass. The design
notes including examples are shown below. This pass complements the
indirect call profile infrastructure
http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-April/084271.html

There are issues with the profiling infrastructure proposal which will
addressed separately.

-- Please send these to Betul and me.

This part looks sane in general to me. See my rely inline.

We've implemented two heuristics from this paper [1].

a. Hot call site heuristic: only consider for promotion a call site
which
contribution (profile count) is more than 0.1% of the total count of all
indirect calls executed during the profile runs.

b. Hot target heuristic: promote the most frequent target at a call site
if it gets at least 40% of all receivers at the call site.

Is the heuristics a || b, or a && b ?

-- a && b

Only the hottest target from a call site is possibly promoted, similarly
to the approach taken in the paper.

In addition to Aigner & Hölzle-based heuristics, we add an inline hint
to
the promoted direct call/invoke instruction if it is the single receiver
at the call site according to the profile data or the number of times
the
target is executed at the call site is at least 4% of the total count of
all indirect calls. Once the function entry profile counts become
available we will use them to tune the above inline-related heuristic.

I don't think indirectly promoted callsites should be treated any
differently from original direct callsites -- after promotion, the
direct callsites have call count info and the same inline heuristics
should apply.

--- Agree in general, we should live this decision to inliner, especially
when it becomes a profile-based inliner.
At ICP pass we currenty have the profile counts for indirect call sites
and their receivers (targets) and it is tempting to pass some of this
information to the inliner.

  if (ptr->foo == A::foo)
to
  if (ptr->_vptr == A::_vtable)

You can do that if you know class A is final. In general, you need
type or vtable profiling to get it.

-- It is a future enhancement. Could you please provide some more details,
in particular is it valid for C++ programs?

This will sink one load from the original block into the less frequently
executed if.false block. This opportunity was found by Balaram Makam.

4. New enhancement patch
-------------------------
Currently our implementation has the following shortcomings:
a. Our heuristics do not depend on the global information on function
counts. It could be that none of the indirect call sites are
contributing
highly to the overall calls. Because our current implementation is
deciding what to inline based on the indirect call site sum only, it
could
be inlining functions that are in essence cold when all functions in the
source base are considered. This situation will be improved when the
function entry profile counts become available in llvm IR.

Our plan is to add program level summary data for PGO. Any global
decisions need to made based on that because only relative global
hotness matters.

b. Current implementation only transforms the first hot target, the rest
of the targets are never considered even if they are relatively hot.

This is probably a good thing. Going beyond 2 can have negative effect.

-- With 2 we're getting incremental improvements, and we plan to further
tune it.

Thanks for the feedback, David.
Ivan

Hi, we've implemented an indirect call promotion llvm pass. The design
notes including examples are shown below. This pass complements the
indirect call profile infrastructure
http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-April/084271.html

There are issues with the profiling infrastructure proposal which will
addressed separately.

This part looks sane in general to me. See my rely inline.

We've implemented two heuristics from this paper [1].

a. Hot call site heuristic: only consider for promotion a call site
which
contribution (profile count) is more than 0.1% of the total count of all
indirect calls executed during the profile runs.

b. Hot target heuristic: promote the most frequent target at a call site
if it gets at least 40% of all receivers at the call site.

Is the heuristics a || b, or a && b ?

Only the hottest target from a call site is possibly promoted, similarly
to the approach taken in the paper.

In addition to Aigner & Hölzle-based heuristics, we add an inline hint
to
the promoted direct call/invoke instruction if it is the single receiver
at the call site according to the profile data or the number of times
the
target is executed at the call site is at least 4% of the total count of
all indirect calls. Once the function entry profile counts become
available we will use them to tune the above inline-related heuristic.

I don't think indirectly promoted callsites should be treated any
differently from original direct callsites -- after promotion, the
direct callsites have call count info and the same inline heuristics
should apply.

  if (ptr->foo == A::foo)
to
  if (ptr->_vptr == A::_vtable)

You can do that if you know class A is final. In general, you need
type or vtable profiling to get it.

This will sink one load from the original block into the less frequently
executed if.false block. This opportunity was found by Balaram Makam.

4. New enhancement patch
-------------------------
Currently our implementation has the following shortcomings:
a. Our heuristics do not depend on the global information on function
counts. It could be that none of the indirect call sites are
contributing
highly to the overall calls. Because our current implementation is
deciding what to inline based on the indirect call site sum only, it
could
be inlining functions that are in essence cold when all functions in the
source base are considered. This situation will be improved when the
function entry profile counts become available in llvm IR.

Our plan is to add program level summary data for PGO. Any global
decisions need to made based on that because only relative global
hotness matters.

b. Current implementation only transforms the first hot target, the rest
of the targets are never considered even if they are relatively hot.

This is probably a good thing. Going beyond 2 can have negative effect.

We are evaluating a new solution which depends on the
presence/availability of functions counts in clang. We form a sorted
multiset of all functions counts. A given indirect target is considered
for inlining if the target’s count at the call site falls within one
of
the ranges that form the top 0-10%, 10-20% or 20-30% of the sorted
multiset. We’ve added checks which become stricter as the target
count
falls farther away from the top most called 10%, 20% or 30% of all
functions respectively.

Targets that are classified as making calls to one of the top most
called
30% of the functions receive inline hints. Inline hints are
communicated
from clang down to LLVM in metadata. Then, on the LLVM side the
transformation pass uses the metadata field for the hint to add an
inline
hint at the transformed call site.

Again, there is no need to invent indirect call (promoted) specific
inline heuristics.

These heuristics are there to decide which indirect call targets to
promote at an indirect call site. Not necessarily an additional inline
heuristic for the promoted targets. However, we do add an inline hint
currently at the transformed call sites to help trigger the inlines at the
new call site.

>> Hi, we've implemented an indirect call promotion llvm pass. The design
>> notes including examples are shown below. This pass complements the
>> indirect call profile infrastructure
>> http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-April/084271.html
>
> There are issues with the profiling infrastructure proposal which will
> addressed separately.

-- Please send these to Betul and me.

yes.

>
> This part looks sane in general to me. See my rely inline.
>
>
>> We've implemented two heuristics from this paper [1].
>>
>> a. Hot call site heuristic: only consider for promotion a call site
>> which
>> contribution (profile count) is more than 0.1% of the total count of all
>> indirect calls executed during the profile runs.
>>
>> b. Hot target heuristic: promote the most frequent target at a call site
>> if it gets at least 40% of all receivers at the call site.
>
> Is the heuristics a || b, or a && b ?

-- a && b

>
>>
>> Only the hottest target from a call site is possibly promoted, similarly
>> to the approach taken in the paper.
>>
>> In addition to Aigner & Hölzle-based heuristics, we add an inline hint
>> to
>> the promoted direct call/invoke instruction if it is the single receiver
>> at the call site according to the profile data or the number of times
>> the
>> target is executed at the call site is at least 4% of the total count of
>> all indirect calls. Once the function entry profile counts become
>> available we will use them to tune the above inline-related heuristic.
>>
>>
>
> I don't think indirectly promoted callsites should be treated any
> differently from original direct callsites -- after promotion, the
> direct callsites have call count info and the same inline heuristics
> should apply.

--- Agree in general, we should live this decision to inliner, especially
when it becomes a profile-based inliner.
At ICP pass we currenty have the profile counts for indirect call sites
and their receivers (targets) and it is tempting to pass some of this
information to the inliner.

The inliner will get the profile information in the very near future, so
any other way to pass the info to inliner is a hack and not recommended.
Note that Value profile transformation can easily compute the branch
probability for the newly generated branch with target count and site total
count.

>>
>> if (ptr->foo == A::foo)
>> to
>> if (ptr->_vptr == A::_vtable)
>
> You can do that if you know class A is final. In general, you need
> type or vtable profiling to get it.

-- It is a future enhancement. Could you please provide some more details,
in particular is it valid for C++ programs?

It is valid. HP compiler does that. Comparing vtable profiling with call
target profiling: while it has its own downsides too. It tends to increase
the number of target values to track because of the complicated class
hierarchy. For instance, B has a member function 'foo' and is inherited by
D1, D2, D3 (not overriding), and D4 (overriding). The number of values to
track increases from 2 to 4. The transformation may also become more
complicated: if (vptr == D1_vtb || vptr == D2_vtb ....) invoke B::foo(...)
...

>
>>
>> This will sink one load from the original block into the less frequently
>> executed if.false block. This opportunity was found by Balaram Makam.
>>
>>
>> 4. New enhancement patch
>> -------------------------
>> Currently our implementation has the following shortcomings:
>> a. Our heuristics do not depend on the global information on function
>> counts. It could be that none of the indirect call sites are
>> contributing
>> highly to the overall calls. Because our current implementation is
>> deciding what to inline based on the indirect call site sum only, it
>> could
>> be inlining functions that are in essence cold when all functions in the
>> source base are considered. This situation will be improved when the
>> function entry profile counts become available in llvm IR.
>
> Our plan is to add program level summary data for PGO. Any global
> decisions need to made based on that because only relative global
> hotness matters.
>
>>
>> b. Current implementation only transforms the first hot target, the rest
>> of the targets are never considered even if they are relatively hot.
>
> This is probably a good thing. Going beyond 2 can have negative effect.
>

-- With 2 we're getting incremental improvements, and we plan to further
tune it.

David