Path forward on profile guided inlining?

I would like to start prototyping changes towards profile guided inlining. Before doing so, I wanted to get feedback from the community on the appropriate approach. I'm going to layout a strawman proposal, but I'm open to other ideas on how to approach the problem.

Depending on what approach we settle on, I *may* be able to commit resources to actually implement this in the near term. I can't commit to too much work here, so if we settle on something which will need a lot of enabling infrastructure work first, I'm likely going to need to just hack this within my local tree. I'm hoping to avoid that, but I want to be up front about the practicalities here.

Now, on to the strawman proposal...

This proposal is intended to not be dependent on the ongoing pass manger changes by Chandler. It's intended to be compatible with that work, but not reliant on it.

Background
One of the problems we have today is that we can't ask for information about the frequency of the call site we're examining within the inlining iteration loop. We can get at this information from normal transform passes run within the outer inlining loop (CallGraphSCCPass), but the inner loop does not have the ability to keep function analysis up to date while inlining.

We have an existing piece of metadata on functions which record the function entry count. This is not currently kept up to date in any way; it records the entry count of the function w/o.r.t. to inlining. This metadata was only recently added and doesn't effect optimization today.

Objective
I'm looking to be able to implement two key inlining heuristics.

The first is to recognize when we have the last remaining *actual* call to a callee. If we can prove there's a single caller today, we essentially just assume it's profitable to inline. I want to extend this logic to the case where we have a single call site which dominates the caller profile for a given callee. We might need to still keep around a copy of the callee (external, or rare callers), but the remaining callee definition will be a cold function, can be optimized for size, and can be placed far from hot code.

Second, I would like to use the frequency of the call site to adjust the inline threshold within the inliner. This basically means that we become more willing to inline hot functions (and probably less willing to inline cold ones).

At least to start with, both of these optimizations would be off by default under a flag. I figure there's a lot to be discussed here, but I'd prefer to defer that a bit. We need to get the enabling parts in place before worrying too much about the heuristics.

The Ideal Answer
If we had the pass manager changes done, we'd be able to just ask BFI to compute an estimated execution count for the call site. Both of the inliner heuristics I'm interested in could be implemented using that information combined with an up to date estimate of the function entry count.

If we had the pass manager changes, the only thing we'd need to do is update InlineFunction to subtract the estimated call count from the entry count of the remaining callee definition. This would result in the entry count of B reflecting the estimated entry count from the remaining callers of B. (Given the call site count is an estimate, this implies that entry counts are also now approximate. Given typical profiling mechanisms are lossy, that's not a big deal, but is worth noting explicitly.)

(See also my note at the bottom about iteration order. I consider that somewhat out of scope for this proposal, but it effects both the ideal and practical sections herein.)

The Practical Answer
Essentially, the remainder of this is an approximation of the pass manager based solution which let's us start experimenting with the inliner changes while waiting for the pass manager. At least in theory, when we have the pass manager in place, we can simply drop most of this.

The basic idea is to use metadata to effectively cache the estimated frequency for a given call site. We can use BFI to generate/update this information once per iteration of the outer inlining loop. As a result, we only need to keep it reasonably up to date within the inner inlining loop. (See note below about alternate approaches.)

The metadata on the call site would look something like:
call void @foo() !prof !"estimated_execution_count" 2000 (where 2000 is the estimated count)

We'd have a new FunctionPass which removes any existing metadata and adds new metadata which basically comes down to a product of the functions entry count and the estimated block frequency (provided by BFI) of the block containing the call site. This would run as the last FunctionPass within the outer inlining loop. Assuming that the entry counts are kept reasonably up to date, the resulting estimated call site counts should be reasonable.

Within the inliner itself, we'd need to update InlineFunction to keep the estimated counts in sync:
- When inlining B into A, split the estimated call counts on the calls within B between the remaining instance of B and the newly created call sites within A. I plan to split the estimated count in roughly the ratio of the entries into B. As a result, a given call site BC (originally from B into C) would be split into two sites AC, and BC with estimated counts (AB.count/B.entry_count * OrigBC.count) and ((1-AB.count/B.entry_count) * OrigBC.count). This does require updating the body of the callee, not just the caller when inlining.

It's important to note that the resulting estimated call counts can be just flat out wrong. As the easiest example, consider a case where B called C, but only when a boolean parameter was set to true. If we split the count of BC into AC, BC and then drop the call site AC, we've essentially lost information about the frequency of the remaining BC w.r.t. any other callers of C. We could try to adjust for this by only updating calls which don't get pruned away by constant propagation within InlineFunction, but I'm not sure how worthwhile it is trying to be smart here given the information will be roughly restored once we're out of the inner loop again.

What we can assume (and thus make inlining decisions on), is that a) a given call sites count is a reasonable approximation of it's actual execution count and b) that the sum of the call site counts for a given callee is less than or equal to the callee's entry count. What we can't do is compare two call sites counts for the same callee within two different functions and decide with high confidence which is more frequent.

Alternate Approaches:
1) We could just recompute BFI explicitly in the inliner right before passing the result to ICA for the purposes of prototyping. If this was off by default, this might be a reasonable scheme for investigation. This could likely never be enabled for real uses.
2) We could pre-compute BFI once per function within a given SCC and then try to keep it up to date during inlining. If we cached the call frequencies for the initial call sites, we could adjust the visit order to minimize the number of times we need to recompute a given functions block frequencies. (e.g. we can look at all the original call sites within a function before looking at newly inlined ones)

For the record, it's not clear that the bottom up inlining approach works anywhere near as well once you starting using profiling information about the caller/callee pair. You would seem to end up with cases where inlining BC makes it more likely you'll inline DC, but if you visited them in the other order you wouldn't inline DC. Consider the case where BC is a hot call site, and DC is the only other caller of C. I have no plans to actually tackle this problem.

Phillip, thanks for looking into this.

I would like to start prototyping changes towards profile guided inlining.
Before doing so, I wanted to get feedback from the community on the
appropriate approach. I'm going to layout a strawman proposal, but I'm open
to other ideas on how to approach the problem.

Depending on what approach we settle on, I *may* be able to commit resources
to actually implement this in the near term. I can't commit to too much
work here, so if we settle on something which will need a lot of enabling
infrastructure work first, I'm likely going to need to just hack this within
my local tree. I'm hoping to avoid that, but I want to be up front about
the practicalities here.

Now, on to the strawman proposal...

This proposal is intended to not be dependent on the ongoing pass manger
changes by Chandler. It's intended to be compatible with that work, but not
reliant on it.

I assume what you are proposing is only for the short term before Pass
manager work is ready?

Longer term, we do need function analysis results (BB level, Loop
level) to be available during IPA/Inliner. For instance to estimate
icache footprint using weighted size etc.

I have some preliminary data that shows the memory overhead of keeping
those data for all functions (and maintained with incremental update)
is small.

Background
One of the problems we have today is that we can't ask for information about
the frequency of the call site we're examining within the inlining iteration
loop. We can get at this information from normal transform passes run
within the outer inlining loop (CallGraphSCCPass), but the inner loop does
not have the ability to keep function analysis up to date while inlining.

We have an existing piece of metadata on functions which record the function
entry count. This is not currently kept up to date in any way; it records
the entry count of the function w/o.r.t. to inlining. This metadata was
only recently added and doesn't effect optimization today.

Objective
I'm looking to be able to implement two key inlining heuristics.

The first is to recognize when we have the last remaining *actual* call to a
callee. If we can prove there's a single caller today, we essentially just
assume it's profitable to inline. I want to extend this logic to the case
where we have a single call site which dominates the caller profile for a
given callee. We might need to still keep around a copy of the callee
(external, or rare callers), but the remaining callee definition will be a
cold function, can be optimized for size, and can be placed far from hot
code.

Second, I would like to use the frequency of the call site to adjust the
inline threshold within the inliner. This basically means that we become
more willing to inline hot functions (and probably less willing to inline
cold ones).

yes. We plan to do that too.

At least to start with, both of these optimizations would be off by default
under a flag. I figure there's a lot to be discussed here, but I'd prefer
to defer that a bit. We need to get the enabling parts in place before
worrying too much about the heuristics.

The Ideal Answer
If we had the pass manager changes done, we'd be able to just ask BFI to
compute an estimated execution count for the call site. Both of the inliner
heuristics I'm interested in could be implemented using that information
combined with an up to date estimate of the function entry count.

If we had the pass manager changes, the only thing we'd need to do is update
InlineFunction to subtract the estimated call count from the entry count of
the remaining callee definition. This would result in the entry count of B
reflecting the estimated entry count from the remaining callers of B.
(Given the call site count is an estimate, this implies that entry counts
are also now approximate. Given typical profiling mechanisms are lossy,
that's not a big deal, but is worth noting explicitly.)

yes. BB count is computed from entry_count and BB freq; call count is
obtained from BB count. Two things need to be updated during inlining

1) entry count of the remaining out of line instance of the callee
2) BB frequency of inline instance of the callee.

see also http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-March/083943.html

(See also my note at the bottom about iteration order. I consider that
somewhat out of scope for this proposal, but it effects both the ideal and
practical sections herein.)

The Practical Answer
Essentially, the remainder of this is an approximation of the pass manager
based solution which let's us start experimenting with the inliner changes
while waiting for the pass manager. At least in theory, when we have the
pass manager in place, we can simply drop most of this.

The basic idea is to use metadata to effectively cache the estimated
frequency for a given call site. We can use BFI to generate/update this
information once per iteration of the outer inlining loop. As a result, we
only need to keep it reasonably up to date within the inner inlining loop.
(See note below about alternate approaches.)

The metadata on the call site would look something like:
call void @foo() !prof !"estimated_execution_count" 2000 (where 2000 is the
estimated count)

Can you use callgraph to keep this information instead?

We'd have a new FunctionPass which removes any existing metadata and adds
new metadata which basically comes down to a product of the functions entry
count and the estimated block frequency (provided by BFI) of the block
containing the call site. This would run as the last FunctionPass within
the outer inlining loop. Assuming that the entry counts are kept reasonably
up to date, the resulting estimated call site counts should be reasonable.

Within the inliner itself, we'd need to update InlineFunction to keep the
estimated counts in sync:
- When inlining B into A, split the estimated call counts on the calls
within B between the remaining instance of B and the newly created call
sites within A. I plan to split the estimated count in roughly the ratio of
the entries into B. As a result, a given call site BC (originally from B
into C) would be split into two sites AC, and BC with estimated counts
(AB.count/B.entry_count * OrigBC.count) and ((1-AB.count/B.entry_count) *
OrigBC.count). This does require updating the body of the callee, not just
the caller when inlining.

This scaling scheme assumes context-insensitivity. See your example
below which it does not apply. However this is no better way.

It's important to note that the resulting estimated call counts can be just
flat out wrong. As the easiest example, consider a case where B called C,
but only when a boolean parameter was set to true. If we split the count of
BC into AC, BC and then drop the call site AC, we've essentially lost
information about the frequency of the remaining BC w.r.t. any other callers
of C. We could try to adjust for this by only updating calls which don't
get pruned away by constant propagation within InlineFunction, but I'm not
sure how worthwhile it is trying to be smart here given the information will
be roughly restored once we're out of the inner loop again.

Why is wrong? should it make the result more precise? Basically C is
never called from B if B is called from A in this case. In such as
case, edge BC's count does not need to be adjusted (though B's entry
count needs to be adjusted).

What we can assume (and thus make inlining decisions on), is that a) a given
call sites count is a reasonable approximation of it's actual execution
count and b) that the sum of the call site counts for a given callee is less
than or equal to the callee's entry count. What we can't do is compare two
call sites counts for the same callee within two different functions and
decide with high confidence which is more frequent.

Alternate Approaches:
1) We could just recompute BFI explicitly in the inliner right before
passing the result to ICA for the purposes of prototyping. If this was off
by default, this might be a reasonable scheme for investigation. This could
likely never be enabled for real uses.
2) We could pre-compute BFI once per function within a given SCC and then
try to keep it up to date during inlining. If we cached the call
frequencies for the initial call sites, we could adjust the visit order to
minimize the number of times we need to recompute a given functions block
frequencies. (e.g. we can look at all the original call sites within a
function before looking at newly inlined ones)

For the record, it's not clear that the bottom up inlining approach works
anywhere near as well once you starting using profiling information about
the caller/callee pair. You would seem to end up with cases where inlining
BC makes it more likely you'll inline DC, but if you visited them in the
other order you wouldn't inline DC. Consider the case where BC is a hot call
site, and DC is the only other caller of C. I have no plans to actually
tackle this problem.

Inliner can be taught to delay inlining and use a top down scheme. For
instance if B has lots of incoming calls, but there is only one
dominating call edge from A -- in such as case, it is likely better to
delay inlining decisions of callsites in B that may have large
overhead until B gets inlined ..

thanks,

David

Phillip, thanks for looking into this.

I would like to start prototyping changes towards profile guided inlining.
Before doing so, I wanted to get feedback from the community on the
appropriate approach. I'm going to layout a strawman proposal, but I'm open
to other ideas on how to approach the problem.

Depending on what approach we settle on, I *may* be able to commit resources
to actually implement this in the near term. I can't commit to too much
work here, so if we settle on something which will need a lot of enabling
infrastructure work first, I'm likely going to need to just hack this within
my local tree. I'm hoping to avoid that, but I want to be up front about
the practicalities here.

Now, on to the strawman proposal...

This proposal is intended to not be dependent on the ongoing pass manger
changes by Chandler. It's intended to be compatible with that work, but not
reliant on it.

I assume what you are proposing is only for the short term before Pass
manager work is ready?

Longer term, we do need function analysis results (BB level, Loop
level) to be available during IPA/Inliner. For instance to estimate
icache footprint using weighted size etc.

I addressed this specifically in my original email. The use of the extra metadata based caching scheme is temporary. The inlining heuristics and updating of entry counts would not be. They'd just be driven by BFI directly.

I have some preliminary data that shows the memory overhead of keeping
those data for all functions (and maintained with incremental update)
is small.

The memory size part doesn't surprise me for at least some analysis passes. What's the runtime cost of the incremental update?

Background
One of the problems we have today is that we can't ask for information about
the frequency of the call site we're examining within the inlining iteration
loop. We can get at this information from normal transform passes run
within the outer inlining loop (CallGraphSCCPass), but the inner loop does
not have the ability to keep function analysis up to date while inlining.

We have an existing piece of metadata on functions which record the function
entry count. This is not currently kept up to date in any way; it records
the entry count of the function w/o.r.t. to inlining. This metadata was
only recently added and doesn't effect optimization today.

Objective
I'm looking to be able to implement two key inlining heuristics.

The first is to recognize when we have the last remaining *actual* call to a
callee. If we can prove there's a single caller today, we essentially just
assume it's profitable to inline. I want to extend this logic to the case
where we have a single call site which dominates the caller profile for a
given callee. We might need to still keep around a copy of the callee
(external, or rare callers), but the remaining callee definition will be a
cold function, can be optimized for size, and can be placed far from hot
code.

Second, I would like to use the frequency of the call site to adjust the
inline threshold within the inliner. This basically means that we become
more willing to inline hot functions (and probably less willing to inline
cold ones).

yes. We plan to do that too.

Good, let's collaborate. It'll get done sooner. :slight_smile:

At least to start with, both of these optimizations would be off by default
under a flag. I figure there's a lot to be discussed here, but I'd prefer
to defer that a bit. We need to get the enabling parts in place before
worrying too much about the heuristics.

The Ideal Answer
If we had the pass manager changes done, we'd be able to just ask BFI to
compute an estimated execution count for the call site. Both of the inliner
heuristics I'm interested in could be implemented using that information
combined with an up to date estimate of the function entry count.

If we had the pass manager changes, the only thing we'd need to do is update
InlineFunction to subtract the estimated call count from the entry count of
the remaining callee definition. This would result in the entry count of B
reflecting the estimated entry count from the remaining callers of B.
(Given the call site count is an estimate, this implies that entry counts
are also now approximate. Given typical profiling mechanisms are lossy,
that's not a big deal, but is worth noting explicitly.)

yes. BB count is computed from entry_count and BB freq; call count is
obtained from BB count. Two things need to be updated during inlining

1) entry count of the remaining out of line instance of the callee
2) BB frequency of inline instance of the callee.

see also http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-March/083943.html

This is exactly correct. I was assuming that the BFI function pass new how to update itself (i.e. lazy rebuild), but if you wanted the inliner to explicitly keep the analysis up to date, that would work too.

(See also my note at the bottom about iteration order. I consider that
somewhat out of scope for this proposal, but it effects both the ideal and
practical sections herein.)

The Practical Answer
Essentially, the remainder of this is an approximation of the pass manager
based solution which let's us start experimenting with the inliner changes
while waiting for the pass manager. At least in theory, when we have the
pass manager in place, we can simply drop most of this.

The basic idea is to use metadata to effectively cache the estimated
frequency for a given call site. We can use BFI to generate/update this
information once per iteration of the outer inlining loop. As a result, we
only need to keep it reasonably up to date within the inner inlining loop.
(See note below about alternate approaches.)

The metadata on the call site would look something like:
call void @foo() !prof !"estimated_execution_count" 2000 (where 2000 is the
estimated count)

Can you use callgraph to keep this information instead?

That would be another option. Not a bad one actually. It would mean having to explicitly populate the information into the callgraph by calling per function analysis passes. Not sure how easy that is to do.

I'm also leery relying on the call graph since I know that's going to radically change with the new pass manager. I really, really, really do not want to be blocked behind Chandlers progress.

We'd have a new FunctionPass which removes any existing metadata and adds
new metadata which basically comes down to a product of the functions entry
count and the estimated block frequency (provided by BFI) of the block
containing the call site. This would run as the last FunctionPass within
the outer inlining loop. Assuming that the entry counts are kept reasonably
up to date, the resulting estimated call site counts should be reasonable.

Within the inliner itself, we'd need to update InlineFunction to keep the
estimated counts in sync:
- When inlining B into A, split the estimated call counts on the calls
within B between the remaining instance of B and the newly created call
sites within A. I plan to split the estimated count in roughly the ratio of
the entries into B. As a result, a given call site BC (originally from B
into C) would be split into two sites AC, and BC with estimated counts
(AB.count/B.entry_count * OrigBC.count) and ((1-AB.count/B.entry_count) *
OrigBC.count). This does require updating the body of the callee, not just
the caller when inlining.

This scaling scheme assumes context-insensitivity. See your example
below which it does not apply. However this is no better way.

I assume you meant "there is no better way". If so, agreed. Assuming that you're not recomputing the call count from another source at least.

It's important to note that the resulting estimated call counts can be just
flat out wrong. As the easiest example, consider a case where B called C,
but only when a boolean parameter was set to true. If we split the count of
BC into AC, BC and then drop the call site AC, we've essentially lost
information about the frequency of the remaining BC w.r.t. any other callers
of C. We could try to adjust for this by only updating calls which don't
get pruned away by constant propagation within InlineFunction, but I'm not
sure how worthwhile it is trying to be smart here given the information will
be roughly restored once we're out of the inner loop again.

Why is wrong? should it make the result more precise? Basically C is
never called from B if B is called from A in this case. In such as
case, edge BC's count does not need to be adjusted (though B's entry
count needs to be adjusted).

To clarify: "wrong" was meant to imply inaccurate, and confusing. Not "wrong" in the sense of miscompile.

The problem is not the count at call site AC or even BC. The problem is that we can have some other call site XC.

We start with:
A calls B with call site count 200
B calls C with call site count 200 (but never when called from A)
X calls C with call site count 50
A's entry count is 200
B's entry count is 210
C's entry count is 250

After inlining AB, we get:
A calls C with call site count 0 (because we proved away the call site)
B calls C with call site count 10/210 (small fraction of original)
X calls C with call site count 50
A's entry count is 200
B's entry count is 10
C's entry count is 250

The information for BC is wildly inaccurate and we've mistated the ratio between call sites BC and BX. Any decision we make off this data is likely to be questionable. This isn't a correctness issue, but it's definitely makes writing optimization heuristics harder.

(Note that the ability to rescale by block frequency within B solves this problem quite nicely. Thus, we want the pass manager changes!)

What we can assume (and thus make inlining decisions on), is that a) a given
call sites count is a reasonable approximation of it's actual execution
count and b) that the sum of the call site counts for a given callee is less
than or equal to the callee's entry count. What we can't do is compare two
call sites counts for the same callee within two different functions and
decide with high confidence which is more frequent.

Alternate Approaches:
1) We could just recompute BFI explicitly in the inliner right before
passing the result to ICA for the purposes of prototyping. If this was off
by default, this might be a reasonable scheme for investigation. This could
likely never be enabled for real uses.
2) We could pre-compute BFI once per function within a given SCC and then
try to keep it up to date during inlining. If we cached the call
frequencies for the initial call sites, we could adjust the visit order to
minimize the number of times we need to recompute a given functions block
frequencies. (e.g. we can look at all the original call sites within a
function before looking at newly inlined ones)

For the record, it's not clear that the bottom up inlining approach works
anywhere near as well once you starting using profiling information about
the caller/callee pair. You would seem to end up with cases where inlining
BC makes it more likely you'll inline DC, but if you visited them in the
other order you wouldn't inline DC. Consider the case where BC is a hot call
site, and DC is the only other caller of C. I have no plans to actually
tackle this problem.

Inliner can be taught to delay inlining and use a top down scheme. For
instance if B has lots of incoming calls, but there is only one
dominating call edge from A -- in such as case, it is likely better to
delay inlining decisions of callsites in B that may have large
overhead until B gets inlined ..

It can be done; agreed. Just out of scope for the moment.

The memory size part doesn't surprise me for at least some analysis passes.
What's the runtime cost of the incremental update?

I don't have such data for LLVM, but it is done in other compiler and
it is not expensive to do.

We start with:
A calls B with call site count 200
B calls C with call site count 200 (but never when called from A)
X calls C with call site count 50
A's entry count is 200
B's entry count is 210
C's entry count is 250

After inlining AB, we get:
A calls C with call site count 0 (because we proved away the call site)
B calls C with call site count 10/210 (small fraction of original)
X calls C with call site count 50
A's entry count is 200
B's entry count is 10
C's entry count is 250

The information for BC is wildly inaccurate and we've mistated the ratio
between call sites BC and BX. Any decision we make off this data is likely
to be questionable. This isn't a correctness issue, but it's definitely
makes writing optimization heuristics harder.

that is what I tried to say. The hotness of the callsite B->C depends
on the context -- when it is called from A, C is rarely called, but
when called by X, C will be called -- even though B is mostly called
by A. As in

A ()
{
  for (i = 0; i < 200; i++)
     B(0);
}
X()
{
for (i = 0; i < 10; i++)
   B(20);
}
B(int n) {
   for (i = 0; i < n; i++)
      C();
}
int main() { A(); X();}

The simple scaling can not handle this. The solution in this case is
that if the compiler knows the exact count of A->C after inlining B,
instead of doing simple scaling to compute the remaining count of B->C
as 200*10/210, it can do more precise update by subtracting:
count(B->C) = original_count (B->C) - count(A->C)

(Note that the ability to rescale by block frequency within B solves this
problem quite nicely. Thus, we want the pass manager changes!)

How can that help? If BB frequency is not properly updated, we will
have the same problem.

thanks,

David

Hi Chandler, I'd like to hear your thought about this interim solution
to make global profile information available to inliner.

thanks,

David

(Resending after removing llvmdev@cs.uiuc.edu and using llvm-dev@lists.llvm.org)

This makes great sense to me. I am looking forward to the patch --
this is a critical for PGO performance tuning.

thanks,

David

Given I didn’t get any response to my original query, I chose not to invest time in this at the time. I am unlikely to get time for this in the near future.

The problem here is that the old pass manager does not have a place/way to cache the function level analysis. You could do this within the inliner itself, but the function passes run by the SCC pass manager will invalidate everything. That’s the problem I was trying to solve. This sounds like you’re essentially planning to extend the old pass manager with function level analysis support in the SCC pass manager. I advise against this given the fact that Chandler is actively working on the new one and getting patches reviewed for the old one is “challenging” at best. If you take this approach, make sure Chandler - who is pretty much the only active reviewer in this area - is on-board with your approach before you start.

Given I didn't get any response to my original query, I chose not to invest
time in this at the time. I am unlikely to get time for this in the near
future.

(Resending after removing llvmdev@cs.uiuc.edu and using
llvm-dev@lists.llvm.org)

Hi Philip,

Is there any update on this? I've been sending patches to get rid of the
callee hotness based inline hints from the frontend and move the logic to
the inliner. The next step is to use the callsite hotness instead. I also
want to focus on the infrastructure to enable this and what I've been
experimenting with is similar to your two alternative approaches:

Alternate Approaches:
1) We could just recompute BFI explicitly in the inliner right before
passing the result to ICA for the purposes of prototyping. If this was off
by default, this might be a reasonable scheme for investigation. This could
likely never be enabled for real uses.
2) We could pre-compute BFI once per function within a given SCC and then
try to keep it up to date during inlining. If we cached the call
frequencies for the initial call sites, we could adjust the visit order to
minimize the number of times we need to recompute a given functions block
frequencies. (e.g. we can look at all the original call sites within a
function before looking at newly inlined ones)

My proposal is very similar (perhaps identical) to your option 2 above. I
don't understand the part where you talk about adjusting the visit order to
minimize BFI computation.

BFI computation: BFI for a function is computed on demand and cached.

The problem here is that the old pass manager does not have a place/way to
cache the function level analysis. You could do this within the inliner
itself, but the function passes run by the SCC pass manager will invalidate
everything. That's the problem I was trying to solve.

Easwaran's proposal does not rely on the analysis pass framework of
the legacy pass manager -- that is the key idea. BFI info is designed
in a way such that it can be computed on-demand.

BFI for the caller will be incrementally updated during inlining, but
after inlining is done for a node, the BFI info will be explicitly
invalidated for that node due to the function passes run later by SCC
pass manager. This is expected.

Update: When 'bar' gets inlined into 'foo', the BFI for 'foo' is updated.
Let OldBB in 'bar' gets cloned as NewBB in 'foo'. NewBB's block frequency
can be incrementally computed from OldBB's block frequency, entry block
frequency of 'bar' and the frequency of the block containing the 'foo' ->
'bar' callsite. Even when the new CGSCC level BFI analysis is in place, this
incremental update is useful to minimize computation.

Invalidation: Once inlining is completed in an SCC (at the end of
runOnSCC), the entries for functions in that SCC are invalidated since other
passes run by the CGSCC pass manager (including those run by the function
pass manager run under CGSCC pass manager) might affect the computed BFI for
the functions in the SCC.

This sounds like you're essentially planning to extend the old pass manager
with function level analysis support in the SCC pass manager. I advise
against this given the fact that Chandler is actively working on the new one
and getting patches reviewed for the old one is "challenging" at best. If
you take this approach, make sure Chandler - who is pretty much the only
active reviewer in this area - is on-board with your approach before you
start.

There is no plan to extend old pass manager for this capability. The
proposed solution is intended to work with the new pass manager too,
so there is no conflict here. The BFI book-keeping by inliner can be
removed with pass manager change is in place.

David

Given I didn't get any response to my original query, I chose not to invest
time in this at the time. I am unlikely to get time for this in the near
future.

(Resending after removing llvmdev@cs.uiuc.edu and using
llvm-dev@lists.llvm.org)

Hi Philip,

  Is there any update on this? I've been sending patches to get rid of the
callee hotness based inline hints from the frontend and move the logic to
the inliner. The next step is to use the callsite hotness instead. I also
want to focus on the infrastructure to enable this and what I've been
experimenting with is similar to your two alternative approaches:

Alternate Approaches:
1) We could just recompute BFI explicitly in the inliner right before
passing the result to ICA for the purposes of prototyping. If this was off
by default, this might be a reasonable scheme for investigation. This could
likely never be enabled for real uses.
2) We could pre-compute BFI once per function within a given SCC and then
try to keep it up to date during inlining. If we cached the call
frequencies for the initial call sites, we could adjust the visit order to
minimize the number of times we need to recompute a given functions block
frequencies. (e.g. we can look at all the original call sites within a
function before looking at newly inlined ones)

My proposal is very similar (perhaps identical) to your option 2 above. I
don't understand the part where you talk about adjusting the visit order to
minimize BFI computation.

BFI computation: BFI for a function is computed on demand and cached.

The problem here is that the old pass manager does not have a place/way to
cache the function level analysis. You could do this within the inliner
itself, but the function passes run by the SCC pass manager will invalidate
everything. That's the problem I was trying to solve.

Easwaran's proposal does not rely on the analysis pass framework of
the legacy pass manager -- that is the key idea. BFI info is designed
in a way such that it can be computed on-demand.

BFI for the caller will be incrementally updated during inlining, but
after inlining is done for a node, the BFI info will be explicitly
invalidated for that node due to the function passes run later by SCC
pass manager. This is expected.

Just to make sure, you're aware this will cost at least one re-computation per iteration in CGPassManager::runOnModule right? If that's expected and acceptable, then yes, the approach you seem to be describing should be workable.

Particularly as an intermediate step to parallelize work with the new pass manager, this sounds very much worthwhile.

Update: When 'bar' gets inlined into 'foo', the BFI for 'foo' is updated.
Let OldBB in 'bar' gets cloned as NewBB in 'foo'. NewBB's block frequency
can be incrementally computed from OldBB's block frequency, entry block
frequency of 'bar' and the frequency of the block containing the 'foo' ->
'bar' callsite. Even when the new CGSCC level BFI analysis is in place, this
incremental update is useful to minimize computation.

Invalidation: Once inlining is completed in an SCC (at the end of
runOnSCC), the entries for functions in that SCC are invalidated since other
passes run by the CGSCC pass manager (including those run by the function
pass manager run under CGSCC pass manager) might affect the computed BFI for
the functions in the SCC.

This sounds like you're essentially planning to extend the old pass manager
with function level analysis support in the SCC pass manager. I advise
against this given the fact that Chandler is actively working on the new one
and getting patches reviewed for the old one is "challenging" at best. If
you take this approach, make sure Chandler - who is pretty much the only
active reviewer in this area - is on-board with your approach before you
start.

There is no plan to extend old pass manager for this capability. The
proposed solution is intended to work with the new pass manager too,
so there is no conflict here. The BFI book-keeping by inliner can be
removed with pass manager change is in place.

Thanks for the clarification.

Given I didn't get any response to my original query, I chose not to
invest
time in this at the time. I am unlikely to get time for this in the near
future.

(Resending after removing llvmdev@cs.uiuc.edu and using
llvm-dev@lists.llvm.org)

Hi Philip,

  Is there any update on this? I've been sending patches to get rid of
the
callee hotness based inline hints from the frontend and move the logic
to
the inliner. The next step is to use the callsite hotness instead. I
also
want to focus on the infrastructure to enable this and what I've been
experimenting with is similar to your two alternative approaches:

Alternate Approaches:
1) We could just recompute BFI explicitly in the inliner right before
passing the result to ICA for the purposes of prototyping. If this was
off
by default, this might be a reasonable scheme for investigation. This
could
likely never be enabled for real uses.
2) We could pre-compute BFI once per function within a given SCC and
then
try to keep it up to date during inlining. If we cached the call
frequencies for the initial call sites, we could adjust the visit order
to
minimize the number of times we need to recompute a given functions
block
frequencies. (e.g. we can look at all the original call sites within a
function before looking at newly inlined ones)

My proposal is very similar (perhaps identical) to your option 2 above.
I
don't understand the part where you talk about adjusting the visit order
to
minimize BFI computation.

BFI computation: BFI for a function is computed on demand and cached.

The problem here is that the old pass manager does not have a place/way
to
cache the function level analysis. You could do this within the inliner
itself, but the function passes run by the SCC pass manager will
invalidate
everything. That's the problem I was trying to solve.

Easwaran's proposal does not rely on the analysis pass framework of
the legacy pass manager -- that is the key idea. BFI info is designed
in a way such that it can be computed on-demand.

BFI for the caller will be incrementally updated during inlining, but
after inlining is done for a node, the BFI info will be explicitly
invalidated for that node due to the function passes run later by SCC
pass manager. This is expected.

Just to make sure, you're aware this will cost at least one re-computation
per iteration in CGPassManager::runOnModule right? If that's expected and
acceptable, then yes, the approach you seem to be describing should be
workable.

For each leaf node, the number of BFI computation is <= 1. For other
nodes, it is <=2 (one as caller, one as callee) -- in other words, it
is linear to the number of functions.

Particularly as an intermediate step to parallelize work with the new pass
manager, this sounds very much worthwhile.

Yes, that is exactly the intention -- it can buy pass manager rework
more time while unlock lots of pending performance tuning work we plan
to do.

thanks,

David

Clearly I haven’t done a good job explaining my proposal. I don’t propose extending the pass manager to provide function level analysis. I proposed to do something like this:

/// \brief Get BlockFrequencyInfo for a function.
BlockFrequencyInfo *BlockFrequencyAnalysis::getBlockFrequencyInfo(Function *F) {
// BFM is a map from Function * to BlockFrequencyInfo * and
// caches BlockFrequencyInfo *
auto Iter = BFM.find(F);
if (Iter != BFM.end()) return Iter->second;
// We need to create a BlockFrequencyInfo object for F and store it.
//
// DominatorTree, LoopInfo and BranchProbabilityInfo can be thrown

// away after BFI is computed.

// First create a dominator tree

DominatorTree DT;
DT.recalculate(*F);

// Then get a LoopInfo object
LoopInfo LI;
LI.analyze(DT);

// Then compute BranchProbabilityInfo
BranchProbabilityInfo BPI;
BPI.calculate(*F, LI);

// Finally compute BlockFrequencyInfo
BlockFrequencyInfo *BFI = new BlockFrequencyInfo;
BFI->calculate(*F, BPI, LI);
BFM[F] = BFI;
return BFI;
}

This recomputes all analysis needed to compute BFI and doesn’t depend on the analysis framework. This is likely more computation (no benefit when a dependent analysis is preserved) and is used only by the inliner and not shared by passes like a true analysis. The invalidation needs to be done only once per function after it ceases to be a caller in inlining. Within inliner itself, the BFI data is incrementally updated.

Thanks,
Easwaran

Given I didn't get any response to my original query, I chose not to
invest
time in this at the time. I am unlikely to get time for this in the near
future.

(Resending after removing llvmdev@cs.uiuc.edu and using
llvm-dev@lists.llvm.org)

Hi Philip,

  Is there any update on this? I've been sending patches to get rid of
the
callee hotness based inline hints from the frontend and move the logic
to
the inliner. The next step is to use the callsite hotness instead. I
also
want to focus on the infrastructure to enable this and what I've been
experimenting with is similar to your two alternative approaches:

Alternate Approaches:
1) We could just recompute BFI explicitly in the inliner right before
passing the result to ICA for the purposes of prototyping. If this was
off
by default, this might be a reasonable scheme for investigation. This
could
likely never be enabled for real uses.
2) We could pre-compute BFI once per function within a given SCC and
then
try to keep it up to date during inlining. If we cached the call
frequencies for the initial call sites, we could adjust the visit order
to
minimize the number of times we need to recompute a given functions
block
frequencies. (e.g. we can look at all the original call sites within a
function before looking at newly inlined ones)

My proposal is very similar (perhaps identical) to your option 2 above.
I
don't understand the part where you talk about adjusting the visit order
to
minimize BFI computation.

BFI computation: BFI for a function is computed on demand and cached.

The problem here is that the old pass manager does not have a place/way
to
cache the function level analysis. You could do this within the inliner
itself, but the function passes run by the SCC pass manager will
invalidate
everything. That's the problem I was trying to solve.

Easwaran's proposal does not rely on the analysis pass framework of
the legacy pass manager -- that is the key idea. BFI info is designed
in a way such that it can be computed on-demand.

BFI for the caller will be incrementally updated during inlining, but
after inlining is done for a node, the BFI info will be explicitly
invalidated for that node due to the function passes run later by SCC
pass manager. This is expected.

Just to make sure, you're aware this will cost at least one re-computation
per iteration in CGPassManager::runOnModule right? If that's expected and
acceptable, then yes, the approach you seem to be describing should be
workable.

For each leaf node, the number of BFI computation is <= 1. For other
nodes, it is <=2 (one as caller, one as callee) -- in other words, it
is linear to the number of functions.

With PGO, the cost can be greatly reduced. For instance, if the
function entry count is zero (in practice, large percentage of
functions are like this), do not bother to do frequency analysis --
they will be optimized for size.

David

Given I didn't get any response to my original query, I chose not to
invest
time in this at the time. I am unlikely to get time for this in the near
future.

(Resending after removing llvmdev@cs.uiuc.edu and using
llvm-dev@lists.llvm.org)

Hi Philip,

   Is there any update on this? I've been sending patches to get rid of
the
callee hotness based inline hints from the frontend and move the logic
to
the inliner. The next step is to use the callsite hotness instead. I
also
want to focus on the infrastructure to enable this and what I've been
experimenting with is similar to your two alternative approaches:

Alternate Approaches:
1) We could just recompute BFI explicitly in the inliner right before
passing the result to ICA for the purposes of prototyping. If this was
off
by default, this might be a reasonable scheme for investigation. This
could
likely never be enabled for real uses.
2) We could pre-compute BFI once per function within a given SCC and
then
try to keep it up to date during inlining. If we cached the call
frequencies for the initial call sites, we could adjust the visit order
to
minimize the number of times we need to recompute a given functions
block
frequencies. (e.g. we can look at all the original call sites within a
function before looking at newly inlined ones)

My proposal is very similar (perhaps identical) to your option 2 above.
I
don't understand the part where you talk about adjusting the visit order
to
minimize BFI computation.

BFI computation: BFI for a function is computed on demand and cached.

The problem here is that the old pass manager does not have a place/way
to
cache the function level analysis. You could do this within the inliner
itself, but the function passes run by the SCC pass manager will
invalidate
everything. That's the problem I was trying to solve.

Easwaran's proposal does not rely on the analysis pass framework of
the legacy pass manager -- that is the key idea. BFI info is designed
in a way such that it can be computed on-demand.

BFI for the caller will be incrementally updated during inlining, but
after inlining is done for a node, the BFI info will be explicitly
invalidated for that node due to the function passes run later by SCC
pass manager. This is expected.

Just to make sure, you're aware this will cost at least one re-computation
per iteration in CGPassManager::runOnModule right? If that's expected and
acceptable, then yes, the approach you seem to be describing should be
workable.

For each leaf node, the number of BFI computation is <= 1. For other
nodes, it is <=2 (one as caller, one as callee) -- in other words, it
is linear to the number of functions.

This is not true.

In CGPassManager::runOnModule, we will iterate on the *same* SCC multiple times if we're making progress (i.e. devirtualizing call sites) on each iteration. The BFI will be invalidated on each iteration. This is caped at 4 iterations, so it's still semi linear, but only because there's a cap.

Examples where this might kick in:
void bar() {};
func_ptr foo() { return &bar; }
void caller() {
   func_ptr f = foo();
   f();
}

We will visit caller twice so that we can inline both foo and bar.

I think this iteration is okay for what your proposing - i.e. off by default, migration step, etc - but it's important to understand the implications of the planned changes.

From: "Easwaran Raman via llvm-dev" <llvm-dev@lists.llvm.org>
To: "Philip Reames" <listmail@philipreames.com>
Cc: llvm-dev@lists.llvm.org, "Xinliang David Li" <davidxl@google.com>
Sent: Thursday, December 10, 2015 7:03:50 PM
Subject: Re: [llvm-dev] [LLVMdev] Path forward on profile guided inlining?

Clearly I haven't done a good job explaining my proposal. I don't
propose extending the pass manager to provide function level
analysis. I proposed to do something like this:

/// \brief Get BlockFrequencyInfo for a function.
BlockFrequencyInfo
*BlockFrequencyAnalysis::getBlockFrequencyInfo(Function *F) {
// BFM is a map from Function * to BlockFrequencyInfo * and
// caches BlockFrequencyInfo *
auto Iter = BFM.find(F);
if (Iter != BFM.end()) return Iter->second;
// We need to create a BlockFrequencyInfo object for F and store it.
//
// DominatorTree, LoopInfo and BranchProbabilityInfo can be thrown

// away after BFI is computed.

// First create a dominator tree

DominatorTree DT;
DT.recalculate(*F);

// Then get a LoopInfo object
LoopInfo LI;
LI.analyze(DT);

// Then compute BranchProbabilityInfo
BranchProbabilityInfo BPI;
BPI.calculate(*F, LI);

// Finally compute BlockFrequencyInfo
BlockFrequencyInfo *BFI = new BlockFrequencyInfo;
BFI->calculate(*F, BPI, LI);
BFM[F] = BFI;
return BFI;
}

This recomputes all analysis needed to compute BFI and doesn't depend
on the analysis framework. This is likely more computation (no
benefit when a dependent analysis is preserved) and is used only by
the inliner and not shared by passes like a true analysis. The
invalidation needs to be done only once per function after it ceases
to be a caller in inlining. Within inliner itself, the BFI data is
incrementally updated.

I think this makes sense, as an incremental step until the new pass manager arrives (and we can actually use these things in the proper way). Given that I suggested this myself in D15289 earlier, I suppose I better be okay with it :wink:

-Hal

From: "Philip Reames via llvm-dev" <llvm-dev@lists.llvm.org>
To: "Xinliang David Li" <davidxl@google.com>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Thursday, December 10, 2015 7:12:50 PM
Subject: Re: [llvm-dev] [LLVMdev] Path forward on profile guided inlining?

>>
>>>> Given I didn't get any response to my original query, I chose
>>>> not to
>>>> invest
>>>> time in this at the time. I am unlikely to get time for this in
>>>> the near
>>>> future.
>>>>
>>>>
>>>> (Resending after removing llvmdev@cs.uiuc.edu and using
>>>> llvm-dev@lists.llvm.org)
>>>>
>>>>> Hi Philip,
>>>>>
>>>>> Is there any update on this? I've been sending patches to
>>>>> get rid of
>>>>> the
>>>>> callee hotness based inline hints from the frontend and move
>>>>> the logic
>>>>> to
>>>>> the inliner. The next step is to use the callsite hotness
>>>>> instead. I
>>>>> also
>>>>> want to focus on the infrastructure to enable this and what
>>>>> I've been
>>>>> experimenting with is similar to your two alternative
>>>>> approaches:
>>>>>
>>>>>> Alternate Approaches:
>>>>>> 1) We could just recompute BFI explicitly in the inliner right
>>>>>> before
>>>>>> passing the result to ICA for the purposes of prototyping. If
>>>>>> this was
>>>>>> off
>>>>>> by default, this might be a reasonable scheme for
>>>>>> investigation. This
>>>>>> could
>>>>>> likely never be enabled for real uses.
>>>>>> 2) We could pre-compute BFI once per function within a given
>>>>>> SCC and
>>>>>> then
>>>>>> try to keep it up to date during inlining. If we cached the
>>>>>> call
>>>>>> frequencies for the initial call sites, we could adjust the
>>>>>> visit order
>>>>>> to
>>>>>> minimize the number of times we need to recompute a given
>>>>>> functions
>>>>>> block
>>>>>> frequencies. (e.g. we can look at all the original call sites
>>>>>> within a
>>>>>> function before looking at newly inlined ones)
>>>>>>
>>>>> My proposal is very similar (perhaps identical) to your option
>>>>> 2 above.
>>>>> I
>>>>> don't understand the part where you talk about adjusting the
>>>>> visit order
>>>>> to
>>>>> minimize BFI computation.
>>>>>
>>>>> BFI computation: BFI for a function is computed on demand and
>>>>> cached.
>>>> The problem here is that the old pass manager does not have a
>>>> place/way
>>>> to
>>>> cache the function level analysis. You could do this within the
>>>> inliner
>>>> itself, but the function passes run by the SCC pass manager will
>>>> invalidate
>>>> everything. That's the problem I was trying to solve.
>>> Easwaran's proposal does not rely on the analysis pass framework
>>> of
>>> the legacy pass manager -- that is the key idea. BFI info is
>>> designed
>>> in a way such that it can be computed on-demand.
>>>
>>> BFI for the caller will be incrementally updated during inlining,
>>> but
>>> after inlining is done for a node, the BFI info will be
>>> explicitly
>>> invalidated for that node due to the function passes run later by
>>> SCC
>>> pass manager. This is expected.
>> Just to make sure, you're aware this will cost at least one
>> re-computation
>> per iteration in CGPassManager::runOnModule right? If that's
>> expected and
>> acceptable, then yes, the approach you seem to be describing
>> should be
>> workable.
>>
> For each leaf node, the number of BFI computation is <= 1. For
> other
> nodes, it is <=2 (one as caller, one as callee) -- in other words,
> it
> is linear to the number of functions.
This is not true.

In CGPassManager::runOnModule, we will iterate on the *same* SCC
multiple times if we're making progress (i.e. devirtualizing call
sites)
on each iteration. The BFI will be invalidated on each iteration.
This
is caped at 4 iterations, so it's still semi linear, but only because
there's a cap.

Examples where this might kick in:
void bar() {};
func_ptr foo() { return &bar; }
void caller() {
   func_ptr f = foo();
   f();
}

We will visit caller twice so that we can inline both foo and bar.

I think this iteration is okay for what your proposing - i.e. off by
default, migration step, etc - but it's important to understand the
implications of the planned changes.

I think it will be interesting to see if this causes a net increase or decrease in compile time. Turning off inlining in regions known (suspected) to be cold should decrease compile time overall, and given that the CodeGen parts involved for large functions are really pretty expensive, I think it is still an open question how this will turn out.

-Hal

From: "Philip Reames via llvm-dev" <llvm-dev@lists.llvm.org>
To: "Xinliang David Li" <davidxl@google.com>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Thursday, December 10, 2015 7:12:50 PM
Subject: Re: [llvm-dev] [LLVMdev] Path forward on profile guided inlining?

>>
>>>> Given I didn't get any response to my original query, I chose
>>>> not to
>>>> invest
>>>> time in this at the time. I am unlikely to get time for this in
>>>> the near
>>>> future.
>>>>
>>>>
>>>> (Resending after removing llvmdev@cs.uiuc.edu and using
>>>> llvm-dev@lists.llvm.org)
>>>>
>>>>> Hi Philip,
>>>>>
>>>>> Is there any update on this? I've been sending patches to
>>>>> get rid of
>>>>> the
>>>>> callee hotness based inline hints from the frontend and move
>>>>> the logic
>>>>> to
>>>>> the inliner. The next step is to use the callsite hotness
>>>>> instead. I
>>>>> also
>>>>> want to focus on the infrastructure to enable this and what
>>>>> I've been
>>>>> experimenting with is similar to your two alternative
>>>>> approaches:
>>>>>
>>>>>> Alternate Approaches:
>>>>>> 1) We could just recompute BFI explicitly in the inliner right
>>>>>> before
>>>>>> passing the result to ICA for the purposes of prototyping. If
>>>>>> this was
>>>>>> off
>>>>>> by default, this might be a reasonable scheme for
>>>>>> investigation. This
>>>>>> could
>>>>>> likely never be enabled for real uses.
>>>>>> 2) We could pre-compute BFI once per function within a given
>>>>>> SCC and
>>>>>> then
>>>>>> try to keep it up to date during inlining. If we cached the
>>>>>> call
>>>>>> frequencies for the initial call sites, we could adjust the
>>>>>> visit order
>>>>>> to
>>>>>> minimize the number of times we need to recompute a given
>>>>>> functions
>>>>>> block
>>>>>> frequencies. (e.g. we can look at all the original call sites
>>>>>> within a
>>>>>> function before looking at newly inlined ones)
>>>>>>
>>>>> My proposal is very similar (perhaps identical) to your option
>>>>> 2 above.
>>>>> I
>>>>> don't understand the part where you talk about adjusting the
>>>>> visit order
>>>>> to
>>>>> minimize BFI computation.
>>>>>
>>>>> BFI computation: BFI for a function is computed on demand and
>>>>> cached.
>>>> The problem here is that the old pass manager does not have a
>>>> place/way
>>>> to
>>>> cache the function level analysis. You could do this within the
>>>> inliner
>>>> itself, but the function passes run by the SCC pass manager will
>>>> invalidate
>>>> everything. That's the problem I was trying to solve.
>>> Easwaran's proposal does not rely on the analysis pass framework
>>> of
>>> the legacy pass manager -- that is the key idea. BFI info is
>>> designed
>>> in a way such that it can be computed on-demand.
>>>
>>> BFI for the caller will be incrementally updated during inlining,
>>> but
>>> after inlining is done for a node, the BFI info will be
>>> explicitly
>>> invalidated for that node due to the function passes run later by
>>> SCC
>>> pass manager. This is expected.
>> Just to make sure, you're aware this will cost at least one
>> re-computation
>> per iteration in CGPassManager::runOnModule right? If that's
>> expected and
>> acceptable, then yes, the approach you seem to be describing
>> should be
>> workable.
>>
> For each leaf node, the number of BFI computation is <= 1. For
> other
> nodes, it is <=2 (one as caller, one as callee) -- in other words,
> it
> is linear to the number of functions.
This is not true.

In CGPassManager::runOnModule, we will iterate on the *same* SCC
multiple times if we're making progress (i.e. devirtualizing call
sites)
on each iteration. The BFI will be invalidated on each iteration.
This
is caped at 4 iterations, so it's still semi linear, but only because
there's a cap.

Examples where this might kick in:
void bar() {};
func_ptr foo() { return &bar; }
void caller() {
   func_ptr f = foo();
   f();
}

We will visit caller twice so that we can inline both foo and bar.

I think this iteration is okay for what your proposing - i.e. off by
default, migration step, etc - but it's important to understand the
implications of the planned changes.

I think it will be interesting to see if this causes a net increase or decrease in compile time. Turning off inlining in regions known (suspected) to be cold should decrease compile time overall, and given that the CodeGen parts involved for large functions are really pretty expensive, I think it is still an open question how this will turn out.

Yes we will monitor this. One thing we know for sure is that PGO can
shrink overall size a lot -- this can translate in to compile time win
(given smaller average function bodies).

David

Given I didn't get any response to my original query, I chose not to
invest
time in this at the time. I am unlikely to get time for this in the
near
future.

(Resending after removing llvmdev@cs.uiuc.edu and using
llvm-dev@lists.llvm.org)

Hi Philip,

   Is there any update on this? I've been sending patches to get rid
of
the
callee hotness based inline hints from the frontend and move the logic
to
the inliner. The next step is to use the callsite hotness instead. I
also
want to focus on the infrastructure to enable this and what I've been
experimenting with is similar to your two alternative approaches:

Alternate Approaches:
1) We could just recompute BFI explicitly in the inliner right before
passing the result to ICA for the purposes of prototyping. If this
was
off
by default, this might be a reasonable scheme for investigation.
This
could
likely never be enabled for real uses.
2) We could pre-compute BFI once per function within a given SCC and
then
try to keep it up to date during inlining. If we cached the call
frequencies for the initial call sites, we could adjust the visit
order
to
minimize the number of times we need to recompute a given functions
block
frequencies. (e.g. we can look at all the original call sites within
a
function before looking at newly inlined ones)

My proposal is very similar (perhaps identical) to your option 2
above.
I
don't understand the part where you talk about adjusting the visit
order
to
minimize BFI computation.

BFI computation: BFI for a function is computed on demand and cached.

The problem here is that the old pass manager does not have a place/way
to
cache the function level analysis. You could do this within the
inliner
itself, but the function passes run by the SCC pass manager will
invalidate
everything. That's the problem I was trying to solve.

Easwaran's proposal does not rely on the analysis pass framework of
the legacy pass manager -- that is the key idea. BFI info is designed
in a way such that it can be computed on-demand.

BFI for the caller will be incrementally updated during inlining, but
after inlining is done for a node, the BFI info will be explicitly
invalidated for that node due to the function passes run later by SCC
pass manager. This is expected.

Just to make sure, you're aware this will cost at least one
re-computation
per iteration in CGPassManager::runOnModule right? If that's expected and
acceptable, then yes, the approach you seem to be describing should be
workable.

For each leaf node, the number of BFI computation is <= 1. For other
nodes, it is <=2 (one as caller, one as callee) -- in other words, it
is linear to the number of functions.

This is not true.

In CGPassManager::runOnModule, we will iterate on the *same* SCC multiple
times if we're making progress (i.e. devirtualizing call sites) on each
iteration. The BFI will be invalidated on each iteration. This is caped at
4 iterations, so it's still semi linear, but only because there's a cap.

Examples where this might kick in:
void bar() {};
func_ptr foo() { return &bar; }
void caller() {
  func_ptr f = foo();
  f();
}

We will visit caller twice so that we can inline both foo and bar.

Right. I would guess on average only a small portion of all SCCs can
trigger such iterative compilation.

David

I think you would need a far more complex example to trigger the revisit: GVN or other need to actually remove a call, the inliner doing its job won’t trigger a reiteration at this level.