[RFC] Target-specific parametrization of function inliner

Hi,

I propose to make function inliner parameters adjustable for specific target.

Currently function inlining pass appears to be target-agnostic with various constants for calculating call cost hardcoded. While it works reasonably well for general purpose CPUs, some quirkier targets like NVPTX would benefit from target-specific tuning.

Currently it appears that there are two things that need to be done:

  • add Inliner preferences to TargetTransformInfo in a way similar to how we customize loop unrolling. Use it to provide inliner with target-specific thresholds and other parameters.
  • augment Inliner pass to use existing TargetTransformInfo API to figure out cost of particular call on a given target. TargetTransforInfo already has getCallCost(), though it does not look like anything uses it.

Comments? Concerns? Suggestions?

Thanks,

From: "Artem Belevich via llvm-dev" <llvm-dev@lists.llvm.org>
To: "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Tuesday, March 1, 2016 6:31:06 PM
Subject: [llvm-dev] [RFC] Target-specific parametrization of function inliner

Hi,

I propose to make function inliner parameters adjustable for specific
target.

Currently function inlining pass appears to be target-agnostic with
various constants for calculating call cost hardcoded. While it
works reasonably well for general purpose CPUs, some quirkier
targets like NVPTX would benefit from target-specific tuning.

Currently it appears that there are two things that need to be done:

* add Inliner preferences to TargetTransformInfo in a way similar to
how we customize loop unrolling. Use it to provide inliner with
target-specific thresholds and other parameters.
* augment Inliner pass to use existing TargetTransformInfo API to
figure out cost of particular call on a given target.
TargetTransforInfo already has getCallCost(), though it does not
look like anything uses it.

Comments? Concerns? Suggestions?

Hi Art,

I've long thought that we should have a more principled way of doing inline profitability. There is obviously some cost to executing a function body, some call site overhead, and some cost reduction associated with any post-inlining simplifications. If inlining reduces the overall call site cost by more than some factor, say 1% (this should probably depend on the optimization level), then we should inline. With profiling information, we might even use global speedup instead of local speedup.

Whether we need a target customization of this threshold, or just a way for a target to supplement the fine inlining decision, is unclear to me. It is also true that a the result of a bunch of locally-optimal decisions might be far from the global optimum. Maybe the target has something to say about that?

In short, I'm fine with what you're proposing, but to the extent possible, I want the numbers provided by the target to mean something. Replacing a global set of somewhat-arbitrary magic numbers, with target-specific sets of somewhat-arbitrary magic numbers should be our last choice.

Thanks again,
Hal

IMO, the appropriate thing for TTI to inform the inliner about is how costly the actual act of a “call” is likely to be. I would hope that this would only be used on targets where there is some really dramatic overhead of actually doing a function call such that the code size cost incurred by inlining is completely dwarfed by the improvements. GPUs are one of the few platforms that exhibit this kind of behavior, although I don’t think they’re truly unique, just a common example.

This isn’t quite the same thing as the cost of the call instruction, which has much more to do with the size. Instead, it has to do with the expected consequences of actually leaving a call edge in the program.

To me, this pretty accurately reflects the TTI hook we have for customizing loop unrolling where the cost of having a cyclic CFG is modeled to help indicate that on some targets (also GPUs) it is worth a very large amount of code size growth to simplify the control flow in a particular way.

Does that make sense to you Hal? Based on that, it would really just be a scaling factor of the inline heuristics. Unsure of how to more scientifically express this construct.

-Chandler

IMO, a good inliner with a precise cost/benefit model will eventually need what Art is proposing here.

Giving the function call overhead as an example. It depends on a couple of factors: 1) call/return instruction latency; 2) function epilogue/prologue; 3) calling convention (argument parsing, using registers or not, what register classes etc). All these factors depend on target information. If we want go deeper, we know certain micro architectures uses a stack of call/return pairs to help branch prediction of ret instructions – such stack has a target specific limit which can be triggered when a callsite is deep in the callchain. Register file size and register pressure increase due to inline comes as another example.

Another relevant example is the icache/itlb sizes. To do a more precise analysis of the cost to ‘speed’ due to icache/itlb pressure increase requires target information, profile information as well as some global analysis. Easwaran has done some research in this area in the past and can share the analysis design when other things are ready.

IMO, the appropriate thing for TTI to inform the inliner about is how
costly the actual act of a "call" is likely to be. I would hope that this
would only be used on targets where there is some really dramatic overhead
of actually doing a function call such that the code size cost incurred by
inlining is completely dwarfed by the improvements. GPUs are one of the few
platforms that exhibit this kind of behavior, although I don't think
they're truly unique, just a common example.

This isn't quite the same thing as the cost of the call instruction, which
has much more to do with the size. Instead, it has to do with the expected
consequences of actually leaving a call edge in the program.

To me, this pretty accurately reflects the TTI hook we have for
customizing loop unrolling where the cost of having a cyclic CFG is modeled
to help indicate that on some targets (also GPUs) it is worth a very large
amount of code size growth to simplify the control flow in a particular way.

From 10000 foot, the LLVM inliner implements a size based heuristic : if

the inline instance's size*/cost after simplification via propagating the
call context (actually the relative size -- the callsite cost is subtracted
from it), is smaller than a threshold (adjusted from a base value), then
the callsite is considered an inline candidate. In most cases, the decision
is made locally due to the bottom-up order (there are tweaks to bypass it).
  The size/cost can be remotely tied and serves a proxy to represent the
real runtime cost due to icache/itlb effect, but it seems the
size/threshold scheme is mainly used to model the runtime speedup vs
compile time/binary size tradeoffs.

Set aside what we need longer term for the inliner, the GPU specific
problems can be addressed by
1) if the call overhead is really large, define a target specific
getCallCost and subtract it from the initial Cost when analyzing a callsite
(this will help boost all targets with high call costs)
2) if not, but instead GPU users can tolerate large code growth, then it is
better to this by adjusting the threshold -- perhaps have a user level
option -finline-limit=?

thanks,

David

* some target dependent info may be used: TTI.getUserCost

From: "Xinliang David Li" <davidxl@google.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Artem Belevich" <tra@google.com>, "llvm-dev"
<llvm-dev@lists.llvm.org>, "chandlerc" <chandlerc@gmail.com>,
"Easwaran Raman" <eraman@google.com>
Sent: Thursday, March 10, 2016 11:00:30 AM
Subject: Re: [llvm-dev] [RFC] Target-specific parametrization of
function inliner

IMO, a good inliner with a precise cost/benefit model will eventually
need what Art is proposing here.

Giving the function call overhead as an example. It depends on a
couple of factors: 1) call/return instruction latency; 2) function
epilogue/prologue; 3) calling convention (argument parsing, using
registers or not, what register classes etc). All these factors
depend on target information. If we want go deeper, we know certain
micro architectures uses a stack of call/return pairs to help branch
prediction of ret instructions -- such stack has a target specific
limit which can be triggered when a callsite is deep in the
callchain. Register file size and register pressure increase due to
inline comes as another example.

Another relevant example is the icache/itlb sizes. To do a more
precise analysis of the cost to 'speed' due to icache/itlb pressure
increase requires target information, profile information as well as
some global analysis. Easwaran has done some research in this area
in the past and can share the analysis design when other things are
ready.

I don't know what you mean by "when other things are ready", but what you say above sounds exactly right. I'm certainly curious what Easwaran has found.

Generally, there seem to be two categories here:

1. Locally decidable issues, for which there are (or can be) good static heuristics (call latencies, costs associated with parameter passing, stack spilling, etc.)
2. Globally decidable issues, like reducing the number of pages consumed by temporally-correlated hot code regions - profiling data likely necessary for good decision-making (although it might be possible to make a reasonable function-local threshold based on page size without it)

and then there are things like icache/itlb effects due to multiple applications running simultaneously, for which profiling might help, but are also policy-level decisions over which users may need more-direct control.

> Hi Art,

> I've long thought that we should have a more principled way of
> doing
> inline profitability. There is obviously some cost to executing a
> function body, some call site overhead, and some cost reduction
> associated with any post-inlining simplifications. If inlining
> reduces the overall call site cost by more than some factor, say 1%
> (this should probably depend on the optimization level), then we
> should inline. With profiling information, we might even use global
> speedup instead of local speedup.

yes -- with target specific cost information, global speedup analysis
can be more precise :slight_smile:

> Whether we need a target customization of this threshold, or just a
> way for a target to supplement the fine inlining decision, is
> unclear to me. It is also true that a the result of a bunch of
> locally-optimal decisions might be far from the global optimum.
> Maybe the target has something to say about that?

The concept of threshold can be a topic of another discussion. In
current design, I think the threshold should remain target
independent. It is the cost that is target specific.

That's fine, but the units are important here. Having a target independent threshold in terms of, roughly, instruction count makes little sense. How instruction count is correlated with either performance or code size is highly target specific (although it is certainly closer for code size). That, however, is, roughly what our TTI.getUserCost gives us. Having target-independent thresholds like % speedup (e.g. inlining should be done when the speedup is > some %) or code-size thresholds (e.g. functions spanning more than a 4 KB are bad) makes sense.

-Hal

From: "Xinliang David Li" <davidxl@google.com>
To: "Chandler Carruth" <chandlerc@google.com>
Cc: "Hal Finkel" <hfinkel@anl.gov>, "Artem Belevich"
<tra@google.com>, "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Thursday, March 10, 2016 12:34:07 PM
Subject: Re: [llvm-dev] [RFC] Target-specific parametrization of
function inliner

> IMO, the appropriate thing for TTI to inform the inliner about is
> how
> costly the actual act of a "call" is likely to be. I would hope
> that
> this would only be used on targets where there is some really
> dramatic overhead of actually doing a function call such that the
> code size cost incurred by inlining is completely dwarfed by the
> improvements. GPUs are one of the few platforms that exhibit this
> kind of behavior, although I don't think they're truly unique, just
> a common example.

> This isn't quite the same thing as the cost of the call
> instruction,
> which has much more to do with the size. Instead, it has to do with
> the expected consequences of actually leaving a call edge in the
> program.

> To me, this pretty accurately reflects the TTI hook we have for
> customizing loop unrolling where the cost of having a cyclic CFG is
> modeled to help indicate that on some targets (also GPUs) it is
> worth a very large amount of code size growth to simplify the
> control flow in a particular way.

From 10000 foot, the LLVM inliner implements a size based heuristic :
if the inline instance's size*/cost after simplification via
propagating the call context (actually the relative size -- the
callsite cost is subtracted from it), is smaller than a threshold
(adjusted from a base value), then the callsite is considered an
inline candidate. In most cases, the decision is made locally due to
the bottom-up order (there are tweaks to bypass it). The size/cost
can be remotely tied and serves a proxy to represent the real
runtime cost due to icache/itlb effect, but it seems the
size/threshold scheme is mainly used to model the runtime speedup vs
compile time/binary size tradeoffs.

Yes, is kind of gets this. But sometimes not very well.

Part of the problem is that we try to make local decisions to control global code size. I understand why we do this, but we shouldn't. The local metric should purely be based on local speedup. That metric itself can be modulated by a global heuristic to control costs from itlb/icache misses, etc. - and there's not too much we can do here without profiling data with temporal correlation information.

Set aside what we need longer term for the inliner, the GPU specific
problems can be addressed by
1) if the call overhead is really large, define a target specific
getCallCost and subtract it from the initial Cost when analyzing a
callsite (this will help boost all targets with high call costs)

Yes, this makes sense.

2) if not, but instead GPU users can tolerate large code growth, then
it is better to this by adjusting the threshold -- perhaps have a
user level option -finline-limit=?

Providing a user option makes sense, but as I said in the other e-mail, we should phrase it in terms of something not in arbitrary units. I think that % speedup makes the most sense. As it turns out, inlining a large function will probably produce a low speedup (especially if we have partial inlining where we only inline the hot regions), so this ends up correlated with code size too.

-Hal

Other than the call cost itself, I’ve been surprised that the TTI is not more involved when it comes to this tradeoff: instructions don’t have the same tradeoff depending on the platform (oh this operation is not legal on this type and will be expanded in multiple instructions in SDAG, too bad…).

From: "Mehdi Amini via llvm-dev" <llvm-dev@lists.llvm.org>
To: "Xinliang David Li" <davidxl@google.com>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Friday, April 1, 2016 2:26:27 PM
Subject: Re: [llvm-dev] [RFC] Target-specific parametrization of
function inliner

> > IMO, the appropriate thing for TTI to inform the inliner about is
> > how
> > costly the actual act of a "call" is likely to be. I would hope
> > that
> > this would only be used on targets where there is some really
> > dramatic overhead of actually doing a function call such that the
> > code size cost incurred by inlining is completely dwarfed by the
> > improvements. GPUs are one of the few platforms that exhibit this
> > kind of behavior, although I don't think they're truly unique,
> > just
> > a common example.
>

> > This isn't quite the same thing as the cost of the call
> > instruction,
> > which has much more to do with the size. Instead, it has to do
> > with
> > the expected consequences of actually leaving a call edge in the
> > program.
>

> > To me, this pretty accurately reflects the TTI hook we have for
> > customizing loop unrolling where the cost of having a cyclic CFG
> > is
> > modeled to help indicate that on some targets (also GPUs) it is
> > worth a very large amount of code size growth to simplify the
> > control flow in a particular way.
>

> From 10000 foot, the LLVM inliner implements a size based heuristic
> :
> if the inline instance's size*/cost after simplification via
> propagating the call context (actually the relative size -- the
> callsite cost is subtracted from it), is smaller than a threshold
> (adjusted from a base value), then the callsite is considered an
> inline candidate. In most cases, the decision is made locally due
> to
> the bottom-up order (there are tweaks to bypass it). The size/cost
> can be remotely tied and serves a proxy to represent the real
> runtime cost due to icache/itlb effect, but it seems the
> size/threshold scheme is mainly used to model the runtime speedup
> vs
> compile time/binary size tradeoffs.

Other than the call cost itself, I've been surprised that the TTI is
not more involved when it comes to this tradeoff: instructions don't
have the same tradeoff depending on the platform (oh this operation
is not legal on this type and will be expanded in multiple
instructions in SDAG, too bad..).

I think that doing this was intended, but we've not done it yet (as we did for the throughput model used for vectorization). I think we should (I also think we should combine the cost models so that we have a single model that returns multiple kinds of costs (throughput, size, latency, etc.)).

-Hal

------------------------------

*From: *"Xinliang David Li" <davidxl@google.com>
*To: *"Hal Finkel" <hfinkel@anl.gov>
*Cc: *"Artem Belevich" <tra@google.com>, "llvm-dev" <
llvm-dev@lists.llvm.org>, "chandlerc" <chandlerc@gmail.com>, "Easwaran
Raman" <eraman@google.com>
*Sent: *Thursday, March 10, 2016 11:00:30 AM
*Subject: *Re: [llvm-dev] [RFC] Target-specific parametrization of
function inliner

IMO, a good inliner with a precise cost/benefit model will eventually need
what Art is proposing here.

Giving the function call overhead as an example. It depends on a couple of
factors: 1) call/return instruction latency; 2) function epilogue/prologue;
3) calling convention (argument parsing, using registers or not, what
register classes etc). All these factors depend on target information. If
we want go deeper, we know certain micro architectures uses a stack of
call/return pairs to help branch prediction of ret instructions -- such
stack has a target specific limit which can be triggered when a callsite is
deep in the callchain. Register file size and register pressure increase
due to inline comes as another example.

Another relevant example is the icache/itlb sizes. To do a more precise
analysis of the cost to 'speed' due to icache/itlb pressure increase
requires target information, profile information as well as some global
analysis. Easwaran has done some research in this area in the past and can
share the analysis design when other things are ready.

I don't know what you mean by "when other things are ready", but what you
say above sounds exactly right. I'm certainly curious what Easwaran has
found.

By readiness, I mean the basic infrastructure support (such as making
profile data available for inliner) and related tuning based on simple
heuristics. The former will be revisited very soon. Those work will be
useful to setup a good baseline before we start engaging in more
sophisticated analysis.

Generally, there seem to be two categories here:

1. Locally decidable issues, for which there are (or can be) good static
heuristics (call latencies, costs associated with parameter passing, stack
spilling, etc.)
2. Globally decidable issues, like reducing the number of pages consumed
by temporally-correlated hot code regions - profiling data likely necessary
for good decision-making (although it might be possible to make a
reasonable function-local threshold based on page size without it)

Program level static analysis needs to be combined with profile data to
form independent or nested hot regions (with cache/tlb reuse). For global
inlining decisions, there won't be a single budget to be used. The
decision will highly depend on the region nest where the callsite sits in.

Another side effect is that we may need more flexible inlining order (based
on net benefit of inlining a callsite) than the current bottom up order
scheme.

and then there are things like icache/itlb effects due to multiple
applications running simultaneously, for which profiling might help, but
are also policy-level decisions over which users may need more-direct
control.

Some thread level profiling may also be useful in guiding the decision --
i.e., use information about the shared instruction footprint across
different threads - e.g. Programs running heterogeneous threads vs programs
with homogeneous worker threads running identical code.

thanks,

David

------------------------------

*From: *"Mehdi Amini via llvm-dev" <llvm-dev@lists.llvm.org>
*To: *"Xinliang David Li" <davidxl@google.com>
*Cc: *"llvm-dev" <llvm-dev@lists.llvm.org>
*Sent: *Friday, April 1, 2016 2:26:27 PM
*Subject: *Re: [llvm-dev] [RFC] Target-specific parametrization of
function inliner

IMO, the appropriate thing for TTI to inform the inliner about is how
costly the actual act of a "call" is likely to be. I would hope that this
would only be used on targets where there is some really dramatic overhead
of actually doing a function call such that the code size cost incurred by
inlining is completely dwarfed by the improvements. GPUs are one of the few
platforms that exhibit this kind of behavior, although I don't think
they're truly unique, just a common example.

This isn't quite the same thing as the cost of the call instruction,
which has much more to do with the size. Instead, it has to do with the
expected consequences of actually leaving a call edge in the program.

To me, this pretty accurately reflects the TTI hook we have for
customizing loop unrolling where the cost of having a cyclic CFG is modeled
to help indicate that on some targets (also GPUs) it is worth a very large
amount of code size growth to simplify the control flow in a particular way.

From 10000 foot, the LLVM inliner implements a size based heuristic : if
the inline instance's size*/cost after simplification via propagating the
call context (actually the relative size -- the callsite cost is subtracted
from it), is smaller than a threshold (adjusted from a base value), then
the callsite is considered an inline candidate. In most cases, the decision
is made locally due to the bottom-up order (there are tweaks to bypass it).
  The size/cost can be remotely tied and serves a proxy to represent the
real runtime cost due to icache/itlb effect, but it seems the
size/threshold scheme is mainly used to model the runtime speedup vs
compile time/binary size tradeoffs.

Other than the call cost itself, I've been surprised that the TTI is not
more involved when it comes to this tradeoff: instructions don't have the
same tradeoff depending on the platform (oh this operation is not legal on
this type and will be expanded in multiple instructions in SDAG, too bad..).

I think that doing this was intended, but we've not done it yet (as we did
for the throughput model used for vectorization). I think we should (I also
think we should combine the cost models so that we have a single model that
returns multiple kinds of costs (throughput, size, latency, etc.)).

yes -- the time/speed up estimate should be independent of size increase
estimate.

David