[RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

Hello,

My team and I are looking to do some enhancements in the partial inliner in opt. Would appreciate any feedback that folks might have.

Partial Inlining in LLVM opt

Summary

Background

Currently, the partial inliner searches the first few blocks of the callee and looks for a branch to the return block (ie. early return). If found, it attempts to outline the rest of the slow (or heavy) code so the inliner will be able to inline the fast (or light) code. If no early returns are found, the partial inliner will give up. As far as I can tell, BlockFrequency and BranchProbability information is only used when attempting to inline the early return code, and not used to determine whether to outline the slow code.

Proposed changes

In addition to looking for early returns, we should utilize profile information to outline blocks that are considered cold. If we can sufficiently reduce the size of the original function via this type of outlining, inlining should be able to inline the rest of the hot code.

Details

With the presence of profile information, we have a view of what code is infrequently executed and make better decisions on what to outline. Early return blocks that are infrequently executed should still be included as candidates for outlining, but will be treated just like any other cold blocks. Without profiling information, however, we should remain conservative and only partial inline in the presence of an early return in the first few blocks of a function (ie. peel the early return out of the function).

To find cold regions to outline, we will traverse the CFG to find edges deemed ‘cold’ and look at the blocks dominated by the successor node. If, for some reason, that block has more than one predecessor, then we will skip this candidate as there should be a node that dominates this successor that has a single entry point. The last node in the dominance vector should also have a single successor. If it does not, then further investigation of the CFG is necessary to see when/how this situation occurs.

We will need several heuristics to make sure we only outline in cases where we are confident it will result in a performance gain. Things such as threshold on when a branch is considered cold, the minimum number of times the predecessor node has to be executed in order for an edge to be considered (confidence factor), and the minimum size of the region to be outlined (can use inlining cost analysis like we currently do) will require some level of tuning.

Similar to the current implementation, we will attempt to inline the leftover (hot) parts of the code, and if for some reason we cannot then we discard the modified function and its outlined code.

Code changes

The current Partial Inlining code first clones the function of interest and looks for a single set of blocks to outline. It then creates a function with the set the blocks, and saves the outlined function and outline callsite information as part of the function cloning container. In order to outline multiple regions of the function, we will need to change these containers to keep track of a list of regions to outline. We will also need to update the cost analysis to take into account multiple outlined functions.

When a ProfileSummary is available, then we should skip the code that looks for early returns and go into new code that looks for cold regions to outline. When ProfileSummary is not available, then we can fall back to the existing code and look for early returns only.

Tuning

  • The outlining heuristics will need to determine if a set of cold blocks is large enough to warrant the overhead of a function call. We also don’t want the inliner to attempt to inline the outlined code later.
  • The threshold for determining whether a block is cold will also need to be tuned. In the case that profiling information is not accurate, we will pay the price of the additional call overhead for executing cold code.
  • The confidence factor, which can be viewed as the minimum number of times the predecessor has to be executed in order for an edge to be considered cold, should also be taken into account to avoid outlining code paths we have little information on.

Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077 C2-407/8200/Markham
Email: gyiu@ca.ibm.com

Hi Graham,

Have you looked at using the existing outliner framework to implement partial inlining? In the past, people have combined outlining + inlining to get partial inlining “for free” (see "Function outlining and partial inlining” by Zhao and Amaral).

There is the MachineOutliner. River Riddle has been working on an IR-level outliner as well. The MachineOutliner doesn’t currently consider the heat of a block, but might be worth experimenting with. It might be good to experiment with River’s work and the existing outlining pass to see if this functionality can effectively fall out of what we have already.

Here’s the recent IR-level outlining thread for reference wrt River’s work: http://lists.llvm.org/pipermail/llvm-dev/2017-July/115666.html

  • Jessica

Hi Jessica,

Thanks for the feedback.

I believe the existing partial inliner pass does use some common utilities like the code extractor to do outlining. Is that what you’re referring to? I don’t recall seeing any other passes that has outlining other than the machine outliner, but I may have missed something.

I briefly looked at River’s RFC and it seems he’s mainly utilizing outlining as a tool to reduce code size while maintaining code performance. Our proposal is to improve on the existing outlining scheme of the partial inliner to give us more opportunities to inline hot code. We will only do this type of general outlining with the presence of profiling or user data (such as builtin_expect). With that said, I am interested in how River plans to handle live ranges and heuristics he will use with profiling data available.

I haven’t looked at the machine outliner yet, but my understanding is it also prioritizes code size reduction over performance gains. Is my assumption correct? Do you think there’s some tuning we can do in the machine outliner to tailor for performance improvements?

Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077 C2-407/8200/Markham
Email: gyiu@ca.ibm.com

graycol.gifJessica Paquette —08/15/2017 02:45:06 PM—Hi Graham, Have you looked at using the existing outliner framework to implement partial inlining? I

Hi Graham, Making partial inlining more general is something worth doing. Regarding your implementation plan, I have some suggestions here:

*) Function outlining introduces additional runtime cost: passing of live in values, returning of live out values (via memory), glue code in the caller to handle regions without a single exit block etc. The cost analysis needs to factor in those carefully
*) Remove the limitation that there is only one outlined routine. Instead, the algorithm can compute multiple single-entry/single exit or single entry/multiple exit regions (cold ones) in the routine, and outline each region into its own function. The benefit include

  1. simplify the design and implementation and most of the existing code can be reused;
  2. provide more flexibility to allow most effective outlining;
  3. reduced runtime overhead of making calls to the outline functions.

thanks,

David

Hey Graham,
I worked on pretty much this exact thing last year. I did something similar to what you described, I traversed the CFG and built potentially profitable regions from any given valid start node. At that point there were several road blocks that prevented it from really going anywhere( like the inliner not preserving pgo data) but those problems should be gone by now.
Some thoughts:

: In terms of cost modeling, you will want to consider the interaction with loops:

  • Whats the impact of outlining from inside of a loop.
  • What if the hot region does not include the entry/exit, i.e hot loop.

: Something that would also benefit, but would require more tuning, is the relationship between size and “hotness”. If for example you have a function that looks like this:

foo(bool B) {
… // Code that prevents early return.

if(B) // RegionA : Taken 50% of the time.
… Very large amount of code.

else // RegionB : Taken 50% of the time.
… Very small amount of code.
}

In the above you have two regions that aren’t exactly cold. Outlining RegionA will make foo profitable to inline, but only if you take into account its large size. Using profile data as the sole cutoff will miss this opportunity.

: You may also want to look into hot switch statement splitting, though I don’t know if it should really go into the partial inliner. This is splitting a switch statement into multiple switch statements if there are only a few cases that are actually hot.

SWITCH1:
switch(val) {
HOT: …
COLD1: …
COLD2: …

COLDN: …
default: …
}

would get transformed into:

SWITCH1:

switch(val) {
HOT: …
default:
br SWITCH2:
}

SWITCH2:

switch(val) {
COLD1: …
COLD2: …

COLDN: …
default: …
}

graycol.gif

Hey Graham,
I worked on pretty much this exact thing last year. I did something
similar to what you described, I traversed the CFG and built potentially
profitable regions from any given valid start node. At that point there
were several road blocks that prevented it from really going anywhere( like
the inliner not preserving pgo data) but those problems should be gone by
now.
Some thoughts:

: In terms of cost modeling, you will want to consider the interaction
with loops:
- Whats the impact of outlining from inside of a loop.
- What if the hot region does not include the entry/exit, i.e hot loop.

: Something that would also benefit, but would require more tuning, is the
relationship between size and "hotness". If for example you have a function
that looks like this:

foo(bool B) {
... // Code that prevents early return.

if(B) // RegionA : Taken 50% of the time.
  ... Very large amount of code.

else // RegionB : Taken 50% of the time.
  ... Very small amount of code.
}

In the above you have two regions that aren't exactly cold. Outlining
RegionA will make foo profitable to inline, but only if you take into
account its large size. Using profile data as the sole cutoff will miss
this opportunity.

Modelling the cost/benefit for this case can be very tricky though.

: You may also want to look into hot switch statement splitting, though I
don't know if it should really go into the partial inliner. This is
splitting a switch statement into multiple switch statements if there are
only a few cases that are actually hot.

SWITCH1:
switch(val) {
  HOT: ...
  COLD1: ...
  COLD2: ...
  ....
  COLDN: ...
  default: ...
}

would get transformed into:

SWITCH1:
switch(val) {
HOT: ...
default:
   br SWITCH2:
}

SWITCH2:
switch(val) {
  COLD1: ...
  COLD2: ...
  ....
  COLDN: ...
  default: ...
}

This is profile based switch peeling -- which is a separate optimization
from partial inlining.

David

graycol.gif

Most definitely, but it’s something to think about when moving forward.

Ah thanks, I wasn’t sure of the name.
River Riddle

graycol.gif

Hey David,

Yes, we’ll need to consider the effect on live ranges for regions we want to outline. In my experience, outlining live-exit regions seem to cause the most harm as we ruin chances to keep data in registers as you were alluding to. It’s unclear, however, what the exact effect of outlining regions with live-entries would be.

I’ll probably try to avoid regions that are not single entry & single exit at least initially, to simplify the transformation and analysis. Are multi-exit regions common in your experience?

And of course, I agree, we should reuse as much of the current partial inlining infrastructure as possible. I’ll likely run some ideas by you as I begin to make changes.

Cheers,

Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077 C2-407/8200/Markham
Email: gyiu@ca.ibm.com

graycol.gifXinliang David Li —08/15/2017 05:36:07 PM—Hi Graham, Making partial inlining more general is something worth doing. Regarding your implementat

Hey River,

It’s good to know that you have been down this path before, hopefully I can share with you any roadblocks I’ll encounter along the way.

Your first example illustrates a very good point, where ideally we should take into account a coldness-to-size ratio. However, this type of case requires a lot of additional analysis to get right, and may not give you the most ‘bang-for-your-buck’. With that said, I think it’s worthwhile to build in some flexibility as I begin implementation in case we want to tackle something like this in the future.

Second example is probably outside of the scope of what I’m trying to do, at least in the immediate term, but interesting nonetheless.

Cheers,

Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077 C2-407/8200/Markham
Email: gyiu@ca.ibm.com

graycol.gif

Hi Graham,

I haven’t looked at the machine outliner yet, but my understanding is it also prioritizes code size reduction over performance gains. Is my assumption correct? Do you think there’s some tuning we can do in the machine outliner to tailor for performance improvements

Yes, it’s only concerned with code size reduction. It currently doesn’t consider performance gains in the cost model at all. It also doesn’t use any profiling information. (I also don’t know if that information is preserved until the point where the MachineOutliner runs, but if it is, it would be possible to hook it in.)

If the profiling information is available, it would be possible to ignore instructions in, say, hot blocks during the preprocessing step in the MachineOutliner. Before even considering candidates for outlining, the MachineOutliner walks the module to assign instructions integer values. If it’s something we can’t outline, then we effectively ignore it. The heat of an overall block could be considered during this step as well. I’m not sure if this would be sufficient for your use-case though.

  • Jessica

(How) does your proposal affect the debug info for the transformed code?

-- adrian

Hi Adrian,

That’s a good question, I haven’t looked into the effect on debug information, but I imagine it won’t be any different from what the partial inliner does today (David Li might be able to answer that question better than I). There’s existing utilities that do the code extraction for outlining, so I’ll need to look into that code to see if it also modifies debug information.

Cheers,

Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077 C2-707/8200/Markham
Email: gyiu@ca.ibm.com

graycol.gifAdrian Prantl —08/16/2017 12:34:38 PM—(How) does your proposal affect the debug info for the transformed code? – adrian

The impact of function outlining on debugability should be no different from other code cloning optimizations (function cloning, inlining, multi-versioning, etc).

David

graycol.gif

Hi David,

So I’ve began doing some implementation on the outlining portion of the code. Currently, I got the partial inliner to outline cold regions (single entry, single exit) of the code, based solely on the existence of ProfileSummaryInfo (ie. profiling data). However, I have some concerns on how this will co-exist with the existing code that peels early returns.

The control flow looks something like this:

// New Code: find cold regions to outline
if (!computeOutliningInfoForColdRegions()) {
// If we can’t find any cold regions, then fall-back to early return peeling
if (!computeOutliningInfo) {
return nullptr;
}
}
// Try to outline the identified regions
// Then try to inline the cloned function

My concern is during inlining, if we fail to inline the cloned function, we give up and discard all cloned and outlined functions. But with these two types of outlining we’re doing, it’s possible to attempt to inline the cloned function that has outlined cold regions, and if we cannot do so, try to inline a different clone that has peeled early returns (ie. the way we have it today). This would require us to clone the original function twice and modify one based on cold region outlining and the other early return peeling, with the latter being our fall-back option if we fail to inline the first clone.

What are your thoughts?

Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077 C2-707/8200/Markham
Email: gyiu@ca.ibm.com

graycol.gifGraham Yiu—08/15/2017 08:04:28 PM—Hey David, Yes, we’ll need to consider the effect on live ranges for regions we want to outline. In

Hi David,

So I've began doing some implementation on the outlining portion of the
code. Currently, I got the partial inliner to outline cold regions (single
entry, single exit) of the code, based solely on the existence of
ProfileSummaryInfo (ie. profiling data). However, I have some concerns on
how this will co-exist with the existing code that peels early returns.

The control flow looks something like this:

// New Code: find cold regions to outline
if (!computeOutliningInfoForColdRegions()) {
// If we can't find any cold regions, then fall-back to early return
peeling
if (!computeOutliningInfo) {
return nullptr;
}
}
// Try to outline the identified regions
// Then try to inline the cloned function

My concern is during inlining, if we fail to inline the cloned function,
we give up and discard all cloned and outlined functions. But with these
two types of outlining we're doing, it's possible to attempt to inline the
cloned function that has outlined cold regions, and if we cannot do so, try
to inline a different clone that has peeled early returns (ie. the way we
have it today). This would require us to clone the original function twice
and modify one based on cold region outlining and the other early return
peeling, with the latter being our fall-back option if we fail to inline
the first clone.

What are your thoughts?

I expect computeOutliningInfoForColdRegions can produce a super set of
outlinable regions to the current 'pattern matching' approach. In other
words, most of the cases currently caught by 'computeOutlineInfo' should be
caught by the new algorithm, so why not ditching the current
'computeOutlningInfo' completely?

My suggestion was to enhance the pass to 1) support outlining multiple
regions; and 2) add a mode to do function outlining only (not the inlining
part). The second is important can be used before the regular inliner
pass. With the new pass manager and profile aware inlining, the inliner
won't undo the outline decision, but in meantime becomes more powerful due
to the reduced hot function size.

David

graycol.gif

Hi David,

The only reason I can see to use the ‘pattern matching’ part as a fall-back is in case we cannot inline the (what I’m assuming would be) a much bigger hot-path-only cloned function for whatever reason. What I’m assuming here is that after cold-region outlining, we may still have a large portion of the original function body to attempt to inline, whereas the pattern matching method will only contain a few basic blocks, giving a better chance to inline something.

For your (2) point, I think we’ll have to be careful here. Without a sense of how ‘likely’ we’re going to inline the new function, we’ll have to make sure our outlining of cold regions will not degrade the performance of the function in 99.xx% of the cases, as it’s unclear how much performance we’ll gain from just outlining (without inlining to increase the odds of some performance gain). My initial thought was to ditch the new function and its outlined children if we cannot immediately inline it.

Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077 C2-707/8200/Markham
Email: gyiu@ca.ibm.com

graycol.gifXinliang David Li —08/24/2017 03:05:06 PM—On Thu, Aug 24, 2017 at 10:40 AM, Graham Yiu gyiu@ca.ibm.com wrote: > Hi David,

Hi David,

The only reason I can see to use the 'pattern matching' part as a
fall-back is in case we cannot inline the (what I'm assuming would be) a
much bigger hot-path-only cloned function for whatever reason. What I'm
assuming here is that after cold-region outlining, we may still have a
large portion of the original function body to attempt to inline, whereas
the pattern matching method will only contain a few basic blocks, giving a
better chance to inline something.

With profile data, the overhead of outlining a cold region can be estimated
more accurately. (With the new PM), the threshold of inlining a hot
callsite is also much higher. Without profile, the pattern matching method
won't work too well in general even though it can enable more more inlining
because the call overhead introduced to call the outlined function may
outweigh the benefit of inlining the caller.

What ever region that can be found by the pattern matching method should be
identified by the new method as well. If there are multiple (but mutually
exclusive) candidate regions found, the cost analysis heuristic should pick
the best candidate region for outlining .

For your (2) point, I think we'll have to be careful here. Without a sense
of how 'likely' we're going to inline the new function, we'll have to make
sure our outlining of cold regions will not degrade the performance of the
function in 99.xx% of the cases, as it's unclear how much performance we'll
gain from just outlining (without inlining to increase the odds of some
performance gain). My initial thought was to ditch the new function and its
outlined children if we cannot immediately inline it.

The outlining only mode is useful to enable more aggressive inlining for
the regular inlining pass. Slightly different heuristics can be applied
here. For instance it can prefer largest candidate region (to maximiize the
chance to inline the caller). The outlined region does not need to be super
cold and leave it to the inliner to do more deeper analysis and decide to
inline it right back in.

David

graycol.gif

I second the fact that a way to outline specific function regions independently of the partial inliner sound very useful. I am not sure however if we would want a mode within the partialInliner or something completely independent.

As a general question, does anybody has a clear idea of what are the constraints on the region CodeExtractor is currently able to handle ? Going through the code, it looks like the only requirement is for the header to dominate all the BB in the region ;

graycol.gif

Hi David,

“Without profile, the pattern matching method won’t work too well in general even though it can enable more more inlining because the call overhead introduced to call the outlined function may outweigh the benefit of inlining the caller”

  • I’m not sure I understand why or how there’d be any additional call overhead if we partially inlined one or more early returns. The call overhead of not inlining the function at all vs. call overhead of the outlined function should be similar, right?
  • i.e. if we partially inlined bar into foo, we’d have some early return checks and then a function call to bar_outlined, but if we didn’t inline bar into foo at all, we’d still have the call overhead of calling bar.

“The outlining only mode is useful to enable more aggressive inlining for the regular inlining pass. Slightly different heuristics can be applied here. For instance it can prefer largest candidate region (to maximiize the chance to inline the caller). The outlined region does not need to be super cold and leave it to the inliner to do more deeper analysis and decide to inline it right back in.”

  • Interesting point, I never thought of that. The partial inliner would then be a ‘function splitter’ rather than a partial inliner at that point. Maybe worthwhile to create a separate pass for this so we don’t have the partial inliner trying to do too many things.
  • So I did some digging and it seems like the ‘regular’ inlining pass in opt is ran before the partial inliner, which makes sense to me since we want only the candidates left over from inlining which it could not inline (maybe due to code size). Is there another inlining pass downstream that I may have missed? Or perhaps you’re referring to inlining with ThinLTO?

Cheers,

Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077 C2-707/8200/Markham
Email: gyiu@ca.ibm.com

graycol.gifXinliang David Li —08/26/2017 12:53:03 PM—On Thu, Aug 24, 2017 at 12:47 PM, Graham Yiu gyiu@ca.ibm.com wrote: > Hi David,

Hi Kader,

I agree with you, if we were going to only do outlining for some functions and not immediately attempt to inline, it should be an independent pass. The partial inliner should do what its name suggests and attempt to inline or bail.

I haven’t looked through the CodeExtractor at all. I imagine I’ll have to go through it at some point. I’d also be interested in something that does an analysis before code extraction that tells me how many live ranges I’m going to be killing or how many symbols I’m going to be taking the address of by extracting a specific region of code. Not sure if that currently exists.

Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077 C2-707/8200/Markham
Email: gyiu@ca.ibm.com

graycol.gifkeita abdoul-kader —08/29/2017 12:15:25 PM—I second the fact that a way to outline specific function regions independently of the partial inlin