Add a function splitting pass to LLVM which extracts cold regions into their own functions

Tobias,

Thanks for taking the time to summarize all this. It's a great writeup. I'm moving the thread to llvm-dev. My responses below.

First of all some information about the RegionInfo pass:

=======================================================================
The very first paper I was looking at, when implementing it was "The Program Structure Tree: Computing Control Regions in Linear Time (1994)". Nadav already mentioned this and it is also mentioned in the "RegionInfo.h" file.

The algorithm described in this paper is very easy to implement. However, it only calculates SESE regions, where SESE means a region with a single entry and a single exit _edge_. The word _edge_ is here very important, as the algorithm misses a lot of regions that could easily be transformed to a SESE _edge_ region, but that are no SESE _edge_ regions.

Have a look at:

opt -view-regions-only -only-simple-regions test/Analysis/RegionInfo/condition_complicated.ll

All regions in this examples would not be detected by the first algorithm, because they have both multiple entry and multiple exit edges. However, transforming them to regions with a single entry and a single exit _edge_ is simple. We call regions that are not single entry single exit edge regions, but can easily transformed to such regions 'refined' regions. Simple single entry singe exit _edge_ regions are called 'simple'.

For my work on Polly, I wanted to detect both 'simple' and 'refined' regions. The paper 'The Refined Process Structure Tree - Jussi Vanhatalo, Hagen Voelyer, Jana Koehler - 2009' describes an algorithm how to detect such regions in linear time. However, unfortunately, the proposed algorithm is complex to implement (and I am not convinced it is fast in the common cases).

At that point I started an IRC discussion about what to do and Chris
suggested to look for an dominance information based algorithm. I did what he said and came up with the existing algorithm, which is, due to taking advantage of dominance information, rather simple. Unfortunately, I think I misunderstood Chris, as I later realized he did not want me to use the DominanceFrontier analysis, but just the normal DominatorTree.

So why did I not change the algorithm and remove the use of DominanceFrontier? There are two reasons:

1) It was and is not obvious for me how to do so

I did some analysis on SPEC and polyhedron.com and there are 3-8 times more 'refined' regions than 'simple' regions [1]. For the use in Polly, I wanted to retain the ability to optimize 'refined' regions.

Unfortunately, until today I was not aware of an algorithm that has linear complexity, is in practice fast and can calculate 'refined' regions.

2) For many use cases the existing solution works perfectly fine

The RegionInfo analysis is obviously non-linear as it uses the DominanceFrontier analysis. However, for me it works surprisingly well.
When I committed it I tested it on the LLVM test suite, polyhedron.com and the SPEC 2006 benchmarks [1]. I got the following timings (in seconds):

Name DomTree PostDomTree DomFrontier RegionTree
SPEC 2006 1.109 0.911 0.525 0.662
Polyhedron.com 0.034 0.029 0.016 0.022

On these examples DomFrontier and RegionTree calculation is very fast. Also, I used RegionInfo for over two years in my daily Polly work. And I know people who test Polly on their larger internal test suites. I got _never_ any complains about RegionInfo speed issues. So RegionInfo is for now good enough for my work.

I also heard that RegionInfo is used in some commercial compilers, but did not hear of any performance problems there.

Reason 2) does not mean I am not in favor of improving RegionInfo. Very much in contrast, I support any work that makes RegionInfo useable on the LLVM machine cfgs or that allows it to be used as a clang default analysis pass. I just raised that point to explain why I did not spend a large amount of time on 1)

That's interesting. It looks like PostDomTree is the only thing we're really concerned with pulling in. If RegionInfo can be further improved, that's great, but I'm confident that any algorithm that does what you're doing will require PostDomTree.

As a side note, in a perfectly efficient compiler, we should only have to pay for either LoopInfo or Dominators, not both, since one can be efficiently computed from the other. However, I think we're always stuck computing PostDomTree from scratch whenever we need to determine control dependence, or regions based on it.

Andrew:
> I'm dismayed that you depend on RegionInfo, but it is understandable. > We could probably improve RegionInfo to be more efficient and
> self-contained (no DomFrontier), but I think we'd be much better off
> dropping RegionInfo and handling SEME regions. AFAIK CodeExtractor
> can already handle that. Is there any particular reason you need
> RegionInfo other than finding a connected set of blocks dominated by > a cold block? Can FunctionSplitter just find its own regions?

Andrew, can you define what you mean by SEME region? As written above, RegionInfo already detects SEME regions with a single entry block and multiple exit edges that terminate at the same block. What is the _exact_ kind of SEME region you are talking about? Does it require single entry edge or do you allow various entry edges if they terminate at the same basic block?

Also, is there a good algorithm available to detect such regions. I am very much in favor of replacing the existing RegionInfo algorithm with something faster _and_ more generic.

I was calling your regions SESE, otherwise I need a new acronym!

I'm not considering the number of entry or exit edges, because restricting them to end in the same block gives you the same properties as a single block entry/exit AFAICT.

I'm not aware of a better algorithm for building RegionInfo when one of the goals is to represent nested control dependence.

I don't know if we need a standalone SEME RegionInfo pass. Beyond LoopInfo, I don't know what criteria would be used to pick the regions, but I haven't read the papers you mention below.

Some passes, like FunctionSplitting, may have ad-hoc profile-driven region formation logic. For them, the Region API is just a way to package a SEME-block set of connected blocks.

Also, supporting SEME regions with one entry block would be great
as this would make the LoopInfo tree a subtree of the RegionTree. So
a LoopTree would be a simple RegionTree that is already available by default. Also, the code extractor

Yes, exactly.

Chandler:
> I'm dismayed that RegionInfo is still in the tree and not being
> improved. ;] I think we should fix RegionInfo to be efficient and
> reasonably well implemented. It as *already* handling SEME regions
> except for the "ME" case of function returns (something that is
> trivial to fix).

I explained above, why RegionInfo is good enough for our cases and why improving it is not straightforward. Advice how to make it better is highly welcome.

I would have to do a lot more homework before I could suggest improving RegionInfo. DomFrontiers seemed unnecessary, but if you say it's an efficient and easy to reason about way of handling CFGs that aren't already neatly nested, then I believe you.

The problem I was pointing out is that FunctionSplitting does not need to be informed by control dependence. So it's pulling in a costly analysis (PostDominators), and the only thing it gets in return is an unnecessary limitation. FunctionSplitting should be able to work on any region that LoopInfo can detect.

Andrew:

There are two issues then, Region API vs. Region discovery.

A convenient API for visiting and traversing regions is nice, though by
design it's never needed in LLVM. A problem with the existing RegionInfo
API is that it is superficially limited to SESE--though you say
otherwise.

Not exactly SESE, but yes. I don't see a reason to widen the API to a more generic SEME region, in case there is an algorithm to detect such regions.

Andrew:

That may be right for some clients, but then those clients
likely don't need the region discovery that it provides. They can
probably get by with finding control equivalent regions, which is
trivial if you have postdominators--walk down in the dom tree and up in
the posdom tree at the same time.

Are you referring to the SESE regions as defined in that paper:
"The Program Structure Tree: Computing Control Regions in Linear Time"

I'm not referring to a specific paper. I was suggesting that a pass could probably directly query control dependence without building RegionInfo. But if the only expensive thing in RegionInfo is PostDominators, then it just comes down to the interface that's most natural.

As explained above, they have a very low resolution.

Andrew:

Maybe you just want a neat Region iterator API utility that could be
used by anyone doing region discovery, including RegionInfo. But you
wouldn't need to pull in RegionInfo analysis.

Sounds like a good idea.

Chandler:
> It would be better to have such clarification in documentation,
> ideally in the header files of these passes. This is a very "meta"
> point, but it is frustrating to contributors to have arbitrary
> restrictions materialize only after designing an optimization. I
> would rather that people are aware of the constraints they need to
> work within if they want to implement a particular optimization pass.
> Essentially, if there are analyses or passes which are known to be
> unacceptable for the normal compilation pipeline, I think that would
> be an important thing to mention in the high-level comments for the
> pass. (It is possible that I missed such comments, or that I never
> looked in the right place, but I feel like the fact that dom-frontier
> and region-info is essentially on the chopping block should be more
> clear than it currently is... ;]

True. We should clearly document which passes are acceptable in the default optimization chain.

One (hopefully obvious) rule of thumb. When looking at the cost/benefit of a new transformation, the compile time must include time spent in analysis passes that now get pulled in at a new point in the standard pass order. If you're pulling in a new Analysis, you need to be able to show that it solves an important problem for you in the most reasonably efficient way.

There may be some analysis passes that are currently in use but are out-of-favor and we would like to remove. This is a collective "we", so my strategy is to look at commit logs and ask around. There are many cases in which I would have benefitted from more documentation. Stale docs hurt though.

-Andy

Tobias,

Thanks for taking the time to summarize all this. It's a great writeup. I'm moving the thread to llvm-dev. My responses below.

[..]

Name DomTree PostDomTree DomFrontier RegionTree
SPEC 2006 1.109 0.911 0.525 0.662
Polyhedron.com 0.034 0.029 0.016 0.022

On these examples DomFrontier and RegionTree calculation is very fast. Also, I used RegionInfo for over two years in my daily Polly work. And I know people who test Polly on their larger internal test suites. I got _never_ any complains about RegionInfo speed issues. So RegionInfo is for now good enough for my work.

I also heard that RegionInfo is used in some commercial compilers, but did not hear of any performance problems there.

Reason 2) does not mean I am not in favor of improving RegionInfo. Very much in contrast, I support any work that makes RegionInfo useable on the LLVM machine cfgs or that allows it to be used as a clang default analysis pass. I just raised that point to explain why I did not spend a large amount of time on 1)

That's interesting. It looks like PostDomTree is the only thing we're really concerned with pulling in. If RegionInfo can be further improved, that's great, but I'm confident that any algorithm that does what you're doing will require PostDomTree.

I thought the reason DomFrontier is a problem was the worst case quadratic complexity. I have never seen a case triggering that behavior, but I must admit I did not specifically look for it.

As a side note, in a perfectly efficient compiler, we should only have to pay for either LoopInfo or Dominators, not both, since one can be efficiently computed from the other. However, I think we're always stuck computing PostDomTree from scratch whenever we need to determine control dependence, or regions based on it.

Yes, probably the PostDomtree or something equally expensive is necessary. I believe in case a transformation needs detailed control regions, we need to see if the benefits it gives are worth the added analysis overhead. In case we can get around calculating detailed control regions, we should definitely try to go this way.

Andrew:

I'm dismayed that you depend on RegionInfo, but it is understandable.> We could probably improve RegionInfo to be more efficient and
self-contained (no DomFrontier), but I think we'd be much better off
dropping RegionInfo and handling SEME regions. AFAIK CodeExtractor
can already handle that. Is there any particular reason you need
RegionInfo other than finding a connected set of blocks dominated by> a cold block? Can FunctionSplitter just find its own regions?

Andrew, can you define what you mean by SEME region? As written above, RegionInfo already detects SEME regions with a single entry block and multiple exit edges that terminate at the same block. What is the _exact_ kind of SEME region you are talking about? Does it require single entry edge or do you allow various entry edges if they terminate at the same basic block?

Also, is there a good algorithm available to detect such regions. I am very much in favor of replacing the existing RegionInfo algorithm with something faster _and_ more generic.

I was calling your regions SESE, otherwise I need a new acronym!

I'm not considering the number of entry or exit edges, because restricting them to end in the same block gives you the same properties as a single block entry/exit AFAICT.

Yes. It is trivial to transform between them, but it is a lot easier to detect SESE regions if they have just a single entry and exit edge.

I'm not aware of a better algorithm for building RegionInfo when one of the goals is to represent nested control dependence.

Even without. I did not find a way to check for two given basic blocks, if they form a SESE region without using DomFrontier (just using DomTree and PostDomTree). It is trivial if there is a single entry and a single exit edge, but for multiple edges (ending at the same block) I could not find a good solution.

(Solving this problem, would remove the requirement for DomFrontier in RegionInfo)

I don't know if we need a standalone SEME RegionInfo pass. Beyond LoopInfo, I don't know what criteria would be used to pick the regions, but I haven't read the papers you mention below.
Some passes, like FunctionSplitting, may have ad-hoc profile-driven region formation logic. For them, the Region API is just a way to package a SEME-block set of connected blocks.

Also, supporting SEME regions with one entry block would be great
as this would make the LoopInfo tree a subtree of the RegionTree. So
a LoopTree would be a simple RegionTree that is already available by default. Also, the code extractor

Yes, exactly.

Chandler:

I'm dismayed that RegionInfo is still in the tree and not being
improved. ;] I think we should fix RegionInfo to be efficient and
reasonably well implemented. It as *already* handling SEME regions
except for the "ME" case of function returns (something that is
trivial to fix).

I explained above, why RegionInfo is good enough for our cases and why improving it is not straightforward. Advice how to make it better is highly welcome.

I would have to do a lot more homework before I could suggest improving RegionInfo. DomFrontiers seemed unnecessary, but if you say it's an efficient and easy to reason about way of handling CFGs that aren't already neatly nested, then I believe you.

I said it's good enough for my uses. Different tools may have different requirements.

The problem I was pointing out is that FunctionSplitting does not need to be informed by control dependence. So it's pulling in a costly analysis (PostDominators), and the only thing it gets in return is an unnecessary limitation. FunctionSplitting should be able to work on any region that LoopInfo can detect.

Sure. If FunctionSplitting can be written without using the RegionInfo analysis, avoiding the (re)calculation of it is the way to go. To me it seems RegionInfo gives more details. It can e.g. give separate regions for the if/else parts of a condition. If this level of detail is needed, should to be evaluated. For this it seems to be good to define generic regions that can either be calculated based on LoopInfo or on RegionInfo. Like this the same pass can be used with different analysis depending on how much details are required.

As explained above, they have a very low resolution.

Andrew:

Maybe you just want a neat Region iterator API utility that could be
used by anyone doing region discovery, including RegionInfo. But you
wouldn't need to pull in RegionInfo analysis.

Sounds like a good idea.

Chandler:

It would be better to have such clarification in documentation,
ideally in the header files of these passes. This is a very "meta"
point, but it is frustrating to contributors to have arbitrary
restrictions materialize only after designing an optimization. I
would rather that people are aware of the constraints they need to
work within if they want to implement a particular optimization pass.
Essentially, if there are analyses or passes which are known to be
unacceptable for the normal compilation pipeline, I think that would
be an important thing to mention in the high-level comments for the
pass. (It is possible that I missed such comments, or that I never
looked in the right place, but I feel like the fact that dom-frontier
and region-info is essentially on the chopping block should be more
clear than it currently is... ;]

True. We should clearly document which passes are acceptable in the default optimization chain.

One (hopefully obvious) rule of thumb. When looking at the cost/benefit of a new transformation, the compile time must include time spent in analysis passes that now get pulled in at a new point in the standard pass order. If you're pulling in a new Analysis, you need to be able to show that it solves an important problem for you in the most reasonably efficient way.

Yes, that's why I want to the RegionInfo to give high detail results. They are costly, but allow us to measure the benefit of a high resolution. When pulling passes in the default optimization chain, it obviously needs to be understood if the high resolution is necessary or if a faster, lower resolution analysis is the better option.

There may be some analysis passes that are currently in use but are out-of-favor and we would like to remove. This is a collective "we", so my strategy is to look at commit logs and ask around. There are many cases in which I would have benefitted from more documentation. Stale docs hurt though.

Maybe we should just add your 'rule of thumb' to some documentation and mention the PostDom or RegionInfo pass as an example.

Tobi