proof of concept for a loop fusion pass

Hi,

We are proposing a loop fusion pass that tries to proactive fuse loops across function call boundaries and arbitrary control flow.

http://reviews.llvm.org/D7008

With this pass, we get 103 loop fusions in SPECCPU INT 2006 462.libquantum with rate performance improving close to 2.5X in x86 (results from AMD A10-6700).

I took some liberties in patching up some of the code in ScalarEvolution/DependenceAnalysis/GlobalsModRef/Clang options/ and also adjusted the IPO/LTO pass managers. I would need to do a better job there but I would like to put this out as WIP for high level reviews. At this time standard loop fusion test cases may not work (cases where control dep are same or loops are truly adjacent etc). I would be doing that as well.

Appreciate suggestions and other help.

The pass initially attempts to inline calls which have loops in them. Following inlining, a separate scalar “main” pass tries to fuse loops. It is not necessary for loops to be adjacent or have the same control dependence. Specifically a control dependence closure is computed and loops that share a control dep closure prefix are taken as candidates for fusion, in case there are no other loops in a certain path between them. A loop graph is built with edges between loops which share the control dependence closure prefix. A recursive application of fusion follows on the graph from “bottom-up”.

During fusion, program slices are taken based on the control flow closures of the two loops being fused. Example: Suppose two loops A and B are going to be fused. They share the same control dependence prefix, but their suffices vary. The suffices are used to duplicate Control flow paths that pass through both loops. Specifically paths from the first block in the control-dep closure suffix of the first loop, through the first loop’s exit and following into the suffix of the second loop through the second loop’s exit on to the common post-dominator of either loop’s exit. The number of paths grows exponentially. At this time some heuristics are used when exploring for paths. Using profile based heuristics is a better idea.

Also function unswitching helps by cleaning up control flow.

Example:

if (a & b) {

if (a & c) {

if (t & 0xF) {

for (i=0; i < size; i++) {

if (A[i] & d) {

A[i] |= (1 << t);

}

}

}

}

}

if (b & d) {

if (a & d) {

if (t & 0xA) {

for (i=0; i < size; i++) {

if (A[i] & d) {

A[i] |= (1 << t);

}

}

}

}

}

After fusion we will have:

if (a&b && a&c && t&0xF && b&d && a&d t&0xA) {

for (i=0; i < size; i++) {

if (A[i] & d) {

A[i] |= (1 << t);

}

if (A[i] & d) {

A[i] |= (1 << t);

}

}

} else {

// original code

if (a & b) {

if (a & c) {

if (t & 0xF) {

for (i=0; i < size; i++) {

if (A[i] & d) {

A[i] |= (1 << t);

}

}

}

}

}

if (b & d) {

if (a & d) {

if (t & 0xA) {

for (i=0; i < size; i++) {

if (A[i] & d) {

A[i] |= (1 << t);

}

}

}

}

}

}

Thanks,

Ramshankar

Hi Ramshankar,

This looks really interesting.

Interestingly, a patch for loop distribution has been posted for review recently.
On a conceptual level loop fusion and loop distribution are each other’s inverse, so I would assume that
they should ideally somehow be combined and use a single cost function to decide, out of all legal
fusions/distributions, which one is most profitable.

Does that make sense, or is there some reason why these fundamentally couldn’t be combined into a single
loop fuse-and-or-distribute transformation?

I haven’t looked closely at either the code of the distribution or the fusion patches, but would it be hard to
combine them into a single transformation – assuming my assumptions above do make sense?

If it would prove sensible to somehow combine the distribution and fusion transformations – could this
be the beginning of a more general high-level loop transformation framework in LLVM?

Thanks,

Kristof

At the very high level I don't think the analysis passes should be
combined. Outside of SPEC tuning how would you decide when the cost of
doing one vs the other is appropriate? Without target level information..
how could you estimate the cost or impact? If we think very very long term,
I wonder how something like backtracking could be useful here.. by this I
mean running multiple potentially conflicting or opposing transformations
in parallel, potentially all the way to code generation, and then in the
end pick the path which produces the best output. I realize the
infrastructure to do something like this is almost a dream, but I thought
it's worth mentioning.

If my comments aren't coherent I can potentially dig up the slides from
Fred Chow who originally proposed the idea/concept (for another compiler)
(ping me off list)

Hi Kristof,

Thanks! We have to have a cost function but unfortunately I haven’t given it more than a fleeting thought yet. I think that a reasonable locality of reference model would be able to identify the candidates we are fusing right now. In Adam’s work, he is pruning out loops with one loop for statements in a dependence cycle with the result that has one vectorizable loop for the statements not in a dependence cycle and one scalar loop for the rest of the statements. That is not a locality of reference based tuning. That is still a minor detail and one expects a common cost function based on cache model/helps vectorize etc., would very well fit into both of these works. IMO both works are standard, text book like. All suggestions/help sought J as we are far from a strong loop optimization framework. There are other challenges like phase ordering (inlining may help fusion if targeted/analyzing at LTO for real life cases/it looks to be better to delay vectorization after fusion).

Best regards,

Ramshankar

I would strongly suggest we not try to be overly general at this time. Let’s focus on getting a clearly profitably use case for each, and then worry about generalizing once we have more practical experience and code has actually landed in tree. General loop optimization is a hard problem.

As a near term goal, having the profitability heuristics well separated from the actual transforms seems like a reasonable goal. Even this should happen after we have tests and working examples in tree though.

Philip

Hi,

We are proposing a loop fusion pass that tries to proactive fuse loops across function call boundaries and arbitrary control flow.

http://reviews.llvm.org/D7008

With this pass, we get 103 loop fusions in SPECCPU INT 2006 462.libquantum with rate performance improving close to 2.5X in x86 (results from AMD A10-6700).

I took some liberties in patching up some of the code in ScalarEvolution/DependenceAnalysis/GlobalsModRef/Clang options/ and also adjusted the IPO/LTO pass managers. I would need to do a better job there but I would like to put this out as WIP for high level reviews. At this time standard loop fusion test cases may not work (cases where control dep are same or loops are truly adjacent etc). I would be doing that as well.

Appreciate suggestions and other help.

The pass initially attempts to inline calls which have loops in them. Following inlining, a separate scalar "main" pass tries to fuse loops. It is not necessary for loops to be adjacent or have the same control dependence. Specifically a control dependence closure is computed and loops that share a control dep closure prefix are taken as candidates for fusion, in case there are no other loops in a certain path between them. A loop graph is built with edges between loops which share the control dependence closure prefix. A recursive application of fusion follows on the graph from "bottom-up".

During fusion, program slices are taken based on the control flow closures of the two loops being fused. Example: Suppose two loops A and B are going to be fused. They share the same control dependence prefix, but their suffices vary. The suffices are used to duplicate Control flow paths that pass through both loops. Specifically paths from the first block in the control-dep closure suffix of the first loop, through the first loop's exit and following into the suffix of the second loop through the second loop's exit on to the common post-dominator of either loop's exit. The number of paths grows exponentially. At this time some heuristics are used when exploring for paths. Using profile based heuristics is a better idea.

Hi Ramshankar,

This is interesting, indeed. Thanks for soliciting input.

My main comment has to do with the fact that in order to get this gain you need many pieces to work together. My suggestion is to proceed in steps and develop it incrementally on trunk.

For example it would be good to first add a simple loop fusion pass that only merges adjacent loops and then build it up from there. Can you break down your approach into smaller, incremental steps like this? What would these steps be?

My other comment is about the choice of DependenceAnalysis framework. No pass currently uses this framework and I am not sure how complete the implementation is and whether it has adequate test coverage. This was my main reason for reusing the dependence analysis from the LoopVectorizer. Have you considered the same?

Adam

From: "Adam Nemet" <anemet@apple.com>
To: "Ramshankar Ramanarayanan" <Ramshankar.Ramanarayanan@amd.com>
Cc: llvmdev@cs.uiuc.edu
Sent: Saturday, January 17, 2015 12:20:55 AM
Subject: Re: [LLVMdev] proof of concept for a loop fusion pass

Hi,

We are proposing a loop fusion pass that tries to proactive fuse
loops across function call boundaries and arbitrary control flow.

http://reviews.llvm.org/D7008

This link contains the Clang patch. Did you intend to post the LLVM patch as well?

With this pass, we get 103 loop fusions in SPECCPU INT 2006
462.libquantum with rate performance improving close to 2.5X in x86
(results from AMD A10-6700).

I took some liberties in patching up some of the code in
ScalarEvolution/DependenceAnalysis/GlobalsModRef/Clang options/ and
also adjusted the IPO/LTO pass managers. I would need to do a better
job there but I would like to put this out as WIP for high level
reviews. At this time standard loop fusion test cases may not work
(cases where control dep are same or loops are truly adjacent etc).
I would be doing that as well.

Appreciate suggestions and other help.

The pass initially attempts to inline calls which have loops in them.
Following inlining, a separate scalar "main" pass tries to fuse
loops. It is not necessary for loops to be adjacent or have the same
control dependence. Specifically a control dependence closure is
computed and loops that share a control dep closure prefix are taken
as candidates for fusion, in case there are no other loops in a
certain path between them. A loop graph is built with edges between
loops which share the control dependence closure prefix. A recursive
application of fusion follows on the graph from "bottom-up".

During fusion, program slices are taken based on the control flow
closures of the two loops being fused. Example: Suppose two loops A
and B are going to be fused. They share the same control dependence
prefix, but their suffices vary. The suffices are used to duplicate
Control flow paths that pass through both loops. Specifically paths
from the first block in the control-dep closure suffix of the first
loop, through the first loop's exit and following into the suffix of
the second loop through the second loop's exit on to the common
post-dominator of either loop's exit. The number of paths grows
exponentially. At this time some heuristics are used when exploring
for paths. Using profile based heuristics is a better idea.

On this last point, would it be possible to encode your static heuristics as improvements to our static branch probability heuristics, and then naturally adapt to profile-based branch probabilities when those are available?

Hi Ramshankar,

This is interesting, indeed. Thanks for soliciting input.

My main comment has to do with the fact that in order to get this
gain you need many pieces to work together. My suggestion is to
proceed in steps and develop it incrementally on trunk.

I second this. We definitely should have loop fusion in LLVM, and incremental in-trunk development is the best way to make that happen.

For example it would be good to first add a simple loop fusion pass
that only merges adjacent loops and then build it up from there. Can
you break down your approach into smaller, incremental steps like
this? What would these steps be?

My other comment is about the choice of DependenceAnalysis framework.
No pass currently uses this framework and I am not sure how complete
the implementation is

DependenceAnalysis has known issues -- as I recall, the way it pulls apart GEP indices without verifying the boundary conditions is not really sound, but this could be replaced by our delinearization functionality now (what Polly is using) -- and if this work provides the impetus for cleaning that up, so much the better. That having been said, care is required. DependenceAnalysis is definitely more powerful than the vectorizer's dependence analysis, and conceptually cleaner. However, using what the vectorizer uses should provide a conservatively-correct base.

and whether it has adequate test coverage.

DependenceAnalysis's test coverage actually isn't bad, at least for loop bodies in traditional form.

This was my main reason for reusing the dependence analysis from the
LoopVectorizer. Have you considered the same?

Finally, it might be a good idea to discuss the overall optimization plan for fuseable loops. One possibility that I had discussed with Chandler a couple of months ago was: aggressively fuse early, expose optimization opportunities, and then split late (after vectorization) based on target-specific modeling (accounting for register pressure, hardware prefetching resources, etc.). There are obviously also other possible approaches, and I'd like to know what you think would be an optimal strategy.

Thanks again,
Hal

From: “Adam Nemet” <anemet@apple.com>
To: “Ramshankar Ramanarayanan” <Ramshankar.Ramanarayanan@amd.com>
Cc: llvmdev@cs.uiuc.edu
Sent: Saturday, January 17, 2015 12:20:55 AM
Subject: Re: [LLVMdev] proof of concept for a loop fusion pass

Hi,

We are proposing a loop fusion pass that tries to proactive fuse
loops across function call boundaries and arbitrary control flow.

http://reviews.llvm.org/D7008

This link contains the Clang patch. Did you intend to post the LLVM patch as well?

With this pass, we get 103 loop fusions in SPECCPU INT 2006
462.libquantum with rate performance improving close to 2.5X in x86
(results from AMD A10-6700).

I took some liberties in patching up some of the code in
ScalarEvolution/DependenceAnalysis/GlobalsModRef/Clang options/ and
also adjusted the IPO/LTO pass managers. I would need to do a better
job there but I would like to put this out as WIP for high level
reviews. At this time standard loop fusion test cases may not work
(cases where control dep are same or loops are truly adjacent etc).
I would be doing that as well.

Appreciate suggestions and other help.

The pass initially attempts to inline calls which have loops in them.
Following inlining, a separate scalar “main” pass tries to fuse
loops. It is not necessary for loops to be adjacent or have the same
control dependence. Specifically a control dependence closure is
computed and loops that share a control dep closure prefix are taken
as candidates for fusion, in case there are no other loops in a
certain path between them. A loop graph is built with edges between
loops which share the control dependence closure prefix. A recursive
application of fusion follows on the graph from “bottom-up”.

During fusion, program slices are taken based on the control flow
closures of the two loops being fused. Example: Suppose two loops A
and B are going to be fused. They share the same control dependence
prefix, but their suffices vary. The suffices are used to duplicate
Control flow paths that pass through both loops. Specifically paths
from the first block in the control-dep closure suffix of the first
loop, through the first loop’s exit and following into the suffix of
the second loop through the second loop’s exit on to the common
post-dominator of either loop’s exit. The number of paths grows
exponentially. At this time some heuristics are used when exploring
for paths. Using profile based heuristics is a better idea.

On this last point, would it be possible to encode your static heuristics as improvements to our static branch probability heuristics, and then naturally adapt to profile-based branch probabilities when those are available?

Hi Ramshankar,

This is interesting, indeed. Thanks for soliciting input.

My main comment has to do with the fact that in order to get this
gain you need many pieces to work together. My suggestion is to
proceed in steps and develop it incrementally on trunk.

I second this. We definitely should have loop fusion in LLVM, and incremental in-trunk development is the best way to make that happen.

For example it would be good to first add a simple loop fusion pass
that only merges adjacent loops and then build it up from there. Can
you break down your approach into smaller, incremental steps like
this? What would these steps be?

My other comment is about the choice of DependenceAnalysis framework.
No pass currently uses this framework and I am not sure how complete
the implementation is

DependenceAnalysis has known issues – as I recall, the way it pulls apart GEP indices without verifying the boundary conditions is not really sound, but this could be replaced by our delinearization functionality now (what Polly is using) – and if this work provides the impetus for cleaning that up, so much the better. That having been said, care is required. DependenceAnalysis is definitely more powerful than the vectorizer’s dependence analysis, and conceptually cleaner. However, using what the vectorizer uses should provide a conservatively-correct base.

and whether it has adequate test coverage.

DependenceAnalysis’s test coverage actually isn’t bad, at least for loop bodies in traditional form.

This was my main reason for reusing the dependence analysis from the
LoopVectorizer. Have you considered the same?

Finally, it might be a good idea to discuss the overall optimization plan for fuseable loops. One possibility that I had discussed with Chandler a couple of months ago was: aggressively fuse early, expose optimization opportunities, and then split late (after vectorization) based on target-specific modeling (accounting for register pressure, hardware prefetching resources, etc.). There are obviously also other possible approaches, and I’d like to know what you think would be an optimal strategy.

We probably want another splitting before the vectorizer to allow partial vectorization. But other than that, this sounds like a good initial plan to me.

Adam

That looks like the best way to go and will split the general part from the heroics.

The libquantum optimization an example for an heroic optimization that will fire for one benchmark only. But the compiler capability that recognizes the opportunity has value beyond that.

There are also a few patents from AMD and Intel in that space the oversight team will have to be aware of.

-Gerolf

Hi Hal,

There are two diffs under revision update history:

The clang patch is: http://reviews.llvm.org/differential/diff/18250/
The llvm patch is: http://reviews.llvm.org/differential/diff/18249/

The loop fusion pass is in Scalar opts. One of the challenges in legality checks was aliases with extern variables. However I derived some metadata from GlobalsModRef and used that in the scalar opt to get what I wanted. Do you have thoughts on how we can propagate GlobalsModRef analysis into the scalar opt? That could be one of the first sub-patches. Going by the various suggestions I will put out a diff just for loop fusion (along with control dependences/closures) with some synthetic test cases.

This phase ordering question is of interest: Will inlining (and other IPO analysis/opt incl globals mod/ref, function unswitch) facilitate better loop fusion? I experimented with an IPO analysis that spans call graphs and checks "adjacent" code for loops that can be fused. However this made the compiler very slow and I ended up with inlining all functions that has a singly nested loop. I could check that there is a path from entry to the loop header that does not have "bad" basicblocks: blocks that have calls for example. Alternatively I could use profile or static block frequencies. We need to prune the search space somehow and we got to do it across call boundaries.

I think that vectorization should come later than fusion and distribution in the optimization flow. Vectorization adds loop versioning. That creates additional loops and probably a more challenging search space for loop fusion. Interestingly, I tried vectorization by hand for 462.libquantum and it did not help: there is an interplay with if-converting stores (what I saw so far). Perhaps it is an architecture thing.

Thanks,
Ram

From: "Ramshankar Ramanarayanan" <Ramshankar.Ramanarayanan@amd.com>
To: "Hal Finkel" <hfinkel@anl.gov>, "Adam Nemet" <anemet@apple.com>
Cc: llvmdev@cs.uiuc.edu, "chandlerc" <chandlerc@gmail.com>
Sent: Monday, January 19, 2015 7:57:42 AM
Subject: RE: [LLVMdev] proof of concept for a loop fusion pass

Hi Hal,

There are two diffs under revision update history:

The clang patch is: http://reviews.llvm.org/differential/diff/18250/
The llvm patch is: http://reviews.llvm.org/differential/diff/18249/

Okay, please don't do that. Please open one differential per patch. It is not at all obvious that the different updates refer to completely different patches (plus, it will mess up the incremental patch differencing capability).

The loop fusion pass is in Scalar opts.

That seems like the right place.

One of the challenges in
legality checks was aliases with extern variables. However I derived
some metadata from GlobalsModRef and used that in the scalar opt to
get what I wanted. Do you have thoughts on how we can propagate
GlobalsModRef analysis into the scalar opt?

So it seems, specifically, you're attempting to propagate whether a global does not have its address taken. For basic blocks, we stick this information into its value subclass data, perhaps we could do the same here? We could also have an AddressTaken member variable, as we do for MachineBasicBlock in CodeGen. Regardless, the metadata does not seem unreasonable.

That could be one of the
first sub-patches. Going by the various suggestions I will put out a
diff just for loop fusion (along with control dependences/closures)
with some synthetic test cases.

Great!

This phase ordering question is of interest: Will inlining (and other
IPO analysis/opt incl globals mod/ref, function unswitch) facilitate
better loop fusion?

I'm sure they will, especially inlining, but I see no reason that loop fusion should occur outside of the inliner-driven CGSCC pass manager (so that is occurs iteratively with inlining). We might want to make ICA (the inliner's cost analysis) smarter about loops that might be fused, but that's a separable improvement.

I experimented with an IPO analysis that spans
call graphs and checks "adjacent" code for loops that can be fused.
However this made the compiler very slow and I ended up with
inlining all functions that has a singly nested loop. I could check
that there is a path from entry to the loop header that does not
have "bad" basicblocks: blocks that have calls for example.

I see no reason to avoid fusing loops with calls (so long, I suppose, as we can split later based on register pressure considerations). Why are you doing that?

Alternatively I could use profile or static block frequencies. We
need to prune the search space somehow and we got to do it across
call boundaries.

Better use of BPI within ICA is something that the new pass-manager infrastructure should allow.

I think that vectorization should come later than fusion and
distribution in the optimization flow. Vectorization adds loop
versioning. That creates additional loops and probably a more
challenging search space for loop fusion.

Good point. As I mentioned, I think that fusing early and splitting late makes sense. Loop fusion potentially exposes many other simplification opportunities, and predicting them ahead of time is unlikely to be very successful.

Interestingly, I tried
vectorization by hand for 462.libquantum and it did not help: there
is an interplay with if-converting stores (what I saw so far).
Perhaps it is an architecture thing.

Interesting.

-Hal

From: "Hal Finkel" <hfinkel@anl.gov>
To: "Ramshankar Ramanarayanan" <Ramshankar.Ramanarayanan@amd.com>
Cc: llvmdev@cs.uiuc.edu, "chandlerc" <chandlerc@gmail.com>, "Adam Nemet" <anemet@apple.com>
Sent: Sunday, January 25, 2015 9:16:26 PM
Subject: Re: [LLVMdev] proof of concept for a loop fusion pass

> From: "Ramshankar Ramanarayanan" <Ramshankar.Ramanarayanan@amd.com>
> To: "Hal Finkel" <hfinkel@anl.gov>, "Adam Nemet" <anemet@apple.com>
> Cc: llvmdev@cs.uiuc.edu, "chandlerc" <chandlerc@gmail.com>
> Sent: Monday, January 19, 2015 7:57:42 AM
> Subject: RE: [LLVMdev] proof of concept for a loop fusion pass
>
> Hi Hal,
>
> There are two diffs under revision update history:
>
> The clang patch is:
> http://reviews.llvm.org/differential/diff/18250/
> The llvm patch is: http://reviews.llvm.org/differential/diff/18249/

Okay, please don't do that. Please open one differential per patch.
It is not at all obvious that the different updates refer to
completely different patches (plus, it will mess up the incremental
patch differencing capability).

Also, please post your patches with full context (see http://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface for instructions), and make sure they're not inverted (your current diff makes it looks like your new files are being deleted, etc.).

-Hal