[PATCH/RFC] Modifying reassociate for improved CSE: fairly large perf gains

When playing around with reassociate I noticed a seemingly obvious optimization that was not getting done anywhere in llvm… nor in gcc or ICC.

Consider the following trivial function:

void foo(int a, int b, int c, int d, int e, int *res) {
res[0] = (e * a) * d;
res[1] = (e * b) * d;
res[2] = (e * c) * d;
}

This function can be optimized down to 4 multiplies instead of 6 by reassociating such that (e*d) is the common subexpresion. However, no compiler I’ve tested does this. I wrote a slightly hacky heuristic algorithm to augment reassociate to do this and tested it.

First, before the details, the results: on a large offline test suite of graphics shaders it cut down total instruction count by ~0.9% (!) and float math instruction count by 1.5% (!). I ran it on SPEC out of curiosity and the one standout was that with fast-math, it cut down 470.lbm by ~6% runtime (see Note 1 at the bottom for the probable reason why). Not coincidentally, this is probably the SPEC program that looks most like a graphics shader.

Here’s how it works:

  1. Do reassociate once as normal. Keep track of the linear chains of operations that are saved to the final IR for later.
  2. Create a “pair map” consisting of a mapping from <Instr, Instr> to . Have one pair map for each type of BinaryOperation. This map represents how common a given operand pair occurs in the source code for a given BinaryOperation. But in addition to counting each actual instruction, we also count each possible O(N^2) pair of each linear operand chain. So for example, if the operand chain is this:

abc

we do:

PairMap[Multiply][{a, b}]++;
PairMap[Multiply][{b, c}]++;
PairMap[Multiply][{a, c}]++;

Obviously if runtime is an issue we can prohibit this on extremely long chains or such that are unlikely to occur in real programs to avoid asymptotically quadratic behavior.
3. Run reassociate again. All the information is saved from the first time around so hopefully this won’t be very expensive except for the changes we actually make. But this time, whenever emitting a linear operand chain, pick the operand pair that’s most common in the source code (using PairMap) and make that one the first operation. Thus, for example:

(((a*b)*c)*d)*e

if “b*e” is the most common, this becomes:

(((b*e)*a)*c)*d

Now b*e can be CSE’d later! Magic!

This could probably be enhanced to allow for multiple operations to be CSE’d, e.g. by doing something like ((be)(a*c))*d, but that would change the canonicalization regime and make it a more invasive change, I think.

Also, as a tiebreaker, the current one I’m using is the “pair which has the lowest max rank of the two operands”, which makes sense because in this example, “a*b” is the first operation in the chain, so we want to pick the duplicates which are also higher up in the program vs closer to the leaf. No other tiebreaker I tried seemed to work as well.

This is not the cleanest implementation or algorithm, but it’s the most straightforward I could come up with and it gives us some unreasonably large perf gains, and of course, optimizes “foo” correctly down to 4 multiplies.

Any thoughts? The gains here are… well, more than I expected, to say the least :wink:

—escha

Note 1: see how (1.0 - u2) is a common factor here.
DST_C ( dstGrid ) = (1.0-OMEGA)SRC_C ( srcGrid ) + DFL1OMEGArho(1.0 - u2);
DST_N ( dstGrid ) = (1.0-OMEGA)SRC_N ( srcGrid ) + DFL2OMEGArho(1.0 + uy*(4.5uy + 3.0) - u2);
DST_S ( dstGrid ) = (1.0-OMEGA)SRC_S ( srcGrid ) + DFL2OMEGA
rho*(1.0 + uy*(4.5uy - 3.0) - u2);
DST_E ( dstGrid ) = (1.0-OMEGA)SRC_E ( srcGrid ) + DFL2OMEGA
rho*(1.0 + ux*(4.5*ux + 3.0) - u2);
(etc)

reassociate_cse.diff (7.72 KB)

Hi,

When playing around with reassociate I noticed a seemingly obvious optimization that was not getting done anywhere in llvm… nor in gcc or ICC.

Consider the following trivial function:

void foo(int a, int b, int c, int d, int e, int *res) {
res[0] = (e * a) * d;
res[1] = (e * b) * d;
res[2] = (e * c) * d;
}

This function can be optimized down to 4 multiplies instead of 6 by reassociating such that (e*d) is the common subexpresion. However, no compiler I’ve tested does this. I wrote a slightly hacky heuristic algorithm to augment reassociate to do this and tested it.

*First, before the details, the results: on a large offline test suite of graphics shaders it cut down total instruction count by ~0.9% (!) and float math instruction count by 1.5% (!). *I ran it on SPEC out of curiosity and the one standout was that with fast-math, it cut down 470.lbm by ~6% runtime (see Note 1 at the bottom for the probable reason why). Not coincidentally, this is probably the SPEC program that looks most like a graphics shader.

Here’s how it works:

1. Do reassociate once as normal. Keep track of the linear chains of operations that are saved to the final IR for later.
2. Create a “pair map” consisting of a mapping from <Instr, Instr> to <unsigned>. Have one pair map for each type of BinaryOperation. This map represents how common a given operand pair occurs in the source code for a given BinaryOperation. But in addition to counting each actual instruction, we also count each possible O(N^2) pair of each linear operand chain. So for example, if the operand chain is this:

a*b*c

we do:

PairMap[Multiply][{a, b}]++;
PairMap[Multiply][{b, c}]++;
PairMap[Multiply][{a, c}]++;

Obviously if runtime is an issue we can prohibit this on extremely long chains or such that are unlikely to occur in real programs to avoid asymptotically quadratic behavior.
3. Run reassociate again. All the information is saved from the first time around so hopefully this won’t be very expensive except for the changes we actually make. But this time, whenever emitting a linear operand chain, pick the operand pair that’s *most common* in the source code (using PairMap) and make that one the first operation. Thus, for example:

(((a*b)*c)*d)*e

if “b*e” is the most common, this becomes:

(((b*e)*a)*c)*d

Now b*e can be CSE’d later! Magic!

This could probably be enhanced to allow for multiple operations to be CSE’d, e.g. by doing something like ((b*e)*(a*c))*d, but that would change the canonicalization regime and make it a more invasive change, I think.

Also, as a tiebreaker, the current one I’m using is the “pair which has the lowest max rank of the two operands”, which makes sense because in this example, “a*b” is the first operation in the chain, so we want to pick the duplicates which are also higher up in the program vs closer to the leaf. No other tiebreaker I tried seemed to work as well.

This is not the cleanest implementation or algorithm, but it’s the most straightforward I could come up with and it gives us some unreasonably large perf gains, and of course, optimizes “foo” correctly down to 4 multiplies.

Any thoughts? The gains here are… well, more than I expected, to say the least :wink:

Interesting, I think guiding Reassociate to reorder to maximize CSE later on sounds like a good idea, all things being equal.

As you have a proof-of-concept patch already (and the proposed change is an isolated improvement to the Reassociate pass, with no impact to anything else), I think you could just create a review on Phabricator, as this would make it easier for people to comment on the actual code.

One concern will be compile time, but it should be possible to measure the impact and we can consider restricting the set of candidates by either limiting the length of chains and/or limiting the scope we look for candidates for.

Cheers,
Florian

Done!

—escha