Potential SimplifyCFG optimization; hammock to diamond transformation

All,
I have some code that looks like the following:

{
double a, b, c;
for (…) {

a = lots of FP math;
b = lots of FP math;
c = lots of FP math;
if (cond) {
a = 0.0;
b = 0.1;
c = 0.2;
}

}
}

Could we not convert the hammock into a diamond and move the initial computation of a, b, and c into the else block. Something like this:

{
double a, b, c;
for (…) {

if (cond) {
a = 0.0;
b = 0.1;
c = 0.2;
} else {
a = lots of FP math;
b = lots of FP math;
c = lots of FP math;
}

}
}

Does a similar optimization exist? If not, would the SimplifyCFG pass be an appropriate home for such an optimization?

Chad

Hi,

I imagine this optimization is for FP math stuff that doesn’t involve functions which might set errno (or when we’ve specified we don’t care about that)?

Cheers,
Dave

The specific code I’m looking at does not have any function calls. The “lots of FP math” is just a series of FP addition, subtraction and division.

Hi Chad,

All,
I have some code that looks like the following:

{
  double a, b, c;
  for (...) {
    ...
    a = lots of FP math;
    b = lots of FP math;
    c = lots of FP math;
    if (cond) {
      a = 0.0;
      b = 0.1;
      c = 0.2;
    }
   ...
  }
}

Could we not convert the hammock into a diamond and move the initial computation of a, b, and c into the else block. Something like this:

{
  double a, b, c;
  for (...) {
    ...
    if (cond) {
      a = 0.0;
      b = 0.1;
      c = 0.2;
    } else {
      a = lots of FP math;
      b = lots of FP math;
      c = lots of FP math;
    }
   ...
  }
}

Does a similar optimization exist? If not, would the SimplifyCFG pass be an appropriate home for such an optimization?

I am not sure if something similar exists in LLVM today. The optimization you are looking for is called partial dead code elimination (PDE).

I do not think SimplifyCFG would be a great place to put this as that is primarily about cleaning up control flow. I think it deserves its own pass.

The most general formulation would result in sinking instructions, including loads (and possibly some other side-effecting instructions when safe to do so), along with potentially duplicating code when the results are dead on some paths below the definition, but not all.

It seems like you could do a simpler formulation that just splits critical edges, and then moves instructions that have no side effects and have only a single use, to just before to the use (in this case it would be just prior to the phi in the new block created when splitting edges). You would also need to be careful not to sink code into loops - you would instead want to sink it below code that guards the loop, but not actually into the loop.

Here is one paper I am aware of that appears to have an SSA-based formulation of PDE. I have not read this as it is behind a paywall:
  http://rd.springer.com/chapter/10.1007%2F3-540-48294-6_12

Mark

Thanks, Mark. I will give the paper a look.

Thanks, Mark. I will give the paper a look.

Hi Chad,

All,
I have some code that looks like the following:

{
double a, b, c;
for (…) {

a = lots of FP math;
b = lots of FP math;
c = lots of FP math;
if (cond) {
a = 0.0;
b = 0.1;
c = 0.2;
}

}
}

Could we not convert the hammock into a diamond and move the initial computation of a, b, and c into the else block. Something like this:

{
double a, b, c;
for (…) {

if (cond) {
a = 0.0;
b = 0.1;
c = 0.2;
} else {
a = lots of FP math;
b = lots of FP math;
c = lots of FP math;
}

}
}

Does a similar optimization exist? If not, would the SimplifyCFG pass be an appropriate home for such an optimization?

I am not sure if something similar exists in LLVM today. The optimization you are looking for is called partial dead code elimination (PDE).

I do not think SimplifyCFG would be a great place to put this as that is primarily about cleaning up control flow. I think it deserves its own pass.

The most general formulation would result in sinking instructions, including loads (and possibly some other side-effecting instructions when safe to do so), along with potentially duplicating code when the results are dead on some paths below the definition, but not all.

It seems like you could do a simpler formulation that just splits critical edges, and then moves instructions that have no side effects and have only a single use, to just before to the use (in this case it would be just prior to the phi in the new block created when splitting edges). You would also need to be careful not to sink code into loops - you would instead want to sink it below code that guards the loop, but not actually into the loop.

Here is one paper I am aware of that appears to have an SSA-based formulation of PDE. I have not read this as it is behind a paywall:
http://rd.springer.com/chapter/10.1007%2F3-540-48294-6_12

I haven’t read that paper because I’m too cheap to pay for it. However, there is a very simple GCM algorithm that handles this case. All you need is a DomTree, no fancy value graph. It makes two passes over the SSA data-flow graph to find valid blocks.

http://dl.acm.org/citation.cfm?id=207154

The difficult compiler-specific part is inferring control and memory dependencies. Those are not explicit in LLVM SSA. Our GVN pass has the infrastructure to deal with that.

Next you have to consider profitability. You could aggressively sink everything at IR level within a loop, but you would need to split a lot of critical edges. There probably needs to be a cost heuristic to avoid splitting to sink cheap operations. That really complicates the problem though.

For even better heuristiscs… you could potentially use BlockFrequency to ignore cold branches. If you care about latency hiding/critical path and register pressure, then it needs to be an MI-level pass. It would be nice to have such an MI pass to compensate for lack of a global scheduler.

-Andy