Register Spill Caused by the Reassociation pass

Hi Sanjay,

I observed some extra register spills when applying the reassociation pass on spec2006 benchmarks and I would like to listen to your advice.

For example, function get_new_point_on_quad() of tria_boundary.cc in spec2006/dealII has a sequences of code like this

X=a+b

Y=X+c

Z=Y+d

There are many other instructions between these float adds. The reassociation pass first swaps a and c when checking the second add, and then swaps a and d when checking the third add. The transformed code looks like

X=c+b

Y=X+d

Z=Y+a

a is pushed all the way down to the bottom and its live range is much larger now.

Best,

Haicheng

Hi Haicheng,

We need to prevent the transform if it causes spilling, but I’m not sure yet what mechanism/heuristic we can use to do that.

Can you file a bug report with a reduced test case?

Thanks!

This conflict is with many optimizations incl. copy prop, coalescing, hoisting etc. Each could increase register pressure and with similar impact. Attempts to control the register pressure locally (within an optimization pass) tend to get hard to tune and maintain. Would it be a better way to describe eg in metadata how to undo an optimization? Optimizations that attempt to reduce pressure like splitting or remat could be hooked up and call an undo routine based on a cost model.

I think there is time to do something longer term. This particular instance can only be an issue under -fast-math.

Cheers
Gerolf

The test case in the bug report exposes at least one problem, but it’s not the presumed problem of spilling.

Reduced example based on the PR attachment:

define double @foo_calls_bar_4_times_and_sums_the_results() {
%a = call double @bar()
%b = call double @bar()
%t0 = fadd double %a, %b
%c = call double @bar()
%t1 = fadd double %t0, %c
%d = call double @bar()
%t2 = fadd double %t1, %d
ret double %t2
}

I don’t think we’re ever going to induce any extra spilling in a case like this. The default (any?) x86-64 ABI requires spilling because no SSE registers are preserved across function calls. So we get 3 spills regardless of any reassociation of the adds:

$ ./llc -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=avx -o - 25016.ll

callq bar
vmovsd %xmm0, (%rsp) # 8-byte Spill
callq bar
vaddsd (%rsp), %xmm0, %xmm0 # 8-byte Folded Reload
vmovsd %xmm0, (%rsp) # 8-byte Spill
callq bar
vaddsd (%rsp), %xmm0, %xmm0 # 8-byte Folded Reload
vmovsd %xmm0, (%rsp) # 8-byte Spill
callq bar
vaddsd (%rsp), %xmm0, %xmm0 # 8-byte Folded Reload

If we enable reassociation via -enable-unsafe-fp-math, we still have 3 spills:

callq bar
vmovsd %xmm0, 16(%rsp) # 8-byte Spill
callq bar
vmovsd %xmm0, 8(%rsp) # 8-byte Spill
callq bar
vaddsd 8(%rsp), %xmm0, %xmm0 # 8-byte Folded Reload
vmovsd %xmm0, 8(%rsp) # 8-byte Spill
callq bar
vaddsd 8(%rsp), %xmm0, %xmm0 # 8-byte Folded Reload
vaddsd 16(%rsp), %xmm0, %xmm0 # 8-byte Folded Reload

This looks like what is described in the original problem: the adds got reassociated for no benefit (and possibly some harm, although it may be out-of-scope for the MachineCombiner pass).

We wanted to add the results of the first 2 function calls, add the results of the last 2 function calls, and then add those 2 results to reduce the critical path. Instead, we got:

((b + c) + d) + a

This shows that either the cost calculation in the MachineCombiner is wrong or the results coming back from MachineTraceMetrics are wrong. Or maybe MachineCombiner should be bailing out of a situation like this in the first place - are we even allowed to move instructions around those function calls?

Here’s where it gets worse - if the adds are already arranged to reduce the critical path:

define double @foo4_reassociated() {
%a = call double @bar()
%b = call double @bar()
%c = call double @bar()
%d = call double @bar()
%t0 = fadd double %a, %b
%t1 = fadd double %c, %d
%t2 = fadd double %t0, %t1
ret double %t2
}

The MachineCombiner is increasing the critical path by reassociating the operands:

callq bar
vmovsd %xmm0, 16(%rsp) # 8-byte Spill
callq bar
vmovsd %xmm0, 8(%rsp) # 8-byte Spill
callq bar
vmovsd %xmm0, (%rsp) # 8-byte Spill
callq bar
vaddsd (%rsp), %xmm0, %xmm0 # 8-byte Folded Reload
vaddsd 8(%rsp), %xmm0, %xmm0 # 8-byte Folded Reload
vaddsd 16(%rsp), %xmm0, %xmm0 # 8-byte Folded Reload

(a + b) + (c + d) → ((d + c) + b) + a

I think this is a problem calculating and/or using the “instruction slack” in MachineTraceMetrics.

The machine combiner does not see spills. Perhaps there is a phase ordering issue. From the analysis here I don’t see an explanation for a performance loss (the potential increase in register pressure did make sense to me, though).

-Gerolf