RFC: New loop distribution pass for software pipelining

RFC: New loop distribution pass for software pipelining

I am considering to implement loop distribution to improve efficiency of software pipelining and would like advice and comments.
For example, other approaches to raise ILP, whether it is best for improvement of software pipelining, feasibility of implementation, etc.

Background

Typically, a hot spot in a simulation program is often a loop that has a long latency and contains a large number of instructions.
Such loops require instruction scheduling, such as software pipelining, because the instruction latency cannot be adequately hidden by hardware out-of-order.
The effect of scheduling is greater when there is no SMT, as is the case with many Arm processors.

Software pipelining schdules multiple iterations overlapping each other, but the restrictions on the number of registers often make it impossible to overlap them sufficiently.
Splitting loops and increasing the locality of register utilization could increase the efficiency of pipelining.

Algorithm

Evaluation of division

A given loop is evaluated at the estimated value of the minimum II (Initiation Interval) for which there is no shortage of registers.
II is a term for software pipelining and is the number of cycles until the next iteration begins.
Therefore, this is the number of cycles per iteration.
A division with a smaller total II is superior.

Since the exact II cannot be determined without an actual schedule, the following calculation method is used to estimate it.

RequiredParallel = TotalLatency/MII
PossibleParallel = NumRegisters/NumRegisterConsumptionPerIter
if RequiredParallel < 1
  II = MII
elif RequiredParallel < PossibleParallel
  II = TotalLatency/RequiredParallel
else
  II = TotalLatency/PossibleParallel

Here, II is considered as the total latency per iteration divided by the number of overlapping iterations.
The number of iterations that can overlap is undercalculated as the number of registers in the architecture divided by the maximum register consumption per iteration.

MII is the minimum II considering only resource and recurrence constraints.

If there are register references that cross over distributed loops, they are saved in a temporary array.
Additional load/store instructions must be reflected in the MII calculation.

Search algorithm

I think that local search is acceptable as a method to search for partitions, but the details have not yet been studied.

Alternative method

An alternative method would be to spill and fill registers with a sufficiently long cycles from definition to reference during pipelining.
I think that loop distribution is more flexible and can be used to take other requirements into account, e.g., to increase the locality of memory accesses.

Implementation

I submitted a prototype to Phabricator: ⚙ D147491 [PoC][WIP] New loop distribution pass for software pipelining
The purpose of this implementation is to confirm that the actual pipelining results of the distribution will be the same as estimated.

Unlike the existing LoopDistribute pass, it is implemented on the back end because it requires MIR-level splitting.

Although it is implemented for AArch64, it is for ease of implementation and the algorithm is considered to be generic.

Example result

I confirmed that II is reduced from 24 to 16 by distribution at the following artificial test program.
The target processor is Neoverse N2 whose latencies for floating-point addition, multiplication, and FMA are 2, 3, and 4, respectively.

void f(double *restrict a, double *b, int n) {
  double c0 = a[0];
  double c1 = a[1];
  // The aim is to prevent TMP00-05 from occupying the registers for too long by splitting it.
  for (int i=0; i<n; i++) {
    double TMP00 = c0 + c0*b[i];
    double TMP01 = TMP00 + TMP00*b[i];
    double TMP02 = TMP01 + TMP01*b[i];
    double TMP03 = TMP02 + TMP02*b[i];
    double TMP04 = TMP03 + TMP03*b[i];
    double TMP05 = TMP04 + TMP04*b[i];
    double tmp06 = TMP05 + TMP05*b[i];
    double tmp07 = tmp06 + tmp06*b[i];
    double tmp08 = tmp07 + tmp07*b[i];
    double tmp09 = tmp08 + tmp08*b[i];
    double tmp10 = tmp09 + tmp09*b[i];
    double tmp11 = tmp10 + tmp10*b[i];
    double tmp12 = tmp11 + tmp11*b[i];
    double tmp13 = tmp12 + tmp12*b[i];
    double tmp14 = tmp13 + tmp13*b[i];
    double tmp15 = tmp14 + tmp14*b[i];
    double tmp16 = tmp15 + tmp15*b[i];
    double tmp17 = tmp16 + TMP00*b[i];
    double tmp18 = tmp17 + TMP01*b[i];
    double tmp19 = tmp18 + TMP02*b[i];
    // distributed point. TMP03-05, tmp19, and b[i] are stored in temporary arrays.
    // TMP00-02 are not actually stored to reduce the number of temporary arrays.
    double tmp20 = tmp19 + TMP03*b[i];
    double tmp21 = tmp20 + TMP04*b[i];
    double tmp22 = tmp21 + TMP05*c1;
    a[i] = tmp22;
  }
}

In the future, I think it is necessary to confirm the effectiveness with actual code and measure the execution time.

Details of the example

II is estimated as follows by the prototype for the code above:

  • Original II: 26
  • Original ResMII: 12
  • Total II after distribution: 13 + 2 = 15 (split into two with respective II)
  • Total MII after distribution: 13 + 2 = 15

Optimization flags are as follows:

-O3 -fno-tree-vectorize -fno-unroll-loops -mcpu=neoverse-n2 -mllvm --aarch64-enable-loop-dist=1 -mllvm -aarch64-enable-pipeliner=1 -mllvm -pipeliner-max-stages=100 -mllvm -pipeliner-max-mii=100 -mllvm -enable-misched=0 -mllvm -enable-post-misched=0 -mllvm -debug-only=aarch64-loop-distribute,pipeliner

Actual pipelined results are as follows:

  • Original II: 24
  • Total II after distribution: 13 + 3 = 16

It is confirmed that the original II is the minimum value at which register spills do not occur by adjusting the option -mllvm -pipeliner-force-ii=, which forces to use the specified II.

This method over-estimates the register pressure during pipelining, so it is expected that a smaller II can actually be achieved.

The reason why II of the second split loop is larger than estimated is considered to be that it does not take into account the increased instructions such as address computation.

Disclaimer: I am usually working in the mid-end, rarely on MachineIR.

AFAICS, the difference of AArch64LoopDistribute and LoopDistribute are:

  • Works on MachineIR
  • POC implementation only for AArch64
  • LoopDistribute try to maximally distribute a loop, whereas AArch64LoopDistribute uses a heuristic
  • LoopDistribute is intended for vectorization, AArch64LoopDistribute for software pipelining
  • AArch64LoopDistribute uses hardware-specific metrics (e.g. available registers)

LoopDistribute is disabled by default because it does not have a profitability heuristic that knows whether a loop will actually be vectorized. AArch64LoopDistribute seems to address that, although it’s still separate from MachinePipeliner so there seems still no guarantee that MachinePipeliner will actually profit from loop distribution.

Generally, I would avoid having multiple passes for the same transformation, but being a MachineIR pass seems to justify that.

Since AArch64LoopDistribute is tied to MachinePipeliner, I think I would put the distribution logic into the MachinePipeliner pass itself, so it can reuse its II calculation from ResourceManager an remain in-sync and available for all targets.

The implementation of AArch64LoopDistribute split at most at one position (findDistPoint). I think it should be able to split loops into as many parts as necessary, possibly by just repeating schedule.

findDistPoint is implemented by brute-force checking all possible instructions, leading to quadratic complexity (assuming estII is linear). Any possibility to limit that?

For performance-improving changes we would expect some data that shows how much it is improving application performance for the motivating use cases without regressing other applications and compilation time too much. Typically we use the llvm-test-suite for that. I would also be interested in a comparison when splitting everthing into minimal chunks without heuristic as LoopDistribute does. If that yields similar performance, maybe we don’t need a MachineIR distribute pass.

Sorry for the late reply. Thanks for the summary and advice.

I am trying to improve some features of MachinePipeliner and implement the AArch64 backend to take performance data.
I then will consider an algorithm for findDistPoint that can balance performance and compile time.