RFC: New loop distribution pass for software pipelining
I am considering to implement loop distribution to improve efficiency of software pipelining and would like advice and comments.
For example, other approaches to raise ILP, whether it is best for improvement of software pipelining, feasibility of implementation, etc.
Background
Typically, a hot spot in a simulation program is often a loop that has a long latency and contains a large number of instructions.
Such loops require instruction scheduling, such as software pipelining, because the instruction latency cannot be adequately hidden by hardware out-of-order.
The effect of scheduling is greater when there is no SMT, as is the case with many Arm processors.
Software pipelining schdules multiple iterations overlapping each other, but the restrictions on the number of registers often make it impossible to overlap them sufficiently.
Splitting loops and increasing the locality of register utilization could increase the efficiency of pipelining.
Algorithm
Evaluation of division
A given loop is evaluated at the estimated value of the minimum II (Initiation Interval) for which there is no shortage of registers.
II is a term for software pipelining and is the number of cycles until the next iteration begins.
Therefore, this is the number of cycles per iteration.
A division with a smaller total II is superior.
Since the exact II cannot be determined without an actual schedule, the following calculation method is used to estimate it.
RequiredParallel = TotalLatency/MII
PossibleParallel = NumRegisters/NumRegisterConsumptionPerIter
if RequiredParallel < 1
II = MII
elif RequiredParallel < PossibleParallel
II = TotalLatency/RequiredParallel
else
II = TotalLatency/PossibleParallel
Here, II is considered as the total latency per iteration divided by the number of overlapping iterations.
The number of iterations that can overlap is undercalculated as the number of registers in the architecture divided by the maximum register consumption per iteration.
MII is the minimum II considering only resource and recurrence constraints.
If there are register references that cross over distributed loops, they are saved in a temporary array.
Additional load/store instructions must be reflected in the MII calculation.
Search algorithm
I think that local search is acceptable as a method to search for partitions, but the details have not yet been studied.
Alternative method
An alternative method would be to spill and fill registers with a sufficiently long cycles from definition to reference during pipelining.
I think that loop distribution is more flexible and can be used to take other requirements into account, e.g., to increase the locality of memory accesses.
Implementation
I submitted a prototype to Phabricator: ⚙ D147491 [PoC][WIP] New loop distribution pass for software pipelining
The purpose of this implementation is to confirm that the actual pipelining results of the distribution will be the same as estimated.
Unlike the existing LoopDistribute pass, it is implemented on the back end because it requires MIR-level splitting.
Although it is implemented for AArch64, it is for ease of implementation and the algorithm is considered to be generic.
Example result
I confirmed that II is reduced from 24 to 16 by distribution at the following artificial test program.
The target processor is Neoverse N2 whose latencies for floating-point addition, multiplication, and FMA are 2, 3, and 4, respectively.
void f(double *restrict a, double *b, int n) {
double c0 = a[0];
double c1 = a[1];
// The aim is to prevent TMP00-05 from occupying the registers for too long by splitting it.
for (int i=0; i<n; i++) {
double TMP00 = c0 + c0*b[i];
double TMP01 = TMP00 + TMP00*b[i];
double TMP02 = TMP01 + TMP01*b[i];
double TMP03 = TMP02 + TMP02*b[i];
double TMP04 = TMP03 + TMP03*b[i];
double TMP05 = TMP04 + TMP04*b[i];
double tmp06 = TMP05 + TMP05*b[i];
double tmp07 = tmp06 + tmp06*b[i];
double tmp08 = tmp07 + tmp07*b[i];
double tmp09 = tmp08 + tmp08*b[i];
double tmp10 = tmp09 + tmp09*b[i];
double tmp11 = tmp10 + tmp10*b[i];
double tmp12 = tmp11 + tmp11*b[i];
double tmp13 = tmp12 + tmp12*b[i];
double tmp14 = tmp13 + tmp13*b[i];
double tmp15 = tmp14 + tmp14*b[i];
double tmp16 = tmp15 + tmp15*b[i];
double tmp17 = tmp16 + TMP00*b[i];
double tmp18 = tmp17 + TMP01*b[i];
double tmp19 = tmp18 + TMP02*b[i];
// distributed point. TMP03-05, tmp19, and b[i] are stored in temporary arrays.
// TMP00-02 are not actually stored to reduce the number of temporary arrays.
double tmp20 = tmp19 + TMP03*b[i];
double tmp21 = tmp20 + TMP04*b[i];
double tmp22 = tmp21 + TMP05*c1;
a[i] = tmp22;
}
}
In the future, I think it is necessary to confirm the effectiveness with actual code and measure the execution time.
Details of the example
II is estimated as follows by the prototype for the code above:
- Original II: 26
- Original ResMII: 12
- Total II after distribution: 13 + 2 = 15 (split into two with respective II)
- Total MII after distribution: 13 + 2 = 15
Optimization flags are as follows:
-O3 -fno-tree-vectorize -fno-unroll-loops -mcpu=neoverse-n2 -mllvm --aarch64-enable-loop-dist=1 -mllvm -aarch64-enable-pipeliner=1 -mllvm -pipeliner-max-stages=100 -mllvm -pipeliner-max-mii=100 -mllvm -enable-misched=0 -mllvm -enable-post-misched=0 -mllvm -debug-only=aarch64-loop-distribute,pipeliner
Actual pipelined results are as follows:
- Original II: 24
- Total II after distribution: 13 + 3 = 16
It is confirmed that the original II is the minimum value at which register spills do not occur by adjusting the option -mllvm -pipeliner-force-ii=
, which forces to use the specified II.
This method over-estimates the register pressure during pipelining, so it is expected that a smaller II can actually be achieved.
The reason why II of the second split loop is larger than estimated is considered to be that it does not take into account the increased instructions such as address computation.