I am considering implementing modulo variable expansion for MachinePipeliner.
In the current implementation, for processors with ordinary registers such as Arm and x86, values that cross multiple stages are renamed with the move instruction to prevent overwriting.
This method may result in a very large number of move instructions, which can cause performance problems.
Modulo variable expansion (MVE) makes the move instructions unnecessary by duplicating the kernel as many times as the number of renames required.
On the other hand, there are drawbacks such as increased code size and remainder iterations depending on the number of expansions.
Do you have any advice on the effectiveness of MVE or points to keep in mind when implementing it?
Modulo Variable Expansion as described in a paper by Monica Lam is replicating the loop body to avoid lifetime overlaps of different iterations. This can also be achieved by unrolling, MVE determines the minimal required unroll factor.
I am not an export on LLVMâs MachinePipeliner/software pipelining implementation. @avl-llvm, @James_Molloy1, @ThomasRaoux, @tpreudhomme, @kparzysz-quic, @sgundapa have worked on the pipeliner, maybe they could add some comments?
I havenât looked at it in details but as @Meinersbur mentioned unrolling before pipelining solves the same problem it also allows potential better usages of the bundles slots if the number of slots per bundle is not divisible by the total number of slots needed in the loop.
That being said, the downside of doing unrolling before pipelining is that it tends to make it harder to find a good schedule when doing swing modulo. (it might be less true when using slack algorithm). So having such expander may be valuable.
I havenât looked at this for a while, @hgreving might have some better suggestions.
I think the difference of doing MVE (aka post-pipelining unrolling) vs. before is that if you do it before, there is no guarantee that you get less long live-ranges IMO since instructions of the later unrolled iteration could get hoisted. For large(r) loops, I also tend to see more spilling if the loop bodies get larger. As for swing or slack, while there are distinct differences and in our case slack is a bit more prone to introducing copies, I think if it would matter, tweaking the heuristics for slack (which is more free where to place nodes) would always be possible. But back to swing and MVE: I have not studied behavior on superscalar out-of-orders like x86 and ARM with swp, but I would expect that in order to use their resources effectively, you would probably unroll before pipelining anyway. As to how much copies matter on those machines (with renaming engines, forwarding in HW and so forth), no idea, maybe if it becomes excessive? Itâs a bit surprising to hear that there can be an excessive copy problem on those machines, since they donât have any very long latency instructions, interesting. On machines that Thomas and me have worked on, there are. I think my main issue with MVE would be that as we have figured out how to support unknown and low trip counts < number of stages in the same loop, that probably wouldnât work with MVE, and you had to version the loop(s). My advice would also be to study the effectiveness of unrolling before pipelining, not necessarily in order to make the loop larger for less live-range intersects, but to saturate the resource II in relation to the number of potential copies in the loop, just because I am a little doubtful copies matter that much on those machines, but interested to hear about your conclusions!
Thanks for the great advices! I have investigated the effects of unrolling before pipelining in a test program.
An example of the test program is the following. It is intended to test an impact of the number of values with long live ranges and the length of the dependency chain.
I would also like to test a kernel based on a real application.
// In this example the number of temporary values is 4 and the length of the dependency chain is 6.
// n is 1000, so that the arrays fit into L1$.
void kernel(double *restrict a, double *b, int n) {
for (int i=0; i<n; i++) {
double tmp0 = b[i];
// Temporary values with long live range
double tmp1 = tmp0 + tmp0*b[i];
double tmp2 = tmp1 + tmp1*b[i];
double tmp3 = tmp2 + tmp2*b[i];
double tmp4 = tmp3 + tmp3*b[i];
// Dependency chain
double x = tmp4;
x = x + x*b[i];
x = x + x*b[i];
x = x + x*b[i];
x = x + x*b[i];
x = x + x*b[i];
x = x + x*b[i];
x = x + x*tmp1;
x = x + x*tmp2;
x = x + x*tmp3;
x = x + x*tmp4;
a[i] = x;
}
}
I used Ampere Altra Max processor for the measurements. It is an AArch64 OoO processor having 2 FP arithmetic units with 4-cycle FMA latency. The scheduling model used is for Neoverse N2 (-mcpu=neoverse-n2). It was found as a result of searching for a model with matching FMA latency.
Since there is no AArch64 backend for MachinePipeliner, I created a simple prototype.
The first table below shows the performance without unrolling and the maximum number of copies of values required. Sched II is the actual scheduled II which was chosen to be the minimum where no register spills occur. Therefore, there is no performance degradation due to register spills.
Observed II, converted from measured time, is shown to be worse than scheduled II. I suspect that this is due to the move instructions to retain the copy.
Num tmps
Len chain
Min II
Sched II
Max copies
Observed II
Efficiency
2
2
3
7
3
7.5
0.40
2
4
4
7
4
10.9
0.37
2
6
5
7
5
14.3
0.35
2
8
6
7
6
16.5
0.36
2
10
7
8
7
18.8
0.37
4
2
5
7
4
16.0
0.31
4
4
6
7
5
19.5
0.31
4
6
7
9
5
20.2
0.35
4
8
8
11
5
21.7
0.37
4
10
9
12
5
24.0
0.38
6
2
7
10
4
21.0
0.33
6
4
8
13
4
21.5
0.37
6
6
9
15
4
24.0
0.38
6
8
10
17
4
26.0
0.38
6
10
11
19
4
28.0
0.39
8
2
9
16
3
24.4
0.37
8
4
10
20
3
23.6
0.42
8
6
11
23
3
26.6
0.41
8
8
12
25
3
34.9
0.34
8
10
13
28
3
30.0
0.43
10
2
11
23
3
26.5
0.42
10
4
12
27
3
28.6
0.42
10
6
13
31
3
30.6
0.42
10
8
14
35
3
33.4
0.42
10
10
15
39
3
38.0
0.39
The second table shows the results of unrolling before pipelining. The unroll factor is the number of copies indicated above. Sched II is also multiplied by the same number.
Observed II is almost better than Sched II, but sometimes worse when Max copies is 2. The reason why the results are better than Sched II should be due to OoO.
As a supplement, the performance without pipelining was measured. It is observed that with pipelining the performance is better from 20 to 70%.
Num tmps
Len chain
Num unroll
Sched II
Max copies
Observed II
Efficiency
Efficiency w/o pipelining
2
2
3
21
1
11.7
0.77
0.65
2
4
4
28
1
20.4
0.79
0.61
2
6
5
35
2
30.5
0.83
0.58
2
8
6
42
2
46.8
0.77
0.52
2
10
7
56
1
56.7
0.86
0.51
4
2
4
28
1
24.8
0.81
0.59
4
4
5
35
2
46.0
0.65
0.55
4
6
5
45
2
49.0
0.71
0.51
4
8
5
55
2
50.5
0.79
0.49
4
10
5
60
2
66.0
0.68
0.45
6
2
4
40
1
34.4
0.81
0.51
6
4
4
52
1
41.6
0.77
0.48
6
6
4
60
1
48.4
0.74
0.45
6
8
4
68
1
56.0
0.71
0.45
6
10
4
76
1
64.4
0.68
0.41
8
2
3
48
1
37.2
0.73
0.47
8
4
3
60
1
43.2
0.69
0.43
8
6
3
69
1
49.2
0.67
0.42
8
8
3
75
1
54.6
0.66
0.40
8
10
3
84
1
60.9
0.64
0.38
10
2
3
69
1
53.4
0.62
0.42
10
4
3
81
1
60.3
0.60
0.41
10
6
3
93
1
72.9
0.54
0.39
10
8
3
105
1
83.4
0.50
0.38
10
10
3
117
1
93.9
0.48
0.37
In summary, the results showed that unrolling is necessary to achieve the expected performance and that it is usually sufficient even before pipelining.
However, I have no ideas on how to determine the appropriate unrolling factor in advance. If unrolling after pipelining can provide sufficient performance, I believe the method would be easier to implement.
In fact, our downstream implementation uses a heavily modified version of this commit for pipelining. Itâs very profitable for small loops whose operations do not saturate a targetâs functional units (ILP) or the required delay slots of the branch back to the top.
However, this is strictly an unroller, not MVE, so each iteration is still sequential. I believe an implementation could do pre-pipeliner MVE by identifying and splitting accumulation values, but we have not attempted this yet.
Thanks for the information.
I understood that pre-pipeline unrolling is also needed to fill slots (i.e. reduce the ratio of loop control instructions) and to split accumulations, which cannot be achieved by post-pipeline unrolling.
I think these could be achieved by interleaving in the LoopVectorize pass. However, some improvements may be necessary to ensure that the best factors are selected.