Looking for some history on MachinePipeliner

I’ve been tinkering with the MachinePipeliner for my team’s downstream target, and have run into a particular issue with node ordering in the SMS algorithm. In particular, I’ve noticed that in some cases, I’m receiving an ‘Invalid node order’ warning in the debug. I’ve tracked this down to how ‘computeNodeOrder’ is implemented.

computeNodeOrder’s algorithm seems to be relatively untouched from the initial commit:

https://reviews.llvm.org/D16829

However, the algorithm is very different from those found in the 3 papers upon which the implementation is based:

https://github.com/llvm/llvm-project/blob/f40bba48a593d4297a6d0b4d1534082858b1d0ac/llvm/include/llvm/CodeGen/MachinePipeliner.h#L18

In the implementation:

https://github.com/llvm/llvm-project/blob/f40bba48a593d4297a6d0b4d1534082858b1d0ac/llvm/lib/CodeGen/MachinePipeliner.cpp#L1877

The order seems to be:

  • If pred_L is a subset of the NodeSet, BottomUp and add subset to work list
  • If succ_L is s subset of the NodeSet, TopDown and add subset to work list
  • If intersect(succ_L, NodeSet) is non-empty, TopDown and add intersect to work list
  • If there is only 1 NodeSet, BottomUp and add nodes at the bottom of the schedule (Succs.size() == 0) to work list
  • Otherwise, BottomUp and add node with the highest ASAP to work list

In the papers (Using Tanya Lattner’s master’s thesis as the example), the order is:

  • If intersect(pred_L, NodeSet) is non-empty, BottomUp and add subset to work list
  • If intersect(succ_L, NodeSet) is non-empty, TopDown and add subset to work list
  • Otherwise, BottomUp and add node with the highest ASAP to work list

Interestingly, when I implement the ordering from the paper, the node order for my loop is valid.

This difference in implementation is not mentioned or commented on in the review, so I’m wondering if there are any documents or emails I may have missed in my search that would shed some light. Certainly, this ordering is just a heuristic, so I can understand that the current incarnation just resulted in better performance for the targets utilizing it, but I’m curious if there was something more to the decision.

Regards,

J.B. Nagurne

Code Generation

Texas Instruments

[Public]

It’s been a while since I’ve looked at the pipeliner… The lack of discussion and history is because the pipeliner was developed downsteam for a long period of time before it was upstreamed.

Three of the cases are from Figure 6 in the original paper by Llosa and others.

  • If pred_L is a subset of the NodeSet, BottomUp and add subset to work list
  • If succ_L is s subset of the NodeSet, TopDown and add subset to work list
  • Otherwise, BottomUp and add node with the highest ASAP to work list

I’m pretty sure, at various points, we tried intersection for pred_L and succ_L, but I don’t think that gave us very good results.

The two additional cases,

  • If intersect(succ_L, NodeSet) is non-empty, TopDown and add intersect to work list
  • If there is only 1 NodeSet, BottomUp and add nodes at the bottom of the schedule (Succs.size() == 0) to work list

These were added to improve node ordering and the generated schedule, which were identified when analyzing benchmarks. For the first case, if that heuristic is removed, then one of the tests, swp-conv3x3-nested.ll fails. That was an important kernel to pipeline, and it doesn’t pipeline without that heuristic.

The SMS algorithm tries to limit the cases when a node has both successors and predecessors scheduled already. This is a concern for node-sets with recurrences (because there is one node that will have successors and predecessors scheduled already). In loops that contain many node-sets that have recurrences, finding a valid ordering may be difficult. The challenge with the original three cases is that the fallback is to do a bottom-up schedule, and it starts with a single node only. We noticed that it’s important to choose the correct direction and it’s important to identify the initial set of nodes used (to start going top-down or bottom-up).

For the first case, with the intersection, we don’t want to go bottom-up with a single node. Instead, starting top-down is better because the current node-set contains successors of nodes scheduled already. Furthermore, we want to start with the set of nodes. If we start with a single node going top-down, then the algorithm ends up switching eventually to bottom-up and creating a case where a node has a successor and predecessor scheduled already.

In the second case, with only 1 node-set, all the nodes in the basic block belong to a single node-set. With the fall back, the algorithm goes bottom-up with a single node, and it may be difficult to choose the best one to start with. This heuristic chooses multiple nodes to start with so that bottom-up traversal occurs as a wave over a set. That is, rather than going up, and then down, etc. It helped to prime the traversal with multiple nodes rather than a single node. (Unfortunately, there is no test case that fails if this heuristic is removed).

Let me know if you have any questions. In your case, are the two new heuristics the reason why your example doesn’t schedule?

Thanks,

Brendon

Hi Brendon, glad to see the original committer is still around :blush:

“Three of the cases are from Figure 6 in the original paper by Llosa and others.

  • If pred_L is a subset of the NodeSet, BottomUp and add subset to work list
  • If succ_L is a subset of the NodeSet, TopDown and add subset to work list
  • Otherwise, BottomUp and add node with the highest ASAP to work list

I would tend to disagree that the first two bullets are from the orginal paper, unless there is a different revision of the Llosa et al. paper:

In my copy, the paper’s first condition is ‘if (Pred_L(O) ∩ S) ≠ ∅’, which is not equivalent to the statement ‘if (Pred_L(O) ⊆ S) in the implementation

If your intent was to say that the cases are modified versions of those base cases, then I understand.

“The SMS algorithm tries to limit the cases when a node has both successors and predecessors scheduled already.”

This is the problem my particular loop runs into.

With NodeSets:

S0 = {1, 2, 3, 4}, maxASAP is node 4

S1 = {5, 6, 7, 8}, maxASAP is node 8 (tied with 7, tie-broken because id 8 > 7)

With a skeletal set of DAG edges

2 → 6 → 7

6 → 8

computeNodeOrder looks something like:

I would tend to disagree that the first two bullets are from the orginal paper, unless there is a different
revision of the Llosa et al. paper: In my copy, the paper’s first condition is ‘if (Pred_L(O) ∩ S) ≠ ∅’,
which is not equivalent to the statement ‘if (Pred_L(O) ⊆ S) in the implementation

An aside:
There are indeed two SMS papers. The first is the conference version:

J. Llosa, A. Gonzalez, E. Ayguade and M. Valero, “Swing module scheduling: a lifetime-sensitive approach,”
Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique, 1996, pp. 80-86,
doi: 10.1109/PACT.1996.554030.

The second is the journal version:
J. Llosa, E. Ayguade, A. Gonzalez, M. Valero and J. Eckhardt, “Lifetime-sensitive modulo scheduling in a
production environment,” in IEEE Transactions on Computers, vol. 50, no. 3, pp. 234-249, March 2001,
doi: 10.1109/12.910814.

The journal version likely should be the followed for the ordering algorithm (‘if (Pred_L(O) ∩ S) ≠ ∅’).
Additionally (although I don’t remember the details 20+ years later), there were other changes to avoid some
of the arbitrary choices made in the node ordering to tune or otherwise avoid heuristic noise.

So there is! Thankful that you were lurking and took the time to respond, Jason, or else I probably would have kept going in circles until I happened upon the difference.

To summarize:

  • Tanya Lattner’s thesis and the 2001 journal version of the SMS paper both check for simple intersection
  • The 1996 version of the SMS paper utilizes the subset check

At this point, I don’t think I’ll be suggesting that the algorithm upstream be changed since it seems to work for those downstream (our contrived loop notwithstanding), but I may be interested in developing an infrastructure change to allow the target a say in node ordering should the difference become apparent in practice.

Thanks all for the responses.

JB

[Public]

I think it would be fine to propose changing the subset test to an intersection test and try to get some feedback on the impact. Although I think I've tried it, that would have been a long time ago, so I don't remember the outcome. I'm curious because looking at the code again, it uses the intersection test, but only for successors. The following is equivalent to the existing code (and replaces separate tests for subset then intersection):

    if (pred_L(NodeOrder, N) && llvm::set_is_subset(N, Nodes)) {
        /* bottom up */
    } else if (succ_L(NodeOrder, N) && isIntersect(N, Nodes, R)) {
        /* top down */

So, really, it would just change the logic for the pred_L case.

One comment about the example provided earlier. Using the subset test generates a good node order for this example.

This is the problem my particular loop runs into.
With NodeSets:
S0 = {1, 2, 3, 4}, maxASAP is node 4
S1 = {5, 6, 7, 8}, maxASAP is node 8 (tied with 7, tie-broken because id 8 > 7)

With a skeletal set of DAG edges
2 -> 6 -> 7
6 -> 8

computeNodeOrder looks something like:
1. S0 is added to the node order bottom-up (default), say NodeOrder = {4, 3, 2, 1}
2. Starting on S1: pred_L(NodeOrder) is non-empty, but not a subset of S1, so execution eventually drops into bottom-up (default), with node 8 added to the work list as maxASAP

Step 2 with the subset test should be:
2. NodeOrder is {4, 3, 2, 1} so, succ_L(NodeOrder) is {6}. Since {6} is a subset of {5, 6, 7, 8}, then the top down order is chosen. Nodes 6, 7, and 8 are added to NodeOrder({4, 3, 2, 1, 6, 7, 8}). Then, the algorithm switches to bottom up and 5 is added to produce NodeOrder({4, 3, 2, 1, 6, 7, 8, 5}).

Now, if there is a 3rd NodeSet, S3 = {9}, and an edge from, say, 4 -> 9, then the subset test doesn't work because succ_L({4, 3, 2, 1}) is {6, 9}, which is not a subset of {5, 6, 7, 8}. So if there is no subsequent intersection test, then the bottom-up order is used on NodeSet S1.

Thanks,
Brendon

I do apologize for the bad abstract example. There are more edges I’ve elided for brevity. The full dag and NodeSets are more complicated. I only stated that pred_L wasn’t a subset of S1 without adding nodes and edges to make it so.

If I’m able to replicate the ordering behavior on an upstream target, I’ll look into sharing the loop code (an in-house benchmark, so the rules may be … complicated).

JB