[RFC] Window Scheduling Algorithm for MachinePipeliner in LLVM

My current work focuses on optimizing our own VLIW accelerator code using software pipelining. Based on the method in Steven Muchnick’s book “Advanced Compiler Design And Implementation”, we have implemented a Window Scheduling algorithm and hope to contribute this algorithm to the community.

Motivation:
Our accelerator mainly runs AI operator code. Due to the loops of these operators being unrolled multiple times, such as 4, 8 times, etc., the kernel loop becomes relatively large, usually containing about 80~150 instructions. In addition, similar to other VLIW accelerators, there are many resource conflicts between vector instructions. At this time, the Swing Modulo Scheduling (SMS) algorithm is prone to failure, unable to search for a valid II. By analyzing the hardware model, we believe there is still an opportunity to improve instruction parallelism. Therefore, we have implemented the Window Scheduling algorithm, expecting to achieve better performance.

Algorithm Overview:
This algorithm is also described in Venkatraman Govindaraju’s report “Implementation of Software Pipelining Using Window Scheduling”. The scheduling result is equivalent to Modulo Scheduling with Stage 2, but it is easier to obtain valid results.
The algorithm is mainly divided into two parts:

  1. Copy the core loop twice, and then adjust the window position multiple times. After each adjustment, use the List scheduling algorithm to calculate the performance of the instructions within the window.
  2. Select the window position with the best performance, calculate the corresponding instruction order, and generate the corresponding Kernel, Prolog and Epilog code.

The computational complexity of this algorithm is n times that of List scheduling and is mainly used in scenarios where the performance is highly concerned. In engineering, we made some simplifications to reduce the computational complexity, such as changing the method of adjusting window positions.

Integration with LLVM:
I initially want to make the Window Scheduling algorithm a supplement to the SMS algorithm, and use it when SMS fails. I will create a new WindowSchedulerDAG, inheriting from the ScheduleDAGInstrs class. This class implements the Window Scheduling algorithm and also calls the existing List scheduling algorithm multiple times to calculate the scheduling performance after the window is moved.

Performance Evaluation:
On our accelerator, compared with the commonly used List Schedule algorithm, the loop kernel performance of Activation operators can be increased by more than 10% (when the SMS algorithm fails). Note that this result is related to the microarchitecture of the Target, and the main influencing factor is the severity of resource conflicts between instructions.

Request for Comments:
I would like to know if the Window Scheduling algorithm is suitable for contributing to the community, and I am also happy to discuss the implementation details of this algorithm with everyone.

6 Likes

This would be good, I think we could use scheduling infrastructure improvements.

Thank you for your opinion. I will first refactor the local algorithm code and submit a PR later.

1 Like

We have submitted the refactored Window Scheduler to the community for review:

In this implementation, we made two innovations:

  1. Copy the loop kernel three times and construct the corresponding DAG to identify dependencies between MIs;
  2. Use a heuristic algorithm to obtain a set of window offsets.

In the PR, we use the Window Scheduler before Register Allocation (RA) as initially envisioned. In our DSA, we have also implemented the post-RA Window Scheduler in a derived manner, which also achieved good results.

In your original motivation you write that the loop has been unrolled already. Are you software pipelining the unrolled loop? I am wondering what this does to code size expansion. We generally find that SWP should replace unrolling.

We would also be interested in PostRA experiences. For instance, did you tweak the modulo extraction to not generate PHI nodes, or did you change it completely?

In the AI field, we primarily execute operators on DSA, such as convolution, ReLU, swish, etc. The code size for these operators is not large, but they require high execution performance. Typically, we perform loop unrolling(4 or 8 times) and SWP to fully utilize the hardware functional units.
We do not consider unrolling and software pipelining to be mutually exclusive. The use of these techniques depends on the specific application and the target architecture.

Applying the window scheduling algorithm to the PostRA phase can actually be simpler. At this stage, there are no PHI nodes, which simplifies many processes. However, there are amount of infrastructure codes that we need to write ourselves, such as the logic for prologue and epilogue generation.

Additionally, we anticipate that the increase in code size will not be substantial if loop unrolling is performed only on hot functions in other compute-intensive applications.

Ok. Our situation is different I guess. We have very limited program memory. We have one or a few compute kernels and unrolling them is prohibitive. To be fair, the inner loops of the source code already have appropriate operation density.

I can imagine it’s simpler. I’m just curious how you plug in your prologue and epilogue generation into the MachinePipeliner.

We have completely rewritten the logic here from the perspective of the SWP algorithm based on the pipelined kernel. Here, it is crucial to pay attention to the arrangement of the code for each stage.

Thank you all for your attention. The window scheduler has been merged into the mainline. We welcome everyone to try applying this algorithm to your own DSA. My colleagues and I will actively respond to any feedback you may have.

We’ll definitely have a look. Great to see progress!

Thanks for your great work. Our current DSA core is VLIW-based and relies on Swing Modulo Scheduling, we will try this new algorithm definitely.

Thank you for your feedback! When the compiler performs precise modeling for the VLIW core, it may encounter severe resource conflicts (using scoreboard) that cause the SMS algorithm to fail. In such cases, you can try this window algorithm. In terms of the algorithm itself, the window algorithm is equivalent to the SMS algorithm when the stage is set to 2.

Additionally, although our submitted window algorithm is positioned before RA, if you have already performed precise modeling for the VLIW core, you can also try to apply the window algorithm after RA. As mentioned earlier, it requires some coding effort, but it usually results in better scheduling result.