[RFC] Window Scheduling Algorithm for MachinePipeliner in LLVM

huaatian · November 7, 2023, 3:44am

My current work focuses on optimizing our own VLIW accelerator code using software pipelining. Based on the method in Steven Muchnick’s book “Advanced Compiler Design And Implementation”, we have implemented a Window Scheduling algorithm and hope to contribute this algorithm to the community.

Motivation:
Our accelerator mainly runs AI operator code. Due to the loops of these operators being unrolled multiple times, such as 4, 8 times, etc., the kernel loop becomes relatively large, usually containing about 80~150 instructions. In addition, similar to other VLIW accelerators, there are many resource conflicts between vector instructions. At this time, the Swing Modulo Scheduling (SMS) algorithm is prone to failure, unable to search for a valid II. By analyzing the hardware model, we believe there is still an opportunity to improve instruction parallelism. Therefore, we have implemented the Window Scheduling algorithm, expecting to achieve better performance.

Algorithm Overview:
This algorithm is also described in Venkatraman Govindaraju’s report “Implementation of Software Pipelining Using Window Scheduling”. The scheduling result is equivalent to Modulo Scheduling with Stage 2, but it is easier to obtain valid results.
The algorithm is mainly divided into two parts:

Copy the core loop twice, and then adjust the window position multiple times. After each adjustment, use the List scheduling algorithm to calculate the performance of the instructions within the window.
Select the window position with the best performance, calculate the corresponding instruction order, and generate the corresponding Kernel, Prolog and Epilog code.

The computational complexity of this algorithm is n times that of List scheduling and is mainly used in scenarios where the performance is highly concerned. In engineering, we made some simplifications to reduce the computational complexity, such as changing the method of adjusting window positions.

Integration with LLVM:
I initially want to make the Window Scheduling algorithm a supplement to the SMS algorithm, and use it when SMS fails. I will create a new WindowSchedulerDAG, inheriting from the ScheduleDAGInstrs class. This class implements the Window Scheduling algorithm and also calls the existing List scheduling algorithm multiple times to calculate the scheduling performance after the window is moved.

Performance Evaluation:
On our accelerator, compared with the commonly used List Schedule algorithm, the loop kernel performance of Activation operators can be increased by more than 10% (when the SMS algorithm fails). Note that this result is related to the microarchitecture of the Target, and the main influencing factor is the severity of resource conflicts between instructions.

Request for Comments:
I would like to know if the Window Scheduling algorithm is suitable for contributing to the community, and I am also happy to discuss the implementation details of this algorithm with everyone.

arsenm · November 7, 2023, 4:50am

This would be good, I think we could use scheduling infrastructure improvements.

huaatian · November 7, 2023, 7:11am

Thank you for your opinion. I will first refactor the local algorithm code and submit a PR later.

huaatian · March 11, 2024, 3:41am

We have submitted the refactored Window Scheduler to the community for review:

github.com/llvm/llvm-project

[llvm][CodeGen] Add a new software pipeliner 'Window Scheduler'

llvm:main ← huaatian:main

opened 08:01AM - 08 Mar 24 UTC

huaatian

+1331 -3

This commit implements the Window Scheduler as described in the RFC: https://dis…course.llvm.org/t/rfc-window-scheduling-algorithm-for-machinepipeliner-in-llvm/74718 This Window Scheduler implements the window algorithm designed by Steven Muchnick in the book "Advanced Compiler Design And Implementation", with some improvements: 1. Copy 3 times of the loop kernel and construct the corresponding DAG to identify dependencies between MIs; 2. Use heuristic algorithm to obtain a set of window offsets. The window algorithm is equivalent to modulo scheduling algorithm with a stage of 2. It is mainly applied in targets where hardware resource conflicts are severe, and the SMS algorithm often fails in such cases. On our own DSA, this window algorithm typically can achieve a performance improvement of over 10%. Co-authored-by: Kai Yan <aklkaiyan@tencent.com> Co-authored-by: Ran Xiao <lennyxiao@tencent.com>

In this implementation, we made two innovations:

Copy the loop kernel three times and construct the corresponding DAG to identify dependencies between MIs;
Use a heuristic algorithm to obtain a set of window offsets.

In the PR, we use the Window Scheduler before Register Allocation (RA) as initially envisioned. In our DSA, we have also implemented the post-RA Window Scheduler in a derived manner, which also achieved good results.

martiendejong · April 15, 2024, 7:25am

In your original motivation you write that the loop has been unrolled already. Are you software pipelining the unrolled loop? I am wondering what this does to code size expansion. We generally find that SWP should replace unrolling.

martiendejong · April 15, 2024, 7:33am

We would also be interested in PostRA experiences. For instance, did you tweak the modulo extraction to not generate PHI nodes, or did you change it completely?

huaatian · April 15, 2024, 8:02am

In the AI field, we primarily execute operators on DSA, such as convolution, ReLU, swish, etc. The code size for these operators is not large, but they require high execution performance. Typically, we perform loop unrolling（4 or 8 times） and SWP to fully utilize the hardware functional units.
We do not consider unrolling and software pipelining to be mutually exclusive. The use of these techniques depends on the specific application and the target architecture.

huaatian · April 15, 2024, 8:28am

Applying the window scheduling algorithm to the PostRA phase can actually be simpler. At this stage, there are no PHI nodes, which simplifies many processes. However, there are amount of infrastructure codes that we need to write ourselves, such as the logic for prologue and epilogue generation.

huaatian · April 15, 2024, 8:39am

Additionally, we anticipate that the increase in code size will not be substantial if loop unrolling is performed only on hot functions in other compute-intensive applications.

martiendejong · April 15, 2024, 8:51am

Ok. Our situation is different I guess. We have very limited program memory. We have one or a few compute kernels and unrolling them is prohibitive. To be fair, the inner loops of the source code already have appropriate operation density.

martiendejong · April 15, 2024, 8:53am

I can imagine it’s simpler. I’m just curious how you plug in your prologue and epilogue generation into the MachinePipeliner.

huaatian · April 15, 2024, 9:13am

We have completely rewritten the logic here from the perspective of the SWP algorithm based on the pipelined kernel. Here, it is crucial to pay attention to the arrangement of the code for each stage.

huaatian · June 15, 2024, 9:16am

Thank you all for your attention. The window scheduler has been merged into the mainline. We welcome everyone to try applying this algorithm to your own DSA. My colleagues and I will actively respond to any feedback you may have.

martiendejong · June 17, 2024, 2:27pm

We’ll definitely have a look. Great to see progress!

TerryGuo · June 25, 2024, 5:59am

Thanks for your great work. Our current DSA core is VLIW-based and relies on Swing Modulo Scheduling, we will try this new algorithm definitely.

huaatian · June 25, 2024, 7:16am

Thank you for your feedback! When the compiler performs precise modeling for the VLIW core, it may encounter severe resource conflicts (using scoreboard) that cause the SMS algorithm to fail. In such cases, you can try this window algorithm. In terms of the algorithm itself, the window algorithm is equivalent to the SMS algorithm when the stage is set to 2.

huaatian · June 25, 2024, 7:35am

Additionally, although our submitted window algorithm is positioned before RA, if you have already performed precise modeling for the VLIW core, you can also try to apply the window algorithm after RA. As mentioned earlier, it requires some coding effort, but it usually results in better scheduling result.

Topic		Replies	Views
Help on implementation of a modulo scheduling algorithm LLVM Dev List Archives	1	99	October 23, 2015
modulo scheduling LLVM Dev List Archives	0	117	March 5, 2013
Software pipeline using LLVM LLVM Dev List Archives	2	195	July 26, 2018
Pipeline using LLVM LLVM Dev List Archives	0	77	July 19, 2018
MachinePipeliner and irregular loops LLVM Dev List Archives	0	83	October 1, 2020

[RFC] Window Scheduling Algorithm for MachinePipeliner in LLVM

Related topics