My current work focuses on optimizing our own VLIW accelerator code using software pipelining. Based on the method in Steven Muchnick’s book “Advanced Compiler Design And Implementation”, we have implemented a Window Scheduling algorithm and hope to contribute this algorithm to the community.
Motivation:
Our accelerator mainly runs AI operator code. Due to the loops of these operators being unrolled multiple times, such as 4, 8 times, etc., the kernel loop becomes relatively large, usually containing about 80~150 instructions. In addition, similar to other VLIW accelerators, there are many resource conflicts between vector instructions. At this time, the Swing Modulo Scheduling (SMS) algorithm is prone to failure, unable to search for a valid II. By analyzing the hardware model, we believe there is still an opportunity to improve instruction parallelism. Therefore, we have implemented the Window Scheduling algorithm, expecting to achieve better performance.
Algorithm Overview:
This algorithm is also described in Venkatraman Govindaraju’s report “Implementation of Software Pipelining Using Window Scheduling”. The scheduling result is equivalent to Modulo Scheduling with Stage 2, but it is easier to obtain valid results.
The algorithm is mainly divided into two parts:
- Copy the core loop twice, and then adjust the window position multiple times. After each adjustment, use the List scheduling algorithm to calculate the performance of the instructions within the window.
- Select the window position with the best performance, calculate the corresponding instruction order, and generate the corresponding Kernel, Prolog and Epilog code.
The computational complexity of this algorithm is n times that of List scheduling and is mainly used in scenarios where the performance is highly concerned. In engineering, we made some simplifications to reduce the computational complexity, such as changing the method of adjusting window positions.
Integration with LLVM:
I initially want to make the Window Scheduling algorithm a supplement to the SMS algorithm, and use it when SMS fails. I will create a new WindowSchedulerDAG, inheriting from the ScheduleDAGInstrs class. This class implements the Window Scheduling algorithm and also calls the existing List scheduling algorithm multiple times to calculate the scheduling performance after the window is moved.
Performance Evaluation:
On our accelerator, compared with the commonly used List Schedule algorithm, the loop kernel performance of Activation operators can be increased by more than 10% (when the SMS algorithm fails). Note that this result is related to the microarchitecture of the Target, and the main influencing factor is the severity of resource conflicts between instructions.
Request for Comments:
I would like to know if the Window Scheduling algorithm is suitable for contributing to the community, and I am also happy to discuss the implementation details of this algorithm with everyone.