Hi,
I have been studying LLVM infrastructure from past 1 month and trying out some hands on to get familiar with it. I am keen to propose loop reversal pass in LLVM as a part of GSoC 2015. Following are the details of my proposal.
A for loop can run in two ways – it can count up or it can count down.
-
for(i=0; i<10 ;i++) {// Do something}
-
for(i=9; i–
{// Do something}
Counting down to zero with the decrement operator (i–) is faster than counting up to a number of iterations with the increment operator (i++).
I have run a program with sufficient iteration for both types of loops (The program file Loop.cpp is attached with the mail). The average run time observed for both the loops over 100 runs
Loop counting up 0.15 ms
Loop counting down 0.08 ms
As observed, there is an execution time gain of almost 50%. The test case is rather simple though; need to check in wide variety of test cases.
It is a legal transformation if the resulting dependence vector remains lexicographically positive. Although trivial, it is a useful optimization since it may enable others such as loop interchange and can reduce the loop exit condition to a single branch-not-equal-to-zero instruction.
For example, the following code cannot be interchanged or have its inner loop parallelized because of (1,−1) dependencies.
do i = 1, n
do j = 1, n
a[i,j] = a[i-1,j+1] + 1
end do
end do
Reversing the inner loop yields (1, 1) dependencies. The loops can now be interchanged and/or the inner loop made parallel.
do i = 1, n
do j = n, 1, -1
a[i,j] = a[i-1,j+1] + 1
end do
end do
To see the difference in LLVM IR, I have taken a simpler test case:
int a[5], b[5], c[5], d[5];
void foo() {
int i;
for(i=0; i<5; i++)
a[i] = b[i] + c[i];
}
void foo2() {
int i;
for(i=4; i–
d[i] = a[i] + b[i];
}
IR for foo()
define void @foo() {
br label %1
; :1 ; preds = %13, %0
%i.0 = phi i32 [ 0, %0 ], [ %14, %13 ]
%2 = icmp slt i32 %i.0, 5
br i1 %2, label %3, label %15
; :3 ; preds = %1
%4 = sext i32 %i.0 to i64
%5 = getelementptr inbounds [5 x i32], [5 x i32]* @b, i32 0, i64 %4
%6 = load i32, i32* %5, align 4
%7 = sext i32 %i.0 to i64
%8 = getelementptr inbounds [5 x i32], [5 x i32]* @c, i32 0, i64 %7
%9 = load i32, i32* %8, align 4
%10 = add nsw i32 %6, %9
%11 = sext i32 %i.0 to i64
%12 = getelementptr inbounds [5 x i32], [5 x i32]* @a, i32 0, i64 %11
store i32 %10, i32* %12, align 4
br label %13
; :13 ; preds = %3
%14 = add nsw i32 %i.0, 1
br label %1
; :15 ; preds = %1
ret void
}
IR for foo2()
define void @foo2() #0 {
br label %1
; :1 ; preds = %4, %0
%i.0 = phi i32 [ 4, %0 ], [ %2, %4 ]
%2 = add nsw i32 %i.0, -1
%3 = icmp ne i32 %i.0, 0
br i1 %3, label %4, label %14
; :4 ; preds = %1
%5 = sext i32 %2 to i64
%6 = getelementptr inbounds [5 x i32], [5 x i32]* @b, i32 0, i64 %5
%7 = load i32, i32* %6, align 4
%8 = sext i32 %2 to i64
%9 = getelementptr inbounds [5 x i32], [5 x i32]* @c, i32 0, i64 %8
%10 = load i32, i32* %9, align 4
%11 = add nsw i32 %7, %10
%12 = sext i32 %2 to i64
%13 = getelementptr inbounds [5 x i32], [5 x i32]* @d, i32 0, i64 %12
store i32 %11, i32* %13, align 4
br label %1
; :14 ; preds = %1
ret void
}
As visible, one of the blocks is eliminated in case of loop reverse traversal, and induction variable updating instruction (i–) has been hoisted to the start block.
(This has constant inbounds of the loop, which in further optimizations eliminates all the blocks. The above IR is for demo purpose)
As of now I can think of 3 stages to achieve this :
-
Check legality if the loop can be reversed – check dependence analysis in the loop , various scenarios will come – exit condition being constant, variable, nested loops, etc.
-
Check profitability of the loop – check if somehow the loop exit condition can be trimmed down to comparison with zero ( More Inputs on this are required )
-
If steps 1 and 2 are true, then reverse the loop.
Inputs on 1st and 2nd steps are the critical for implementation.
I am pasting some of the links of the papers, where I found mention of loop traversal as one of the prominent loop optimization techniques.
http://www.cs.utexas.edu/users/mckinley/papers/asplos-1994.pdf
http://perso.citi.insa-lyon.fr/trisset/papers/sympa08.pdf
https://www.complang.tuwien.ac.at/andi/papers/ijcsee_13.pdf
ftp://gcc.gnu.org/pub/gcc/summit/2004/High%20Level%20Loop%20Optimizations.pdf
Can anyone review my idea and suggest improvements? I will be glad if someone mentors me on this, especially on the first 2 steps, if idea is found to be useful.
Regards,
Vishal Sarda,
3rd Year Undergraduate,
Department of Computer Engineering,
College of Engineering, Pune
Loop.cpp (978 Bytes)