GSoC:Loop Reversal Transformation

Hi,

I have been studying LLVM infrastructure from past 1 month and trying out some hands on to get familiar with it. I am keen to propose loop reversal pass in LLVM as a part of GSoC 2015. Following are the details of my proposal.

A for loop can run in two ways – it can count up or it can count down.

  1. for(i=0; i<10 ;i++) {// Do something}

  2. for(i=9; i–:wink: {// Do something}

Counting down to zero with the decrement operator (i–) is faster than counting up to a number of iterations with the increment operator (i++).

I have run a program with sufficient iteration for both types of loops (The program file Loop.cpp is attached with the mail). The average run time observed for both the loops over 100 runs

Loop counting up 0.15 ms

Loop counting down 0.08 ms

As observed, there is an execution time gain of almost 50%. The test case is rather simple though; need to check in wide variety of test cases.

It is a legal transformation if the resulting dependence vector remains lexicographically positive. Although trivial, it is a useful optimization since it may enable others such as loop interchange and can reduce the loop exit condition to a single branch-not-equal-to-zero instruction.

For example, the following code cannot be interchanged or have its inner loop parallelized because of (1,−1) dependencies.

do i = 1, n

do j = 1, n

a[i,j] = a[i-1,j+1] + 1

end do

end do

Reversing the inner loop yields (1, 1) dependencies. The loops can now be interchanged and/or the inner loop made parallel.

do i = 1, n

do j = n, 1, -1

a[i,j] = a[i-1,j+1] + 1

end do

end do

To see the difference in LLVM IR, I have taken a simpler test case:

int a[5], b[5], c[5], d[5];

void foo() {

int i;

for(i=0; i<5; i++)

a[i] = b[i] + c[i];

}

void foo2() {

int i;

for(i=4; i–:wink:

d[i] = a[i] + b[i];

}

IR for foo()

define void @foo() {

br label %1

; :1 ; preds = %13, %0

%i.0 = phi i32 [ 0, %0 ], [ %14, %13 ]

%2 = icmp slt i32 %i.0, 5

br i1 %2, label %3, label %15

; :3 ; preds = %1

%4 = sext i32 %i.0 to i64

%5 = getelementptr inbounds [5 x i32], [5 x i32]* @b, i32 0, i64 %4

%6 = load i32, i32* %5, align 4

%7 = sext i32 %i.0 to i64

%8 = getelementptr inbounds [5 x i32], [5 x i32]* @c, i32 0, i64 %7

%9 = load i32, i32* %8, align 4

%10 = add nsw i32 %6, %9

%11 = sext i32 %i.0 to i64

%12 = getelementptr inbounds [5 x i32], [5 x i32]* @a, i32 0, i64 %11

store i32 %10, i32* %12, align 4

br label %13

; :13 ; preds = %3

%14 = add nsw i32 %i.0, 1

br label %1

; :15 ; preds = %1

ret void

}

IR for foo2()

define void @foo2() #0 {

br label %1

; :1 ; preds = %4, %0

%i.0 = phi i32 [ 4, %0 ], [ %2, %4 ]

%2 = add nsw i32 %i.0, -1

%3 = icmp ne i32 %i.0, 0

br i1 %3, label %4, label %14

; :4 ; preds = %1

%5 = sext i32 %2 to i64

%6 = getelementptr inbounds [5 x i32], [5 x i32]* @b, i32 0, i64 %5

%7 = load i32, i32* %6, align 4

%8 = sext i32 %2 to i64

%9 = getelementptr inbounds [5 x i32], [5 x i32]* @c, i32 0, i64 %8

%10 = load i32, i32* %9, align 4

%11 = add nsw i32 %7, %10

%12 = sext i32 %2 to i64

%13 = getelementptr inbounds [5 x i32], [5 x i32]* @d, i32 0, i64 %12

store i32 %11, i32* %13, align 4

br label %1

; :14 ; preds = %1

ret void

}

As visible, one of the blocks is eliminated in case of loop reverse traversal, and induction variable updating instruction (i–) has been hoisted to the start block.

(This has constant inbounds of the loop, which in further optimizations eliminates all the blocks. The above IR is for demo purpose)

As of now I can think of 3 stages to achieve this :

  1. Check legality if the loop can be reversed – check dependence analysis in the loop , various scenarios will come – exit condition being constant, variable, nested loops, etc.

  2. Check profitability of the loop – check if somehow the loop exit condition can be trimmed down to comparison with zero ( More Inputs on this are required )

  3. If steps 1 and 2 are true, then reverse the loop.

Inputs on 1st and 2nd steps are the critical for implementation.

I am pasting some of the links of the papers, where I found mention of loop traversal as one of the prominent loop optimization techniques.

http://www.cs.utexas.edu/users/mckinley/papers/asplos-1994.pdf

http://perso.citi.insa-lyon.fr/trisset/papers/sympa08.pdf

https://www.complang.tuwien.ac.at/andi/papers/ijcsee_13.pdf

ftp://gcc.gnu.org/pub/gcc/summit/2004/High%20Level%20Loop%20Optimizations.pdf

Can anyone review my idea and suggest improvements? I will be glad if someone mentors me on this, especially on the first 2 steps, if idea is found to be useful.

Regards,

Vishal Sarda,

3rd Year Undergraduate,

Department of Computer Engineering,

College of Engineering, Pune

Loop.cpp (978 Bytes)

Hi,

Hi Matt, All,

Thanks for looking into this and your suggestions.

I compiled the program with -O3 optimization level (clang -O3) on X86_64 target.
After your suggestion, i ran the program with iteration of 10000 runs and found that average runtimes are (attaching data collected as well)

Forward loop traverse : 2.006 milli seconds
Reverse loop traverse : 1.531 milli seconds

Yes, i agree that this sample program may not be sufficient to say that loop reversal traversing will always be faster, however, difference in runtime is visible though. And that’s where the profitability calculation comes into picture.

I found mention of loop traverse in various papers (links mentioned in previous mail), and hence thought of implementing it.

I am not sure though if its really helpful. Above papers mentioned that it might not be beneficial in itself, but opens up the opportunity for other optimizations.

Suggestions on this are most welcomed. Waiting for others to pitch in too.

Regards,
Vishal Sarda,
3rd Year Undergraduate,
Department of Computer Engineering,
College of Engineering, Pune

Compare.xlsx (270 KB)

Hi Vishal,

I am still a little confused - can you share your new code? Be aware that you’re relying on undefined behaviour: the “unisgned int” you are adding to is overflowing, and clang will use this fact.

I was able to replicate your initial results but I found that if I ran the counting code twice, the second time both types of loop took much the same time:

$ make; and ./loop
make: Nothing to be done for `all’.
First run:
0:834
0:405
Second run:
0:381
0:384

(on a 3.5GHz Haswell).

I hacked together a very simplistic version using a memory barrier to defeat the optimizer instead of adding (which means the loop actually gets compiled: http://url.godbolt.org/hqwkx ). Additionally, I used the CPU cycle counter on x86 (again, very simplistically, there’s a huge art to this: I can share other resources if you’re interested). Counting to 2.4 billion takes an appreciable amount of time now :slight_smile:

The results from that:

/t/lop $ make; and ./loop2
First run:
Counting up: 2528555165 cycles
Counting down: 2524361064 cycles
Second run:
Counting up: 2495025808 cycles
Counting down: 2511668857 cycles

Source and makefile attached.

If you look at the code clang is producing in both cases (see the URL) it has already transformed the code into effectively the same: adding and/or subtracting until hitting zero. At least in this case, where the loop counter value itself isn’t used.

Sorry I haven’t added much to the discussion, but hopefully this email is useful in gauging the relevance of the feature.

Best regards, Matt

loop.tar.gz (867 Bytes)

Hi Matt,

Thanks for the input.

I ran the same program attached earlier “Loop.cpp” with -O3. Using shell script on terminal i ran it for 10000 times and redirected the data to a file.

  1. clang++ -O3 Loop.cpp
  2. $ for i in {1…10000}; do ./a.out >> qwerty; done
  3. Sorted out the result in excel sheet.

I have also looked at the assembly generated, and agree to your point that there is not much difference in instructions.
However, in my opinion, and as some of the papers in links pasted in previous mail suggest, that this might open up opportunity for some other optimizations to come into picture.

For ex as shown earlier,

do i = 1, n
do j = 1, n
a[i,j] = a[i-1,j+1] + 1
end do
end do

for above example, we cannot interchange or parallelize inner loop because of (1, -1 ) dependency.

However, if we traverse the inner loop in reverse,

do i = 1, n
do j = n, 1, -1
a[i,j] = a[i-1,j+1] + 1
end do
end do

the resultant dependency is (1, 1), which can now be used for inner loop parallelization or loop interchange (which might improve locality reference and result in less cache miss). Loop Interchange pass is already there in llvm trunk.

What are your thoughts on it - this pass creating opportunities for other passes to kick in? I might be wrong though.

Awaiting for your valuable inputs.(sorry for late reply because of difference in time zone)

Regards,
Vishal Sarda,
3rd Year Undergraduate,
Department of Computer Engineering,
College of Engineering, Pune