I would like to know the motivation behind the current design of acc.loop directive. Right now it simply denotes that the very next structured or unstructured loop should be considered for analysis and transformation by the OpenACC compiler. At the same time, OpenMP dialect introduced wsloop, which also has induction variable. So, wsloop is in fact a loop operation, and acc.loop is more like a hint to the compiler. I see the following issues with the current acc.loop design:
- Other loop transformations may be applied to whatever is inside acc.loop, breaking the semantics of the original user program (e.g., tiling would introduce additional loops, so can loop unroll).
- It is not clear how to apply collapse in this case. Consider the following code:
program sample
implicit none
integer :: i, j
!collapse(2)
!$acc parallel loop collapse(2)
do i = 1,1000
do j = 1,1000
!some workload…
end do
end do
!$acc end parallel loop
end program sample
Right now MLIR for this piece of code would look like:
acc.parallel { acc.loop {
%c1_i32 = arith.constant 1 : i32
%4 = fir.convert %c1_i32 : (i32) -> index
%c1000_i32 = arith.constant 1000 : i32 %5 = fir.convert %c1000_i32 : (i32) -> index %c1 = arith.constant 1 : index
%6 = fir.convert %4 : (index) -> i32
%7:2 = fir.do_loop %arg0 = %4 to %5 step %c1 iter_args(%arg1 = %6) -> (index, i32) {
fir.store %arg1 to %0 : !fir.ref<i32>
%c1_i32_0 = arith.constant 1 : i32
%8 = fir.convert %c1_i32_0 : (i32) -> index
%c1000_i32_1 = arith.constant 1000 : i32
%9 = fir.convert %c1000_i32_1 : (i32) -> index
%c1_2 = arith.constant 1 : index
%10 = fir.convert %8 : (index) -> i32
%11:2 = fir.do_loop %arg2 = %8 to %9 step %c1_2 iter_args(%arg3 = %10) -> (index, i32) {
}
}
}
}
Due to the rouge fir.store between the two nested loops, it is hard to say, whether they were perfectly nested in the original code. Of course, one can try to rely on mem2reg-like optimization. But that brings new problems: there’s no guarantee such an optimization would be successful, and if it is, it’ll be hard to diagnose the case, when the loops were not perfectly nested (of course, that will be handled by Flang FE long before, but what about IR-level validation?).
On the other hand, introducing multiple induction variables to acc.loop would solve both problems:
acc.parallel {
acc.loop for (%arg0, %arg1) : i32 = (%c1_i32, %c1_i32) to (%c1000_i32, %c1000_i32) step (%c1_i32, %c1_i32) {
}
}
In this example, loop collapse was done during code generation. At the same time, various optimizations won’t be applied to the loops in question.
So, are there any plans to support induction variables for acc.loop? If not, is there any plan to resolve the above issues?
cc @clementval