`acc.loop` gaps with `omp.wsloop`

I would like to know the motivation behind the current design of acc.loop directive. Right now it simply denotes that the very next structured or unstructured loop should be considered for analysis and transformation by the OpenACC compiler. At the same time, OpenMP dialect introduced wsloop, which also has induction variable. So, wsloop is in fact a loop operation, and acc.loop is more like a hint to the compiler. I see the following issues with the current acc.loop design:

  1. Other loop transformations may be applied to whatever is inside acc.loop, breaking the semantics of the original user program (e.g., tiling would introduce additional loops, so can loop unroll).
  2. It is not clear how to apply collapse in this case. Consider the following code:
program sample
implicit none
  integer :: i, j

  !collapse(2)
  !$acc parallel loop collapse(2)
  do i = 1,1000
    do j = 1,1000
        !some workload…
    end do
  end do
  !$acc end parallel loop

end program sample

Right now MLIR for this piece of code would look like:

acc.parallel   {                                                                                                                                                                                         acc.loop {
        %c1_i32 = arith.constant 1 : i32
        %4 = fir.convert %c1_i32 : (i32) -> index
        %c1000_i32 = arith.constant 1000 : i32                                                                                                                                                                 %5 = fir.convert %c1000_i32 : (i32) -> index                                                                                                                                                           %c1 = arith.constant 1 : index
        %6 = fir.convert %4 : (index) -> i32
        %7:2 = fir.do_loop %arg0 = %4 to %5 step %c1 iter_args(%arg1 = %6) -> (index, i32) {
          fir.store %arg1 to %0 : !fir.ref<i32>
          %c1_i32_0 = arith.constant 1 : i32
          %8 = fir.convert %c1_i32_0 : (i32) -> index
          %c1000_i32_1 = arith.constant 1000 : i32
          %9 = fir.convert %c1000_i32_1 : (i32) -> index
          %c1_2 = arith.constant 1 : index
          %10 = fir.convert %8 : (index) -> i32
          %11:2 = fir.do_loop %arg2 = %8 to %9 step %c1_2 iter_args(%arg3 = %10) -> (index, i32) {
         }
    }
  }
}

Due to the rouge fir.store between the two nested loops, it is hard to say, whether they were perfectly nested in the original code. Of course, one can try to rely on mem2reg-like optimization. But that brings new problems: there’s no guarantee such an optimization would be successful, and if it is, it’ll be hard to diagnose the case, when the loops were not perfectly nested (of course, that will be handled by Flang FE long before, but what about IR-level validation?).
On the other hand, introducing multiple induction variables to acc.loop would solve both problems:

acc.parallel {
      acc.loop for  (%arg0, %arg1) : i32 = (%c1_i32, %c1_i32) to (%c1000_i32, %c1000_i32) step (%c1_i32, %c1_i32) {
     }
}

In this example, loop collapse was done during code generation. At the same time, various optimizations won’t be applied to the loops in question.
So, are there any plans to support induction variables for acc.loop? If not, is there any plan to resolve the above issues?

cc @clementval

I will leave it for @clementval to answer. But FWIU this was part of an initial prototype to generate code. So there can be further design changes.

Also FYI. In the OpenMP side, there are plans to introduce an omp.canonical_loop. That could be another way to go about it.

Reference:

https://reviews.llvm.org/D147658

1 Like

As @kiranchandramohan mentioned, acc.loop was added a while back and my contribution to the OpenACC has been on hold as I was working on upstreaming flang and implementing polymorphic entities.

We have just resume to work on OpenACC and it is likely that some operation will get improvement or complete redesign. We are not working on acc.loop at the moment but I’ll share our proposed design when/if we have one. It is likely to be in the near future.

1 Like

Thank you @kiranchandramohan @clementval for replies.

I think, I still fail to understand how this is going to help with #2 from my original post. The process of conversion from AST to IR can have some side effects, and even though the loops were perfectly nested in user code, they may not appear so in IR. And even if we make some transformations on the IR, that can a) be against user instructions (like with -O0); b) be insufficient to recover perfect nest. Overall, the whole construction just appears to be fragile.

If a nested loop to be collapsed can be represented as

omp.wsloop for (%arg0, %arg1) : i32 = (%c1_i32, %c1_i32) to (%c1000_i32, %c1000_i32) step (%c1_i32, %c1_i32) 

then I am reasonably sure that there exists a way for it to exist as

omp.workshare_loop collapse(2)
omp.canonical_loop %arg0 : i32 = %ci1 to %ci1000 step %c1
omp.canonical_loop %arg1 : i32 = %cj1 to %cj1000 step %c2

or

%outer = omp.canonical_loop for (%ci1) to (%ci1000) step (%c1) {
  %inner = omp.canonical_loop for (%cj1) to (%cj1000) step (%c2) {
    ..
  }
}
omp.workshare_loop collapse(%outer, %inner)

The additional instructions probably came in due to the requirement for conversions. Also, we can probably use an interface or the verifier to require that they remain nested. Also, why is proper nesting required? I believe we can collapse loops even without proper nesting.

The issue with the omp.wsloop scheme is that it is difficult to apply worksharing to a loop that is generated by other directives like for eg: there can be directives to split a loop and then apply worksharing to the second loop in the split.

Conversions + memory SSA. It is probably possible to emit the perfectly nested code from AST during lowering phase (is it legal for nested omp loop to depend on parent induction variable? if not, then the task should be trivial). The problem with collapsing loops without proper nesting in IR is that it makes it harder to verify, that the collapse can be done in practice. While we’re talking about Flang or Clang, those have a different representation available, which can be used to identify legal transformations. But omp dialect can be generated from other dialects. It is easy to allow loop invariants between nested loops, though.

I also see that reductions are designed differently for omp and acc dialects. @clementval do you also plan to work on those? I can see pros and cons for both designs, and I don’t have a strong preference for either, but I think, having a unified design would be nice.

Reduction will certainly be redesigned as well. I cannot tell now if we will go with the same design as the OpenMP dialect.

1 Like

This is the idea. If there are additional instructions in %outer but before/after %inner, the OpenMP specification allows them to be moved into %inner. Depending on what the directive, it could also do something else such as executing it at most once per thread, but the default handling of omp.collapse would be to just sink the instructions as its always safe. That’s what OpenMPIRBuilder::collapseLoops does.

It’s a bit different if %inner uses some of the results of the interleaving code. In that case it must conform to a specific structure that corresponds to how OpenMP defines a (rectangular or non-rectangular loop nest). In LLVM for instance it implies that it can be represented as a SCEVExpr. The instructions can still be sunk into the inner loop but will eventually be optimized away if the instruction results are not used in the inner loop.

2 Likes

I’m not very experienced with OpenMP or OpenACC specs. @Meinersbur could you please help me find the place in the spec, which implies this? My understanding of the following paragraphs was a bit different.


You can refer to the collapse section in the loop or worksharing loop construct.
https://www.openmp.org/spec-html/5.0/openmpsu44.html#x67-1880002.9.5
https://www.openmp.org/spec-html/5.0/openmpsu41.html#x64-1290002.9.2

1 Like

Help is welcome to fix issues in the omp reduction. If there is interest you can work on adding support for reduction to parallel regions (currently only wsloop support is there) while lowering from MLIR to LLVM IR.

@alexbatashev It is mentioned in “4.4.1 OpenMP Loop-Iteration Spaces and Vectors” directly before the collapse clause.

In OpenMP 5.0:

If more than one loop is associated with the worksharing-loop construct then the number of times that any intervening code between any two associated loops will be executed is unspecified but will be at least once per iteration of the loop enclosing the intervening code and at most once per iteration of the innermost loop associated with the construct. If the iteration count of any loop that is associated with the worksharing-loop construct is zero and that loop does not enclose the intervening code, the behavior is unspecified.

I don’t know how OpenACC handles it though.

1 Like