Background
MLIR modelling of basic loops
The fir dialect model both variants of do loops: increment and concurrent using a single MLIR operation: fir.do_loop. For example, a simple increment loop such as the following:
do i=1,10
end do
is emitted as:
%5:2 = fir.do_loop %arg0 = %2 to %3 step %c1 iter_args(%arg1 = %4) -> (index, i32) {
fir.store %arg1 to %1#1 : !fir.ref<i32>
%6 = arith.addi %arg0, %c1 overflow<nsw> : index
%7 = fir.convert %c1 : (index) -> i32
%8 = fir.load %1#1 : !fir.ref<i32>
%9 = arith.addi %8, %7 overflow<nsw> : i32
fir.result %6, %9 : index, i32
}
do concurrent loops look a bit different. For example, given the following single-range loop:
do concurrent (i=1:10)
end do
The MLIR op looks like:
fir.do_loop %arg0 = %4 to %5 step %c1 unordered {
%6 = fir.convert %arg0 : (index) -> i32
fir.store %6 to %1#1 : !fir.ref<i32>
}
The most important difference here is the unordered attribute attached to the concurrent loop. This models the fact that the loop iterations may be executed in any order. There are other differences between increment and concurrent loops which are detailed below.
MLIR modelling of multi-range loops
One major difference between increment and concurrent loops is that increment loops can only have a single ineration range. Concurrent loops, on the other hand, can have multiple iteration ranges. For example the following is invalid Fortran code:
! Invalid syntax
do i=1,10, j=1,10
end do
While the following is valid:
do concurrent (i=1:10, j=1:10)
end do
The fir.do_loop is currently restricted to single-range loops. To overcome this limitation, flang emits multi-range do concurrent loops as a nest of fir.do_loop ops:
1. fir.do_loop %arg0 = %8 to %9 step %c1 unordered {
2. %12 = fir.convert %arg0 : (index) -> i32
3. fir.store %12 to %3#1 : !fir.ref<i32>
4. fir.do_loop %arg1 = %10 to %11 step %c1_2 unordered {
5. %13 = fir.convert %arg1 : (index) -> i32
6. fir.store %13 to %1#1 : !fir.ref<i32>
7. }
8. }
Locality specifiers and data environment control
Another difference is that do concurrent loops need to somehow model locality of data used inside the loop so that the iterations of the loop can be executed out-of-order (or concurrently) without introducing any data races. The fir.do_loop op is now able to model these locality specifiers in case the fir.do_loop is marked with the unordered attribute.
Problem statement
There are few consequences for modeling both increment and concurrent loops using the same MLIR op. The most obvious one is that multi-range loops are not directly detectable. To detect a multi-range do concurrent loop, a pass would need to walk the fir.do_loop nest structure (see above example) and make sure that the only ops between 2 nested fir.do_loop ops are the ops needed to update the iteration variable of the inner loop (lines 2 & 3 in the above snippet). This is not ideal for a number of reasons:
- We cannot know for sure whether the loop nest originated from a
do concurrentloop or was emitted by flang as part of some other expression or op conversion. - The nest detection logic introduces complexity that can be alleviated if the multi-range is directly detectable on the MLIR level.
Another consequence of using the same MLIR op is that the fir.do_loop op is overloaded with the semantics of both increment and concurrent loops. While this can enable code sharing (e.g. we use the same functions to emit the MLIR ops), it also makes things more difficult in later stages of the compiler pipeline because we have to analyze the op to understand what it represents.
We recently came across these issues while working on a pass to map do concurrent loops to OpenMP constructs so that they can be automatically parallelized. See: [flang][OpenMP] Upstream do concurrent loop-nest detection and other PRs in the stack for more info.
Proposal
Introducing a separate MLIR op for do concurrent loops would help making the IR more representative of the original source at least during the earlier stages of the compiler pipeline. It would also help limiting the scope of the current fir.do_loop op to only model increment loops.
With the new op, the following loop:
do concurrent (i=1:10, j=1:10)
end do
would be emitted in MLIR as (using same SSA value numbering as the above snippet for comparison):
fir.do_concurrent (%arg0, %arg1) = (%8, %10) to (%9 to %11) step (%c1, %c1_2) {
%12 = fir.convert %arg0 : (index) -> i32
fir.store %12 to %3#1 : !fir.ref<i32>
%13 = fir.convert %arg1 : (index) -> i32
fir.store %13 to %1#1 : !fir.ref<i32>
}
The above op comes with changes to the current fir.do_loop op as well. In particular, we would:
- Remove the
unorderedattribute since there will no more need for it. - Move the locality specifier from
fir.do_looptofir.do_concurrent.
Considered alternatives
Loop nest detection
This is what is currently being implemented in the do concurrent to OpenMP conversion pass. However, as detailed above this is not ideal and therefore should be replaced by an explicit do concurrent MLIR op.
Adding a loop_nest(n) attribute to fir.do_loop
This would help us directly detect the depth of the nest and therefore avoid having to implement a loop nest detection algorithm. However, this is also not ideal since there is no guaranteee this attribute will be correctly maintained and carried forward by all passes of the pipeline.
Creating a shared dialect to model parallel/concurrent loops
The above proposal is very similar to what the scf.parallel op looks like. Possibly there are other similar ops downstream to model these kinds of loops. This is similar to the current effort to have a shared ptr dielact. This also falls inline with the road ahead proposed by a recent dev meeting talk: 2024 LLVM Dev Mtg - Making upstream MLIR more friendly to programming languages. I think this should be still on our radar for future improvement.
Final remarks
The RFC proposes a new op for do concurrent in the FIR dialect in the short and medium terms. In the longer term, I think it would be beneficial to have an MLIR control-flow dialect that unifies ops such as fir.do_concurrent and scf.concurrent in one generic op.