Problem
Consider an example:
program test
integer, parameter :: N = 5
real :: x(N)
!$omp parallel workshare NUM_THREADS(5)
x = get_thread_num()
!$omp end parallel workshare
print *, x
contains
impure elemental integer function get_thread_num()
use omp_lib
get_thread_num = omp_get_thread_num()
! print *, get_thread_num
end function
end program test
Output:
$ flang-new x.f90 -fopenmp && ./a.out
0. 0. 0. 0. 0.
$ gfortran x.f90 -fopenmp && ./a.out
0.00000000 1.00000000 2.00000000 3.00000000 4.00000000
As you can see the assignment is happening in the single thread itself.
Also, It seems we have implemented a workaround for workshare here: [Flang][OpenMP] : Add a temporary lowering for workshare directive by kiranchandramohan · Pull Request #78268 · llvm/llvm-project · GitHub
FIR
[...]
omp.parallel {
omp.single {
[...]
omp.terminator
}
omp.terminator
}
[...]
The Workshare construct allows only the following
- Array & Scalar assignments
- Forall statements & constructs
- Where statements & constructs
- parallel, critical & atomic constructs
Solution
- Scalar Assignments: Enclose the scalar assignments into single construct.
!$omp parallel workshare
x = x + 1
y = y + 1
[...] ! Other statements
z = z + 1
!$omp end parallel workshare
omp.parallel {
omp.single {
[...] // x = x + 1
[...] // y = y + 1
omp.terminator
}
[...] // Other constructs
omp.single {
[...] // z = z + 1
omp.terminator
}
omp.terminator
}
- Array assignment: Enclose the array assignment within a wsloop construct
!$omp parallel workshare
x = x + 1
!$omp end parallel workshare
omp.parallel {
omp.wsloop ... {
[...] // x = x + 1
}
omp.terminator
}
- Forall statement: No modification
- Where statement: No modification
- Critical and Atomic: No modification (should run on the single thread)
For 3, 4 and 5, I need to implement and check if everything works as expected.
I’m new in writing an RFC, if any other details to be added, please let me know.
Thank you
Thanks @thirumalai for writing this RFC. As mentioned in the patch the existing lowering to single is a temporary one till the real lowering comes in. Ivan who has working with @jdoerfert was working on a lowering for the coexecute construct (OpenMP 6.0) that is similar to the workshare construct. We have to check whether Ivan has plans to upstream this.
Regarding the use of omp.wsloop
, the previous issue was that we always had the loop control along with omp.wsloop
. This was relexed recently but it still requires omp.loop_nest
that has the loop control. For array assignments in HLFIR an hlfir.assign
operation will be created and this will further be lowered to a _FortranAAssign
runtime call or a loop.
hlfir.assign %2#0 to %1#0 : !fir.box<!fir.array<?xi32>>, !fir.box<!fir.array<?xi32>>
So omp.wsloop
will have to be modified to work with hlfir.assign
and the lowerings to the runtime call/loop be modified suitably.
1 Like
I have posted an RFC for my workdistribute
proof of concept implementation here.
I think the idea of using high level transformations in hlfir
and then adding a custom lowering for those to openmp operations should work here as well. However, the scope of the allowed statements inside a workshare
is larger so further work may be needed for this than for workdistribute
.
Other than assignments, array intrinsics should also be considered as they also lower to runtime calls such as _FortranAAssign
.
I believe one option here is to provide OpenMP-enabled implementations of these functions.
Most of the complexity in my workdistribute
implementation stems from the inability to synchronize code in an omp distribute
and the need to alter the host/target boundary both of which are not a problem here so it should be simpler in that way.
@thirumalai Would you be interested in going through @ivanradanov’s RFC and discussing its suitability for workshare
?
Separately, ifx seems to be doing something similar to the temporary lowering in flang-new
. Compiler Explorer
I see that OpenMP 5.2 has a few more restrictions that makes this particular example non-conforming.
The construct must not contain any user-defined function calls unless either the function is pure and elemental or the function call is contained inside a parallel construct that is nested inside the workshare construct.
I think work-distribution has the similar restrictions, we can have the same implementation for both.
Yes, I used the above example to check the thread usage. After implementing the workshare. I will add the required semantics check to catch it : )
Overall, it looks great. Thank you very much.
I’m new to the llvm-project, so I’m leaving it to experts to provide suggestions.