Problem:
Context
Using Ivan’s PR the intrinsic calls seem to be enclosed within the single construct.
Consider an example:
!$omp parallel workshare
s = sum(x)
!$omp end parallel workshare
Gets converted to:
omp.parallel {
omp.single nowait {
// [...]
%13 = fir.call @_FortranASumInteger4(%9, %10, %c5_i32, %11, %12)
}
}
Details
The documentation says:
OpenMP 5.2: 11.4 Workshare Construct
[…]
Evaluation of transformational array intrinsic functions may be freely subdivided into any number of units of work.
The transformational array intrinsic functions are MATMUL, DOT_PRODUCT, SUM, PRODUCT, MAXVAL, MINVAL, COUNT, ANY, ALL, SPREAD, PACK, UNPACK, RESHAPE, TRANSPOSE, EOSHIFT, CSHIFT, MINLOC, and MAXLOC.
So, the intrinsic operations have to be parallelized to get better performance.
Solution:
One solution is to add the intrinsic operation into the body replacing the intrinsic call using the WorkshareLowering pass. So, I decided to pick one and experiment by enclosing the operations into a wsloop node which helps us to share the operations among different threads.
Here: PR, I picked up SUM intrinsic and added a patch to generate the wsloop.
Consider the above example and the MLIR generated is:
omp.parallel {
// [...]
omp.wsloop {
omp.loop_nest {{.*}} {
// [...]
%[[SUM:.*]] = arith.addf {{.*}}
// [...]
}
}
}
Also, see: patch for the generated MLIR.
The same idea has to be performed to implement for all the other intrinsic mentioned in the documentation.
Community Replies in the PR
Idea suggestion:
From Ivan R. Ivanov,
I think there are two main approaches possible here, and they can coexist:
- Have alternative intrinsic implementations in their own runtime library
- Code gen the intrinsics at the IR level
See in details here, link
Reviewers comments:
From Tom Eccles,
I don’t have a strong opinion about whether this should be done in flang codegen or in a runtime library. Some drive-by thoughts:
[…]
Overall I don’t know what the right approach is. Maybe we should do both as Ivan said. I think if we do decide to implement several intrinsics this way, it should go in its own pass because the code could quickly become very long.
Comment link
From Michael Klemm,
Regarding the initial question: I’m ambivalent on the direction (runtime entry point vs. code-gen). I will say though that if we go with the runtime entry point, we will tie the Fortran runtime to the OpenMP runtime, which may not be desired in all cases.
Comment link
Reference compiler’s intrinsic handling analysis
Intel compiler (ifx)
Based on the analysis of the assembly file generated by the ifx, it seems all the operations are enclosed within the single construct. Also, it doesn’t contain any call to the runtime functions, all the operations seem to be performed in place.
GNU (gfortran)
Overall analysis
- If the return type is a scalar, then the whole operation is enclosed within the single construct
- If dim is specified, a runtime call is used, along with the resultant value being assigned using a for construct. Some runtime calls are inside a single construct, but others don’t use single constructs.
Keys:
Abbreviations | Description |
---|---|
Runtime call | Calls runtime function |
Assigns result (for loop) | Uses !$omp for to assign the resultant output |
Single construct | All the operations are enclosed inside the single construct |
No Single construct | No single construct is used, but the operations are performed directly under the parallel construct |
For construct | All the operations are enclosed inside the for construct |
Analysis
Intrinsics | Case 1 | Case 2 |
---|---|---|
MATMUL | For 1D output, Single construct, Runtime call |
For 2D output, No Single construct, Runtime call, Assigns result using for loop |
DOT_PRODUCT | For 1D, No Single construct, Assigns result using for loop |
- |
SUM & PRODUCT | For Scalar output, Single construct |
For Array output, with dim argument, For construct |
MAXVAL, MINVAL, ANY, ALL | For Scalar output, Single construct |
For Array output, Single construct, Runtime call |
COUNT | For Scalar output, Single construct |
For Array output, No Single construct, Runtime call Assigns result using for loop |
SPREAD | For all cases, Single construct, Runtime call |
- |
PACK | For 1D array, No Single construct, Runtime call, Assigns result using for loop |
For 2D array, Single construct, Runtime call |
UNPACK | For all cases, No Single construct, Runtime call, Assigns result using for loop |
- |
RESHAPE | For all cases, Single construct, Runtime call |
- |
TRANSPOSE | For all cases, Single construct |
- |
EOSHIFT & CSHIFT | For all cases, No Single construct, Runtime call, Assigns result using for loop |
- |
MINLOC & MAXLOC | For Scalar output, Single construct |
For Array output, No Single construct, Runtime call, Assigns result using for loop |
Any more ideas or suggestions would be really helpful for further implementation.
Thank you