This is a followup post to the SNAP Performance thread (SNAP Performance analysis, more detailed than the presentation)
This is specifically about “inline SUM intrinsic”, although the purpose of this post is to agree a strategy for inlining of intrinsics that can be extended and a pattern for this sort of work.
There is a link below to a Phabricator review of some code. I’m repeating the essential part that is in the commit message:
This is not a finished implementation, it is mostly “to encourage discussion”, not intended to be reviewed in detail, but as a “this is how this could be done”.
I’d like to suggest that there are three candidates for solving this, each with some good and bad things. Alternative solutions would be appreciated.
-
Inline during lowering to MLIR in IntrinsicCall.cpp - what I have implemented.
- Good: Easy to implement, everything needed is there, and it’s just a matter of generating the relevant code.
- Bad: Lowering into inline happens early, before other potential optimisations has been done. The FIR isn’t “remaining high level for as long as possible”.
- Good: Can allow OpenMP Workshare [1] and similar to see the loop in the inlined code, which is hidden in a runtime call (becuase the loop isn’t in the FIR, it is in the rumtime implementation).
- Good: Low overhead, we’re already in the right place, no need to iterate over the code to find places that need changing.
-
FIR pass that finds calls to SUM and inlines it when suitable.
- Good: Follows the “FIR remaining level as possible for as long as possible”.
- Good: Least intrusive approach.
- Good: Easy to turn on/off (just add or don’t add the pass)
- Good: Code would be almost identical to Inline in MLIR generation, so can relatively easily be moved.
- Bad: Slightly more work to identify the call and figure out its arguments and suitability.
- Bad: Possible that OpenMP Workshare [1] and similar still don’t see it as a loop - this depends on where in the order of things this happens.
- Bad: Potential overhead to scan the FIR code to find places that need updating.
-
Linking with LLVM-IR (bitcode) in Flang. Runtime functons, or selected subset of, are compiled into LLVM-IR and added to the Flang compiler in some fashion, to be linked into the LLVM-IR before the object file is produced.
- Good: Relatively little change to the overall code lowering.
- Good: Works for many intrinsics, potentially with little effort.
- Good: Low overhead (assuming the LLVM Link IR functionality isn’t very heavy).
- Bad: Complicates the overall build process. Some steps have to be taken to inject the LLVM-IR bitcode into the executable (I think so at least - that’s what antoher project I worked on was doing).
- Bad: Possible that OpenMP Workshare [1] and similar don’t see the loop.
- Bad: It is likely that LLVM optisation won’t manage to “understand” the LLVM-IR to reduce it into the basic loop, the way this proposal does. My feeling is that there are too many layers for LLVM to understand the consequences of the calls [2]. So we may still need some analysis and “select the right form of the function” or similar.
My weakly held belief is that a FIR pass is a reasonable solution. It is a little more work than what I’ve done so far (but most of what I have done should be reusable, with only some extra work needed to identify).
What I have done shows a small fraction better results than the simple SUM1D, and a dramatic improvement over what fir-dev as of right now produces.
||Fir-dev|Inline SUM|Relative|Fortran SUM1D||Relative
|—|—|—|—|—|—|
|||||||
| Parallel Setup|0.007|0.007|1.00|0.007|1.00|
| Input|0.000|0.000|1.04|0.000|1.06|
| Setup|3.138|3.134|1.00|3.137|1.00|
| Solve|7.762|2.958|2.62|3.037|2.56|
| Parameter Setup|0.021|0.020|1.03|0.020|1.05|
| Outer Source|0.075|0.074|1.01|0.075|1.00|
| Inner Iterations|7.663|2.860|2.68|2.940|2.61|
| Inner Source|0.033|0.033|1.01|0.033|0.99|
| Transport Sweeps|7.618|2.817|2.70|2.896|2.63|
| Inner Misc Ops|0.012|0.010|1.13|0.011|1.11|
| Solution Misc Ops|0.003|0.003|1.12|0.003|1.11|
| Output|0.284|0.282|1.00|0.282|1.00|
| Total Execution time|11.196|6.386|1.75|6.468|1.73|
(Measurements on my desktop x86-64 machine, single thread but with OpenMP enabled)
Relative columns: Reference (1.0) is fir-dev Higher is better, 1.75 means 1.75 times faster. Compilation was done using flang-new -fc1 -emit-llvm
and clang -O1
to compile the resulting .ll file.
Phabricator review here: ⚙ D123788 [flang][RFC][WIP} Inline sum - not ready to use
So, as a summary:
- An agreement on a solution?
- Do we need to perform some other experiments?
[1] Consider this:
!$omp workshare
x = SUM(big_array)
!$omp end workshare
[2]
The SUM intrinsic implementation is here:
https://github.com/flang-compiler/f18-llvm-project/blob/fir-dev/flang/runtime/sum.cpp#L127
which ends up here:
https://github.com/flang-compiler/f18-llvm-project/blob/fir-dev/flang/runtime/reduction-templates.h#L77
then goes here:
https://github.com/flang-compiler/f18-llvm-project/blob/fir-dev/flang/runtime/reduction-templates.h#L42