RFC: How to inline Fortran inrinsics

This is a followup post to the SNAP Performance thread (SNAP Performance analysis, more detailed than the presentation)

This is specifically about “inline SUM intrinsic”, although the purpose of this post is to agree a strategy for inlining of intrinsics that can be extended and a pattern for this sort of work.

There is a link below to a Phabricator review of some code. I’m repeating the essential part that is in the commit message:

This is not a finished implementation, it is mostly “to encourage discussion”, not intended to be reviewed in detail, but as a “this is how this could be done”.

I’d like to suggest that there are three candidates for solving this, each with some good and bad things. Alternative solutions would be appreciated.

  • Inline during lowering to MLIR in IntrinsicCall.cpp - what I have implemented.

    • Good: Easy to implement, everything needed is there, and it’s just a matter of generating the relevant code.
    • Bad: Lowering into inline happens early, before other potential optimisations has been done. The FIR isn’t “remaining high level for as long as possible”.
    • Good: Can allow OpenMP Workshare [1] and similar to see the loop in the inlined code, which is hidden in a runtime call (becuase the loop isn’t in the FIR, it is in the rumtime implementation).
    • Good: Low overhead, we’re already in the right place, no need to iterate over the code to find places that need changing.
  • FIR pass that finds calls to SUM and inlines it when suitable.

    • Good: Follows the “FIR remaining level as possible for as long as possible”.
    • Good: Least intrusive approach.
    • Good: Easy to turn on/off (just add or don’t add the pass)
    • Good: Code would be almost identical to Inline in MLIR generation, so can relatively easily be moved.
    • Bad: Slightly more work to identify the call and figure out its arguments and suitability.
    • Bad: Possible that OpenMP Workshare [1] and similar still don’t see it as a loop - this depends on where in the order of things this happens.
    • Bad: Potential overhead to scan the FIR code to find places that need updating.
  • Linking with LLVM-IR (bitcode) in Flang. Runtime functons, or selected subset of, are compiled into LLVM-IR and added to the Flang compiler in some fashion, to be linked into the LLVM-IR before the object file is produced.

    • Good: Relatively little change to the overall code lowering.
    • Good: Works for many intrinsics, potentially with little effort.
    • Good: Low overhead (assuming the LLVM Link IR functionality isn’t very heavy).
    • Bad: Complicates the overall build process. Some steps have to be taken to inject the LLVM-IR bitcode into the executable (I think so at least - that’s what antoher project I worked on was doing).
    • Bad: Possible that OpenMP Workshare [1] and similar don’t see the loop.
    • Bad: It is likely that LLVM optisation won’t manage to “understand” the LLVM-IR to reduce it into the basic loop, the way this proposal does. My feeling is that there are too many layers for LLVM to understand the consequences of the calls [2]. So we may still need some analysis and “select the right form of the function” or similar.

My weakly held belief is that a FIR pass is a reasonable solution. It is a little more work than what I’ve done so far (but most of what I have done should be reusable, with only some extra work needed to identify).

What I have done shows a small fraction better results than the simple SUM1D, and a dramatic improvement over what fir-dev as of right now produces.

||Fir-dev|Inline SUM|Relative|Fortran SUM1D||Relative
|—|—|—|—|—|—|
|||||||
| Parallel Setup|0.007|0.007|1.00|0.007|1.00|
| Input|0.000|0.000|1.04|0.000|1.06|
| Setup|3.138|3.134|1.00|3.137|1.00|
| Solve|7.762|2.958|2.62|3.037|2.56|
| Parameter Setup|0.021|0.020|1.03|0.020|1.05|
| Outer Source|0.075|0.074|1.01|0.075|1.00|
| Inner Iterations|7.663|2.860|2.68|2.940|2.61|
| Inner Source|0.033|0.033|1.01|0.033|0.99|
| Transport Sweeps|7.618|2.817|2.70|2.896|2.63|
| Inner Misc Ops|0.012|0.010|1.13|0.011|1.11|
| Solution Misc Ops|0.003|0.003|1.12|0.003|1.11|
| Output|0.284|0.282|1.00|0.282|1.00|
| Total Execution time|11.196|6.386|1.75|6.468|1.73|

(Measurements on my desktop x86-64 machine, single thread but with OpenMP enabled)

Relative columns: Reference (1.0) is fir-dev Higher is better, 1.75 means 1.75 times faster. Compilation was done using flang-new -fc1 -emit-llvm and clang -O1 to compile the resulting .ll file.

Phabricator review here: ⚙ D123788 [flang][RFC][WIP} Inline sum - not ready to use

So, as a summary:

  • An agreement on a solution?
  • Do we need to perform some other experiments?

[1] Consider this:

!$omp workshare
   x = SUM(big_array)
!$omp end workshare

[2]
The SUM intrinsic implementation is here:
https://github.com/flang-compiler/f18-llvm-project/blob/fir-dev/flang/runtime/sum.cpp#L127
which ends up here:
https://github.com/flang-compiler/f18-llvm-project/blob/fir-dev/flang/runtime/reduction-templates.h#L77
then goes here:
https://github.com/flang-compiler/f18-llvm-project/blob/fir-dev/flang/runtime/reduction-templates.h#L42

Is there an MLIR inliner?

If so, perhaps you can get the same benefits of inline generation but with the simplification that you can write the routines in Fortran (or MLIR :slight_smile:). In lowering, you just pick the correct specialization and the inliner does the rest.

There is an inliner functionality in MLIR, described here:

But it only does “inline this function, given that it’s a call to a function that there is MLIR code for”.

My main concern here is that it’s a fair bit of extra infrastructure needed to achieve this. I’m not aware of a library form of IR, but perhaps all the runtime code [that is reasonable to assume it needs to be inlined] can be made into single large MLIR “blob” - but it doesn’t seem like great idea to me.

At present, another complication is that the runtime code is in C++ - nothing wrong with that, except that generating MLIR from C++ isn’t supported in current Clang tools… :slight_smile: I’m not sure if there’s any good way to solve that part - is there a “LLVM-IR to MLIR” tool? Or is the intention to move existing runtime functions (where relevant) into Fortran?

Just so we’re clear, the way I view the suggestion of “Use MLIR inliner” would be:

  1. There is some sort of MLIR library of various functions (that we think may benefit from inlining - probably not the I/O functions that support WRITE, READ and PRINT, for example). Some of these functions are specializedl versions of the more generic functions, such as for SUM() without dimension and mask.
  2. When we lower calls to (those functions in #1), the lowering code will pick the right variant (if available), and then just make a call to the function.
  3. At some point, those functions used from the MLIR library gets copied into the MLIR produced for translation unit (is that a C language term, do we use something else in Fortran?). Functions are marked in such a way that duplicates in other translation units aren’t gets collapsed into a single function, should the code not get inlined [similar to how inline functions from headers in C or C++ works].
  4. We run have some MLIR optimisation, one of which use the MLIR inliner functionality to determine if it can/should inline the code.

Apologies if I’ve misunderstood the concept, and something else was intended.

is there a “LLVM-IR to MLIR” tool?

Yes there is one: mlir-translate -import-llvm filename.ll

Or is the intention to move existing runtime functions (where relevant) into Fortran?

I guess this can only be done when we have enough Fortran support in flang itself to compile part of the runtime that would be written in Fortran.

Thanks for the detailed analysis. FWIW I like the second option most (FIR Pass). My main reason for this choice is that it allows me to generate a non-optimized non-inlined binary. The MLIR inliner approach sounds like it does this plus takes responsibility of deciding when to inline. +1 to this approach.

Flang has experimented with using the MLIR inliner (there is a pass in fir-dev).

The policy needs a lot more work (there isn’t one), but the mechanics of inlining are already effectively in place.

That’s the basic gist.

There can be a Fortran module that specifies an interface to some collection of library (we can call them intrinsic) functions.

These would be lowered into FIR as calls. Some as yet unwritten pass would scan the FIR and include small FIR functions (performance critical), say from a list provided in a .td file, on demand.

At that point, the high-level optimizations can just kick in, doing the inlining, specialization, etc. to optimize the module.

MLIR provides an inliner, but flang will want to provide heuristics to guide that inlining process with its domain-specific knowledge (FIR), etc.

1 Like

So, as I mentioned on the call today, I’m working on a solution that works in the FIR/MLIR layer.

I’m not yet sure if it can be done as a single pass, or we need a “simplify calls” pass first, that replaces the runtime call with a call to a inlineable function.

This will still be a rough prototype for further discussion. I’m sure there will be some further discussion on “where do we get the source for the inlineable code from” and “how do we decide what should/shouldn’t be inlined”, etc.

Can LTO help you to inline across modules (Fortran and C++)?

LTO can probably be used, but that requires modification of the original build-options, and I’m not sure we want to ALWAYS rely on this for the type of optimisation that this started out with.

And I’d also be concerned that the analysis in LTO isn’t capable of actually distinguishing the SUM(arr) from SUM(arr, dim, mask), when basically both have the same internal call structure - just that the former has “not present” content in the dim and mask fields. When all you want is to simply add up 12 doubles (or some other small number of basic numeric type), vs. calculating a complicated scenario where the output is an array (dim present), and only some of the input values are used (mask is a mixture of true and false). I admit I haven’t tried, I probably should do that. Certainly in the current form, the biggest improvement is from simply not calling through two levels of function calls and doing a bunch of extra work in the generic form - the calls are part of it, the complexity of the loop is another.

And there are similar “the full form of this runtime functon is complicated, but many times, it’s used only for a more simple form”.

I am afraid that mangled names of the two sum functions are different. So LTO will not be able to inline.

I don’t think that in itself is an issue, there is already some code to make the right names for the runtime functions from C++ to Fortran.

But currently there’s only one for a given type (real or integer and size of the type), regardless of whether the Fortran code calls sum with one, two or three arguments.

At least that’s what I believe, but I haven’t actually tested LTO - will do.

If the runtime implementations of SUM & al. checked for argument presence and executed simplified implementations for the important special cases (no DIM, no MASK), inline expansion followed by constant argument propagation would do the trick. Or most of it, anyway. If the runtime API were to be extended with special-case variants for contiguous data that did not require a descriptor, inline expansion of those implementations could have very good performance; but in the those cases, lowering might as well just emit a loop anyway after doing all of the work necessary to “know” that a specialized runtime API could be called.

Don’t assume that the runtime can’t be made more specialized. Many arguments to template functions can be “moved” into template arguments for cases that matter. What’s there now is meant to be able to run codes correctly for all of the large sets of operand types that we have to support; it’s not the last word on performance by any means.

Thanks for your comment.

I’m fully aware the runtime is currently more for completeness than performance.

The idea here is mainly to figure out a general approach for, I think, two distinct matters:

  1. Simplified functions where possible.

  2. Inline functions, to reduce overhead from the call/argument passing.

I agree, once the whole “have we got the right arguments for a simple call”, generating the loop is very little extra work. It may be more complex for some other functions. Of course, it would be even less work to simply let something (not quite sure exactly how) use the existing function code and inline it by itself.

Let’s see what I can come up with. I’m still experimenting, and expect to learn more along the way.

Inlining of the existing DoTotalReduction() template function as it stands today won’t suffice, even if it could be done at the MLIR level, because I iterate over the argument array elementally by computing and applying the subscripts for each element. This handles all ranks and does not care about contiguity. But it would be straightforward to specialize DoTotalReduction() further by rank (always a known constant at compilation time), detect contiguity (cheap when rank is known), and branch to a fast path loop that sweeps over contiguous data without subscripting.

Again, thanks for confirming that. That is indeed how I have been looking at it all along - we have a generic piece of code that supports everything, and for a lot of cases, that’s good enough. For some simple cases, a more specialized piece of code can be used instead, with good speed improvement, possibly placing it directly inline or using the MLIR inliner to get it inlined.

My current plan is to introduce a new function in MLIR (when needed), and let the MLIR inliner place it inline, because that seems to be a reasonable solution as per above discussions.

At some further point, we may find a different way to store/introduce those functions, e.g. compiling from Fortran to MLIR or using TableGen files, using a library or something like that.

Anyways. I’m going to write something that I think is what has been suggested above, and we can discuss further once I’ve got that working - hopefully today or tomorrow.

One possible flow to consider:

  1. There is a set of Fortran intrinsics that we would like to inline. We will maintain a file somewhere in Fortran containing the implementation of all Fortran intrinsics that we would want to inline if conditions favour it.
  2. While compiling the compiler, once the base compiler is built, we will convert this Fortran file containing the intrinsic functions to MLIR and store it in an MLIR module.
  3. A pass will be written that checks whether it is OK to perform the inlining of an intrinsic function. If it is, then the pass will take the corresponding function implementation from the stored MLIR module and inline using the MLIR inliner.

Important points to check in this flow are,

  1. Is it possible to inline functions from a different module in MLIR?
  2. Is it OK to have intrinsic functions implemented in Fortran (Note the duplication because we will have the C++ runtime as well)? The Fortran version will only be for the simple cases.
  3. Do we need FIR operations for the intrinsics that we want to inline or are the runtime calls to the Flang runtime sufficient to detect in the inline pass?
2 Likes

This is obviously future work. Clang generates CIL/MLIR and you merge several MLIR modules into one.

This seems like an overly complicated implementation for the easy part of the problem, I’m afraid.

In the concrete case of SUM(A), the rank is known at compilation time; the contiguity may be (with many false negatives), and the replacement is a simple loop that can be generated rather than inlined.

More generally, SUM(array expression), if implemented via inlining, would force allocation, population, and deallocation of an array temporary in order to be turned into a function call. Rewriting into a loop can do much better for argument array expressions that don’t otherwise need a temporary, e.g. SUM(A+B).

Thanks, Peter for pointing out the issues.

Is the issue the following allocation and population and dellocation of tmpArray?

allocate tmpArray
tmpArray = A + B
result = call FortranASumInteger(tmpArray)
deallocate tmpArray

can be removed by the do loop version like the following?

result = 0
do i = 1 to size(A)
  result = result + A(i) + B(i)
end do

Also, Just to check whether I understand what you are saying, are you recommending the following during lowering from parse-tree to MLIR? Assuming a contiguous array of rank=1 is the only case where an inline implementation is better than the runtime call.

if (intrinsic == SUM) {
  if (rank == 1 and contiguous) {
    // generate fir.do_loop
  } else {
    // generate runtime call
  }
}