RFC: How to inline Fortran inrinsics

tschuett · April 28, 2022, 5:43pm

Is there a tool to merge two MLIR modules?

klausler · April 28, 2022, 5:53pm

I think the implementation that I have in mind is more like

if (intrinsic == SUM) {
  if (total reduction without mask) {
    if (argument is a variable, or an array expression not requiring a temp) {
      if (extents are known and small) { // SUM(A) with SIZE(A) < some limit
        // closed-form expression like A(1)+A(2)+A(3)
      } else if (rank==1 || known to be contiguous) { // this is OR, not AND
        // emit a single loop
      } else {
        // emit loop nest
      }
      return;
    }
    // call runtime: reduction is partial, or masked, or needs a temporary
  }

See flang/docs/ArrayComposition.md for more information about what I mean by temporary-free array expressions.

Masks could be accommodated as conditional MERGE(x,identity value 0,MASK(…)) in the loop nests but they’re rare in practice and unlikely to be worth the trouble.

A similar approach applies of course to other reduction transformational intrinsics as well as to special cases of MATMUL.

schweitz · May 4, 2022, 4:53pm

A high-performance compiler ought to optimize away such temporaries regardless of where they come from. Moreover, it’s a stronger investment to build a robust high-level optimizer than generating custom code templates for various subcases.

mehdi_amini · May 4, 2022, 4:58pm

The difficult in merging is about what to do on symbol collision.
If there are no collision, merging is trivial: module are containing a single block, which is a linked list of operations, so you just just splice the second list after the first one: it is quite cheap (actually it’ll traverse every operation in the list to update the parent pointer )

schweitz · May 4, 2022, 5:09pm

Flang already makes a strong attempt to reduce collisions by uniquing module level symbols internally. (There are ways to subvert that process so it still must be considered, but elimination/reduction of symbol collisions was a part of the compiler’s original design.)

banach-space · May 9, 2022, 12:39pm

Thanks for putting this together, @Leporacanthicus !

Let me quickly summarise the available options and the driver perspective

From the initial 3 options that you proposed, I feel that only the FIR pass is still on the table (based on the comments here). Two other alternatives have been proposed too:

Specialise runtime implementation (suggested by @klausler),
Use the MLIR inliner instead (suggested by @sscalpone ).

One aspect of all this that we should consider is the ability to turn the inlining “ON” and “OFF” (@shraiysh, is that what you meant in your comment?). If we choose Option 1. above, then this might be tricky to achieve (we might require some extra infrastructure to enable this). If, instead, we leverage MLIR (either in a form of a dedicated pass or through MLIR’s inliner as in Option 2. above) then that becomes almost a no-op for the driver (we’d leverage MLIR infra).

Having said that, it does seem that there are plenty of low hanging fruits waiting for us in the runtime implementation. I agree with Peter that it would be a shame to ignore them! The way that I look at all this is that an MLIR-based inliner (Option 2.) and optimisations in the runtime (Option 1.) should nicely complement each other.

@Leporacanthicus, based on your initial proposal, I assume that for now you want to focus on the MLIR infra? This leaves us with 2 alternative approaches:

an MLIR pass (i.e. your initial suggestion),
an MLIR inliner.

Are there any strong reasons to favour one over the other for now? MLIR experts, please help Perhaps this is just an implementation detail that could be re-fined at a later time?

-Andrzej

sscalpone · May 9, 2022, 5:01pm

Even with the inliner, one or more specialized versions of the runtime routines will be needed. Or a single version that has plenty of checks to help the optimizer pick the right path.

banach-space · May 9, 2022, 8:38pm

That’s very true.

I see two challenges here:

To bring up an infrastructure for inlining “interesting” function calls.
To define and to provide optimised versions of these “interesting” functions (to be used in 1.)

Once we have 1. in place, it will be easier to experiment with 2. Also, with 1. I feel that we agree that an MLIR pass or an inliner is the way to go. As for 2., it’s still a bit unclear.

-Andrzej

shraiysh · May 10, 2022, 3:27am

Yes, that is what I meant. Thanks for the detailed summary.

Leporacanthicus · May 11, 2022, 7:04pm

So, I have cobbled together something that I think is ROUGHLY what we’d like to do in the future:

https://reviews.llvm.org/D125407

This produces a simplified sum function for the case of DIM=0 and MASK=absent, and only for INTEGER*4 and REAL*8.

It’s quite a lot of code, compared to my original patch that did the work in the lowering-side. That’s because this function needs to reverse-engineer various arguemtns, and cast the box<none> arguments back to the corret type (array of integer or array of real).

We probably need to check the number of dimensions too, and only do this for 1 dim arrays [at least for now]. There may be other aspects that I’ve missed.

There’s definitely a bit of “I know how this argument comes to the function” that almost certainly isn’t generic.

However, I do believe this is a workable solution for at least a majority of simple cases of calls to SUM.

A similar mechanism can probably be used for a variety of other simple intrinsics.

I have not actually tried to use the MLIR inliner, even tho’ the file is called “InlineIntrinsics”. I trust that inlinging works. If I compile the generated MLIR with tco + clang -O2, the result is an inlined function call [the out-of-line code is still there, so probably need something to tell clang that “you don’t need to keep this function”].

Thoughts and comments welcome - thanks for all the earlier comments, it’s really appreciated.

Also thanks to Eric who helped me when I was a bit stuck on getting this code to do what it should!

banach-space · May 12, 2022, 6:03pm

An MLIR pass is a very suitable approach to begin with, IMHO. We can use SNAP to verify that it works as expected (on top of regular tests).

Once we start inlining more intrinsics, we will be better informed to decide whether this approach is scalable enough and whether our heuristics are sufficient.

I feel that we are ready to move this discussion to Phabricator and to focus on the fine implementation details there. WDYT?

-Andrzej

kiranchandramohan · May 12, 2022, 10:56pm

Wouldn’t a flag be sufficient with the first option for this requirement?

Thanks, I wasn’t sure whether there would be a temporary array here since there is no LHS.

Will this be an improved version of the Array Value Copy (AVC) Pass? Or AVC + Affine?

Two questions,
→ Why not inline directly into the call-site rather than generate a separate function and then inline?
→ Will it inline if there are multiple call-sites?

I feel that,
→ writing these inlineable versions in Fortran and generating FIR and subsequently using that for inlining is easier to write, review, and verify. [as opposed to handcoding FIR in the pass]
→ using an operation for the sum intrinsic during lowering and then converting that op to an inline version is better than lowering to a runtime call and then pattern matching the name in the pass and recovering the arguments from box<none>.

Having given my opinion, I understand that making progress is important. It seems you have support for this approach from Eric and Mehdi. So +1 to continue and make progress.

sscalpone · May 12, 2022, 11:22pm

As a proof of concept, this is a great testimony for specialization and inlining.

As a long term plan, I support the path that Kiran has described. As an additional benefit, the specialization would exist independent of inline, so they would be used and available if inline was disabled. (Useful for example when profiling or debugging. Also nice to have a single source instead of a library version and an direct-to-mlir version.)

As Kiran, Peter, Eric and others have pointed out, we can imagine a implementation where these operations exist in MLIR, are preserved through inlining, and then temporary optimization is done post-inlining and before the next lowering to calls or inline code.

Leporacanthicus · May 13, 2022, 11:46am

Thanks for the comments.

@sscalpone : yes, the idea here is to be able to produce a specialized version of a function without inlining. That was kind of the point of this version, in contrast to my very first suggestion, which just inlines the functionality immediately.

@kiranchandramohan writing the inlineable function in Fortran is indeed a good suggestion - and it was suggested earlier to have a form of library of MLIR-code that can be inlined [that is specialized forms of selected functions]. That is definitely a good suggestion for the future.

The whole point of this exercise was to find a way to a) retain the original MLIR as a call for as long as possible - retaining the highest possible level and b) find a way to introduce a specialized function in general. In the future, having functions written in either Fortran or FIR/MLIR as separate files would definitely be a good idea - but it does complicate the build-process, as you do need a working compiler (at least in part) to produce the MLIR to be included in the compiler… I’ve seen that in our OpenCL project, where the compiler is bult twice, once without the “built-in function library”, then compile the built-in functions with the first stage conmpiler, and once with those functions are compiled, build a second copy of the compiler. It’s not hard as such, it’s just a bit more work in the build process. [And if you get your dependencies right, you can probably avoid building the whole compiler twice, just the few bits that rely on the compiler building the library functions]

banach-space · May 13, 2022, 12:05pm

That’s going to be harder and more messy to implement. With a pass, you just decide whether to include it in in your pipelines for e.g. -O3 or -O0. The necessary infra is mostly there and this would be consistent with the overall design of LLVM. To achieve this in IntrinsicsCall.cpp, we would need to make sure that the optimisation level is forwarded exactly where it’s needed. It can be done, but to me that would be against the grain of LLVM.

These are good points - why not bring it up on Phab? (I’m just thinking it might be easier to discuss this with the actual code in front of us).

I agree, but do we have enough CodeGen/Lowering support in LLVM Flang to support this today?

-Andrzej

kiranchandramohan · May 13, 2022, 12:31pm

I think setting up an optimisation level like Ofast would pass a lot of flags to clang -cc1. So it might not be without precedent.

I can and will, but the first point is still in the realm of design. Also, please note that I said, the following.

For the following source,

function sum1d(a)
  integer:: a(:)
  integer :: sum1d
  sum1d = 0
  do i=1, size(a)
    sum1d = sum1d + a(i)
  end do
end function

I was able to generate FIR.

func.func @_QPsum1d(%arg0: !fir.box<!fir.array<?xi32>> {fir.bindc_name = "a"}) -> i32 {

  %0 = fir.alloca i32 {bindc_name = "i", uniq_name = "_QFsum1dEi"}
  %1 = fir.alloca i32 {bindc_name = "sum1d", uniq_name = "_QFsum1dEsum1d"}
  %c0_i32 = arith.constant 0 : i32
  fir.store %c0_i32 to %1 : !fir.ref<i32>
  %c1_i32 = arith.constant 1 : i32
  %2 = fir.convert %c1_i32 : (i32) -> index
  %c0 = arith.constant 0 : index
  %3:3 = fir.box_dims %arg0, %c0 : (!fir.box<!fir.array<?xi32>>, index) -> (index, index, index)
  %4 = fir.convert %3#1 : (index) -> i64
  %5 = fir.convert %4 : (i64) -> i32
  %6 = fir.convert %5 : (i32) -> index
  %c1 = arith.constant 1 : index
  %7 = fir.do_loop %arg1 = %2 to %6 step %c1 -> index {
    %10 = fir.convert %arg1 : (index) -> i32
    fir.store %10 to %0 : !fir.ref<i32>
    %11 = fir.load %1 : !fir.ref<i32>
    %12 = fir.load %0 : !fir.ref<i32>
    %13 = fir.convert %12 : (i32) -> i64
    %c1_i64 = arith.constant 1 : i64
    %14 = arith.subi %13, %c1_i64 : i64
    %15 = fir.coordinate_of %arg0, %14 : (!fir.box<!fir.array<?xi32>>, i64) -> !fir.ref<i32>
    %16 = fir.load %15 : !fir.ref<i32>
    %17 = arith.addi %11, %16 : i32
    fir.store %17 to %1 : !fir.ref<i32>
    %18 = arith.addi %arg1, %c1 : index
    fir.result %18 : index
  }
  %8 = fir.convert %7 : (index) -> i32
  fir.store %8 to %0 : !fir.ref<i32>
  %9 = fir.load %1 : !fir.ref<i32>
  return %9 : i32
}

We do this for modules already (see ieee*.mod, iso*.mod etc in build/include/flang). So portion of the flow is already there.

Leporacanthicus · May 13, 2022, 1:27pm

kiranchandramohan:

kiranchandramohan:

shraiysh:

Thanks for the detailed analysis. FWIW I like the second option most (FIR Pass). My main reason for this choice is that it allows me to generate a non-optimized non-inlined binary.
[/quote]

Wouldn’t a flag be sufficient with the first option for this requirement?

That’s going to be harder and more messy to implement. With a pass, you just decide whether to include it in in your pipelines for e.g. -O3 or -O0 . The necessary infra is mostly there and this would be consistent with the overall design of LLVM. To achieve this in IntrinsicsCall.cpp, we would need to make sure that the optimisation level is forwarded exactly where it’s needed. It can be done, but to me that would be against the grain of LLVM.

I think setting up an optimisation level like Ofast would pass a lot of flags to clang -cc1. So it might not be without precedent.

I think there, in the long term, will be flagns such as -f[no-]inline and -f[no-]simplify-library-funcitons [name of the second one is just something I made up right now - there is a flag, I think, in Clang for doing a similar thing, where it, for example, replaces printf("Hello, World\n"); with puts("Hello, World"); - I’m sure we can use the same flag-name as Clang, unless we REALLY need to distinguish the two types of operations for some reason.

As a first step, having -O0 as “don’t do any optimisation at all” and -O{1,2,3} to do SOME optimisations, one of which is inlining, I think is a good start.

banach-space:

kiranchandramohan:

Two questions,

→ Why not inline directly into the call-site rather than generate a separate function and then inline?
→ Will it inline if there are multiple call-sites?

Subject to code-size, inlining will inline all call-sites or not inline, depending on some heuristics in the compiler - there’s usually a way to configure “max number of calls” or “max number of bytes inlined by a function”, etc, but in my experience with LLVM, small functions almost always gets inlined regardless of number of calls. Large functions may or may not be, depending on “what the compiler feels like”.

kiranchandramohan:

These are good points - why not bring it up on Phab? (I’m just thinking it might be easier to discuss this with the actual code in front of us).

I can and will, but the first point is still in the realm of design. Also, please note that I said, the following.

For the following source,

function sum1d(a)
  integer:: a(:)
  integer :: sum1d
  sum1d = 0
  do i=1, size(a)
    sum1d = sum1d + a(i)
  end do
end function

I was able to generate FIR.

func.func @_QPsum1d(%arg0: !fir.box<!fir.array<?xi32>> {fir.bindc_name = "a"}) -> i32 {

  %0 = fir.alloca i32 {bindc_name = "i", uniq_name = "_QFsum1dEi"}
  %1 = fir.alloca i32 {bindc_name = "sum1d", uniq_name = "_QFsum1dEsum1d"}
  %c0_i32 = arith.constant 0 : i32
  fir.store %c0_i32 to %1 : !fir.ref<i32>
  %c1_i32 = arith.constant 1 : i32
  %2 = fir.convert %c1_i32 : (i32) -> index
  %c0 = arith.constant 0 : index
  %3:3 = fir.box_dims %arg0, %c0 : (!fir.box<!fir.array<?xi32>>, index) -> (index, index, index)
  %4 = fir.convert %3#1 : (index) -> i64
  %5 = fir.convert %4 : (i64) -> i32
  %6 = fir.convert %5 : (i32) -> index
  %c1 = arith.constant 1 : index
  %7 = fir.do_loop %arg1 = %2 to %6 step %c1 -> index {
    %10 = fir.convert %arg1 : (index) -> i32
    fir.store %10 to %0 : !fir.ref<i32>
    %11 = fir.load %1 : !fir.ref<i32>
    %12 = fir.load %0 : !fir.ref<i32>
    %13 = fir.convert %12 : (i32) -> i64
    %c1_i64 = arith.constant 1 : i64
    %14 = arith.subi %13, %c1_i64 : i64
    %15 = fir.coordinate_of %arg0, %14 : (!fir.box<!fir.array<?xi32>>, i64) -> !fir.ref<i32>
    %16 = fir.load %15 : !fir.ref<i32>
    %17 = arith.addi %11, %16 : i32
    fir.store %17 to %1 : !fir.ref<i32>
    %18 = arith.addi %arg1, %c1 : index
    fir.result %18 : index
  }
  %8 = fir.convert %7 : (index) -> i32
  fir.store %8 to %0 : !fir.ref<i32>
  %9 = fir.load %1 : !fir.ref<i32>
  return %9 : i32
}

We do this for modules already (see ieee*.mod, iso*.mod etc in build/include/flang). So portion of the flow is already there.
[/quote]

Yes. And if we have some form of file that is external to the compiler, we can certainly do that, it’s not particularly hard.

The example I gave, the generated LLVM-IR (in that case) is part of the binary of the compiler. In that case, you need to build the a boot-strapping compiler first, and then re-build the parts that depend on the compile code.

I’m not sure there is a file-format for FIR libraries, that can be loaded by the compiler - probably also requires a bit of metadata for knowing which functions are in the library and what functions they can replace - for specialisations, the particular restrictions such as, for SUM, argument 1 must have 1 dimension or argumeent 2 needs to be specific value, argument 3 must be not present. We can have that logic in the C++ code that “finds the right function and merges it into the generated code”, but it needs to exist somewhere.

If is a way to create a library of FIR functions, and somehow introduce them into the existing FIR, the rest isn’t particularly hard. But I don’t believe we have a way to achieve that at the moment.

klausler · May 13, 2022, 5:21pm

It still doesn’t make much sense to me to depend on a lot of new infrastructure to implement the easy part of an optimized SUM(). Inlining is important in its own right, of course, but a great SUM() doesn’t need it.

Leporacanthicus · June 23, 2022, 4:47pm

I have updated the review here: ⚙ D125407 Prepare for inlining of SUM intrinsic

New for this revision:

Fixed some, but not all, of the review comments. Thanks to those that commented.
Added shape / dimension to the restrictions, so only calls with 1D arrays are being replaced.
Fixed “off-by-one” on loop iterations (one too many).
Rebased on LLVM/main.

Sorry for the slow progress here. Been working on some other things…

Leporacanthicus · June 23, 2022, 4:49pm

If improving SUM was the only thing we wanted to do, then I agree. I strongly suspect that there are several other intrinsics that will benefit from a similar treatment, and then, with some refactoring, this will stand as a base for multiple intrinsic replacements.

Topic		Replies	Views
Status of Flang's Optimization Flang	11	1260	December 4, 2023
Performance comparison with Simplified and Inlined intrinsics Flang	1	257	August 4, 2022
[RFC] change lowering of Fortran math intrinsics Flang	12	673	March 1, 2023
Flang OpenMP Status & Developer Call Announcement Flang	0	92	July 21, 2020
Flang Technical Call : Summary of presentation on OpenMP for Flang Flang	12	100	September 27, 2019

RFC: How to inline Fortran inrinsics

Related Topics