[RFC] Inline hlfir.copy_in for trivial types

Context

When a non-contiguous array slice is passed as an argument to a function that takes in an explicit-shape array, the Fortran Language Standard requires that it be made contiguous. To achieve that, the original array slice needs to be copied into a temporary. This operation is represented by hlfir.copy_in. The hlfir.copy_in is lowered to a flang-rt call, which handles memory allocation for the temporary array and the copy itself.

Problem

As currently implemented, flang-rt FortranACopyInAssign calls Fortran::runtime::Assign, which implements copying a non-contiguous array with a runtime loop that calls memmove separately on each element. While this is advantageous for things like arrays of derived types, for trivial types this results in a lot of overhead compared to an ordinary copy loop that could take better advantage of hardware pipelines. In practice, this makes certain HPC applications such as thornado-mini perform much worse with llvm-flang than they do with Classic Flang.

Suggested solution

This problem can be solved by adding a new pattern to the InlineHLFIRAssign optimisation pass that in certain circumstances replaces hlfir.copy_in with a nested copy loop emitted directly at compile time, as opposed to leaving it to the runtime. This is how this operation is handled by Classic Flang. This optimisation is only applied to trivial types. To start with, I suggest only applying it in cases where the array does not need to be copied out, e.g. when the parameter is declared as intent(in).

In practice, the optimisation transforms a hlfir snippet such as the following (generated by upstream flang):

%16 = hlfir.designate %4#0 (%6, %c1:%7#1:%c1_1, %14)  shape %15 : (!fir.box<!fir.array<?x?x?xf64>>, i64, index, index, index, i64, !fir.shape<1>) -> !fir.box<!fir.array<?xf64>>
%c100_i32 = arith.constant 100 : i32
%17:2 = hlfir.copy_in %16 to %0 : (!fir.box<!fir.array<?xf64>>, !fir.ref<!fir.box<!fir.heap<!fir.array<?xf64>>>>) -> (!fir.box<!fir.array<?xf64>>, i1)
%18 = fir.box_addr %17#0 : (!fir.box<!fir.array<?xf64>>) -> !fir.ref<!fir.array<?xf64>>
%19:3 = hlfir.associate %c100_i32 {adapt.valuebyref} : (i32) -> (!fir.ref<i32>, !fir.ref<i32>, i1)
fir.call @_QFPsb(%18, %19#0) fastmath<contract> : (!fir.ref<!fir.array<?xf64>>, !fir.ref<i32>) -> ()
hlfir.copy_out %0, %17#1 : (!fir.ref<!fir.box<!fir.heap<!fir.array<?xf64>>>>, i1) -> ()

Into:

%12 = hlfir.designate %3#0 (%5, %c1:%6#1:%c1, %10)  shape %11 : (!fir.box<!fir.array<?x?x?xf64>>, i64, index, index, index, i64, !fir.shape<1>) -> !fir.box<!fir.array<?xf64>>
%13 = fir.is_contiguous_box %12 whole : (!fir.box<!fir.array<?xf64>>) -> i1
%14:2 = fir.if %13 -> (!fir.box<!fir.array<?xf64>>, i1) {
  fir.result %12, %false : !fir.box<!fir.array<?xf64>>, i1
} else {
  %17 = fir.allocmem !fir.array<?xf64>, %8 {bindc_name = ".tmp", uniq_name = ""}
  %18:2 = hlfir.declare %17(%11) {uniq_name = ".tmp"} : (!fir.heap<!fir.array<?xf64>>, !fir.shape<1>) -> (!fir.box<!fir.array<?xf64>>, !fir.heap<!fir.array<?xf64>>)
  fir.do_loop %arg3 = %c1 to %8 step %c1 unordered {
    %19 = hlfir.designate %12 (%arg3)  : (!fir.box<!fir.array<?xf64>>, index) -> !fir.ref<f64>
    %20 = fir.load %19 : !fir.ref<f64>
    %21 = hlfir.designate %18#0 (%arg3)  : (!fir.box<!fir.array<?xf64>>, index) -> !fir.ref<f64>
    hlfir.assign %20 to %21 : f64, !fir.ref<f64>
  }
  fir.result %18#0, %true : !fir.box<!fir.array<?xf64>>, i1
}
%15 = fir.box_addr %14#0 : (!fir.box<!fir.array<?xf64>>) -> !fir.ref<!fir.array<?xf64>>
%16:3 = hlfir.associate %c100_i32 {adapt.valuebyref} : (i32) -> (!fir.ref<i32>, !fir.ref<i32>, i1)
fir.call @_QFPsb(%15, %16#0) fastmath<contract> : (!fir.ref<!fir.array<?xf64>>, !fir.ref<i32>) -> ()
fir.if %14#1 {
  %17 = fir.box_addr %14#0 : (!fir.box<!fir.array<?xf64>>) -> !fir.ref<!fir.array<?xf64>>
  %18 = fir.convert %17 : (!fir.ref<!fir.array<?xf64>>) -> !fir.heap<!fir.array<?xf64>>
  fir.freemem %18 : !fir.heap<!fir.array<?xf64>>
}

Results

With the new optimisation applied, the runtime for thornado-mini when compiled with llvm-flang decreases by a factor of about 1/3rd. No regressions in other applications or benchmarks have been found so far.

Future work

There is no particular reason why the optimisation could not be applied to cases where the array needs to be copied out. It would be possible to transform hlfir.copy_out in the same way. The scope was reduced to only handling the copy_in in order to simplify the discussion and the review process for the implementation.

A draft PR with the implementation can be found here:

2 Likes

Did you try optimizing the runtime instead?

I am thinking that the new ShallowCopy could be a better fit than Assign here llvm-project/flang-rt/lib/runtime/tools.cpp at 32752913b12103431dc392242c3c808afb70bd15 Ā· llvm/llvm-project Ā· GitHub

The thing that bothers me with the inlined versions is that this will generate loops that cannot really be vectorize by essence since one of the input is not contiguous, and they bloat the IR/assembly and may cost a lot of extra optimization time for nothing.

I believe @szakharin previously inlined copy-in and it blow compilation time on some apps for little gains in many cases (although I agree the runtime can likely be optimized a lot since Assign is quite expensive).

See: [flang] Use Assign() runtime for copy-in/copy-out. Ā· llvm/llvm-project@2b60ed4 Ā· GitHub

Did you try optimizing the runtime instead?

I started off by trying to replace the memcpy calls in the runtime with just a loop doing simple assignments, effectively *a = *b. It did not seem to move the needle on the performance regression at all. Admittely I’m no expert here so there might be some other way to do it in the runtime that I’ve not tried. When I ran the application in question through gdb it was getting stuck inside Assign in a loop that looks just like the one in ShallowCopyDiscontiguousToContiguous, here: llvm-project/flang-rt/lib/runtime/assign.cpp at 32752913b12103431dc392242c3c808afb70bd15 Ā· llvm/llvm-project Ā· GitHub

Imo we definitely don’t want to inline every copy_in for the reasons you mentioned, but in certain cases like the one we’ve found going through the runtime does really tank performance of the application. Thus, we should probably try to find a balance of only inlining a certain subset of the copy_ins.

Yes, there has to be a tradeoff, because the effect of these loops on compilation time may be significant, and it is an issue for CPU2017/521.wrf.

I think we first need to understand where the improvement comes from. I agree with Jean that ShallowCopy may provide better performance comparing to Assign, so this is the initial change that we should try, and then see if the inlined loop is faster and why.

and it is an issue for CPU2017/521.wrf

Could you elaborate on what the issue you were seeing was? With the PR as it currently is on GitHub I’m not seeing any substantial differences either in compile time or in the actual performance. The runtime I got with the optimisation was 1s less, the compile time comparison was this:

baseline:
3485.27s user 461.54s system 366% cpu 17:57.64 total
hlfir-inline-copy-in:
3488.05s user 451.68s system 363% cpu 18:03.28 total

I think we first need to understand where the improvement comes from. I agree with Jean that ShallowCopy may provide better performance comparing to Assign, so this is the initial change that we should try, and then see if the inlined loop is faster and why.

I swapped out the part of Assign that handles this copy for a plain call to ShallowCopy, with this patch:

diff --git a/flang-rt/lib/runtime/assign.cpp b/flang-rt/lib/runtime/assign.cpp
index 4a813cd48902..41608ffc79f3 100644
--- a/flang-rt/lib/runtime/assign.cpp
+++ b/flang-rt/lib/runtime/assign.cpp
@@ -470,10 +470,7 @@ RT_API_ATTRS void Assign(Descriptor &to, const Descriptor &from,
       }
     }
   } else { // intrinsic type, intrinsic assignment
-    if (isSimpleMemmove()) {
-      memmoveFct(to.raw().base_addr, from.raw().base_addr,
-          toElements * toElementBytes);
-    } else if (toElementBytes > fromElementBytes) { // blank padding
+    if (toElementBytes > fromElementBytes) { // blank padding
       switch (to.type().raw()) {
       case CFI_type_signed_char:
       case CFI_type_char:
@@ -492,12 +489,8 @@ RT_API_ATTRS void Assign(Descriptor &to, const Descriptor &from,
         terminator.Crash("unexpected type code %d in blank padded Assign()",
             to.type().raw());
       }
-    } else { // elemental copies, possibly with character truncation
-      for (std::size_t n{toElements}; n-- > 0;
-           to.IncrementSubscripts(toAt), from.IncrementSubscripts(fromAt)) {
-        memmoveFct(to.Element<char>(toAt), from.Element<const char>(fromAt),
-            toElementBytes);
-      }
+    } else {
+      ShallowCopy(to, from);
     }
   }
   if (deferDeallocation) {

It does not move the needle for the scenario I’m looking at. The program now spends a lot of time inside ShallowCopy doing that same loop. I think it’s pretty plain to see why, the logic for handling the loop indexing inside ShallowCopy takes a good few instructions to prepare the call the memcpy which in turns takes a good few instruction on its own, plus the function call overhead. The inlined loop I’m generating on the other hand compiles down to just this:

cb28:       fd400120        ldr     d0, [x9]
cb2c:       f1000508        subs    x8, x8, #0x1
cb30:       8b170129        add     x9, x9, x23
cb34:       fc008540        str     d0, [x10], #8
cb38:       54ffff81        b.ne    cb28 <_QMwlinterpolationmodulePcomputetempfromintenergy_lookup+0x8bc>  // b.any

That being said, the implementation inside ShallowCopy is effectively just a slightly better version of one of the codepaths in Assign. I think it’s still a good idea to refactor Assign a little and have it call ShallowCopy for that scenario as opposed to having the same code duplicated in both places. I’m happy to submit that as a separate PR, but it doesn’t address the problem I’m trying to address with this RFC.

AFAIR, the compilation time issues were related to CFG simplification and loop optimization passes in LLVM at higher optimization levels. Basically, the passes kept analyzing these continuity check branches and the compiler generated loops and did not do any meaningful transformations for them.

What options did you use to collect this data?

What options did you use to collect this data?

I’m using a7aca819d44b8d67f2cffd as the baseline for flang. Building SPEC2017 with -g -Ofast -mcpu=native -fuse-ld=lld -flto. The numbers I pasted above I just got from time runcpu -c llvm-flang --rebuild -a build wrf_r.

I still think you may get better performance with the ShallowCopy. Here is why:

  • The subscripts increments may not work quite efficiently, especially for multidimensional arrays (maybe not the case for your app). ShallowCopyDiscontiguousToContiguous uses the straightforward pointer arithmetic for the destination array, so it should be faster than the generic subscript increment.
  • There is still some overhead in Assign even before we reach the copy loop, e.g. the redundant aliasing, derived, allocatable, conformance checks. The shallow copy runtime does not do all that stuff relying on the compiler to create the temporary properly.
  • Depending on the optimization level that you use to build flang-rt, and the build compiler heuristics, Assign runtime might be calling memmove by address which may have some overhead comparing to the direct call of memcpy in the ShallowCopy. Moreover, memmove has more overhead comparing to memcpy.

If using ShallowCopy does not help, I would like to understand why. We should investigate the following avenues:

  • Will it help to specialize the ShallowCopy by the element size and call the specialized versions from the compiler?
  • Are there compiler optimizations that actually happen on the inlined loops?
  • Does the size of the copied array matter?

Thanks! The compilation time looks a bit too much, but this is probably due to LTO. I get around 600 seconds without LTO on Neoverse V2 with the current Flang.

I will need to try your patch to double check. I was using x86 when fixing āš™ D140446 [flang] Use Assign() runtime for copy-in/copy-out.

In general, I think it does not make sense to do the inline expansion if it is not optimizable, unless the overhead of the runtime call is significant for the particular copy-in/copy-out. On the other hand, it may be acceptable at higher optimization levels at the cost of compilation time.

1 Like

Thanks for testing, one reason that could explain why you are not seeing a degradation here could be that the ā€œproblematicā€ copy-in that we saw in WRF had copy-out and that your patch is not inlining such copy-in.

I still think it would be worth trying to optimize the copy-in/out runtime more before inlining and that maybe we need a dedicated ShallowCopy or CopyIn/CopyOu for trivial types with a loop.

Do you know how big are the array sections that are being copied-in in thornado-mini?

My suggestion was to call ShallowCopy directly from the CopyIn/Out runtime here instead of Assign here: llvm-project/flang-rt/lib/runtime/assign.cpp at 92d2e13b99ba1770e6307af7ed7ee877bfabde8c Ā· llvm/llvm-project Ā· GitHub

[edit: although I see now that you would miss the allocation of the temp. I am not sure what is the cost of using the runtime vs inline code for the allocation, but doing temp.Allocate(kNoAsyncId) before calling ShallowCopy directly should avoid going through a lot of branch in the Assign path that are dealing with Fortran assignment non trivialities that do not matter for copy-in/out].

My suggestion was to call ShallowCopy directly from the CopyIn/Out runtime

That won’t really make a difference when almost all the runtime overhead comes from the copying itself rather than anything around it. With the patch I pasted above the runtime distribution given by perf looks like this:

+   96.06%     4.01%  main             main                         [.] _QMwlcomputetempmodulePcomputetempfromintenergy_lookup                                                                                    ā–’
-   72.56%     0.41%  main             libmpi_usempif08.so.40.40.3  [.] _FortranACopyInAssign                                                                                                                     ā–’
   - 72.15% _FortranACopyInAssign                                                                                                                                                                                 ā–’
      - 68.74% Fortran::runtime::Assign(Fortran::runtime::Descriptor&, Fortran::runtime::Descriptor const&, Fortran::runtime::Terminator&, int, void* (*)(void*, void const*, unsigned long))                     ā–’
         + 61.80% Fortran::runtime::ShallowCopyDiscontiguousToContiguous(Fortran::runtime::Descriptor const&, Fortran::runtime::Descriptor const&)                                                                ā–’
         + 3.08% Fortran::runtime::Assign(Fortran::runtime::Descriptor&, Fortran::runtime::Descriptor const&, Fortran::runtime::Terminator&, int, void* (*)(void*, void const*, unsigned long))                   ā–’
         + 2.84% Fortran::runtime::Descriptor::Allocate(long)                                                                                                                                                     ā–’
      + 2.87% _FortranACopyInAssign

Almost all of the runtime from ACopyInAssign is actually spent inside ShallowCopy. We could make small changes to the allocation logic and whatnot, it’s still a useful idea, but you can see that as soon as I’ve outsourced the copying inside Assign to ShallowCopy there’s very little runtime spent in Assign left. That’s clearly not where the issue is.

could be that the ā€œproblematicā€ copy-in that we saw in WRF had copy-out

That is what I’m expecting is the case here, for sure.

Do you know how big are the array sections that are being copied-in in thornado-mini?

The specific case I’m testing this on here has an array of shape 185x81x30 and is copying in a 1-d slice with the middle dimension i.e. 81 elements inside a hot loop.

maybe we need a dedicated ShallowCopy or CopyIn/CopyOu for trivial types with a loop

How would that work? We’d still need to do a bunch of runtime checks to determine what the type is and which specialised implementation to use, would we not? It seems to me that whichever way we cut it it’ll still come out slower than the 4 instruction loop I pasted above. Just by nature of it being a runtime implementation it cannot be as specialised as emitting the right minimal instructions for the specific case at compile time.

I tried specialising it inside ShallowCopy as well, with the copy loop written like this:

    for (std::size_t n{to.Elements()}; n-- > 0;
        toAt += elementBytes, from.IncrementSubscripts(fromAt)) {
      *(double*)toAt = *from.Element<double>(fromAt);
    }

The runtime for my reproduces goes down from 4s to 3.6s, which is presumably the memcpy call overhead. The inlined runtime for that same reproducer is about 1.1s.
Looking at it in gdb, this copy loop once compiled inside the runtime becomes 28 instructions including calls and conditional branches. The inlined one has 4 instructions, no calls and no branches.

FWIW, I’m looking into the performance of Assign, and my investigation shows that the address calculation of array elements (Descriptor::Element) spends time. I’m thinking of creating a branch in Assign for assignment of arrays with the same shape so that it can remove the process to reproduce subscripts.

I’m thinking of creating a branch in Assign for assignment of arrays with the same shape

Sounds useful, certainly good to know :slight_smile: Though I think this is a different scenario from the one I’m trying to address here? Assign as used by copy_in is used when arrays are of different shapes. Or at least the ā€œrealā€ arrays, i.e. the issue described in this RFC arises when copying a non-contiguous array slice. So, I don’t think our respective proposals overlap here? Unless I’m missing something. Even more so since the optimisation I’m proposing here is only supposed to be for trivial types, not derived ones.

The compiler could call the right overload based on the static type info at compile time, the goal just being to hide the loop that makes little sense to spend time optimizing for the compiler and reducing the code size.

I agree the runtime implementation will likely always be somehow slower than the inline one, although there may be some fallback like register pressure, or the fact that further inlining is not performed because the code size of the function is deemed too big.

Copy-in/Copy-out situations are very common in existing Fortran code (even if it is not executed at runtime because the data is very often contiguous). So it is worth being sure inlining is the best path here. Hence what we want to measure is a good runtime impl vs inline code so that the the cost of inlining (extra compile time, code size) can be weighted.

If inlining is the best decision, maybe there are ways to avoid increasing the compile time by telling LLVM that it is not worth vectorizing/specializing such loops.

If inlining is the best decision, maybe there are ways to avoid increasing the compile time by telling LLVM that it is not worth vectorizing/specializing such loops.

Given that the inlined loop only runs if the array is not contiguous, I don’t think we can ever vectorise it. How about we just attach !{!"llvm.loop.vectorize.enable", i1 false} to the loop metadata? Would that take care of the compile time concerns?