Context
When a non-contiguous array slice is passed as an argument to a function that takes in an explicit-shape array, the Fortran Language Standard requires that it be made contiguous. To achieve that, the original array slice needs to be copied into a temporary. This operation is represented by hlfir.copy_in. The hlfir.copy_in is lowered to a flang-rt call, which handles memory allocation for the temporary array and the copy itself.
Problem
As currently implemented, flang-rt FortranACopyInAssign calls Fortran::runtime::Assign, which implements copying a non-contiguous array with a runtime loop that calls memmove separately on each element. While this is advantageous for things like arrays of derived types, for trivial types this results in a lot of overhead compared to an ordinary copy loop that could take better advantage of hardware pipelines. In practice, this makes certain HPC applications such as thornado-mini perform much worse with llvm-flang than they do with Classic Flang.
Suggested solution
This problem can be solved by adding a new pattern to the InlineHLFIRAssign optimisation pass that in certain circumstances replaces hlfir.copy_in with a nested copy loop emitted directly at compile time, as opposed to leaving it to the runtime. This is how this operation is handled by Classic Flang. This optimisation is only applied to trivial types. To start with, I suggest only applying it in cases where the array does not need to be copied out, e.g. when the parameter is declared as intent(in).
In practice, the optimisation transforms a hlfir snippet such as the following (generated by upstream flang):
%16 = hlfir.designate %4#0 (%6, %c1:%7#1:%c1_1, %14) shape %15 : (!fir.box<!fir.array<?x?x?xf64>>, i64, index, index, index, i64, !fir.shape<1>) -> !fir.box<!fir.array<?xf64>>
%c100_i32 = arith.constant 100 : i32
%17:2 = hlfir.copy_in %16 to %0 : (!fir.box<!fir.array<?xf64>>, !fir.ref<!fir.box<!fir.heap<!fir.array<?xf64>>>>) -> (!fir.box<!fir.array<?xf64>>, i1)
%18 = fir.box_addr %17#0 : (!fir.box<!fir.array<?xf64>>) -> !fir.ref<!fir.array<?xf64>>
%19:3 = hlfir.associate %c100_i32 {adapt.valuebyref} : (i32) -> (!fir.ref<i32>, !fir.ref<i32>, i1)
fir.call @_QFPsb(%18, %19#0) fastmath<contract> : (!fir.ref<!fir.array<?xf64>>, !fir.ref<i32>) -> ()
hlfir.copy_out %0, %17#1 : (!fir.ref<!fir.box<!fir.heap<!fir.array<?xf64>>>>, i1) -> ()
Into:
%12 = hlfir.designate %3#0 (%5, %c1:%6#1:%c1, %10) shape %11 : (!fir.box<!fir.array<?x?x?xf64>>, i64, index, index, index, i64, !fir.shape<1>) -> !fir.box<!fir.array<?xf64>>
%13 = fir.is_contiguous_box %12 whole : (!fir.box<!fir.array<?xf64>>) -> i1
%14:2 = fir.if %13 -> (!fir.box<!fir.array<?xf64>>, i1) {
fir.result %12, %false : !fir.box<!fir.array<?xf64>>, i1
} else {
%17 = fir.allocmem !fir.array<?xf64>, %8 {bindc_name = ".tmp", uniq_name = ""}
%18:2 = hlfir.declare %17(%11) {uniq_name = ".tmp"} : (!fir.heap<!fir.array<?xf64>>, !fir.shape<1>) -> (!fir.box<!fir.array<?xf64>>, !fir.heap<!fir.array<?xf64>>)
fir.do_loop %arg3 = %c1 to %8 step %c1 unordered {
%19 = hlfir.designate %12 (%arg3) : (!fir.box<!fir.array<?xf64>>, index) -> !fir.ref<f64>
%20 = fir.load %19 : !fir.ref<f64>
%21 = hlfir.designate %18#0 (%arg3) : (!fir.box<!fir.array<?xf64>>, index) -> !fir.ref<f64>
hlfir.assign %20 to %21 : f64, !fir.ref<f64>
}
fir.result %18#0, %true : !fir.box<!fir.array<?xf64>>, i1
}
%15 = fir.box_addr %14#0 : (!fir.box<!fir.array<?xf64>>) -> !fir.ref<!fir.array<?xf64>>
%16:3 = hlfir.associate %c100_i32 {adapt.valuebyref} : (i32) -> (!fir.ref<i32>, !fir.ref<i32>, i1)
fir.call @_QFPsb(%15, %16#0) fastmath<contract> : (!fir.ref<!fir.array<?xf64>>, !fir.ref<i32>) -> ()
fir.if %14#1 {
%17 = fir.box_addr %14#0 : (!fir.box<!fir.array<?xf64>>) -> !fir.ref<!fir.array<?xf64>>
%18 = fir.convert %17 : (!fir.ref<!fir.array<?xf64>>) -> !fir.heap<!fir.array<?xf64>>
fir.freemem %18 : !fir.heap<!fir.array<?xf64>>
}
Results
With the new optimisation applied, the runtime for thornado-mini when compiled with llvm-flang decreases by a factor of about 1/3rd. No regressions in other applications or benchmarks have been found so far.
Future work
There is no particular reason why the optimisation could not be applied to cases where the array needs to be copied out. It would be possible to transform hlfir.copy_out in the same way. The scope was reduced to only handling the copy_in in order to simplify the discussion and the review process for the implementation.
A draft PR with the implementation can be found here: