[RFC] Improving the performance of `Fortran::runtime::Assign` for derived types

Motivation

I’ve investigated performance issues about assignments and found 9 issues. They are filed in [Flang] Performance issues in assignments · Issue #121125 · llvm/llvm-project · GitHub.
I’m looking into #121127 related to array assignments for derived types. I found that array assignments (Fortran::runtime::Assign) are slower than GFortran. I think inlining array assignments into scalar assignments in loops can be one of its solutions. However, Assign also seems to affect the performance in some other cases. Therefore, I’m thinking of improving Assign first.

Solution

My solution is composed of two parts.

  1. Improving assignments for array elements
    • Assigns n-dim arrays in n-nested loops
  2. Improving assignments for components of derived types
    • Introduces “copy routines” like GFortran

For array elements

The address calculation of array elements (Descriptor::Element) is expensive. It reproduces the subscripts, calculates the offset, and returns the address. Runtime information (e.g., extents, strides) is accessed several times for each array element during those steps.
This is because the runtime doesn’t know how the arrays exactly are in advance. On the other hand, GFortran generates n-nested loops for n-dim arrays at compile-time, so reproducing subscripts isn’t necessary; the loop indices are sufficient for address calculation. Therefore, I’m considering using n-nested loops for n-dim arrays in Assign under certain conditions.

  • The shapes of the arrays are the same.
  • The rank of the arrays is less than 7.
    • This is probably meaningless. I’ve just introduced this not to complicate the runtime code.

In addition, some calls of Descriptor::Element can be replaced with ones of Descriptor::OffsetElement. Descriptor::ZeroBasedIndexedElement is an example.

For components

If the arrays are of derived types, the number of memory accesses for runtime information increases. GFortran hard-codes the information as the operands of instructions in many cases. In other cases, a “copy routine” bound to the vtable is used. We can see that GFortran generates a routine named like __d_MOD___copy_d_Mytype. (Please note that copy routines are not defined in the Fortran standard.)
I tried introducing such a routine and improved the performance in some cases. However, my idea is not fully developed at the moment, and I would appreciate your feedback.

I’m thinking of implementing the routine as a special type-bound procedure as follows:

module m
  type t1
    integer x
  contains
    procedure, nopass :: copy => .cpy.t1
    generic :: copy_routine => copy
  end type
contains
  subroutine .cpy.t1(lhs,rhs) ! copy routine
    type(t1) :: rhs
    type(t1) :: lhs
    lhs = rhs
  end subroutine
end

The symbol of the routine is created in Semantics, and the routine is materialized in the initial lowering.
Although sequence types do not have type-bound procedures, their components are guaranteed to be contiguous in memory. Therefore, memmove can be used to assign each element.
If the derived type is declared in modules, it would be easy to generate the routine as a module procedure. Otherwise, the derived type should have an explicit interface for the type-bound procedure. I’m not familiar with derived types, but I believe such a Fortran code violates the standard. Compiler Explorer

I’d like to ask whether copy routines could be allowed to violate the standard.
If not, is there any way to address this?

1 Like

I’m creating a prototype. This is not complete, but it shows the concept. I used the following Fortran codes for evaluation and found that my prototype made the program 7.6x faster. (Flang is still slower than GFortran, but I believe inlining the array assignment can fill the gap.)

module m
  type :: t1
    integer :: x
  end type
end module

subroutine test(lhs,rhs)
  use m
  type(t1) :: rhs(:)
  type(t1) :: lhs(:)
  lhs = rhs
end subroutine
use m
implicit none
integer, parameter :: count = 4000
integer :: i, j
type(t1) :: l(100), r(100)
real :: start_t, end_t
interface
  subroutine test(lhs,rhs)
    use m
    type(t1) :: rhs(:)
    type(t1) :: lhs(:)
  end subroutine
end interface

call cpu_time(start_t)
do i=1,count
  do j=1,count
    call test(l,r)
  end do
end do
call cpu_time(end_t)
print *, end_t - start_t
end
Compiler Options Execution time [s]
GFortran 14.2 -Ofast 0.132725999
GFortran 14.2 -Ofast -fno-version-loops-for-strides* 0.629696012
GFortran 14.2 -O0 1.33945203
Flang 20.1.0-rc2 -Ofast 12.816779
Flang Prototype (based on 20.1.0-rc2) -Ofast 1.6828934

* prevented GFortran from using memcpy for contiguous array assignments

Thanks for this investigation/write-up/and proposal!

First of all, I fully agree flang should inline copies of POD like (plain old data) derived types, at least from O1.

I agree there are usages of Assign that would remain (at least when called from inside the runtime already, or when dealing with a POD component of a non trivial derived type), however, I feel like we should benchmark with inlining of POD assignments before pursuing a callback approach.

The callback approach is quite interesting (nice to see you could prototype it in a rather simple way!), but opens a few doors:

  • What happens when Assign is used on the device? I imagine the callbacks should then be compiled for the device and the derived type descriptor used on the device should point to that. While this likely not a blocker, this is extra complexity we have to keep in mind.
  • The linkage should be carefully selected. Currently, flang does not require linking against a module file object of a module defining POD, all the required derived type data is generated as linkonce in the compilation unit that are using it. To preserve this, the functions would need to be generated with linkonce_odr linkage. I do not foresee troubles, but I do not think we have done that yet in flang.

I’d like to ask whether copy routines could be allowed to violate the standard.
If not, is there any way to address this?

Regarding this point about type scopes and accessibility, I do not think this is a blocker, the helpers generated in lowering do not have to be procedures that could be printed back in Fortran.

Flang Prototype (based on 20.1.0-rc2) -Ofast 1.6828934

Your RFC mentions two changes, but the benchmark only measured the combined impact. What part of the speed-up is related to the changes done without introducing the callback mechanism?

If inlining of simple assignments + your other changes get flang on par on applications and benchmarks, I would stay away of callbacks for now for the sake of simplicity. If not, we could look at the cases where this is not enough to see if the callbacks are the best fit and how this would work into OpenMP/OpenACC/Cuda Fortran contexts.

Thank you for your feedback!

I feel like we should benchmark with inlining of POD assignments before pursuing a callback approach.

Your RFC mentions two changes, but the benchmark only measured the combined impact. What part of the speed-up is related to the changes done without introducing the callback mechanism?

I would like to share additional measurement results. (I fixed a bug in the prototype for the measurement and updated the patch, but this change should not affect previous results.)

Compiler Options Execution time [s]
Flang 20.1.0-rc2 -Ofast 12.816779
Flang Prototype -Ofast 1.6828934
Flang Prototype without callbacks -Ofast 9.065505
Flang 20.1.0-rc2 + manually inlining -Ofast 0.5668884

The results illustrate that inlining is more effective than callbacks.
I am a little concerned about the disadvantages of inlining. Generally, whether a function call gets inlined is decided based on heuristics. If I understand correctly, this is because inlining functions can significantly increase code size or increase register pressure. I am wondering if we can disregard such disadvantages in this case.

The callback approach is quite interesting (nice to see you could prototype it in a rather simple way!), but opens a few doors:

Thanks for pointing it out. I was missing the perspective of offloading. I’ll check them.