Optimization of complex number division

s-watanabe314 · December 4, 2024, 7:29am

Motivation and problem

Measurements of the execution speed of benchmark 627.cam4_s in SPEC CPU 2017 revealed that a build using flang is slower than a build using gfortran.

Result (1 Core)

Machine	Compiler	Vector Length	Options	Execution Time [s]	Ratio (gcc/llvm)
graviton3	llvm 18.1.8	variable-length	-g -Ofast -mcpu=neoverse-v1	2155.23	0.60
	gcc 14.1.0	variable-length	-g -Ofast -mcpu=neoverse-v1	1295.25	-

The perf profiling indicated that the following functions were high-cost in the flang build:

Symbol	rate
__divdc3	17.63%
Fortran::runtime::Assign	12.35%
__GI___memcpy_simd	4.99%
_QMoptics_libPmieint	3.06%
pow@@GLIBC_2.29	4.50%
_QMmath_libPavint	2.59%

The __divdc3 function exhibited the highest cost. I believe one reason for this is that complex number division in flang is always converted into a call to a runtime function like __divdc3. This appears to inhibit optimizations such as vectorization and reciprocal approximation in Fortran programs involving complex number division. Therefore, I aim to inline __divdc3 to improve speed and promote other optimizations.

Clang and GNU behavior

In clang, specifying either the -ffast-math or -fcx-limited-range option passes -complex-range=basic to the frontend. The complex-range value affects how complex number division is calculated. Similarly, in gcc and gfortran, specifying -fcx-limited-range appears to select the calculations. From my investigation, at least for float and double, the complex number division is converted to calculation expressions. However, for quadruple precision, it results in a call to __divtc3.

Types of complex number division

Consider calculating z = x / y, where x = a + bi, y = c + di.

Algebraic division

Clang and GNU compilers utilize this method when -ffast-math or -fcx-limited-range is specified. This method requires only two divisions, but it may overflow and underflow during intermediate calculations.

den = c^2 + d^2
num_r = a * c + b * d
num_i = a * d - b * c
real = num_r / den
imag = num_i / den

Smith’s algorithm (Algorithm 116 in this paper)

This is the default method used by gfortran. While it requires three divisions, it’s less prone to overflow. when y = 0 + 0i, the result is NaN.

If |c| >= |d|

r = d / c
den = c + r * d
real = (a + b * r) / den
imag = (b - a * r) / den

else

r = c / d
den = d + r * c
real = (a * r + b) / den
imag = (b * r - a) / den

Suggestion

My understanding is that current flang’s implementation uses a runtime function call for complex number division to minimize the risk of overflow during intermediate calculations. However, under optimization flags like -ffast-math, which assume the absence of overflow, the algebraic division could be used. Therefore, I proposed the following implementation:

Similar to clang, implement -fcx-limited-range option in the flang driver, and -complex-range option in the flang frontend.
If -ffast-math or -fcx-limited-range is specified, pass -complex-range=basic to the frontend.
For the initial implementation, when -complex-range=basic is specified, generate algebraic division for float or double types. For other types retain a runtime function call.

The current plan is to modify the lowering stage to check the complex-range value and generate either a runtime function call or fir.divc accordingly. This implementation should result in faster complex number division when -ffast-math, -fcx-limited-range, or -complex-range=basic is specified. However, optimization for other complex number operations (like abs and mul) based on -complex-range=basic may need to be implemented.

Future plans

Implement additional optimization options for speeding up complex number division. This may include implementing reciprocal approximation for division (-mrecip).
Implement handling for other complex-range options (clang appears to have improved, promoted, etc). Since clang uses a runtime call, algebraic division, or Smith’s algorithm selectively for complex number division, flang also need to choose between these calculation methods. I have not considered the phase of this transformation (lowering, transform, or codeGen) yet.
In flang, complex number division is by default converted to Smith’s algorithm. It seems gfortran defaults to using Smith’s algorithm. While this algorithm may result in NaN, it’s less prone to overflow.

DavidTruby · December 4, 2024, 1:55pm

Sounds good to me in concept; I wonder if, rather than introducing lowering ourselves for fir.divc, we can just use the MLIR complex dialect, now that we use that for complex number types anyway? There’s an existing complex.div operation that might do the job and already has lowering to inline LLVM IR.
As I recall, I tried to do this before some time ago but didn’t have the time to finish it so we ended up only having the function call. I don’t think there was anything particularly blocking doing this when -ffast-math is on.

jeanPerier · December 4, 2024, 2:29pm

Sounds good to me, thanks for working on this.

Ideally, we would always lower to MLIR op + some fp flag attribute in lowering, and then gets operation codegen to be whatever is required/allowed for this flag.

This makes lowering/testing easier.

I also agree with David’s comment, since we now use the MLIR complex type, it would be good to see if complex.div can do what we need (that is, be precise and respect NaNs/be lowered to runtime, unless given some flag attribute (fastmath?)).

s-watanabe314 · December 6, 2024, 8:27am

Thank you for your replies. As you said, I will use MLIR’s complex.div instead of fir.divc.

I checked the conversion of complex.div, and it seems to be lowered to Smith’s algorithm in ComplexToStandard and to algebraic division in ComplexToLLVM. The conversion in ComplexToStandard seems to handle nan, inf, and zero division appropriately.

Therefore, in flang, how about an implementation that always lowers to complex.div instead of lowering to runtime, and selects either a properly handled Smith’s algorithm or algebraic division? In this case, since there doesn’t seem to be an option regarding the precision of intermediate calculations with complex numbers in the MLIR dialect, I’m considering adding a new attribute (e.g., complexRangeAttr ) to the arith dialect to check during the conversion of complex.div to other expressions.

DavidTruby · December 7, 2024, 3:38pm

If the algorithm in ComplexToStandard will have the same precision and correctness as in compiler-rt/libgcc then that sounds like a good approach to me. As I recall the reason we didn’t use the mlir complex lowering before is that there were issues with denormals, are those no longer present?

The issue we had with going through mlir complex.divc and trying to do library calls (ComplexToLibm) is that the complex ABI wasn’t handled correctly, I’m not sure if that’s been fixed since either.

s-watanabe314 · December 12, 2024, 9:28am

In libgcc, complex number division for half-precision and float seems to avoid underflow and overflow by casting to a larger precision type before calculation. However, complex.div doesn’t have the same implementation, leading to a likely precision difference in the current state.

With -ffast-math, I don’t anticipate any significant problems. so I’d like to prioritize implementing the -ffast-math case first.
Regarding ABI issues, I am considering resolving them by selecting between a library call or complex.div during the lowering stage. Currently, complex number division is entirely converted to a libraly call in IntrinsicCall.cpp, but I intend to lower to complex.div only if options like -ffast-math are specified.

DavidTruby · December 12, 2024, 4:41pm

That sounds good to me, thanks!
I don’t mind just choosing which of those lowerings to do at the flang level for now, but ideally we’d fix the ABI in complex.div so that it works for anyone else wanting to use it. I am not sure how difficult that is though, so I’d be happy for you to just put a TODO in for now reminding us that that’d be the ideal.

s-watanabe314 · December 13, 2024, 8:43am

Thank you for your reply. I’ll put a TODO in.

Regarding the current implementation, does anyone know why the complex number division algorithms differ between complexToLLVM and complexToStandard? complexToStandard uses Smith’s algorithm, while complexToLLVM uses the algebraic division. Should I modify both to use the same conversion?

thtsikas · January 20, 2025, 4:25pm

The program below is something I wrote up when reading the paper [1210.4539] A Robust Complex Division in Scilab. The “improved” and “robust” methods are implemented in Fortran, below. You may find them helpful.

Program robust
  Call s(Huge(0d0),2d0)
  Call s2(3.d0,0.d0)
Contains
  Subroutine s(a,b)
    Double Precision a,b
    Complex(Kind(0d0)) c,d
    c = a - (0,1)*a
    d = b + (0,1)*b
    Print 1,c,"numerator"
    Print 1,d,"denominator"
    Print 1,c/d, "/ operator"
    Print 1,compdiv_improved(c,d), "improved"
    Print 1,compdiv_robust(c,d), "robust"
    Print *

1   Format(2(1X,ES20.10),A30)
  End Subroutine
  Subroutine s2(a,b)
    Double Precision a,b
    Complex(Kind(0d0)) c,d
    c = (1.d0,0.d0)
    d = a + (0,1)*b
    Print 1,c,"numerator"
    Print 1,d,"denominator"
    Print 1,1/d, "/ operator"
    Print 1,compdiv_improved(c,d), "improved"
    Print 1,compdiv_robust(c,d), "robust"
    Print *
1   Format(2(1X,ES20.10),A30)
  End Subroutine S2

  Function compdiv_improved (x,y) Result(z)
    Complex(Kind(0d0)):: x,y,z
    Real(Kind(0d0)):: a,b,c,d,e,f
    a = x%re; b = x%im
    c = y%re; d = y%im
    If (Abs(d) <= Abs(c)) Then
       call improved_internal(a,b,c,d,e,f)
    Else
       call improved_internal(b,a,d,c,e,f)
    Endif
    z = Cmplx(e,f,Kind(0d0))
  End Function compdiv_improved
  Subroutine improved_internal(a,b,c,d,e,f)
    Real(Kind(0d0)), Intent(In):: a,b,c,d
    Real(Kind(0d0)), Intent(Out):: e,f
    Real(Kind(0d0)) :: r, t
    r = d/c
    t = 1.d0/(c+d*r)
    If (r /= 0.d0) Then
       e = (a+b*r)*t
       f = (b-a*r)*t
    Else
       e = (a+d*(b/c))*t
       f = (b-d*(a/c))*t
    Endif
  end Subroutine improved_internal


  Function compdiv_robust (x,y) Result(z)
    Complex(Kind(0d0)):: x,y,z
    Real(Kind(0d0)):: a,b,c,d,AB,CD,Bb,S,OV,UN,Be,eps
    a = x%re; b = x%im
    c = y%re; d = y%im
    AB=Max(Abs(a),Abs(b))
    CD=Max(Abs(c),Abs(d))
    Bb = 2
    S = 1
    OV = Huge(1.d0)
    UN = Tiny(1.d0)
    eps = Epsilon(1.d0)
    Be = Bb/eps**2
    If (AB >= OV/2) Then ! Scale down a,b
       x=x/2; S=S*2
    EndIf
    If (CD >= OV/2) Then !Scale down c,d
       y=y/2; S=S/2
    EndIf
    If (AB <= UN*Bb/eps) Then !Scale up a,b
       x=x*Be ; S=S/Be
    EndIf
    If (CD <= UN*Bb/eps) Then ! Scale up c,d
       y=y*be; S=S*Be
    EndIf
    z = robust_internal(x,y)
    z = z*S
  End Function compdiv_robust

  Function robust_internal(x,y) Result(z)
    Complex(Kind(0d0)) :: x,y,z
    Real(Kind(0d0)) :: a,b,c,d,e,f
    a=x%re; b=x%im
    c=y%re; d=y%im
    If (Abs(d)<=Abs(c)) Then
       call robust_subinternal(a,b,c,d,e,f)
    Else
       call robust_subinternal(b,a,d,c,e,f)
    EndIf
    z = Cmplx(e,f,Kind(0.d0))
  End Function robust_internal
  Subroutine robust_subinternal(a,b,c,d,e,f)
    Real(Kind(0d0)), Intent(In):: a,b,c,d
    Real(Kind(0d0)), Intent(Out):: e,f
    Real(Kind(0d0)) :: r, t
    r=d/c
    t=1.d0/(c+d*r)
    e=internal_compreal(a,b,c,d,r,t)
    f=internal_compreal(b,-a,c,d,r,t)
  End Subroutine robust_subinternal
  Function internal_compreal(a,b,c,d,r,t) Result(e)
    Real(Kind(0d0)):: a,b,c,d,r,t,e,br
    if (r/=0.d0) Then
       br=b*r
       if (br/=0.d0) Then
          e = (a+br)*t
       Else
          e = a*t+(b*t)*r
       EndIf
    Else
       e=(a+d*(b/c))*t
    EndIf
  End Function internal_compreal

End Program

s-watanabe314 · February 17, 2025, 2:55am

I posted a patch to MLIR that allows selecting the algorithm for complex number division via a pass option. The patch is here, and the discussion on the MLIR is here.

Here’s a summary of the MLIR patch:

Both the ComplexToLLVM and ComplexToStandard passes have been modified to implement two transformation algorithms, smith with NaN handling and algebraic, allowing for their selection.
Added a pass option complex-range=XXXXX to both passes.
- Possible values are improved, basic, none.
- By default, ComplexToStandard converts to smith , while ComplexToLLVM converts to algebraic (with a future plan to unify both to smith ).
- If the option value is improved, it converts to smith.
- If the value is none or basic, it converts to algebraic.

Based on this, I want to add options to the Flang frontend to control complex number division. But I have the following questions:

Is it necessary for Fortran to switch division algorithms per operation or within a translation unit, such as during inlining? If so, my current MLIR patch, which uses a pass option to specify the algorithm for all operations, cannot handle this.
GCC and Clang have the option -fcx-fortran-rules. What are the rules for complex number division that Fortran should satisfy? I suspect it refers to range reduction and NaN handling, but I don’t know the basis for this. My MLIR patch uses the smith algorithm with NaN handling via -complex-range=improved. Would this be a suitable algorithm when -fcx-fortran-rules is specified? It seems that Clang uses the smith algorithm without NaN handling when -fcx-fortran-rules is specified.

I believe Flang needs to implement -f[no-]cx-limited-range, -f[no-]cx-fortran-rules, and -fcomplex-arithmetic=[full|improved|basic].
Any feedback would be greatly appreciated.

clementval · March 13, 2025, 9:32pm

Thanks for the work in the MLIR complex dialect. I’m looking forward to have this in flang is at all possible.

I’m not sure if you covered this and sorry if you did but how does the new lowering compare with the current __divsc3 and __divdc3 that are currently called. Should we expect identical/similar results or is the method used too different?

@s-watanabe314

clementval · March 24, 2025, 3:57pm

@s-watanabe314 Ping. Any idea if we can use complex.div in flang lowering or if not, what are the blocker?

s-watanabe314 · March 25, 2025, 2:58am

I apologize for the late reply.

The GCC runtime functions like __divsc3 used by Flang generally calculates complex division using Smith’s algorithm, and in that case, I believe the result would be the same as complex.div. However, in some cases, different calculation methods are used, so I don’t think all calculation results will be exactly the same. It seems that the GCC runtime caluculates complex number division by, for example, converting float to double and then calculating using the algebraic method, or by performing preprocessing to avoid overflow before calculating using Smith’s algorithm.

If this difference is acceptable, I think we can lower to complex.div by default. If it’s not acceptable, we would need to add an option like -fcomplex-arithmetic= and only use complex.div when this option is specified.

I would like to confirm whether the calculation results of complex.div and the runtime functions are completely identical. If you know of any test suites related to the precision or overflow of complex number division, please let me know.

Topic		Replies	Views
The priority of -fno-fast-math regarding complex number calculations Clang Frontend	10	201	March 12, 2025
Clang ignoring --fast-math for complex division, serious performance hit Clang Frontend	17	143	November 10, 2017
RFC: The meaning of -Ofast Flang	17	2581	May 1, 2024
[RFC] Improve iteration of estimating divisions LLVM Dev List Archives	6	103	August 9, 2019
[RFC] Proposal for complex math intrinsic lowering through the MLIR complex dialect Flang	1	385	July 19, 2023