Motivation and problem
Measurements of the execution speed of benchmark 627.cam4_s
in SPEC CPU 2017
revealed that a build using flang is slower than a build using gfortran.
Result (1 Core)
Machine | Compiler | Vector Length | Options | Execution Time [s] | Ratio (gcc/llvm) |
---|---|---|---|---|---|
graviton3 | llvm 18.1.8 | variable-length | -g -Ofast -mcpu=neoverse-v1 | 2155.23 | 0.60 |
gcc 14.1.0 | variable-length | -g -Ofast -mcpu=neoverse-v1 | 1295.25 | - |
The perf profiling indicated that the following functions were high-cost in the flang build:
Symbol | rate |
---|---|
__divdc3 | 17.63% |
Fortran::runtime::Assign | 12.35% |
__GI___memcpy_simd | 4.99% |
_QMoptics_libPmieint | 3.06% |
pow@@GLIBC_2.29 | 4.50% |
_QMmath_libPavint | 2.59% |
The __divdc3
function exhibited the highest cost. I believe one reason for this is that complex number division in flang is always converted into a call to a runtime function like __divdc3
. This appears to inhibit optimizations such as vectorization and reciprocal approximation in Fortran programs involving complex number division. Therefore, I aim to inline __divdc3
to improve speed and promote other optimizations.
Clang and GNU behavior
In clang, specifying either the -ffast-math
or -fcx-limited-range
option passes -complex-range=basic
to the frontend. The complex-range
value affects how complex number division is calculated. Similarly, in gcc and gfortran, specifying -fcx-limited-range
appears to select the calculations. From my investigation, at least for float
and double
, the complex number division is converted to calculation expressions. However, for quadruple precision, it results in a call to __divtc3
.
Types of complex number division
Consider calculating z = x / y
, where x = a + bi, y = c + di
.
Algebraic division
Clang and GNU compilers utilize this method when -ffast-math
or -fcx-limited-range
is specified. This method requires only two divisions, but it may overflow and underflow during intermediate calculations.
den = c^2 + d^2
num_r = a * c + b * d
num_i = a * d - b * c
real = num_r / den
imag = num_i / den
Smith’s algorithm (Algorithm 116 in this paper)
This is the default method used by gfortran. While it requires three divisions, it’s less prone to overflow. when y = 0 + 0i
, the result is NaN
.
- If |c| >= |d|
r = d / c
den = c + r * d
real = (a + b * r) / den
imag = (b - a * r) / den
- else
r = c / d
den = d + r * c
real = (a * r + b) / den
imag = (b * r - a) / den
Suggestion
My understanding is that current flang’s implementation uses a runtime function call for complex number division to minimize the risk of overflow during intermediate calculations. However, under optimization flags like -ffast-math
, which assume the absence of overflow, the algebraic division could be used. Therefore, I proposed the following implementation:
- Similar to clang, implement
-fcx-limited-range
option in the flang driver, and-complex-range
option in the flang frontend. - If
-ffast-math
or-fcx-limited-range
is specified, pass-complex-range=basic
to the frontend. - For the initial implementation, when
-complex-range=basic
is specified, generate algebraic division forfloat
ordouble
types. For other types retain a runtime function call.
The current plan is to modify the lowering stage to check the complex-range
value and generate either a runtime function call or fir.divc
accordingly. This implementation should result in faster complex number division when -ffast-math
, -fcx-limited-range
, or -complex-range=basic
is specified. However, optimization for other complex number operations (like abs
and mul
) based on -complex-range=basic
may need to be implemented.
Future plans
- Implement additional optimization options for speeding up complex number division. This may include implementing reciprocal approximation for division (
-mrecip
). - Implement handling for other
complex-range
options (clang appears to haveimproved
,promoted
, etc). Since clang uses a runtime call, algebraic division, or Smith’s algorithm selectively for complex number division, flang also need to choose between these calculation methods. I have not considered the phase of this transformation (lowering, transform, or codeGen) yet. - In flang, complex number division is by default converted to Smith’s algorithm. It seems gfortran defaults to using Smith’s algorithm. While this algorithm may result in
NaN
, it’s less prone to overflow.