[RFC] change lowering of Fortran math intrinsics

Hello,

I want to give heads up and collect feedback about the change in lowering Fortran math intrinsics.

The general idea is to represent Fortran math intrinsics as MLIR operations of existing dialects (Math/Complex) as much as possible with the purpose of exposing them to more MLIR optimizations. The current implementation lowers math intrinsics “early” into opaque calls to pgmath library APIs. Not only these calls are blocking any MLIR optimizations, but further translation to LLVM IR leaves them as pgmath library calls that are not recognized/optimized in any way by LLVM backend (though, there were attempts to add some pgmath support in lib/Analysis/TargetLibraryInfo.cpp). So the proposed change is targeting potential performance issues associated with the current lowering scheme.

The existing pgmath scheme tries to look up a pgmath library function based on a Fortran intrinsic name in one of the function tables based on fast, relaxed or precise -mllvm -math-runtime configuration. The comprehensive list of currently supported pgmath functions can be found here. If the lookup fails, then the intrinsics is looked up in llvmIntrinsics table - if found there, then an opaque call is generated to an “llvm-mangled” LLVM intrinsic name, e.g. llvm.fabs.f32. So to add to the already mentioned issue with the opaque calls, the current implementation also relies on the availability of LLVM backend recognizing LLVM IR calls to llvm.fabs.f32 as LLVM intrinsic calls later.

The initial change for a “late” lowering of a subset of intrinsics is merged under an internal option in ⚙ D128385 [flang] Lower Fortran math intrinsic operations into MLIR ops or libm calls. It makes use of Math MLIR dialect operations that happen to be available to represent Fortran math intrinsics in FIR/MLIR. It also lowers Fortran intrinsics to opaque FIR calls for -math-runtime=precise which is trying to mimic the behavior of clang’s -ffp-model=strict as to disallowing optimizations that may, for example, alter the floating point exception semantics. Overall, I believe floating point model is not represented in MLIR at this point in any way, so this change is not an improvement over the current pgmath scheme but it does not seem to be a regress either.

Moreover, the new scheme also uses some llvm-mangled opaque calls (e.g. llvm.round.*), but in general it tries to use standard libm names such as sin, sinf for the precise model. It should be possible to replace some of these names with libm names, but for some of them, such as floating point exponentiation with integer power argument, there are just no libm implementations.

Moving forward, I think we need to represent the majority of Fortran math intrinsics with MLIR operations, but there are some exceptions to this direction. For example, if there are no anticipated optimization that might happen to some math operation (e.g. BESSEL) we may want to avoid adding an MLIR operation to represent it and lower it as an optimizable call (e.g. under fast it may have a narrow set of side effects, and use some FastMathFlags to enable more optimizations around it). If we decide to lower such an intrinsic into a FIR call, we will also have to define the name of the library implementation, e.g. if there is no libm implementation available, we may need to add an implementation to Fortran runtime library - details of how/whether such calls must be optimized (e.g. vectorized) by LLVM or any other backend may be discussed elsewhere.

As already mentioned, one of the problems arising with representing Fortran math intrinsics with MLIR oprations is the lack of floating point model representation in MLIR. This is a related, but a different topic: neither the pgmath nor the new lowering schemes address it consistently besides relying on the opaqueness of call operations. The same issues also exists with the usage of airth dialect FP operations that may be InstCombined, DCEd, etc. by LLVM backend. altering floating point exception behavior.

Given all the existing problems/immaturity of MLIR framework, I still think we should use MLIR operations as much as possible to represent Fortran math intrinsics. So I have started adding Math dialect operations that are missing. Below you will find a list of related differentials, e.g.:

  • Adding Math FPowI and IPowI operations for the exponentiation operator.
  • Getting rid of llvmIntrinsics table to avoid unnecessary work duplication when we have to add new intrinsics into both llvmIntrinsics and mathOperations tables.
  • Switching to “late” lowering by default while keeping pgmath fall back for cases that are currently not supported (e.g. complex data types are not covered at all in the new scheme).

All the current differentials are mostly targeting the fast and relaxed modes and aim to get somewhat performant code with LLVM backend. I believe the precise/strict mode cannot be reliably supported without cross-dialect (arith/math/complex/etc.) support for floating point exception semantics, rounding mode, denormals handling, errno, etc. modeling.

A list of related differentials so far:

  1. Flang:
    [merged] ⚙ D128385 [flang] Lower Fortran math intrinsic operations into MLIR ops or libm calls.
    ⚙ D130048 [flang] Support late math lowering for intrinsics from the llvm table.
    ⚙ D130129 [flang] Try to lower math intrinsics to math operations first.
  2. MLIR:
    [merged] ⚙ D128454 [mlir][math] Lower atan to libm
    [merged] ⚙ D129539 [mlir][math] Added math::tan operation.
    ⚙ D130035 [flang] Run algebraic simplification optimization pass.
    ⚙ D129809 [mlir][math] Added basic support for IPowI operation.
    ⚙ D129810 [mlir][math] Added math::IPowI conversion to calls of outlined implementations.
    ⚙ D129811 [mlir][math] Added basic support for FPowI operation.
    ⚙ D129812 [WIP][mlir][math] Added math::FPowI conversion to LLVM dialect.
1 Like

Some performance data for the initial change with -lower-math-early=true (default) and -lower-math-early=false on Icelake (Gold 6338):
SPEC CPU 2000 speed-ups:

  • 177.mesa: 1.16x

Polyhedron:

  • doduc_11: 1.10x
  • fatigue2_11: 1.13x
  • gas_dyn2_11: 1.32x
  • mp_prop_design_11: 2.19x

To the best of my knowledge, sqrt, sin and cos are the main contributors (meaning usage of native sqrt instruction and sincos optimization in LLVM).

Thanks for posting your plan here @szakharin. Overall it makes sense to me based on the current status of MLIR.

Could you implement the Bessel intrinsic on top of the math dialect for O3 and for O0 make a library call?

Could you implement the Bessel intrinsic on top of the math dialect for O3 and for O0 make a library call?

@tschuett, do you mean lower e.g. Fortran BESSEL_J0 to Math dialect bessel operation and then convert it into an inline implementation of Bessel function under some higher optimization level?

This is possible, but the question is what kind of benefits this will bring. As I said, if there are no anticipated optimizations for an operation, adding a dialect operation for it does not seem to justify the effort. So I guess the decision of adding or not adding an operation may depend on the evolution of optimizations such as AlgebraicSimplification and PolynomialApproximation.

Thanks for the feedback, @clementval!

My bad. A BESSEL op is probably not helpful. I wanted to say to express the formula behind the Bessel intrinsic using the Math dialect. E.g. if FOO(X,Y) is actually X * Y. Then a FOO op is not helpful, but I can say X * Y using the Math dialect instead.

I think FOO(X, Y) <=> X * Y is somewhat a good case for introducing new operation. First, it makes lowering of language constructs less verbose. Second, the fact that there is an “expansion” for FOO operation means that there is an optimization or canonicalization opportunity for this operation that may be internal to the Math dialect’s folding/canonicalization/optimization code. Maybe a better example is a BAR operation that may be expanded into a 100 arith dialect operations. We really do not want to emit all those 100 operations during Flang lowering. We may think of FOO the same way and during Flang lowering just ignore the fact that it is a simple multiplication. Though, I agree that the value of adding such a FOO operation is quite debatable.

Agreed.

If BAR is a Fortran-only thing, there are maybe several expansions. I.e., 100 primitive ops, 20 well-known ops, or 10 really well-known ops. The really well-known ops are more abstract and support other canonicalizations than the primitive ops.

Thanks for sharing, this really helps to understand the overall goal!

Thanks @szakharin for submitting this RFC for lowering Fortran intrinsics.

I have two points for discussion.

  1. Usually different hardware vendors have their own proprietary libraries for Math intrinsics. Sometimes they have different versions for the CPU and the GPU. In these cases if all the intrinsics are modelled in MLIR, then the vendors can write a conversion pass from MathDialect to lower the Math intrinsics to call functions from their proprietary libraries. Wouldn’t this be more convenient?
  2. Although there is no full modelling of floating point support in MLIR, there is some progress in this area like the addition of FastMath Flags (⚙ D126305 [mlir][arith] Initial support for fastmath flag attributes in the Arithmetic dialect (v2)). Since the most common use cases are fast and precise, would this be sufficient to lower to the MLIR Math dialect always assuming the conversions and transformations in the dialect can be restricted or modified to match the behaviour in Flang?
  1. Exactly! Having MLIR operations will make it more convenient for different vendors.
  2. Yes, with the assumption that MLIR passes can be modified to behave correctly (functionally), we can always lower to MLIR Math operations. I kept the calls for precise just to avoid breaking something that works now for the users. Thank you for the link!