Attaching arith::FastMathAttr to operations in Flang

Hello,

I would like to discuss options for attributing MLIR operations produced by Flang with arith::FastMathAttr attribute. As you might be aware there are several patches merged or being in review related to this ([1], [2], [3], [4]). So now we need to start attaching the attribute to arith, math, etc. operations in Flang.

Here is a list of options that I’ve collected so far:

  1. @jfurtek has a pass in review [5] that would attach the specified attribute value to all operations that support arith::FastMathAttr recursively.
  2. Have a special property in fir::FirOpBuilder that identifies the fastmath flags that need to be attached to operations created through this builder.
  3. Same as above, but do it in mlir::OpBuilder.

While (1) seems to be a convenient option, I think we may want to have places in Flang (e.g. at some point in lowering) where we would diverge from the options specified by the user and override the fastmath behavior by turing-off user allowed fastmath flags or turning-on more flags that user allowed. An example of the latter may be setting reassoc for expressions where it is allowed by Fortran standard (10.1.5.2.4). For the former I do not have a good example, so something artificial is disabling contract and reassoc under some conditions for operations inlined to implement DOT_PRODUCT (e.g. under some made up option -fordered-reductions applied on top of -ffast-math). So having just a single pass does not give us much flexibility.

(2) and (3) just follow the path of llvm::IRBuilder::setFastMathFlags() that defines fastmath flags for the next instructions created through the builder. We may also support something like llvm::IRBuilder::FastMathFlagGuard to manage local overrides of fastmath flags.

The builder’s create methods will attach the current fastmath flags based on the check whether the created operation supports arith::ArithFastMathInterface.

(2) seems like a good start to me because it is local to Flang (so does not interfere with other parts of LLVM and can be implemented relatively fast), and later after figuring out all the caveats with fir::FirOpBuilder prototype we can propose the same change for mlir::OpBuilder. For example, I expect it to be a problem for MLIRIR (the component providing mlir::OpBuilder) to depend on MLIRArithDialect, while FIRBuilder component already depends on all ${dialect_libs} (for whatever reason).

Please let me know if you want to consider other options or if you prefer any of the listed ones.

[1] ⚙ D126305 [mlir][arith] Initial support for fastmath flag attributes in the Arithmetic dialect (v2)
[2] ⚙ D136312 [mlir][math] Initial support for fastmath flag attributes for Math dialect.
[3] ⚙ D136080 [flang] Add -ffp-contract option processing
[4] ⚙ D137072 [flang] Add -f[no-]honor-infinities and -menable-no-infs
[5] ⚙ D137114 [mlir][arith] Add pass to globally set fastmath attributes for a module

2 Likes

I agree that Option 2 is the best way forward.

For Transformation/Conversion passes in Flang that operate on floating point numbers, I guess it will be the responsibility of the pass to honour and propagate the Fast Math attributes. I believe we do not have any of the former but the Conversion to LLVM has several usage or creation of Floating point operations (eg from CodeGen.cpp below).

    auto rrn = rewriter.create<mlir::LLVM::FAddOp>(loc, eleTy, xx, yy);
    auto rin = rewriter.create<mlir::LLVM::FSubOp>(loc, eleTy, yx, xy)

BTW, would you like us (@DavidTruby) to make similar changes in the complex Dialect and passes as is in the math dialect?

Thank you for the reply, Kiran! I will make the changes for FirOpBuilder and the Bridge (where we create the main builders). FWIW, I am going to propagate the LangOptions to the FirConverter via LoweringOptions.

Yes, the transformation passes should take into account fastmath attribute of an existing operation that they try to transform and propagate them further. To be able to do this in CodeGen we need to support fastmath attribute for FIR complex arithmetic operations (fir::AddcOp, fir::MulcOp, etc.).

Of course, your help is welcome! Since there are multiple things to do, I think we need to prioritize the work based on benchmarks analysis.

Here is the data that I have so far (I made the analysis on x86):

  1. The biggest gainer from fastmath flags is CPU2006/454.calculix: it gets ~2x speed-up from fastmath flags added for loop nest at line 675 in e_c3d routine. The FP arithmetic operations are produced during lowering, so my FirOpBuilder work should cover this.
    1.1. There should be slight speed-up in CPU2017/503.bwaves and CPU2006/410.bwaves as well.
    1.2. Polyhedron/induct2_11 should gain from marking mlir::math::SqrtOp with fastmath.
  2. Polyhedron/induct2_11,test_fpu2_11 both gain from adding fastmath flags to operations of the simplified DOT_PRODUCT code.
  3. I did not see hot cases where having fastmath for complex arithmetic provides improvement. I guess it may slightly improve performance here and there. I believe @jfurtek has already started adding fastmath support into complex dialect, so we need to make sure to coordinate the changes with him. On our side, we can add fastmath support for FIR complex arithmetic as noted above.

With this said, it will be great if you can work on (2) (of course, this is just my suggestion). Of the top of my head:

  • We may add arith::FastMathAttr support to fir.call operations such that we can mark Fortran runtime DOT_PRODUCT calls.
  • Then in SimplifyIntrinsicsPass we will be able to propagate the attribute to the inline implementations by configuring FirOpBuilder accordingly.
  • We will have to resolve an issue with separate Fortran modules containing DOT_PRODUCT calls and compiled with different fastmath options: if we name the simplified functions the same and we keep using linkonceodr linkage, the linker will take one of the versions, so if it takes “slow” version, then we lose performance, and if it takes “fast” version, then we may produce inaccurate results in the module compiled with stricter fastmath settings. I do not think we should resolve this with functions naming, because there may be too many versions for all combinations of fastmath flags. We can try to just inline the simplified code instead of keeping it in functions or there may be other options.
    • Note that with HLFIR lowering we are planning to actually inline the transformational operations (ref and ref), so starting to inline them now in SimplifyIntrinsicsPass agrees with the future handling.

Another thing that we can do is to see whether it is profitable to apply Math::PolynomialApproximation under -ffast-math (e.g. under afn in particular). If it is, then we will have to modify the pass to trigger rewrites only for operations with appropriate fastmath flag and add the pass into our optimization pipeline (e.g. under some aggressive optimization level).

1 Like

Thanks @szakharin for the detailed information.

I have seen that wrf (12%) and fotonik (9%) also benefit from fast-math in both the gfortran and classic-flang compilers. So I hope we will get the benefits with llvm/flang as well.

We can do the work for the SimplifyIntrinsicsPass. But @Leporacanthicus is away for a couple of weeks so it might have to wait till then.

Right, I missed fotonik! My estimation is 3.3% speed-up, but we may benefit more later, when aliasing information is present to enable vectorization.

I did not look at wrf yet.

With the latest changes wrf gained 7.99% and fotonik only 1.56% on x86. The hot fotonik loops are not vectorized, so fastmath gain seems to be limited.

I will make changes in SimplifyIntrinsicsPass that should help bwaves a little bit.