RISC-V multiply-add instructions (FMADD.*, etc), bit-exactness, and correctness of optimizations

RISC-V specification:

(11.6 Single-Precision Floating-Point Computational Instructions)

FADD.S and FMUL.S perform single-precision floating-point addition and multiplication respectively, between rs1 and rs2. FSUB.S performs the single-precision floating-point subtraction of rs2 from rs1.

FMADD.S multiplies the values in rs1 and rs2, adds the value in rs3, and writes the final result to rd. FMADD.S computes (rs1×rs2)+rs3.
FMSUB.S multiplies the values in rs1 and rs2, subtracts the value in rs3, and writes the final result to rd. FMSUB.S computes (rs1×rs2)-rs3.
FNMSUB.S multiplies the values in rs1 and rs2, negates the product, adds the value in rs3, and writes the final result to rd. FNMSUB.S computes -(rs1×rs2)+rs3.
FNMADD.S multiplies the values in rs1 and rs2, negates the product, subtracts the value in rs3, and writes the final result to rd. FNMADD.S computes -(rs1×rs2)-rs3.

(12.4 Double-Precision Floating-Point Computational Instructions)

The double-precision floating-point computational instructions are defined analogously to their single-precision counterparts, but operate on double-precision operands and produce double-precision results.

Hardware implementations of fused multiply-add instructions often don’t round the intermediate multiplication result. So

fmul.d M, A, B
fadd.d X, M, C

is not bit-exact equivalent to

fmadd X, A, B, C

Currently Clang generates multiply-add instructions under -O1, and does it slightly different from GCC (which generates multiply-add instructions under -O2). We are considering an optimization that would generate more multiply-add instructions.

Is there any consensus on the conditions (such as compiler flags) when such optimizations are “correct”?

This is specified by the C standard – you can do a conversion within one expression when #pragma FP_CONTRACT ON is in effect (vs OFF). The default value can be set via -ffp-contract command-line argument, and defaults to on. Because the correctness of the transform is dependent upon the operations being within one C expression, the clang frontend creates the FMAs when possible – it cannot be a bitcode optimization.

However, as an additional option which violates the C standard, we also support -ffp-contract=fast. This does allow contraction across statements. In that case, Clang emits the LLVM IR Fast-math-flag contract on floating-point instructions, which tells the optimizer that it may do such an optimization if desired.

2 Likes