RISC-V specification:
(11.6 Single-Precision Floating-Point Computational Instructions)
FADD.S and FMUL.S perform single-precision floating-point addition and multiplication respectively, between rs1 and rs2. FSUB.S performs the single-precision floating-point subtraction of rs2 from rs1.
…
FMADD.S multiplies the values in rs1 and rs2, adds the value in rs3, and writes the final result to rd. FMADD.S computes (rs1×rs2)+rs3.
FMSUB.S multiplies the values in rs1 and rs2, subtracts the value in rs3, and writes the final result to rd. FMSUB.S computes (rs1×rs2)-rs3.
FNMSUB.S multiplies the values in rs1 and rs2, negates the product, adds the value in rs3, and writes the final result to rd. FNMSUB.S computes -(rs1×rs2)+rs3.
FNMADD.S multiplies the values in rs1 and rs2, negates the product, subtracts the value in rs3, and writes the final result to rd. FNMADD.S computes -(rs1×rs2)-rs3.
(12.4 Double-Precision Floating-Point Computational Instructions)
The double-precision floating-point computational instructions are defined analogously to their single-precision counterparts, but operate on double-precision operands and produce double-precision results.
Hardware implementations of fused multiply-add instructions often don’t round the intermediate multiplication result. So
fmul.d M, A, B
fadd.d X, M, C
is not bit-exact equivalent to
fmadd X, A, B, C
Currently Clang generates multiply-add instructions under -O1
, and does it slightly different from GCC (which generates multiply-add instructions under -O2
). We are considering an optimization that would generate more multiply-add instructions.
Is there any consensus on the conditions (such as compiler flags) when such optimizations are “correct”?