RISC-V specification:

(11.6 Single-Precision Floating-Point Computational Instructions)

FADD.S and FMUL.S perform single-precision floating-point addition and multiplication respectively, between rs1 and rs2. FSUB.S performs the single-precision floating-point subtraction of rs2 from rs1.

…

FMADD.S multiplies the values in rs1 and rs2, adds the value in rs3, and writes the final result to rd. FMADD.S computes (rs1×rs2)+rs3.

FMSUB.S multiplies the values in rs1 and rs2, subtracts the value in rs3, and writes the final result to rd. FMSUB.S computes (rs1×rs2)-rs3.

FNMSUB.S multiplies the values in rs1 and rs2, negates the product, adds the value in rs3, and writes the final result to rd. FNMSUB.S computes -(rs1×rs2)+rs3.

FNMADD.S multiplies the values in rs1 and rs2, negates the product, subtracts the value in rs3, and writes the final result to rd. FNMADD.S computes -(rs1×rs2)-rs3.

(12.4 Double-Precision Floating-Point Computational Instructions)

The double-precision floating-point computational instructions are defined analogously to their single-precision counterparts, but operate on double-precision operands and produce double-precision results.

Hardware implementations of fused multiply-add instructions often don’t round the intermediate multiplication result. So

```
fmul.d M, A, B
fadd.d X, M, C
```

is not bit-exact equivalent to

```
fmadd X, A, B, C
```

Currently Clang generates multiply-add instructions under `-O1`

, and does it slightly different from GCC (which generates multiply-add instructions under `-O2`

). We are considering an optimization that would generate more multiply-add instructions.

Is there any consensus on the conditions (such as compiler flags) when such optimizations are “correct”?