#include <fenv.h>
#pragma STDC FENV_ACCESS ON
long double x_fmulsub(long double a, long double b, long double c) {
fesetround(FE_UPWARD);
return c - a * b;
}
But it is not correct, as we cannot convert c-a*b to (-a*b)+c here.
While we have not something like fmulsub.
To solve this problem, maybe we have 2 solutions:
Add fmulsub/fms(fused multi-sub).
It requires much work on backend. And in fact some backends, such as AArch64 can combine the fneg and fmuladd to fmsub machine instruction.
If we add fmulsub/fms we will have to do some work for them.
Add a new fast-math-flag on @llvm.experimental.constrained.fmuladd.f64, to note that it should be combined to c-a*b with the previous fneg.
So the problem is actually transforming c - (a*b) to c + ((-a) * b) and when they are not fused, they are not the same under FE_UPWARD or FE_DOWNWARD. For example, in float:
FE_UPWARD
a = 1 + 2^-23
b = 1 + 2^-23
c = 1 + 2^-22
a * b = 1 + 2^-22 + 2^-23
(-a) * b = -1 - 2^-22
c - a * b = -2^-23
c + ((-a) * b) = 0
Note: Writing as (-a*b) + c would be understood as (-(a*b)) + c which is the same as c - (a*b) and not what happened here.
#include <stdio.h>
volatile long double a = 1.0L;
volatile long double b = 2.0L;
volatile long double c = 3.0L;
int main () {
long double r = 0.0L;
for (int i=0; i<1e7; i++)
r += (c - a * b);
printf("%Le\n", r);
}
With edit the IR from fmuladd to fma.
The time costs are 0m12.042s vs 0m1.476s
I’m assuming this is default target generic x86, which nobody concerned about performance is going to be using. It should be the same with an -mfma target
This can’t be an attribute, and would require a different intrinsic.
Do you actually have a compelling need for high performance strictfp in this situation, on a target without fast FMA? Your toy example above doesn’t justify the increase in complexity required to handle this. Without a strong, real world use case justification, I am opposed to trying to provide an optimally performant solution for this right now. The current state of strictfp support is not very complete or correct, and adding a new operation increases the surface area that we need to migrate to use from experimental constrained intrinsics to floating-point bundles.
Do you actually have a compelling need for high performance strictfp in this situation, on a target without fast FMA?
I find this problem when I am try to build glibc with Clang.
The failure is about long double, which doesn’t have fast FMA on almost all of the major platforms.
Right, but you can solve it by unconditionally using FMA.
I think we shouldn’t be translating expressions into fma – avoiding that deoptimization is the entire point of having a separate fmuladd operation.
It seems to me that the best answer (without adding an fmulsub operation) would be to disable this transform in strictfp mode – leaving the separate fmul/fsub operations the user wrote as-is.
Of course if a user wants to require an fma in a particular expression, they can always write an fma() call.
That’s not true even on modern hardware for non-float/double types, as has already been mentioned on this thread. If you’re working with x86_fp80 or fp128 types, you likely don’t have FMA in hardware – and the soft-float version is going to be way slower than the separate multiply/add operations. I do not think it’s reasonable to always emit a fused operation, even on modern hardware.
So I think the immediate fix really should be to simply disable the emission of fmuladd operations where it’s not semantically correct.
However after fixing the correctness bug, it’s reasonable to discuss next steps to potentially (correctly) allow contraction in additional cases. There’s 3 reasonable options I see for how to proceed:
Add more optional-fma llvm.fmuladd-like operations. We probably ought to support all 4 variants, (a*b)+c, (a*b)-c, -(a*b)+c, and -(a*b)-c, which are all commonly available as hardware instructions. Perhaps we could add two i1 immediate arguments negate_mul and negate_addend, instead of adding additional intrinsics. This seems like it’d be “relatively straightforward”, but yes, will require work, both in generic code and in target-specific code.
Deprecate and remove fmuladd, and make it the frontend’s responsibility to decide whether to emit a required-to-be-fused fma or separate fmul/fadd, based on its knowledge of particular targets and target options, and the data-type being operated on. My weakly-held opinion is that we shouldn’t take this option, but I can see advantages, so I think it may be a useful exercise to further explore the pros/cons here. (I think this is what you’d advocate doing, arsenm?)
Do nothing further: continue to skip contraction in some cases where it would be possible and profitable.