RFC: c-a*b vs (-a*b)+c for strict mode

When working on the code like:

#include <fenv.h>
#pragma STDC FENV_ACCESS ON
long double x_fmulsub(long double a, long double b, long double c) {
        fesetround(FE_UPWARD);
        return c - a * b;
}

with command

clang -O2 -S -emit-llvm -frounding-math q.c -S

The output is like

define dso_local fp128 @x_fmulsub(fp128 noundef %0, fp128 noundef %1, fp128 noundef %2) local_unnamed_addr #0 {
  %4 = tail call signext i32 @fesetround(i32 noundef signext 3) #3
  %5 = fneg fp128 %0
  %6 = tail call fp128 @llvm.experimental.constrained.fmuladd.f128(fp128 %5, fp128 %1, fp128 %2, metadata !"round.dynamic", metadata !"fpexcept.strict") #4
  ret fp128 %6
}

But it is not correct, as we cannot convert c-a*b to (-a*b)+c here.
While we have not something like fmulsub.

To solve this problem, maybe we have 2 solutions:

  • Add fmulsub/fms(fused multi-sub).
    It requires much work on backend. And in fact some backends, such as AArch64 can combine the fneg and fmuladd to fmsub machine instruction.
    If we add fmulsub/fms we will have to do some work for them.

  • Add a new fast-math-flag on @llvm.experimental.constrained.fmuladd.f64, to note that it should be combined to c-a*b with the previous fneg.

Any idea?

I do not see why this is not correct

Due to that it is ask for rounding-up.

https://sourceware.org/pipermail/libc-alpha/2024-December/162499.html

With rounding upward or downward, you can’t even convert -(a*b) to -a*b. That’s a pretty harsh restriction.

So the problem is actually transforming c - (a*b) to c + ((-a) * b) and when they are not fused, they are not the same under FE_UPWARD or FE_DOWNWARD. For example, in float:

FE_UPWARD
a = 1 + 2^-23
b = 1 + 2^-23
c = 1 + 2^-22

a * b = 1 + 2^-22 + 2^-23
(-a) * b = -1 - 2^-22

c - a * b = -2^-23
c + ((-a) * b) = 0

Note: Writing as (-a*b) + c would be understood as (-(a*b)) + c which is the same as c - (a*b) and not what happened here.

Can we just unconditionally use FMA? I don’t think it’s worth the extra effort of supporting a new variant of fmuladd

Sound reasonable, and we need a flag to mark some cases:
nofma_required
fmadd
fmsub
fnmadd
fnmsub

No new flags, tthe point is to avoid adding anything new.

fused operation may be more cost if no hardware support. It should be reasonable to keep fmuladd.

My point is optimally supporting antique targets without fast FMA for strictfp is not worth the effort. Minimally functional is good enough

I have test the code like this:

#include <stdio.h>

volatile long double a = 1.0L;
volatile long double b = 2.0L;
volatile long double c = 3.0L;

int main () {
        long double r = 0.0L;
        for (int i=0; i<1e7; i++)
                r += (c - a * b);

        printf("%Le\n", r);
}

With edit the IR from fmuladd to fma.
The time costs are
0m12.042s vs 0m1.476s

I don’t think that it is acceptable.

I’m assuming this is default target generic x86, which nobody concerned about performance is going to be using. It should be the same with an -mfma target

I figure out a patch: IR: Add attribute negated by wzssyqa · Pull Request #121027 · llvm/llvm-project · GitHub

Any idea about it?

This can’t be an attribute, and would require a different intrinsic.

Do you actually have a compelling need for high performance strictfp in this situation, on a target without fast FMA? Your toy example above doesn’t justify the increase in complexity required to handle this. Without a strong, real world use case justification, I am opposed to trying to provide an optimally performant solution for this right now. The current state of strictfp support is not very complete or correct, and adding a new operation increases the surface area that we need to migrate to use from experimental constrained intrinsics to floating-point bundles.

Do you actually have a compelling need for high performance strictfp in this situation, on a target without fast FMA?

I find this problem when I am try to build glibc with Clang.
The failure is about long double, which doesn’t have fast FMA on almost all of the major platforms.

Right, but you can solve it by unconditionally using FMA. I mean do you have a case where you need the extra performance in slow-FMA situations

Right, but you can solve it by unconditionally using FMA.

I think we shouldn’t be translating expressions into fma – avoiding that deoptimization is the entire point of having a separate fmuladd operation.

It seems to me that the best answer (without adding an fmulsub operation) would be to disable this transform in strictfp mode – leaving the separate fmul/fsub operations the user wrote as-is.

Of course if a user wants to require an fma in a particular expression, they can always write an fma() call.

1 Like

It’s not a deoptimization on modern hardware. I do think we should stop assuming “FMA is slow” as default behavior

It’s not a deoptimization on modern hardware.

That’s not true even on modern hardware for non-float/double types, as has already been mentioned on this thread. If you’re working with x86_fp80 or fp128 types, you likely don’t have FMA in hardware – and the soft-float version is going to be way slower than the separate multiply/add operations. I do not think it’s reasonable to always emit a fused operation, even on modern hardware.

So I think the immediate fix really should be to simply disable the emission of fmuladd operations where it’s not semantically correct.

However after fixing the correctness bug, it’s reasonable to discuss next steps to potentially (correctly) allow contraction in additional cases. There’s 3 reasonable options I see for how to proceed:

  1. Add more optional-fma llvm.fmuladd-like operations. We probably ought to support all 4 variants, (a*b)+c, (a*b)-c, -(a*b)+c, and -(a*b)-c, which are all commonly available as hardware instructions. Perhaps we could add two i1 immediate arguments negate_mul and negate_addend, instead of adding additional intrinsics. This seems like it’d be “relatively straightforward”, but yes, will require work, both in generic code and in target-specific code.

  2. Deprecate and remove fmuladd, and make it the frontend’s responsibility to decide whether to emit a required-to-be-fused fma or separate fmul/fadd, based on its knowledge of particular targets and target options, and the data-type being operated on. My weakly-held opinion is that we shouldn’t take this option, but I can see advantages, so I think it may be a useful exercise to further explore the pros/cons here. (I think this is what you’d advocate doing, arsenm?)

  3. Do nothing further: continue to skip contraction in some cases where it would be possible and profitable.

Maybe we can have option 4:
a). keep fma only on frontend.
b) support new attributes for fma, such as negated/fused-is-not-required