RFC: c-ab vs (-ab)+c for strict mode

wzssyqa · December 19, 2024, 1:55am

When working on the code like:

#include <fenv.h>
#pragma STDC FENV_ACCESS ON
long double x_fmulsub(long double a, long double b, long double c) {
        fesetround(FE_UPWARD);
        return c - a * b;
}

with command

clang -O2 -S -emit-llvm -frounding-math q.c -S

The output is like

define dso_local fp128 @x_fmulsub(fp128 noundef %0, fp128 noundef %1, fp128 noundef %2) local_unnamed_addr #0 {
  %4 = tail call signext i32 @fesetround(i32 noundef signext 3) #3
  %5 = fneg fp128 %0
  %6 = tail call fp128 @llvm.experimental.constrained.fmuladd.f128(fp128 %5, fp128 %1, fp128 %2, metadata !"round.dynamic", metadata !"fpexcept.strict") #4
  ret fp128 %6
}

But it is not correct, as we cannot convert c-a*b to (-a*b)+c here.
While we have not something like fmulsub.

To solve this problem, maybe we have 2 solutions:

Add fmulsub/fms(fused multi-sub).
It requires much work on backend. And in fact some backends, such as AArch64 can combine the fneg and fmuladd to fmsub machine instruction.
If we add fmulsub/fms we will have to do some work for them.
Add a new fast-math-flag on @llvm.experimental.constrained.fmuladd.f64, to note that it should be combined to c-a*b with the previous fneg.

Any idea?

arsenm · December 19, 2024, 6:05am

I do not see why this is not correct

wzssyqa · December 19, 2024, 6:41am

Due to that it is ask for rounding-up.

https://sourceware.org/pipermail/libc-alpha/2024-December/162499.html

jayfoad · December 19, 2024, 10:42am

With rounding upward or downward, you can’t even convert -(a*b) to -a*b. That’s a pretty harsh restriction.

lntue · December 19, 2024, 4:05pm

So the problem is actually transforming c - (a*b) to c + ((-a) * b) and when they are not fused, they are not the same under FE_UPWARD or FE_DOWNWARD. For example, in float:

FE_UPWARD
a = 1 + 2^-23
b = 1 + 2^-23
c = 1 + 2^-22

a * b = 1 + 2^-22 + 2^-23
(-a) * b = -1 - 2^-22

c - a * b = -2^-23
c + ((-a) * b) = 0

Note: Writing as (-a*b) + c would be understood as (-(a*b)) + c which is the same as c - (a*b) and not what happened here.

arsenm · December 20, 2024, 4:39am

Can we just unconditionally use FMA? I don’t think it’s worth the extra effort of supporting a new variant of fmuladd

wzssyqa · December 20, 2024, 5:12am

Sound reasonable, and we need a flag to mark some cases:
nofma_required
fmadd
fmsub
fnmadd
fnmsub

arsenm · December 20, 2024, 5:33am

No new flags, tthe point is to avoid adding anything new.

wzssyqa · December 20, 2024, 5:49am

fused operation may be more cost if no hardware support. It should be reasonable to keep fmuladd.

arsenm · December 20, 2024, 5:57am

My point is optimally supporting antique targets without fast FMA for strictfp is not worth the effort. Minimally functional is good enough

wzssyqa · December 20, 2024, 8:43am

I have test the code like this:

#include <stdio.h>

volatile long double a = 1.0L;
volatile long double b = 2.0L;
volatile long double c = 3.0L;

int main () {
        long double r = 0.0L;
        for (int i=0; i<1e7; i++)
                r += (c - a * b);

        printf("%Le\n", r);
}

With edit the IR from fmuladd to fma.
The time costs are
0m12.042s vs 0m1.476s

I don’t think that it is acceptable.

arsenm · December 20, 2024, 12:21pm

I’m assuming this is default target generic x86, which nobody concerned about performance is going to be using. It should be the same with an -mfma target

wzssyqa · December 24, 2024, 4:19am

I figure out a patch: IR: Add attribute negated by wzssyqa · Pull Request #121027 · llvm/llvm-project · GitHub

Any idea about it?

arsenm · December 26, 2024, 5:54am

This can’t be an attribute, and would require a different intrinsic.

Do you actually have a compelling need for high performance strictfp in this situation, on a target without fast FMA? Your toy example above doesn’t justify the increase in complexity required to handle this. Without a strong, real world use case justification, I am opposed to trying to provide an optimally performant solution for this right now. The current state of strictfp support is not very complete or correct, and adding a new operation increases the surface area that we need to migrate to use from experimental constrained intrinsics to floating-point bundles.

wzssyqa · December 26, 2024, 6:08am

Do you actually have a compelling need for high performance strictfp in this situation, on a target without fast FMA?

I find this problem when I am try to build glibc with Clang.
The failure is about long double, which doesn’t have fast FMA on almost all of the major platforms.

arsenm · December 27, 2024, 2:35pm

Right, but you can solve it by unconditionally using FMA. I mean do you have a case where you need the extra performance in slow-FMA situations

jyknight · December 27, 2024, 2:57pm

Right, but you can solve it by unconditionally using FMA.

I think we shouldn’t be translating expressions into fma – avoiding that deoptimization is the entire point of having a separate fmuladd operation.

It seems to me that the best answer (without adding an fmulsub operation) would be to disable this transform in strictfp mode – leaving the separate fmul/fsub operations the user wrote as-is.

Of course if a user wants to require an fma in a particular expression, they can always write an fma() call.

arsenm · January 2, 2025, 1:03pm

It’s not a deoptimization on modern hardware. I do think we should stop assuming “FMA is slow” as default behavior

jyknight · January 2, 2025, 7:47pm

It’s not a deoptimization on modern hardware.

That’s not true even on modern hardware for non-float/double types, as has already been mentioned on this thread. If you’re working with x86_fp80 or fp128 types, you likely don’t have FMA in hardware – and the soft-float version is going to be way slower than the separate multiply/add operations. I do not think it’s reasonable to always emit a fused operation, even on modern hardware.

So I think the immediate fix really should be to simply disable the emission of fmuladd operations where it’s not semantically correct.

However after fixing the correctness bug, it’s reasonable to discuss next steps to potentially (correctly) allow contraction in additional cases. There’s 3 reasonable options I see for how to proceed:

Add more optional-fma llvm.fmuladd-like operations. We probably ought to support all 4 variants, (a*b)+c, (a*b)-c, -(a*b)+c, and -(a*b)-c, which are all commonly available as hardware instructions. Perhaps we could add two i1 immediate arguments negate_mul and negate_addend, instead of adding additional intrinsics. This seems like it’d be “relatively straightforward”, but yes, will require work, both in generic code and in target-specific code.
Deprecate and remove fmuladd, and make it the frontend’s responsibility to decide whether to emit a required-to-be-fused fma or separate fmul/fadd, based on its knowledge of particular targets and target options, and the data-type being operated on. My weakly-held opinion is that we shouldn’t take this option, but I can see advantages, so I think it may be a useful exercise to further explore the pros/cons here. (I think this is what you’d advocate doing, arsenm?)
Do nothing further: continue to skip contraction in some cases where it would be possible and profitable.

wzssyqa · January 9, 2025, 5:23am

Maybe we can have option 4:
a). keep fma only on frontend.
b) support new attributes for fma, such as negated/fused-is-not-required

Topic		Replies	Views
FW: clarification needed for the constrained fp implementation. LLVM Dev List Archives	5	104	November 7, 2017
FENV_ACCESS and floating point LibFunc calls LLVM Dev List Archives	3	97	May 11, 2017
[PATCH] D14707: add fast-math-flags to 'call' instructions (PR21290) LLVM Dev List Archives	0	119	January 7, 2016
[RFC] Improving IR fast-math semantics IR & Optimizations core , rfc , llvm , llvm-ir	28	1108	March 26, 2025
Fusing contract fadd/fsub with normal fmul LLVM Dev List Archives	3	123	June 12, 2017

RFC: c-a*b vs (-a*b)+c for strict mode

Related topics

RFC: c-ab vs (-ab)+c for strict mode