Clarifying FMA-related TargetOptions

Hello everyone,

I’d like to propose the attached patch to form FMA intrinsics aggressively, but in order to do so I need some clarification on the intended semantics for the various FP precision-related TargetOptions. I’ve summarized the three relevant ones below:

UnsafeFPMath - Defaults to off, enables “less precise” results than permitted by IEEE754. Comments specifically reference using hardware FSIN/FCOS on X86.

NoExcessFPPrecision - Defaults to off (i.e. excess precision allowed), enables higher-precision implementations than specified by IEEE754. Comments reference FMA-like operations, and X87 without rounding all over the place.

LessPreciseFPMADOption - Defaults to off, enables “less precise” FP multiply-add.

My general sense is that aggressive FMA formation is beyond the realm of what UnsafeFPMath allows, but I’m unclear on the relationship between NoExcessFPPrecision and LessPreciseFPMADOption. My understanding is that fused multiply-add operations are “more precise” (i.e. closer to the numerically true value) than the baseline (which would round between the multiply and the add). By that reasoning, it seems like it should be covered by !NoExcessFPPrecision. However, that opens the question of what LessPreciseFPMADOption is intended to cover. Are there targets on which FMA is actually “less precise” than the baseline sequence? Or is the comment just poorly worded?

A related concern is that, while NoExcessFPPrecision seems applicable, it is the only one of the above that defaults to the more-relaxed option. From testing my patch, I can say that it does change the behavior of a number of benchmarks in the LLVM test suite, and for that reason alone seems like it should not be enabled by default.

Anyone more knowledgable about FP than me have any ideas?

–Owen

fma.diff (724 Bytes)

Hello everyone,

I'd like to propose the attached patch to form FMA intrinsics
aggressively, but in order to do so I need some clarification on the
intended semantics for the various FP precision-related
TargetOptions. I've summarized the three relevant ones below:

UnsafeFPMath - Defaults to off, enables "less precise" results than
permitted by IEEE754. Comments specifically reference using hardware
FSIN/FCOS on X86.

NoExcessFPPrecision - Defaults to off (i.e. excess precision allowed),
enables higher-precision implementations than specified by IEEE754.
Comments reference FMA-like operations, and X87 without rounding all
over the place.

LessPreciseFPMADOption - Defaults to off, enables "less precise" FP
multiply-add.

My general sense is that aggressive FMA formation is beyond the realm
of what UnsafeFPMath allows, but I'm unclear on the relationship
between NoExcessFPPrecision and LessPreciseFPMADOption. My
understanding is that fused multiply-add operations are "more
precise" (i.e. closer to the numerically true value) than the baseline
(which would round between the multiply and the add). By that
reasoning, it seems like it should be covered by !NoExcessFPPrecision.

I agree, and this is what the PPC backend does.

  However, that opens the question of what LessPreciseFPMADOption is
intended to cover. Are there targets on which FMA is actually "less
precise" than the baseline sequence? Or is the comment just poorly
worded?

A related concern is that, while NoExcessFPPrecision seems applicable,
it is the only one of the above that defaults to the more-relaxed
option. From testing my patch, I can say that it does change the
behavior of a number of benchmarks in the LLVM test suite, and for
that reason alone seems like it should not be enabled by default.

This does not surprise me, however, care is required here. First, there
has been a previous thread on this recently, and a specifically
recommend that you read Stephen Canon's remarks:
http://permalink.gmane.org/gmane.comp.compilers.llvm.cvs/106578

In my experience, users of numerical codes expect that the compiler will
use FMA instructions where it can, unless specifically asked to avoid
doing so by the user. Even though this can sometimes produce a different
result (*almost* always a better one), the performance gain is too large
to be ignored by default. I highly recommend that we continue to enable
FMA instruction-generation by default (as is the current practice, not
only here, but in most vendor compilers with which I am familiar). We
should also implement the FP_CONTRACT pragma, but that is another
matter.

-Hal

Hi Owen,

Having looked into this due to Clang failing PlumHall with it recently I can give an opinion...

I think !NoExcessFPPrecision covers FMA completely. There are indeed some algorithms which give incorrect results when FMA is enabled, examples being those that do floating point comparisons such as: a * b + c - d. If c == d, it is still possible for that result not to equal a*b, as "+c " will have been fused with the multiply whereas "- d" won't.

I think Andy Trick (I think?!) gave a less contrived example a couple of weeks back.

Therefore, it shouldn't be enabled by default. I say that because the C standard defines a pragma to control it - #pragma FP_CONTRACT - which is what Clang was failing with in PlumHall. This pragma defines a code section where FMA may or may not be enabled. If we lack the ability to pass that information through from the frontend to the backend (which we do, at the moment), we should not enable the optimisation by default.

That said, I think we should enhance the IR to allow this information to be passed from front to back ends. An attribute on fadd, fmul, fdiv, frem and fsub in the same vein as "nsw" would be my suggestion.

Cheers,

James

The caveat I would add to this is that, when I tried enabling FMA-by-default on an ARM target, I saw a large number of testcases in the LLVM test suite that either failed their output comparisons, crashed, or failed to terminate (!!!). That seems pretty scary to me.

--Owen

I agree that !NoExcessFPPrecision seems like it should cover FMA, but if that that is the case, what does LessPreciseFPMADOption cover?

--Owen

On that, I'm afraid I have no clue. Hopefully someone more knowledgeable than me will chip in.

Cheers,

James

> In my experience, users of numerical codes expect that the compiler will
> use FMA instructions where it can, unless specifically asked to avoid
> doing so by the user. Even though this can sometimes produce a different
> result (*almost* always a better one), the performance gain is too large
> to be ignored by default. I highly recommend that we continue to enable
> FMA instruction-generation by default (as is the current practice, not
> only here, but in most vendor compilers with which I am familiar). We
> should also implement the FP_CONTRACT pragma, but that is another
> matter.

The caveat I would add to this is that, when I tried enabling FMA-by-default on an ARM target, I saw a large number of testcases in the LLVM test suite that either failed their output comparisons, crashed, or failed to terminate (!!!). That seems pretty scary to me.

That is quite scary. I could obviously be wrong, but I would suspect
that something else is going on here. Either there is a bug in the FMA
patch, or the patch is triggering a bug elsewhere. Although I've
certainly seen cases where enabling FMA changes some numerical results
(slightly), crashing and failing to terminate in a large number of cases
seems like another issue. Most code just does not depend on exact
floating-point cancelation in a way that impacts loop termination
conditions, memory indexing, etc.

-Hal

Hi Owen,

Having looked into this due to Clang failing PlumHall with it recently I can give an opinion...

I think !NoExcessFPPrecision covers FMA completely. There are indeed some algorithms which give incorrect results when FMA is enabled, examples being those that do floating point comparisons such as: a * b + c - d. If c == d, it is still possible for that result not to equal a*b, as "+c " will have been fused with the multiply whereas "- d" won't.

I think Andy Trick (I think?!) gave a less contrived example a couple of weeks back.

Therefore, it shouldn't be enabled by default. I say that because the C standard defines a pragma to control it - #pragma FP_CONTRACT - which is what Clang was failing with in PlumHall. This pragma defines a code section where FMA may or may not be enabled. If we lack the ability to pass that information through from the frontend to the backend (which we do, at the moment), we should not enable the optimisation by default.

Fair enough.

That said, I think we should enhance the IR to allow this information to be passed from front to back ends. An attribute on fadd, fmul, fdiv, frem and fsub in the same vein as "nsw" would be my suggestion.

I agree that this is a good idea. I think this will be easy to support
if we end up defining some patterns in tablegen like fmul_combinable
(I'm not actually recommending such a long name) and define any FMA-like
patterns in terms of those.

-Hal

Owen Anderson <resistor@mac.com> writes:

A related concern is that, while NoExcessFPPrecision seems applicable,
it is the only one of the above that defaults to the more-relaxed
option. From testing my patch, I can say that it does change the
behavior of a number of benchmarks in the LLVM test suite, and for
that reason alone seems like it should not be enabled by default.

Anyone more knowledgable about FP than me have any ideas?

FWIW, we've found that having a switch to turn off FMA explicitly is
helpful for debugging. We don't expose the switch to users but it has
saved us a few times when trying to track down numerical differences.

Our FP switches are not so precisely named. We basically have fp0, fp1,
fp2 and fp3, analogous to O0, O1, O2, O3. The idea is that the higher
the number, the less guarantee you have that your results will be the
same as scalar code (or code w/o FMA) would give you. The tradeoff
being faster execution, of course. We don't say anything about
precision directly.

                             -Dave