fp-contract at -O0

Hi everyone,

Melanie Blower recently submitted a change that was intended to make the default set of floating point options in clang be consistent with the options that would be set by the -ffp-model=precise umbrella option. The only change needed was to make the default for fp-contract “on” instead of “off”. While not a trivial change, we thought this was reasonable, since fp-contract=on only allows contraction that is allowed by the language standard. Unfortunately, this change unleashed a surprising number of problems.

The most surprising problem, to me at least, was that this change caused FMA instructions to be generated at -O0.

There are a couple of things that need to be sorted out here, but I’d like to start with the -O0 behavior. Consider the following scenario, which was possible even before the recent change:

Why is that a problem? As long as it doesn't create build performance
regressions, it seems to be semantically valid to do?

Joerg

-O0 does not mean “do not optimize”. It means "Reduce compilation time and make debugging produce the expected results” (quoting the GCC manual, but it applies equally to clang).

– Steve

That’s certainly a reasonable position, but it isn’t without problems. For instance: https://godbolt.org/z/9JtoPt

In this case, “-O0 -ffp-contract=on -march=haswell” results in an FMA instruction but “-O0 -ffp-contract=fast -march=haswell” does not.

I’m not opposed to allowing the explicit use of -ffp-contract=on to lead clang to generate a call to llvm.fmuladd, but I don’t think that should happen by default at -O0.

-Andy

Why not? What situation are you trying to avoid?

I don’t see a problem with the godbolt link; is your concern simply that you think -ffp-contract=fast should fuse a super-set of what is done by =on, or is there something else?

If anything, preserving FMA formation at O0 _helps_ debuggability, because it means that numerical behavior is more likely to match what a user observed at Os, allowing them to debug the problem.

Why not? What situation are you trying to avoid?

It just seems unexpected. I write code with no explicit FMA’s. I compile with no command line options, and I get FMA. It’s not what I’d expect, and someone else specifically complained about this behavior after Melanie’s patch landed.

I don’t see a problem with the godbolt link; is your concern simply that you think -ffp-contract=fast should fuse a super-set of what is done by =on, or is there something else?

Yes, that is my concern. I think =fast should always produce at least as many FMA’s as =on.

If anything, preserving FMA formation at O0 helps debuggability, because it means that numerical behavior is more likely to match what a user observed at Os, allowing them to debug the problem.

That’s an excellent point. I could definitely be persuaded by that argument.

-Andy

I can imagine a few ways to handle this, if we really want to do something about it:

1 A diagnostic when combining -ffp-contract=fast with -O0 that you aren’t going to get FMA formation.
2 Make -ffp-contract=fast decay to =on under -O0.
3 Make -ffp-contract=fast always imply =on as well (so the frontend would form fmuladd nodes in both modes, but =fast would additionally license forming fma out of mul+add pairs).

Option 1 is easy but silly. Option 2 is only slightly more invasive and definitely fixes the “problem”, but is maybe a little too clever. Option 3 may be the best, but I haven’t thought through all the details, and it would require some experimentation.

– Steve

3 Make -ffp-contract=fast always imply =on as well (so the frontend would form fmuladd nodes in both modes, but =fast would additionally license forming fma out of mul+add pairs).

This option could potentially impede optimizations that are currently performed. Having the contract flag set on FP operations instead of using the fmuladd intrinsic gives the backend freedom to mix and match operations from different source expressions. I’ve come across a case recently where this is beneficial.

Perhaps the problem is with my expectation. The option isn’t very well documented in clang (or gcc).

“Form fused FP ops (e.g. FMAs): fast (everywhere) | on (according to FP_CONTRACT pragma) | off (never fuse). Default is ‘fast’ for CUDA/HIP and ‘on’ otherwise.”

Obviously, we don’t form fused FP ops “everywhere.” What this probably should say is that we form fused ops potentially anywhere, at the discretion of the compiler. A more verbose explanation would be good. With the right wording this would reasonably explain why such ops aren’t fused at -O0.

Having given it more thought, I’d be OK with option 0 – leave things as they are (or recently have been/soon will be) with =on as the default and the front end forming fmuladd or setting the contract flag without regard to the optimization level.

BTW, I also noticed some time ago that the front end will form fmuladd with =fast if the code in question is subject to a pragma STDC FP_CONTRACT ON. That seemed wrong to me at the time but now seems reasonable and consistent.

-Andy

"Kaylor, Andrew via cfe-dev" <cfe-dev@lists.llvm.org> writes:

--------
test.c
--------
double f(double a, double b, double c) {
  return a * b + c;
}
--------
clang -c -O0 -ffp-contract=on test.c
--------

Since clang 5.0 this has produced a call to llvm.fmuladd, which for
targets that support FMA will generally result in an FMA
instruction. Arguably this is what the user asked for, since they
explicitly enabled fp-contract. On the other hand, it is also an
optimization, which they said they did not want. As a point of
comparison, specifying -ffast-math will cause the front end to attach
the "fast" flag to math operations (which also allows contraction),
but will not lead to FMA formation.

What should we do with this? I see two possible solutions:

1. The driver should not pass the -ffp-contract=on flag by default at
-O0 (still allows fmuladd formation if the user specifies
-ffp-contract=on)

2. The front end should not form the llvm.fmuladd intrinsic at -O0

I prefer option #1. If the user explicitly adds -ffp-contract=on then
we should absolutely generate FMAs even at -O0. To me this is the
principle of least surprise. A "more specific" option
(-ffp-contract=on) overrides a "more general" option (-O0).

                       -David

"Kaylor, Andrew via cfe-dev" <cfe-dev@lists.llvm.org> writes:

I’m not opposed to allowing the explicit use of -ffp-contract=on to
lead clang to generate a call to llvm.fmuladd, but I don’t think that
should happen by default at -O0.

+1. This seems like the most reasonable behavior to me.

                    -David

Stephen Canon via cfe-dev <cfe-dev@lists.llvm.org> writes:

If anything, preserving FMA formation at O0 _helps_ debuggability,
because it means that numerical behavior is more likely to match what
a user observed at Os, allowing them to debug the problem.

The user can always pass -ffp-contract=on to do that.

There are many cases where FMA is not desired and most users don't
expect fused operations at -O0 unless they specifically ask for it.

                     -David

If FMA is not desired, -ffp-contract=off or the pragma should be used to disable it. -O0 is the wrong tool for that job.

– Steve

Stephen Canon <scanon@apple.com> writes:

Jumping in with my opinion, as this thread doesn’t seem to be dying of its own accord:

The -O0 level is supposed to be the compiler’s “default” optimization level — that is, the “simplest possible” optimization level, the fastest one, the one that just flows through the compiler without taking any unnecessary detours or side quests. -O0 is the level where you get the thing that just works, without applying any additional post-processing to it.

In fact, film “post-processing” is a good way to think about optimization. -O0 codegen is like the dailies straight from the camera. Optimization options, -ffp-contract=whatever, and so on, are all inputs (from the human “director-producer”) to the guy who does the post-processing, saying “take this raw footage, as it came from the camera, and— look for some extra FMAs, or lower the ones that basic codegen already put there, or whatever.”

The innards of the compiler always look basically like this:

do_some_codegen();
if (some_option) {
postprocess_the_codegen_to_satisfy_a_whim_of_the_director();
}

The “-O0” path is by definition the path that does not take that if branch. I don’t care if the whim is “I want more FMAs” or “I want fewer FMAs” or “I want more spills to stack” or “I want fewer spills to stack” or whatever. The “-O0” path is by definition the path that does not cater to any whim except “I want to see the dailies as soon as possible.”

Which is to say, Stephen Canon and Joerg Sonnenberger are correct.

–Arthur