defaults for FP contraction [e.g. fused multiply-add]: suggestion and patch to be slightly more aggressive and to make Clang`s optimization settings closer to having the same meaning as when they are given to GCC [at least for "-O3"]

Dear all,

In the process of investigating a performance difference between Clang & GCC when both compile the same non-toolchain program while using the "same"* compiler flags, I have found something that may be worth changing in Clang, developed a patch, and confirmed that the patch has its intended effect.

*: "same" in quotes b/c the essence of the problem is that the _meaning_ of "-O3" on Clang differs from that of "-O3" on GCC in at least one way.

The specific problem here relates to the default settings for FP contraction, e.g. fused multiply-add. At -O2 and higher, GCC defaults FP contraction to "fast", i.e. always on. I`m not suggesting that Clang/LLVM/both need to do the same, since Clang+LLVM has good support for "#pragma STDC FP_CONTRACT".

If we keep Clang`s default for FP contraction at "on" [which really means "according to the pragma"] but change the default value of the _pragma_ [currently off] to on at -O3, then Clang will be more competitive with GCC at high optimization settings without resorting to the more-brutish "fast by default" at plain -O3 [as opposed to "-Ofast", "-O3 -ffast-math", etc.].

Since I don`t know what Objective-C [and Objective-C++] have to say about FP operations, I have made my patch very selective based on language. Also, I noticed that the CUDA front-end seems to already have its own defaults for FP contraction, so there`s no need to change this for every language.

I needed to change one test case because it made an assumption that FP contraction is off by default when compiling with "-O3" but without any additional optimization-related flags.

Patch relative to upstream code with Git ID b0768e805d1d33d730e5bd44ba578df043dfbc66

Gating this on -Owhatever is dangerous, . We should simply default to the pragma “on” state universally.

– Steve

Why so? [honestly asking, not arguing]

My guess: b/c we don`t want programs to give different results when compiled at different "-O<...>" settings with the exception of "-Ofast".

At any rate, the above change is trivial to apply to my recent proposed patch: just remove the "&& (Res.getCodeGenOpts().OptimizationLevel >= 3)" part of the condition.

Regards,

Abe

Gating this on -Owhatever is dangerous, . We should simply default to the pragma “on” state universally.

Why so? [honestly asking, not arguing]

My guess: b/c we don`t want programs to give different results when compiled at different “-O<…>” settings with the exception of “-Ofast”.

Pretty much. In particular, imagine a user trying to debug an unexpected floating point result caused by conversion of a*b + c into fma(a, b, c).

[Stephen Canon wrote:]

Gating this on -Owhatever is dangerous, . We should simply default to the pragma “on” state universally.

[Abe wrote:]

Why so? [honestly asking, not arguing]
My guess: b/c we don`t want programs to give different results when compiled at different "-O<...>" settings with the exception of "-Ofast".

[Steve Canon wrote:]

Pretty much. In particular, imagine a user trying to debug an unexpected floating point result caused by conversion of a*b + c into fma(a, b, c).

I strongly agree with that philosophy. I tried arguing that GCC should change its policy, but I was rebuffed ["RESOLVED INVALID"].

For reference: 77515 – GCC fusing of multiply-add ["FMA"] occurring at "-O3" withOUT "-ffast-math" and withOUT "-ffp-contract=fast"

Regards,

Abe

I think that’s unavoidable, because of the way the optimization levels work. Even fma contraction is on by default (something I’d like to see), at -O0, we wouldn’t be doing contraction for:

auto x = a*b;
auto y = x+c;

but we would do that at -O2 since we do mem2reg on x.

-Chris

Sent from my iPhone

Gating this on -Owhatever is dangerous, . We should simply default to the pragma “on” state universally.

Why so? [honestly asking, not arguing]

My guess: b/c we don`t want programs to give different results when compiled at different “-O<…>” settings with the exception of “-Ofast”.

Pretty much. In particular, imagine a user trying to debug an unexpected floating point result caused by conversion of a*b + c into fma(a, b, c).

I think that’s unavoidable, because of the way the optimization levels work. Even fma contraction is on by default (something I’d like to see), at -O0, we wouldn’t be doing contraction for:

auto x = a*b;
auto y = x+c;

but we would do that at -O2 since we do mem2reg on x.

In C, we don’t contract (the equivalent of) this unless we’re passed fp-contract=fast. The pragma only licenses contraction within a statement.

IIRC, the situation in C++ is somewhat different, and the standard allows contraction across statement boundaries, though I don’t think we take advantage of it at present.

You’re definitely correct that there will still be differences; e.g.:

x = ab + c;
y = a
b;

It might be that at some optimization level we prove y is unused / constant / etc. When targeting a machine where fma is costlier than mul, we generate mul+add in one case and fma in the other. These cases are necessarily rarer than if we gate it on optimization level, however. (And we want the perf win for -O0 anyway).

TLDR: yeah, let’s do this.

Pretty much. In particular, imagine a user trying to debug an unexpected floating point result caused by conversion of a*b + c into fma(a, b, c).

I think that’s unavoidable, because of the way the optimization levels work. Even fma contraction is on by default (something I’d like to see), at -O0, we wouldn’t be doing contraction for:

auto x = a*b;
auto y = x+c;

but we would do that at -O2 since we do mem2reg on x.

In C, we don’t contract (the equivalent of) this unless we’re passed fp-contract=fast. The pragma only licenses contraction within a statement.

Ah ok. What’s GCC’s policy on this?

IIRC, the situation in C++ is somewhat different, and the standard allows contraction across statement boundaries, though I don’t think we take advantage of it at present.

Is language standard pedanticism what we want to base our policies on? It’s great to not violate the standard of course, but it would be suboptimal for switching a .c file to .cpp to change its behavior. I’m not sure which way this cuts on this topic though, or if the cost is worth bearing.

TLDR: yeah, let’s do this.

Nice :slight_smile:

-Chris

The now-ungated-by-O3-or-higher passes with no new unexpected failures when run on Ubuntu 14.04.1 on a Xeon-based server in 64-bit mode. [No known unexpected failures when testing on any other platform.]

-- Abe

The below patch is relative to...

platform.]

Oops. I did a "make check" when I _should_ have done a "make check-all". Some test cases _are_ broken. I will work on fixing them [as well as finishing my new test cases that will test in the future that the WIP improvement will not have regressed] and report again later.

-- Abe

Dear all,

I have added 4 test cases that all fail on the "vanilla" [i.e. unmodified] compiler and succeed with my patch applied. Please see below, presented for comments/feedback.

The only difference across the non-O0 files is the -O<something> flag; would other people prefer that I factor this out into one include file and 3 short stubs, if I can?

The only difference other than -O<something> between the O0 test and all the rest is that in the -O0 case I have removed the "CHECK-NEXT" for "ret" immediately following the "fmadd" b/c at -O0 the optimizer is not eliminating the boilerplate stack-related code that in this case is not truly needed.

Regards,

Abe

diff --git a/clang/test/CodeGen/fp-contract-pragma___on-by-default___-O0___aarch64-backend.c b/clang/test/CodeGen/fp-contract-pragma___on-by-default___-O0___aarch64-backend.c
new file mode 100644
index 0000000..fd4a979
--- /dev/null
+++ b/clang/test/CodeGen/fp-contract-pragma___on-by-default___-O0___aarch64-backend.c
@@ -0,0 +1,15 @@
+// RUN: %clang_cc1 -triple aarch64 -O0 -S -o - %s | FileCheck %s
+// REQUIRES: aarch64-registered-target