Hi all,
In a mailing-list post last November:
http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html
I raised some concerns that having the IR-level fast-math-flag ‘fast’ act as an
“umbrella” to implicitly turn on all the lower-level fast-math-flags, causes
some fundamental problems. Those fundamental problems are related to
situations where a user wants to disable a portion of the fast-math behavior.
For example, to enable all the fast-math transformations except for the
reciprocal-math transformation, a command like the following is what a user
would expect to work:
clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp
But that isn’t what it’s doing.
I believe this is a serious problem, but I also want to avoid over-stating the
seriousness. To be explicit, the problems I’m describing here happen when
‘-ffast-math’ is used with one or more of the underlying fast-math-related
aspects disabled (like the ‘-fno-reciprocal-math’ example, above).
Conversely, when ‘-ffast-math’ is used “on its own”, the situation is fine.
For terminology here, I’ll refer to these underlying fast-math-related aspects
(like reciprocal-math, associative-math, math-errno, and others) as
“sub-fast-math” aspects.
I apologize for the length of this post. I’m putting the summary up front, so
that anyone interested in fast-math issues can quickly get the big-picture of
the issues I’m describing here.
In Summary:
- With the change of r297837, the driver now more cleanly handles
‘-ffast-math’, and other sub-fast-math switches (like
‘-f[no]-reciprocal-math’, ‘-f[no-]math-errno’, and others).
- Prior to that change, the disabling of a sub-fast-math switch was often
ineffective. So as an example, the following two commands often resulted
in the same code-gen, even if there were
fast-math-reciprocal-transformations that were done:
clang++ -O2 -ffast-math -c foo.cpp
clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp
- Since that change, the disabling of a sub-fast-math switch disables many
more sub-fast-math transformations than just the one specified. So now,
the following two commands often result in very similar (and sometimes
identical) code-gen:
clang++ -O2 -c foo.cpp
clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp
That is, disabling a single sub-fast-math transformation in some (many?)
cases now ends up disabling almost all the fast-math transformations.
This causes a performance hit for people that have been doing this.
- To fix this, I think that additional fast-math-flags are likely needed in
the IR. Instead of the following set:
‘nnan’ + ‘ninf’ + ‘nsz’ + ‘arcp’ + ‘contract’
something like this:
‘reassoc’ + ‘libm’ + ‘nnan’ + ‘ninf’ + ‘nsz’ + ‘arcp’ + ‘contract’
would be more useful. Related to this, the current ‘fast’ flag which acts
as an umbrella (enabling ‘nnan’ + ‘ninf’ + ‘nsz’ + ‘arcp’ + ‘contract’) may
not be needed. A discussion on this point was raised last November on the
mailing list:
http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html
TL;DR
More details are in that thread from November, but the problem in its entirety
involved both back-end LLVM issues, and front-end Clang (driver) issues. The
LLVM issues are related to the umbrella aspect of ‘fast’, along with other
fast-math-flags implementation details (described below). The front-end
aspects in Clang are related to the driver’s handling of ‘-ffast-math’ (which
also had an “umbrella” aspect). The driver code has been refactored since that
November post, fixing the umbrella aspect of the front-end. But I never got
around to working on the related back-end issues (nor has anyone else), and the
refactored front-end now results in the back-end issues manifesting
differently, and arguably in a worse way (details on the “worse” aspect,
below).
For reference, the refactored driver code was done in r297837:
[Driver] Restructure handling of -ffast-math and similar options
To be clear, I’m not at all suggesting that the above change was incorrect. I
think that refactoring of the driver code is the right thing to do. An aspect
of this refactoring is that prior to it, when a user passed ‘-ffast-math’ on
the command-line, it was also passed to the cc1 process, even if a
sub-fast-math component was disabled. With the refactoring, the driver only
passes ‘-ffast-math’ to cc1 when a specific set of sub-fast-math components are
enabled.
More specifically, when a user specifies just ‘-ffast-math’ on the
command-line, the following 7 sub-fast-math switches:
-fno-honor-infinities
-fno-honor-nans
-fno-math-errno
-fassociative-math
-freciprocal-math
-fno-signed-zeros
-fno-trapping-math
get passed to cc1 (this is true both with the old (pre r297837) and new (since
r297837) compilers). Furthermore, the “umbrella” ‘-ffast-math’ is also passed
to cc1 in this case of the user specifying just ‘-ffast-math’ on the
command-line (again, in both the old and new compilers).
The difference related to this issue in the old/new behavior, is that when a
user turns on fast-math but disables one (or more) of the sub-fast-math
switches, for example, as in:
clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp
then in the old mode ‘-ffast-math’ was still passed to cc1 (acting as an
umbrella, causing trouble), but in the new mode ‘-ffast-math’ is no longer
passed to cc1 in this case. (In both the old and new modes,
‘-freciprocal-math’ is not passed to cc1 with this command-line, as you’d
expect.)
What’s happening is that in the old mode, it was the user passing ‘-ffast-math’
on the command-line that resulted in passing the umbrella ‘-ffast-math’ to cc1
(even if all 7 of the sub-fast-math switches were disabled by the user).
Whereas in the new mode, the ‘-ffast-math’ switch is passed to cc1 iff all 7 of
the underlying sub-fast-math switches are enabled.
I’d say that’s an improvement in the handling of the switches, and also on the
plus side, I think it makes dealing with the concerns I raised in November LLVM
a little clearer, and so more manageable in some sense. But on the negative
side, since the new behavior in LLVM is arguably worse, fixing the back-end
issues is now a higher priority for my customers.
The behavior that is arguably worse, is that when a user enables fast-math, but
attempts to disable one of the sub-fast-math aspects, the old behavior (pre
r297837) was that the sub-fast-math aspect to be disabled, generally (often?)
remained enabled. The new behavior (since r297837) is that when disabling a
sub-fast-math aspect, that aspect plus many more (possibly often the majority)
of the fast-math transformations are disabled. So this results in a
performance regression in these fast-math contexts when a sub-fast-math aspect
is disabled, which is why it is a fairly high priority for us.
FTR, r297837 was made during llvm 5.0 development, so the new behavior has the
effect of a performance regression in moving from 4.0 to 5.0. In describing
things here, I’ll compare llvm 4.0 with llvm 5.0 behavior. But more precisely,
it’s pre-r297837 with post-r297837 behavior.
Here is a tiny example, to illustrate it concretely:
$ cat assoc.cpp
//////////// “assoc.cpp” ////////////
float foo(float a, float x)
{
return ((a + x) - x); // fastmath reassociation eliminates the arithmetic
}
/////////////////////////////////////
$
When -ffast-math is specified, the reassociation enabled by it allows us to
simply return the first argument (and that reassociation does happen with
‘-ffast-math’, with both the old and new compilers):
$ clang -c -O2 -o x.o assoc.cpp
$ llvm-objdump -d x.o | grep "^ .*: "
0: f3 0f 58 c1 addss %xmm1, %xmm0
4: f3 0f 5c c1 subss %xmm1, %xmm0
8: c3 retq
$ clang -c -O2 -ffast-math -o x.o assoc.cpp
$ llvm-objdump -d x.o | grep "^ .*: "
0: c3 retq
$
FTR, GCC also does the reassociation transformation here when ‘-ffast-math’ is
used, as expected.
But when using ‘-ffast-math’ and disabling a sub-fast-math aspect of it (say
via ‘-fno-reciprocal-math’, ‘-fno-associative-math’, or ‘-fmath-errno’), both
the old and new compilers exhibit incorrect behavior in some cases. With the
old compiler, the behavior was that using any of these switches did not disable
the transformation. Those switches were mostly ineffective. (Only
‘-fno-associative-math’ should disable the transformation in this example, so
the fact that the other ones didn’t disable it is correct/desired.) Here is
the old behavior for the above test-case, when some example sub-fast-math
aspects are individually disabled:
$ old/bin/clang --version | grep version
clang version 4.0.0 (tags/RELEASE_400/final)
$ old/bin/clang -c -O2 -o x.o assoc.cpp
$ llvm-objdump -d x.o | grep "^ .*: "
0: f3 0f 58 c1 addss %xmm1, %xmm0
4: f3 0f 5c c1 subss %xmm1, %xmm0
8: c3 retq
$ old/bin/clang -c -O2 -ffast-math -o x.o assoc.cpp
$ llvm-objdump -d x.o | grep "^ .*: "
0: c3 retq
$ old/bin/clang -c -O2 -ffast-math -fno-reciprocal-math -o x.o assoc.cpp
$ llvm-objdump -d x.o | grep "^ .*: "
0: c3 retq
$ old/bin/clang -c -O2 -ffast-math -fno-associative-math -o x.o assoc.cpp # Error
$ llvm-objdump -d x.o | grep "^ .*: "
0: c3 retq
$ old/bin/clang -c -O2 -ffast-math -fmath-errno -o x.o assoc.cpp
$ llvm-objdump -d x.o | grep "^ .*: "
0: c3 retq
$
So with the old compiler, the case marked ‘Error’ above is incorrect, in that
the reassociation should be suppressed in that case, but it isn’t.
Again FTR, the GCC behavior disables the re-association in the case marked
‘Error’ above.
Moving on to the new compiler, instead of ‘-fno-associative-math’ being
ineffective, the problem is that when disabling other sub-fast-math aspects
(unrelated to reassociation), the transformation is suppressed, when it should
not be. Here is the new behavior with that same set of sub-fast-math aspects
individually disabled:
$ new/bin/clang --version | grep version
clang version 5.0.0 (tags/RELEASE_500/final)
$ new/bin/clang -c -O2 -o x.o assoc.cpp
$ llvm-objdump -d x.o | grep "^ .*: "
0: f3 0f 58 c1 addss %xmm1, %xmm0
4: f3 0f 5c c1 subss %xmm1, %xmm0
8: c3 retq
$ new/bin/clang -c -O2 -ffast-math -o x.o assoc.cpp
$ llvm-objdump -d x.o | grep "^ .*: "
0: c3 retq
$ new/bin/clang -c -O2 -ffast-math -fno-reciprocal-math -o x.o assoc.cpp # Error
$ llvm-objdump -d x.o | grep "^ .*: "
0: f3 0f 58 c1 addss %xmm1, %xmm0
4: f3 0f 5c c1 subss %xmm1, %xmm0
8: c3 retq
$ new/bin/clang -c -O2 -ffast-math -fno-associative-math -o x.o assoc.cpp # Good
$ llvm-objdump -d x.o | grep "^ .*: "
0: f3 0f 58 c1 addss %xmm1, %xmm0
4: f3 0f 5c c1 subss %xmm1, %xmm0
8: c3 retq
$ new/bin/clang -c -O2 -ffast-math -fmath-errno -o x.o assoc.cpp # Error
$ llvm-objdump -d x.o | grep "^ .*: "
0: f3 0f 58 c1 addss %xmm1, %xmm0
4: f3 0f 5c c1 subss %xmm1, %xmm0
8: c3 retq
$
The two cases marked as ‘Error’ are incorrectly suppressing the re-association.
The case marked as ‘Good’ is now doing the right thing for this test-case.
Again FTR, the GCC behavior allows the re-association in the cases marked
‘Error’ above to happen.