Trouble when suppressing a portion of fast-math-transformations

Hi all,

In a mailing-list post last November:

http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html

I raised some concerns that having the IR-level fast-math-flag ‘fast’ act as an

“umbrella” to implicitly turn on all the lower-level fast-math-flags, causes

some fundamental problems. Those fundamental problems are related to

situations where a user wants to disable a portion of the fast-math behavior.

For example, to enable all the fast-math transformations except for the

reciprocal-math transformation, a command like the following is what a user

would expect to work:

clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp

But that isn’t what it’s doing.

I believe this is a serious problem, but I also want to avoid over-stating the

seriousness. To be explicit, the problems I’m describing here happen when

‘-ffast-math’ is used with one or more of the underlying fast-math-related

aspects disabled (like the ‘-fno-reciprocal-math’ example, above).

Conversely, when ‘-ffast-math’ is used “on its own”, the situation is fine.

For terminology here, I’ll refer to these underlying fast-math-related aspects

(like reciprocal-math, associative-math, math-errno, and others) as

“sub-fast-math” aspects.

I apologize for the length of this post. I’m putting the summary up front, so

that anyone interested in fast-math issues can quickly get the big-picture of

the issues I’m describing here.

In Summary:

  1. With the change of r297837, the driver now more cleanly handles

‘-ffast-math’, and other sub-fast-math switches (like

‘-f[no]-reciprocal-math’, ‘-f[no-]math-errno’, and others).

  1. Prior to that change, the disabling of a sub-fast-math switch was often

ineffective. So as an example, the following two commands often resulted

in the same code-gen, even if there were

fast-math-reciprocal-transformations that were done:

clang++ -O2 -ffast-math -c foo.cpp

clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp

  1. Since that change, the disabling of a sub-fast-math switch disables many

more sub-fast-math transformations than just the one specified. So now,

the following two commands often result in very similar (and sometimes

identical) code-gen:

clang++ -O2 -c foo.cpp

clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp

That is, disabling a single sub-fast-math transformation in some (many?)

cases now ends up disabling almost all the fast-math transformations.

This causes a performance hit for people that have been doing this.

  1. To fix this, I think that additional fast-math-flags are likely needed in

the IR. Instead of the following set:

‘nnan’ + ‘ninf’ + ‘nsz’ + ‘arcp’ + ‘contract’

something like this:

‘reassoc’ + ‘libm’ + ‘nnan’ + ‘ninf’ + ‘nsz’ + ‘arcp’ + ‘contract’

would be more useful. Related to this, the current ‘fast’ flag which acts

as an umbrella (enabling ‘nnan’ + ‘ninf’ + ‘nsz’ + ‘arcp’ + ‘contract’) may

not be needed. A discussion on this point was raised last November on the

mailing list:

http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html

TL;DR

More details are in that thread from November, but the problem in its entirety

involved both back-end LLVM issues, and front-end Clang (driver) issues. The

LLVM issues are related to the umbrella aspect of ‘fast’, along with other

fast-math-flags implementation details (described below). The front-end

aspects in Clang are related to the driver’s handling of ‘-ffast-math’ (which

also had an “umbrella” aspect). The driver code has been refactored since that

November post, fixing the umbrella aspect of the front-end. But I never got

around to working on the related back-end issues (nor has anyone else), and the

refactored front-end now results in the back-end issues manifesting

differently, and arguably in a worse way (details on the “worse” aspect,

below).

For reference, the refactored driver code was done in r297837:

[Driver] Restructure handling of -ffast-math and similar options

To be clear, I’m not at all suggesting that the above change was incorrect. I

think that refactoring of the driver code is the right thing to do. An aspect

of this refactoring is that prior to it, when a user passed ‘-ffast-math’ on

the command-line, it was also passed to the cc1 process, even if a

sub-fast-math component was disabled. With the refactoring, the driver only

passes ‘-ffast-math’ to cc1 when a specific set of sub-fast-math components are

enabled.

More specifically, when a user specifies just ‘-ffast-math’ on the

command-line, the following 7 sub-fast-math switches:

-fno-honor-infinities

-fno-honor-nans

-fno-math-errno

-fassociative-math

-freciprocal-math

-fno-signed-zeros

-fno-trapping-math

get passed to cc1 (this is true both with the old (pre r297837) and new (since

r297837) compilers). Furthermore, the “umbrella” ‘-ffast-math’ is also passed

to cc1 in this case of the user specifying just ‘-ffast-math’ on the

command-line (again, in both the old and new compilers).

The difference related to this issue in the old/new behavior, is that when a

user turns on fast-math but disables one (or more) of the sub-fast-math

switches, for example, as in:

clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp

then in the old mode ‘-ffast-math’ was still passed to cc1 (acting as an

umbrella, causing trouble), but in the new mode ‘-ffast-math’ is no longer

passed to cc1 in this case. (In both the old and new modes,

‘-freciprocal-math’ is not passed to cc1 with this command-line, as you’d

expect.)

What’s happening is that in the old mode, it was the user passing ‘-ffast-math’

on the command-line that resulted in passing the umbrella ‘-ffast-math’ to cc1

(even if all 7 of the sub-fast-math switches were disabled by the user).

Whereas in the new mode, the ‘-ffast-math’ switch is passed to cc1 iff all 7 of

the underlying sub-fast-math switches are enabled.

I’d say that’s an improvement in the handling of the switches, and also on the

plus side, I think it makes dealing with the concerns I raised in November LLVM

a little clearer, and so more manageable in some sense. But on the negative

side, since the new behavior in LLVM is arguably worse, fixing the back-end

issues is now a higher priority for my customers.

The behavior that is arguably worse, is that when a user enables fast-math, but

attempts to disable one of the sub-fast-math aspects, the old behavior (pre

r297837) was that the sub-fast-math aspect to be disabled, generally (often?)

remained enabled. The new behavior (since r297837) is that when disabling a

sub-fast-math aspect, that aspect plus many more (possibly often the majority)

of the fast-math transformations are disabled. So this results in a

performance regression in these fast-math contexts when a sub-fast-math aspect

is disabled, which is why it is a fairly high priority for us.

FTR, r297837 was made during llvm 5.0 development, so the new behavior has the

effect of a performance regression in moving from 4.0 to 5.0. In describing

things here, I’ll compare llvm 4.0 with llvm 5.0 behavior. But more precisely,

it’s pre-r297837 with post-r297837 behavior.

Here is a tiny example, to illustrate it concretely:

$ cat assoc.cpp

//////////// “assoc.cpp” ////////////

float foo(float a, float x)

{

return ((a + x) - x); // fastmath reassociation eliminates the arithmetic

}

/////////////////////////////////////

$

When -ffast-math is specified, the reassociation enabled by it allows us to

simply return the first argument (and that reassociation does happen with

‘-ffast-math’, with both the old and new compilers):

$ clang -c -O2 -o x.o assoc.cpp

$ llvm-objdump -d x.o | grep "^ .*: "

0: f3 0f 58 c1 addss %xmm1, %xmm0

4: f3 0f 5c c1 subss %xmm1, %xmm0

8: c3 retq

$ clang -c -O2 -ffast-math -o x.o assoc.cpp

$ llvm-objdump -d x.o | grep "^ .*: "

0: c3 retq

$

FTR, GCC also does the reassociation transformation here when ‘-ffast-math’ is

used, as expected.

But when using ‘-ffast-math’ and disabling a sub-fast-math aspect of it (say

via ‘-fno-reciprocal-math’, ‘-fno-associative-math’, or ‘-fmath-errno’), both

the old and new compilers exhibit incorrect behavior in some cases. With the

old compiler, the behavior was that using any of these switches did not disable

the transformation. Those switches were mostly ineffective. (Only

‘-fno-associative-math’ should disable the transformation in this example, so

the fact that the other ones didn’t disable it is correct/desired.) Here is

the old behavior for the above test-case, when some example sub-fast-math

aspects are individually disabled:

$ old/bin/clang --version | grep version

clang version 4.0.0 (tags/RELEASE_400/final)

$ old/bin/clang -c -O2 -o x.o assoc.cpp

$ llvm-objdump -d x.o | grep "^ .*: "

0: f3 0f 58 c1 addss %xmm1, %xmm0

4: f3 0f 5c c1 subss %xmm1, %xmm0

8: c3 retq

$ old/bin/clang -c -O2 -ffast-math -o x.o assoc.cpp

$ llvm-objdump -d x.o | grep "^ .*: "

0: c3 retq

$ old/bin/clang -c -O2 -ffast-math -fno-reciprocal-math -o x.o assoc.cpp

$ llvm-objdump -d x.o | grep "^ .*: "

0: c3 retq

$ old/bin/clang -c -O2 -ffast-math -fno-associative-math -o x.o assoc.cpp # Error

$ llvm-objdump -d x.o | grep "^ .*: "

0: c3 retq

$ old/bin/clang -c -O2 -ffast-math -fmath-errno -o x.o assoc.cpp

$ llvm-objdump -d x.o | grep "^ .*: "

0: c3 retq

$

So with the old compiler, the case marked ‘Error’ above is incorrect, in that

the reassociation should be suppressed in that case, but it isn’t.

Again FTR, the GCC behavior disables the re-association in the case marked

‘Error’ above.

Moving on to the new compiler, instead of ‘-fno-associative-math’ being

ineffective, the problem is that when disabling other sub-fast-math aspects

(unrelated to reassociation), the transformation is suppressed, when it should

not be. Here is the new behavior with that same set of sub-fast-math aspects

individually disabled:

$ new/bin/clang --version | grep version

clang version 5.0.0 (tags/RELEASE_500/final)

$ new/bin/clang -c -O2 -o x.o assoc.cpp

$ llvm-objdump -d x.o | grep "^ .*: "

0: f3 0f 58 c1 addss %xmm1, %xmm0

4: f3 0f 5c c1 subss %xmm1, %xmm0

8: c3 retq

$ new/bin/clang -c -O2 -ffast-math -o x.o assoc.cpp

$ llvm-objdump -d x.o | grep "^ .*: "

0: c3 retq

$ new/bin/clang -c -O2 -ffast-math -fno-reciprocal-math -o x.o assoc.cpp # Error

$ llvm-objdump -d x.o | grep "^ .*: "

0: f3 0f 58 c1 addss %xmm1, %xmm0

4: f3 0f 5c c1 subss %xmm1, %xmm0

8: c3 retq

$ new/bin/clang -c -O2 -ffast-math -fno-associative-math -o x.o assoc.cpp # Good

$ llvm-objdump -d x.o | grep "^ .*: "

0: f3 0f 58 c1 addss %xmm1, %xmm0

4: f3 0f 5c c1 subss %xmm1, %xmm0

8: c3 retq

$ new/bin/clang -c -O2 -ffast-math -fmath-errno -o x.o assoc.cpp # Error

$ llvm-objdump -d x.o | grep "^ .*: "

0: f3 0f 58 c1 addss %xmm1, %xmm0

4: f3 0f 5c c1 subss %xmm1, %xmm0

8: c3 retq

$

The two cases marked as ‘Error’ are incorrectly suppressing the re-association.

The case marked as ‘Good’ is now doing the right thing for this test-case.

Again FTR, the GCC behavior allows the re-association in the cases marked

‘Error’ above to happen.

Hi, Warren,

Thanks for writing all of this up. In short, regarding your suggested solution:

4. To fix this, I think that additional fast-math-flags are likely needed in

the IR. Instead of the following set:

'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract'

something like this:

'reassoc' + 'libm' + 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract'

would be more useful. Related to this, the current 'fast' flag which acts

as an umbrella (enabling 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract') may

not be needed. A discussion on this point was raised last November on the

mailing list:

http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html

I agree. I'm happy to help review the patches. It will be best to have only the finer-grained flags where there's no "fast" flag that implies all of the others.

  -Hal

Hi Hal,

4. To fix this, I think that additional fast-math-flags are likely
needed in the IR. Instead of the following set:

'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract'

something like this:

'reassoc' + 'libm' + 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract'

would be more useful. Related to this, the current 'fast' flag which acts
as an umbrella (enabling 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract') may
not be needed. A discussion on this point was raised last November on the
mailing list:

http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html

I agree. I'm happy to help review the patches. It will be best to have
only the finer-grained flags where there's no "fast" flag that implies
all of the others.

Thanks for the quick response, and for the willingness to review. I won't let
this languish so long, like the post from last November.

Happy to hear that you feel it's best not to have the umbrella "fast" flag.

Thanks again,
-Warren

Are we confident that we just need those 7 bits to represent all of the relaxed FP states that we need/want to support?

I’m asking because FMF in IR is currently mapped onto the SubclassOptionalData of Value…and we have exactly 7 bits there. :slight_smile:

If we’re redoing the definitions, I’m wondering if we can share the struct with the backend’s SDNodeFlags, but that already has one extra bit for vector reduction. Should we give up on SubclassOptionalData for FMF? We have a MD_fpmath enum value for metadata, so we could move things over there?

I’m not aware of any additional bits needed. But putting us right at the edge leaves me uncomfortable. So an implementation that isn’t limited by the 7 bits in SubclassOptionalData seems sensible.

Thanks,

-Warren

Is there anything that means, in particular, “go fast, even if it means not all bits are significant”?

I’m currently working on an llvm-based compiler for a GPU that is optomised for OpenGL, where 16 bit FP may not be quite accurate enough (or may be in some cases), but 32 bit FP is overkill. A lot of the fast, built in, operations end up with a few junk bits at the end (not add/sub/mul . but divide is available only using reciprocal).

When implementing OpenCL, the specs and conformance tests require full IEEE accuracy. In some cases this requires a round of Newton-Raphson to clean up the accuracy, which is a significant though maybe not crippling performance penalty. But in other cases we need to do a lot of range reduction, some polynomial, and then generalise the result again. This can be an order of magnitude or more slower than using the not-quite-accurate-enough built in instruction.

The OpenCL spec defines a number of compile flags controlling optimizartions. Some seem to map well onto the flags already discussed here:

-cl-mad-enable

-cl-no-signed-zeros

-cl-finite-math-only

However it looks to me that the following ones don’t presently map well to LLVM:

-cl-unsafe-math-optimizations
Allow optimizations for floating-point arithmetic that (a) assume that arguments and results are valid, (b) may violate IEEE 754 standard and (c) may violate the OpenCL numerical compliance requirements as defined in the SPIR-V OpenCL environment specification for single precision and double precision floating-point, and edge case behavior in the SPIR-V OpenCL environment specification. This option includes the -clno-signed-zeros and -cl-mad-enable options.

-cl-fast-relaxed-math
Sets the optimization options -cl-finite-math-only and -cl-unsafe-math-optimizations. This allows optimizations for floating-point arithmetic that may violate the IEEE 754 standard and the OpenCL numerical compliance requirements for single precision and double precision floating-point, as well as floating point edge case behavior. This option also relaxes the precision of commonly used math functions. This option causes the preprocessor macro FAST_RELAXED_MATH to be defined in the OpenCL program. The original and modified values are defined in the SPIR-V OpenCL environment specification

I’d like to emphasise in the latter one: “This option also relaxes the precision of commonly used math functions.”

I agree that using SubclassOptionalData is going to be problematic when we run out of bits. As I recall, the reason that we didn’t use metadata in the first place was because metadata is (generically) expensive. This case is very much like the case of debug info: in some modes, we add the debugging metadata to nearly every instruction. We use metadata for debug locations, syntactically, but we actually have a DebugLoc in each instruction that’s used for the underlying representation. Here we’d have a similar problem: in some modes, we’d add this metadata to a large subset of all instructions. That could measurably slow down the optimizer. We may need to find some other place to put the data (e.g., an actual member variable of Instruction or more tail-allocated data in places) -Hal

This is what arcp is for (implying that you can use the reciprocal estimate and not worry about getting the exact answer). Now there’s a separate question about how many Newton iterations to use, and we have a separate flag for that (-mrecip=…). Check out the implementation of TargetLoweringBase::getRecipEstimateSqrtEnabled to see how it’s setup in backend. This is, however, per function, so we don’t currently have a per-operation control on this. I think the idea is that this flag, like -funsafe-math-optimizations, gets mapped to an appropriate collection of finer-grained flags internally. Isn’t this the “libm” flag that is proposed in this thread? -Hal

I don't know. I didn't see any definition of it.

In my case I'm talking about allowing the use of lower precision but very
fast machine instructions instead of a slow sequence of inline
instructions. But I guess instruction vs library is equivalent.

I’d like to emphasise in the latter one: "This option also relaxes the precision of

commonly used math functions."

Isn’t this the “libm” flag that is proposed in this thread?

I don’t know. I didn’t see any definition of it.

In my case I’m talking about allowing the use of lower precision but very fast machine

instructions instead of a slow sequence of inline instructions. But I guess instruction

vs library is equivalent.

I haven’t defined “libm” explicitly. The concept of “reassoc” and “libm” are a result of the discussion last November. The “libm” aspect in particular, came from:

http://lists.llvm.org/pipermail/llvm-dev/2016-November/107114.html

It was intended to mean actual library functions, which is what I thought you were referring to when you said “This option also relaxes the precision of commonly used math functions.” From your elaboration describing it as a sequence of inline instructions, I think whether “libm” is the right flag to control it would depend on what that sequence of instructions were doing. If they were implementing ‘pow()’ or ‘sqrt()’ or ‘sin()’ etc., then yes, I think “libm” would be the right flag. If something else (unrelated to functions generally in libm), then probably some other flag (or set of flags) would control whether a transformation was done.

More generally, my intended approach of doing this change (of removing the “fast” umbrella flag, and adding the two new flags “reassoc” and “libm”), is to audit all the places that currently enable an optimization based on whether ‘hasUnsafeAlgebra()’ is true (which essentially is asking whether all the existing FastMathFlags are enabled), and see which of them can/should be controlled by one (or a subset of) the full set. So it’s possible that a particular slow sequence of inline instructions you are transforming would be controlled by “libm”, and a different slow sequence of inline instructions would be controlled by some other flag (or flags).

-Warren

I agree that using SubclassOptionalData is going to be problematic when we run out of bits. ...

I don't have a good view of the big-picture here (in terms of the cost of size and speed of the metadata approach, vs a member of Instruction, vs something else).

We could tackle this problem now, or defer it hoping that we're not going to want to add more flags for finer granularity control of fast-math transformations in the future. It seems the general sense is that we should tackle it now.
Sanjay and Hal, is that what you're view is?

-Warren

I think that it might be a good idea, but I don’t have an opinion on ordering. They should be separate patches if possible (and, if it is not possible because we’ve run out of bits, then obviously the decision has been made for us). -Hal

I agree. We should give the flags semantic meanings. Whether or not something is generally a function call is irrelevant. libm, if we use that name, refers to the functions in libm, and perhaps close extensions, regardless of how the target generally generates code for them. It might be clearer, instead of using ‘libm’, to use something like ‘trans’ (for transcendental functions). -Hal

> I agree that using SubclassOptionalData is going to be problematic when
we run out of bits. ...

I don't have a good view of the big-picture here (in terms of the cost of
size and speed of the metadata approach, vs a member of Instruction, vs
something else).

We could tackle this problem now, or defer it hoping that we're not going
to want to add more flags for finer granularity control of fast-math
transformations in the future. It seems the general sense is that we
should tackle it now.

Sanjay and Hal, is that what you're view is?

I think that it might be a good idea, but I don't have an opinion on
ordering. They should be separate patches if possible (and, if it is not
possible because we've run out of bits, then obviously the decision has
been made for us).

Yes, it seems like we're going to hit that wall sooner or later, so I'd try
to get that part done first since that should have no visible effect on the
final codegen.

I have no sense of what the fastest implementation would be or how anyone
would know without just trying something and timing it.

It might be clearer, instead of using ‘libm’, to use something like ‘trans’ (for transcendental functions).

That does seem clearer. ‘trans’ is definitely good with me.

-Warren

Hi Warren -

IIUC, we need to audit/fix the current transforms that use FMF to respond to the individual flags, and those fixes can occur in parallel. They might cosmetically conflict with the redefining/replumbing work on the flags struct, but that shouldn’t be too hard to fix?

Let me know what I can do to help without interfering with any patches you might have in progress. Also, if you’ll be at the conference next week, we could find some time for interested parties to meet to discuss/hack this more if needed.

Hi Sanjay,

IIUC, we need to audit/fix the current transforms that use FMF to respond to the

individual flags, and those fixes can occur in parallel. They might cosmetically

conflict with the redefining/replumbing work on the flags struct, but that shouldn’t be

too hard to fix?

Yes, the overall goal is to audit/fix the current transforms that use FMF to use the updated set of flags, which includes not using the “umbrella” flag ‘fast’ (eliminating the concept of the umbrella).

I agree with you that this can proceed in parallel with the redefining/replumbing (to deal with the “umbrella” aspect of ‘fast’, and with the 7-bit limitation of the current implementation that you pointed out in the SubclassOptionalData of Value). But that said, it feels to me like it’s going to get confusing to deal with them in parallel.

I see three aspects of the overall task:

  1. Remove the “umbrella” ‘fast’ LLVM IR flag, and add two new flags (‘reassoc’ and ‘trans’). In this portion, the ‘UnsafeAlgebra’ bit in FastMathFlags would be removed, and two new bits would be added (for ‘reassoc’ and ‘trans’). Currently, the concept of ‘FastMathFlags::setUnsafeAlgebra()’ sets the ‘UnsafeAlgebra’ bit in FastMathFlags, along with the other 5 bits that already exist (‘NoNaNs’, ‘NoInfs’, ‘NoSignedZeros’, ‘AllowReciprocal’ and ‘AllowContract’). With this change, those other 5 would still be set, and the two new ones for ‘reassoc’ and ‘trans’ would also be set. The method ‘FastMathFlags::unsafeAlgebra()’ would check that all the bits were on, rather than just the one ‘UnsafeAlgebra’ bit (which will no longer exist). A related change would be needed for the handling of ‘Value::SubclassOptionalData’ (to check for all 7 bits, rather than just the one no-longer-existing ‘FastMathFlags::UnsafeAlgebra’ bit).

  2. Refactor the implementation to remove the 7-bit limitation. That is, don’t use ‘Value::SubclassOptionalData’, and instead carry this information via some other approach. A new member variable of ‘Instruction’ seems sensible to me. But I don’t feel I have the experience to judge if that’s the right way (from an efficiency of implementation perspective).

  3. Audit all the uses of the UnsafeAlgebra concept, and change the ones that don’t require all the flags to be enabled to only check the status of the needed subset.

Conceptually, points 1 and 2 are transparent to users. Point 1 is complicated (to me) since it changes the IR, and so would need some work to deal with that change from a compatibility perspective. Point 2 is somewhat messy (again, to me).

If (1) and (2) were done, then I think point 3 (which I’d call the actual fast-math-related fixes) could be done in an evolutionary way. That is, individual examples of incorrect behavior can be analyzed and fixed as separate bugs. For example, if ‑fno‑reciprocal‑math is suppressing a reassociation, that can be fixed by changing some of the checks for UnsafeAlgebra to check for something more fine-grained (something that doesn’t include the reciprocal-math flag). There’s no requirement from my POV that all the fixes along these lines are done together. (If instead, a thorough audit is desired to get as many issues as are found fixed in one commit, that’s fine with me, but doing them in smaller groups is also fine, and seems straightforward.)

Regarding:

Also, if you’ll be at the conference next week, we could find some time for interested

parties to meet to discuss/hack this more if needed.

Yes, I’ll be at the conference. And I’m definitely interested in participating in discussions and hacking on this. I’d feel I’d be mostly on the sidelines for points 1 and 2, but I’m still very interested in participating in that aspect. Others may see it differently than my 3-point approach, above. In that case, discussing it all face-to-face at the conference would be a good way to make progress.

Thanks,

-Warren