[RFC] Stronger guarantees for "denormal-fp-math

andykaylor · June 24, 2024, 9:18pm

Summary

I’d like to refine the semantics of the "denormal-fp-math" function attribute to provide stronger guarantees regarding what assumptions the optimizer can and cannot make in the presence of this attribute. The goal of this change would be to allow LLVM IR to describe various semantic modes to more closely model the execution-time behavior of target processors that support flushing denormal/subnormal values to zero.

Background

Floating-point environment

On some target architectures, flushing of denormal inputs or outputs can be enabled or disabled dynamically. For example, on x86-based targets there are bits in the MXCSR register to control whether denormal inputs are treated as zero (DAZ) and whether denormal results are flushed to zero (FTZ). For such architectures, the denormal flushing behavior is a de facto part of the floating-point environment, although there is no explicit mention of such behavior being part of the floating-point environment in standards documents, such as IEEE-754 or the C language standard.

By default, LLVM assumes IEEE-754 semantics for the handling of denormal values, but it is possible to describe some restrictions using the "denormal-fp-math" attribute.

Attribute semantics

The current LLVM Language References says the “denormal-fp-math” attribute “indicates the denormal (subnormal) handling that may be assumed for the default floating-point environment.” The attribute is associated with a comma-separated pair of string values, each of which may be "ieee" , "preserve-sign" , "positive-zero" , or "dynamic". The first entry indicates the flushing mode for the result of floating-point operations. The second indicates the handling of denormal inputs to floating point instructions.

The current definition states that if the output mode is "preserve-sign" or "positive-zero" denormal results may be flushed to zero but are not required to be. The result is that transformations like x * 1.0 -> x are permitted.

The Lang Ref definition states “If the mode is "dynamic", the behavior is derived from the dynamic state of the floating-point environment. Transformations which depend on the behavior of denormal values should not be performed.” However, there seems to be some ambiguity about the meaning of this last statement. In a previous discussion, @arsenm told me “the intention was that you cannot replace non-canonicalizing operations with canonicalizing operations without knowing the mode.” And that seems to be the way the attribute is currently being handled. The identity transformation mentioned above (x * 1.0 -> x) is not blocked by "denormal-fp-math"="dynamic,dynamic".

One place where the "denormal-fp-math" attribute is considered is in the value tracking and fpclass deduction associated with explicit comparisons with zero. If the "denormal-fp-math" attribute is not present or the input mode is not set to "ieee" we will assume that an equality comparison with zero guarantees that a value is zero. If the input mode is "dynamic" or "preserve-sign" or "positive-zero" we do not make this assumption.

Motivation

I would like to strengthen the definition of "denormal-fp-math" for two reasons:

To provide consistent numeric results when users to change the FTZ/DAZ modes when FENV_ACCESS is allowed.
To allow users to rely on the compiler preserving numeric behavior in accordance with the denormal behavior described using the -fdenormal-fp-math command-line option currently provided by clang or similar options with other front ends.

Proposal

I am proposing strengthening the definition of the "denormal-fp-math" to say that when this attribute is present the optimizer is not permitted to perform any transformation that would change the numeric results of the generated program if it were executed with the denormal mode set as described by the attribute. If the input or output modes are set to "dynamic" the compiler is not permitted to perform any transformation that would change the numeric results under any denormal mode available with the target architecture.

This would primarily affect two types of transformation: (1) removal or introduction of canonicalizing operations, and (2) constant folding involving denormal values.

We would continue to use “ieee,ieee” as the default denormal mode and so existing transformations that make this assumption would be permitted by default.

Canonicalizing operations

When the "denormal-fp-math" attribute is set to a non-IEEE mode, we would not be allowed to eliminate operations such as x = x * 1.0 which potentially flush input values to zero. This pattern is sometimes used in math libraries which are required to behave in a way that is consistent with the dynamic FTZ/DAZ modes. A function implementation may look like this:

float f(float x) {
  if (x == 0.0) {
    // Handle the non-zero case
  } else {
    // We may get here as a result of a flushed denormal.
    // Return zero with the sign of the input value.
    return x * 1.0f;
  }
}

If the compiler eliminates the x * 1.0f operation, this function will return an incorrect result for denormal inputs when the DAZ flag is set.

Constant folding of denormals

When we perform constant folding involving a number with a denormal input values or a denormal result, the constant folding should honor the denormal mode described by the "denormal-fp-math" attribute. If the attribute is set to "dynamic,dynamic", we should not perform any constant folding involing denormal values. If the attribute is set to "ieee,ieee" (or is absent) we can perform constant folding as we currently do, using the denormal values and denormal results according to the IEEE standard. If the input or output modes are set to "preserve-sign" or "positive-zero", the constant folding should be performed with denormal values flushed in the way described by the attribute.

Note, the LLVM optimizer will currently perform constant folding even when constrained intrinsics are used if APFloat reports that performing the operation would not raise any floating-point exceptions. This can change the numeric results of the program in cases where the DAZ flag is set.

Further discussion

This topic has already been discussed extensively here: Questions about llvm.canonicalize

I have also proposed this as a topic for discussion at the LLVM Floating-Point Working Group meeting this Wednesday at 10 AM Pacific/5 PM UCT. This instance of the meeting has been rescheduled due to a holiday conflict last week, so it isn’t on the LLVM calendar. The meeting link is https://meet.google.com/kxo-bayk-nnd

jcranmer · June 26, 2024, 9:16pm

Owing to the light attendance of this week’s FP WG meeting where we
discussed this, I’ll opt to repost some of my comments on this topic here.

When the |“denormal-fp-math”| attribute is set to a non-IEEE mode, we
would not be allowed to eliminate operations such as |x = x * 1.0|
which potentially flush input values to zero. This pattern is
sometimes used in math libraries which are required to behave in a way
that is consistent with the dynamic FTZ/DAZ modes. A function
implementation may look like this:

The LangRef, at present, doesn’t have much to say about how noncanonical
floating-point values are handled. Outside of @llvm.canonicalize, in
fact, it’s mentioned only in the documentation for denormal-fp-math
(where it’s mentioned that @llvm.canonicalize may be needed to
implement the semantics of denormal-fp-math), in @llvm.minnum and
@llvm.maxnum (to say @llvm.canonicalize may be needed to handle sNaN
inputs correctly), and in @llvm.is.fpclass, where it’s explicitly
mentioned that the input is not canonicalized.

The end result is that we have a pretty major gap in our FP semantics,
with little effort to close the gap, perhaps mostly because the cases
where it matters are cases that people would prefer to not think about.
Our FP instructions are not implementations of IEEE 754, but we don’t
even have fully coherent ideas of how they differ from IEEE 754, much
less anything that can be realized in a formal model of FP semantics.
The problem of non-canonical inputs is salient to me because I’m helping
out with the decimal FP implementation effort, and that’s a scenario
where not only is non-canonical more relevant (and more thorny because
of BID verses DPD differences), but there’s also decimal exponents that
have similar identity-operations-have-observable-effects issues.

To my eyes, the only clear signpost we have is our treatment of sNaNs
(and NaN payload handling in general), where we’ve agreed that–despite
IEEE 754 semantics requiring (almost) all operations to quiet sNaN–LLVM
IR operations are not required to consistently quiet sNaN operations. If
you specifically want it to be quieted, then @llvm.canonicalize is
guaranteed to do it. It seems to me that the only responsible way to
extend this rule to noncanonical values is to say that floating-point
operations may, but are not required to, canonicalize input values [1].

That analysis does assume that denormal flushing is best thought of as
treating denormals as noncanonical versions of zero. The fact that
denormal handling typically has independent DAZ and FTZ which can be set
independently of one another, and there are fun corner cases on what
happens in FTZ mode when the exact result is between the largest
subnormal number and the smallest normal number that can result in the
smallest normal number being flushed to zero in FTZ mode (see
Compiler Explorer for an example) does suggest that this
may not be the wisest way to handle this.

In any case, in the motivating example, my understanding is that the
entire reason the developers of the math library are writing x * 1.0
is because they are specifically attempting to canonicalize the input
value (by flushing possible denormals), and as far as I’m aware,
everyone is in agreement that @llvm.canonicalize is guaranteed to
flush denormals in this situation. It really seems to me that the
correct answer in this case is that the users should do the operation
with the semantics they actually wanted in the first place (which is
@llvm.canonicalize), rather than the operation that they’re spelling
to try to effect those semantics. I’m not entirely convinced that it’s a
good motivation to change the operation to bless what everybody seems to
agree is merely a workaround to achieve the semantics they wanted (as
this library, AIUI, predates availability of __builtin_canonicalize in
C code).

When we perform constant folding involving a number with a denormal
input values or a denormal result, the constant folding should honor
the denormal mode described by the |“denormal-fp-math”| attribute. If
the attribute is set to |“dynamic,dynamic”|, we should not perform any
constant folding involing denormal values.

With some of the playing with FTZ I have done recently, I don’t think
“denormal input values or a denormal result” is strictly speaking the
correct result, and it needs to be altered slightly to also include
cases where the exact result is smaller than the smallest normal number
but may end up rounding to the smallest normal number.

Note, the LLVM optimizer will currently perform constant folding even
when constrained intrinsics are used if APFloat reports that
performing the operation would not raise any floating-point
exceptions. This can change the numeric results of the program in
cases where the DAZ flag is set.
This I think is a bug. On a broader topic, I interpret constrained
intrinsics semantics as generally meaning “evaluate this instruction
strictly according to the current dynamic floating-point environment.”
And the dynamic floating-point environment can include mode bits other
than rounding mode and exception flags. Indeed, FTZ/DAZ bits are
modestly common, and then you can have really fun mode bits like the x87
precision control bits, which means constrained intrinsics for a generic
architecture can have generically unpredictable semantics. At the very
least, for any case where we know that some mode bit can be set–and
FTZ/DAZ are common in many arches!–we shouldn’t be constant-folding any
operation that bit might affect.

[1] There’s a lot of load-bearing on “floating-point operation” here in
this definition, since there are operations on floating-point values for
which this definition should not apply and instead must treat input
noncanonical values as noncanonical (e.g., @llvm.is.fp_class or a
bitcast instruction). While this can and should be clarified, it’s
rather orthogonal to the discussion here.

efriedma-quic · June 26, 2024, 11:02pm

The expected semantics here with constrained intrinsics (without any fast-math flags) seems pretty clear: the result has to be as-if we performed the operations the user wrote. I doubt anyone considers that controversial.

The question here is, how far do we want to stretch non-strict-fp. Currently, we allow fp operations to produce non-canonical results. With IEEE rounding, this is basically just SNaNs, so nobody really cares: the only way to get an SNaN is to explicitly request one, anyway. With non-IEEE denormal handling, common operations produce them, complicating optimizations. For example, consider x * .5 * 2.0 with denormal-fp-math? Can it be folded to x? Can it be folded to x * 1.0? What fast-math flags are required for each of those?

I’m generally concerned about the continuing split where some groups consider constrained-fp impractical due to performance concerns.

andykaylor · June 26, 2024, 11:48pm

I don’t accept that as a good signpost because we don’t have an "fp-signaling-nan-math" attribute. The "denormal-fp-math" attribute was added to handle a few special cases, and I don’t see why it can’t be used to handle others.

There are a few reasons I can think of for the developers of the math library not to use @llvm.canonicalize. The first is that it’s not currently handled for x86 targets, but that should be easy enough to correct. The more important reason is that it isn’t portable. There is no __builtin_canonicalize available for gcc, for example.

As it happens, gcc doesn’t have a good implementation of denormal handling either, but the library developers I’m working with had written their code with the operation obfuscated in an attempt to hide it from the compiler. That worked with gcc, icc, and clang until recently, but then we made some change in clang that allowed it to decipher their code and see the operation for what it was. Obviously, that’s an indication that trying to trick the compiler is not a good long-term solution, but because it used to work, I’m motivated to provide a more robust solution that handles it correctly.

I actually do think a call to __builtin_canonicalize is better for the library implementers than asking the compiler to respect the denormal flushing mode, but I’m looking at this from the more general point of view of wanting to have a mode that says “don’t do anything that will break the assumptions I told you to make about denormals.”

andykaylor · June 27, 2024, 12:01am

That much may not be controversial, but the problem is that the current constrained intrinsics don’t provide a way to describe it. You can describe exception semantics, and you can describe rounding mode. We respect those, but if you have a constrained fdiv that divides the smallest normal number by 2.0, for example, the result is exact and the operation doesn’t raise an exception, so nothing in the constrained intrinsic says it can’t be done.

@lntue suggested that we could consider denormal flushing as a sort of rounding mode, but I don’t think that quite works for all cases (especially since DAZ and FTZ apply at different locations).

I would be open to added a “denormal-mode” argument (or better, operand bundle entry) to describe the denormal mode requirements for constrained intrinsics.

Even without “denormal-fp-math” x * .5 * 2.0 --> x * 1.0 would require the reassoc flag. We don’t consider x * 1.0 --> x as requiring any fast-math flag, but what I’m proposing is that it should require "denormal-fp-math"="ieee,ieee" if the attribute is present. If you set "denormal-fp-math" and also enable reassociation, I think you should expect compiler-chosen results (which you get with reassociation by itself anyway).

I’m concerned about that too. I consider it a temporary situation, until the compiler learns to optimize the constrained intrinsics better. I know some work has been done on that, and I don’t know how large the gap is, but I would be surprised if there isn’t still a considerable gap.

efriedma-quic · June 28, 2024, 12:28am

As an incremental step, maybe we can just check the function-level attribute. Not as precise, but it would be sufficient most of the time.

Oh, right, not sure what I was thinking. I guess could theoretically do it if we could prove something about the input range, but I don’t think LLVM implements that sort of analysis.

Would x * 1.0 -> x would be legal under reassoc even if you specify denormal-fp-math, then? That would significantly reduce the scope of the changes, I think.

andykaylor · June 28, 2024, 10:03pm

That’s what I’m suggesting, except that I am suggesting that it need not be limited to places where we are evaluating constrained intrinsics. The only IR construct currently telling us about the denormal mode is the “denormal-fp-math” attribute, so if that is set to a value other than “ieee,ieee” I am suggesting that should be sufficient to guard folding of denormals.

That’s a really good questions. We are being pulled in two different directions here. The “denormal-fp-math” attribute tells us what denormal behavior we can assume, but the FTZ/DAZ controls themselves have value-changing impact and so they act more like a fast-math control. @arsenm has argued that people using FTZ/DAZ are explicitly forfeiting any right to numeric consistency, and there’s a sense in which that’s true, but I don’t think it follows that people using “-fdenormal-fp-math=dynamic” don’t care about numeric consistency. In fact, for the case I’m trying to support, the exact opposite is true.

Obviously, we would want to eliminate x * 1.0 in fast-math modes, but I don’t think the reassoc flag provides any justification for doing that. Perhaps something in the fast-math flags rewrite that @jcranmer is working on would clean up this case. I believe @arsenm has suggested an “ignore denormal behavior” flag somewhere, but we’ve also talked about some kind of flag to generally allow real number algebraic/trignometric reasoning.

arsenm · July 1, 2024, 2:21pm

-1. There are a few issues with this framing, especially for the non-strictfp case. First, these modes aren’t defined in a semantic way, and there’s a variety of target behaviors that aren’t necessarily internally consistent. If you use this attribute to prescribe a behavior to IR operations, you’ve turned every generic FP operation into a black box we can’t do much of anything to. It not only restricts optimizations like dropping a multiply by 1, but also requires introduction of new instructions in the backend in some cases.

As I mentioned in the other thread, this is morally equivalent to requiring signaling nan quieting. A more reasonable proposal would be to move towards being canonicalize clean (which would be quite a lot of work to fully implement). We could no longer drop canonicalizing operations, and would have to insert new replacement canonicalizing operations. I don’t think it makes any sense to focus on denormal flushing as its own semantic entity.

For strictfp functions, we do effectively have the requirement to preserve canonicalization, which (probably) has the effect of flushing if FTZ or DAZ is enabled, depending on the instruction and processor and time of day.

The reason we have the canonicalize intrinsic today is specifically so we do not have to have these types of restrictions. The canonicalize intrinsic is the only generic operation which directly observes the denormal mode.

The strictfp handling doesn’t consider the denormal mode or other weird target FP modes (AMDGPU has a few other exotic bits). If we care about this, the denormal mode probably should follow along with the rounding mode annotations. I thought a little bit about having a target defined fp environment bundle for denormal/other types of controls, but given that strictfp is currently completely broken for target intrinsics, I think there are higher priorities than worrying about the denormal controls

Ugh, I forgot about this completely. For some reason my calendar notifications never work for these.

That has been my understanding of the rules already.

+1.

Do we? It’s not in the LangRef and I have no idea what it would mean.

This was a consolidation of several parallel mechanisms with different purposes.

You’ve left the world of standards when you enable any denormal flushing, so it’s not surprising different implementations behave differently here. Given the unfortunate non-portability of __builtin_canonicalize with gcc, one suggestion I have is to implement optimizations to recognize explicit denormal flushing / canonicalizing patterns in code. (e.g. something like fpclassify(x) == FP_SUBNORMAL ? copysign(0, x) : x). If we know the mode is ieee or preserve-sign, we can fold this to just x but the dynamic case will work correctly. We could adjust this a bit for the signaling nan case, and fold this to a canonicalize intrinsic call which gets us to the IR semantics. I have no idea if GCC implements similar optimizations to reconstruct a multiply

You can achieve consistency by canonicalizing at the points you need to. For example I am relying on the current behavior of the dynamic mode + canonicalize intrinsic to insert a runtime check of what the denormal mode is in library code, such that it works as expected in either case. Under IPO, when we know the caller’s fp environment, these fold to a constant and the cost is 0.

andykaylor · July 1, 2024, 6:25pm

That’s my fault in this case. I rescheduled the meeting because it originally fell on a holiday, but I didn’t know how to update anyone’s calendars.

No, I am saying that we don’t have such an attribute for NaNs, but we do have such an attribute for denormals, and as you have seen I have been struggling to understand what it means. As it is currently used, it seems to have meaning in some circumstances and not in others, and I am unable to see any sort of consistency in which it will be.

I like the way that it was used to guard the fpclass deduction after comparisons with zero. I would like to be able to apply it in other ways (the two specific cases I’ve described above – constant folding and identity elimination), but I don’t see any clear rules to guide that.

What would you think about this for the x * 1.0 case? If "denormal-fp-math" is absent or set to “ieee,ieee”, x * 1.0 --> x is allowed, otherwise the replacement is x * 1.0 --> llvm.canonicalize(x) .

This still leaves problems for the x * 2.0 * .5 case that Eli raised, but as I said above, I think we could have that with some definition of fast-math flags.

Do you have objections to using the "denormal-fp-math" attribute to restrict constant-folding denormal values?

Can you tell me more about these “other exotic bits”? It might help me to form a more general mental model.

arsenm · July 2, 2024, 8:18pm

The denormal-fp-math attribute model is informative of what the default FP mode is going to be. It does not enable optimizations of canonicalizing operations, or act as a true fast math annotation. More concretely, it states what llvm.canonicalize will do, and what any possibly canonicalizing floating-point operation may, but has no obligation, to do for a denormal input/output. We use it in a few different ways.

Direct combines on llvm.canonicalize, which includes constant folding.
Conversion from non-canonicalizing operations into canonicalizing operations (i.e. turning llvm.is.fpclass into fcmp. I don’t think we have others yet)
At least one place in codegen uses it to defend against an introduced divide by zero if DAZ (a DAZ mode is really more of a restriction, rather than an optimization hint).
AMDGPU needs this to change the codegen for a few operators. This includes avoiding selecting certain instructions that do not respect the mode, and changing the mode for fdiv expansion.
A few places in codegen are currently over-aggressively checking the denormal mode. They’re blocking transforms that are already performed in the IR, and should be removed.

So my rough characterization would be it lets you know when it’s permissible to introduce a canonicalization that wasn’t originally there.

I think the number of contexts that needs to consider the denormal mode needs to be as restricted as possible, ideally only when touching llvm.canonicalize or the small set of relevant non-canonicalizing operations. I don’t think we should be trying to interpret it as a restriction for the canonicalizing-but-not-required-to case, which covers most IR operations.

In the general case, I would object. In the particular instance of canonicalize, the blessed intrinsic, we already do that. As I mentioned in a previous post, I’m relying on this behavior to perform a runtime check of the denormal mode with the dynamic mode by doing a canonicalize of the smallest denormal (0x1) and seeing if the bitvalue result is 0.

Restricting constant folding for the normal IR operations isn’t a consistent model. If the current IR model is no canonicalization is required for non-strict ops, and DAZ/FTZ turns denormals into non-canonical encodings of zero, then it doesn’t make sense to require the denormal flush in the constant folding context. I would prefer the wholistic model of requiring preservation of canonicalizes, rather than avoiding this one specific case.

The main one I’m thinking of is the “IEEE mode” (which we thankfully had removed in the latest generation). For compute we just always enable it. This enables signaling nan quieting, and needlessly breaks a bunch of other instruction modifiers we then just can’t use for optimization. If it’s false (as is used by graphics), you do not get signaling nan quieting and the special modifiers work. This also makes fmin/fmax handling even more complicated. We have an IR attribute to control this, but it’s also assumed based on the calling convention.

The second would be “dx10_clamp”, which changes the nan behavior for the clamp modifier on most FP instructions. We have an attribute for this, but I’m not aware of any direct IR that would observe this. It just changes the rules for modifier folding in codegen. We have an IR attribute for this, but I’m not sure if anyone is setting it.

Another bit (which I’m not aware of any users for) is “FP16_OVFL” which “If set, an overflowed FP16 VALU result is clamped to +/- MAX_FP16 regardless of round mode, while still preserving true INF values”. We don’t have any compiler controls for this

andykaylor · July 2, 2024, 10:49pm

arsenm:

We use it in a few different ways.

Direct combines on llvm.canonicalize, which includes constant folding.

Conversion from non-canonicalizing operations into canonicalizing operations (i.e. turning llvm.is.fpclass into fcmp. I don’t think we have others yet)

At least one place in codegen uses it to defend against an introduced divide by zero if DAZ (a DAZ mode is really more of a restriction, rather than an optimization hint).

AMDGPU needs this to change the codegen for a few operators. This includes avoiding selecting certain instructions that do not respect the mode, and changing the mode for fdiv expansion.

A few places in codegen are currently over-aggressively checking the denormal mode. They’re blocking transforms that are already performed in the IR, and should be removed.

I apologize if I’m being obtuse, but there are also uses of the attribute in KnownFPClass::isKnownNeverLogicalZero() and similar value tracking functions, and I want to be sure I understand how those uses fit your description here.

I think you’re saying that if KnownFPClass::isKnownNeverZero() is based on instructions that may have flushed inputs or outputs to zero then we can’t assume that the value is never logically zero unless we also know it is never a denormal. So if we’ve seen x != 0.0 we can’t know that the operation did not flush the input to zero, and we can’t assume that x is logically zero. But the operation x * 1.0 isn’t required to flush to zero, so even though it “may” do so, we are allowed to eliminate it. Have I got that right?

I don’t like that, but I will admit that it is at least self-consistent.

FWIW, I also found that the constant folder already has some support for exactly what I was suggesting. It seems to have first been introduced here: [InstructionSimplify] handle denormal input for fcmp · llvm/llvm-project@758de0e (github.com)

The motivation there was very similar to what I’ve been discussing. There is a test for the desired denormal handling with instsimplify, but other passes still fold the constants by some other means, so it isn’t actually doing what the author intended in most cases.

I don’t think the author (dcandler) is still active. It looks like you added the dynamic test cases.

arsenm · July 3, 2024, 6:13am

On the input/DAZ side, the possibility of input flushes is an optimization constraint. We have to broaden the possible values in the daz/dynamic cases, so this moves the interpretation into a more conservative direction. This is similar to point 3.

This is specifically the input condition. I believe all the isKnownNeverLogicalZero checks are only when considering incoming values and not the result

Yes

andykaylor · July 3, 2024, 5:34pm

Thanks. I think I understand this now.

As I said, I don’t like the semantics that operations may flush but aren’t required to, but I understand your point that it would be a lot of work to make everything canonicalize clean.

What is your view on the existing constant folding treatment of denormals? We have some code that accounts for "denormal-fp-math" in constant folding and some that doesn’t. I suppose the semantics you’ve described allow for either behavior, but it seems like we should be consistent one way or the other.

Topic		Replies	Views
Subnormal handling attributes LLVM Dev List Archives	0	81	November 1, 2019
Denormal-fp-math and fast-math Clang Frontend	1	183	April 25, 2024
Handling of FP denormal values LLVM Dev List Archives	6	143	September 17, 2019
[RFC] Improving IR fast-math semantics IR & Optimizations core , rfc , llvm , llvm-ir	22	971	May 31, 2024
-fdenormal-fp-math Clang Frontend	15	94	March 17, 2017