Propogation of fpclass assumptions vis a vis fast-math flags

@arsenm has been working recently to enhance the optimizer’s ability to draw conclusions from fast-math flags and related parameter attributes about the possible fpclass of values in the IR. These changes are very good for the cases Matt is trying to optimize, and I believe they are perfectly legal in accordance with the documented semantics of LLVM IR, as well as consistent with a reasonable interpretation of the documentation for the clang command-line options that lead to the generation of these fast-math flags and attributes.

However, over the course of several reviews (and also from time to time in past years), questions have been raised over whether this is really what users want from the fast-math options or whether we are perhaps being too aggressive in our optimization with regard to the fast-math flags, particularly ‘nnan’ and ‘ninf’. I want to be clear that Matt’s recent changes aren’t the cause of the problems I’m discussing below. They’re just expanding the scope of where such issues are visible.

The approach that we’ve general taken, in practice, is to say that when you invoke a compiler with fast-math enabled (and for the sake of discussion, let’s assume I mean that you’ve used ‘clang -O2 -ffast-math’), you are effectively giving the compiler permission to assume that no NaN or Inf values will be used as inputs or produced as outputs for any operation or function call, and if such values are encountered the compiler may treat it as undefined behavior.

The clang documentation for -ffinite-math-only (the relevant component of -ffast-math) says, “Allow floating-point optimizations that assume arguments and results are not NaNs or ±inf.” Given this definition, the behavior above isn’t unreasonable.

But is that really what users want?

There’s no simple answer to this question, of course. I’m sure there are users who want exactly that. However, I’m equally certain that there are users who would like a milder interpretation along the lines of, “Don’t block optimizations that are unsafe in the presence of NaN or infinites, but still respect the basic logic of my code.” I’d like to discuss this second interpretation, because I’m not sure it’s ever been possible to implement such behavior reliably in LLVM, and the recent enhancements to the value tracking and such are making it even more difficult.

Consider the following code:

 float a[4], b[4], c[4];
 
 void foo(float x, float y) {
   for (int i = 0; i < 4; ++i) {
     c[i] = 0.1f + (x * (float)i * y);
     if (c[i] == INFINITY || c[i] == -INFINITY)
       c[i] = 1.0f;
   }
 
   for (int i = 0; i < 4; ++i)
     a[i] = b[i] / c[i];
}

I would like the compiler to generate an approximation for the division in the bottom loop using reciprocal approximation plus Newton-Raphson refinement. Such an approximation is only allowed if c[i] is never zero or infinite. The code guanrantees that c[i] will never be zero, and it has special handling for infinities, so this should be fine, right? The problem is that when I use the fast-math option to enable the reciprocal approximation, my special handling for infinities gets optimized away.

Now suppose I’m trying to create a front-end option that enables fast-math on operations without setting the fast-math flags on explicit comparisons. After some basic simplification, the code above would give me IR like this:

define dso_local void @foo(float noundef nofpclass(nan inf) %x, float noundef nofpclass(nan inf) %y) #0 {
entry:
  br label %for.cond

for.cond:                                         ; preds = %for.inc, %entry
  %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
  %cmp = icmp slt i32 %i.0, 4
  br i1 %cmp, label %for.body, label %for.cond.cleanup

for.cond.cleanup:                                 ; preds = %for.cond
  br label %for.cond13

for.body:                                         ; preds = %for.cond
  %conv = sitofp i32 %i.0 to float
  %mul = fmul fast float %x, %conv
  %mul1 = fmul fast float %mul, %y
  %idxprom = sext i32 %i.0 to i64
  %arrayidx = getelementptr inbounds [4 x float], ptr @c, i64 0, i64 %idxprom
  store float %mul1, ptr %arrayidx, align 4, !tbaa !6
  %idxprom2 = sext i32 %i.0 to i64
  %arrayidx3 = getelementptr inbounds [4 x float], ptr @c, i64 0, i64 %idxprom2
  %0 = load float, ptr %arrayidx3, align 4, !tbaa !6
  %cmp4 = fcmp oeq float %0, 0x7FF0000000000000
  br i1 %cmp4, label %if.then, label %lor.lhs.false

lor.lhs.false:                                    ; preds = %for.body
  %idxprom6 = sext i32 %i.0 to i64
  %arrayidx7 = getelementptr inbounds [4 x float], ptr @c, i64 0, i64 %idxprom6
  %1 = load float, ptr %arrayidx7, align 4, !tbaa !6
  %cmp8 = fcmp oeq float %1, 0xFFF0000000000000
  br i1 %cmp8, label %if.then, label %for.inc

if.then:                                          ; preds = %lor.lhs.false, %for.body
  %idxprom10 = sext i32 %i.0 to i64
  %arrayidx11 = getelementptr inbounds [4 x float], ptr @c, i64 0, i64 %idxprom10
  store float 1.000000e+00, ptr %arrayidx11, align 4, !tbaa !6
  br label %for.inc

for.inc:                                          ; preds = %lor.lhs.false, %if.then
  %inc = add nsw i32 %i.0, 1
  br label %for.cond, !llvm.loop !10

for.cond13:                                       ; preds = %for.body17, %for.cond.cleanup
  %i12.0 = phi i32 [ 0, %for.cond.cleanup ], [ %inc25, %for.body17 ]
  %cmp14 = icmp slt i32 %i12.0, 4
  br i1 %cmp14, label %for.body17, label %for.cond.cleanup16

for.cond.cleanup16:                               ; preds = %for.cond13
  ret void

for.body17:                                       ; preds = %for.cond13
  %idxprom18 = sext i32 %i12.0 to i64
  %arrayidx19 = getelementptr inbounds [4 x float], ptr @b, i64 0, i64 %idxprom18
  %2 = load float, ptr %arrayidx19, align 4, !tbaa !6
  %idxprom20 = sext i32 %i12.0 to i64
  %arrayidx21 = getelementptr inbounds [4 x float], ptr @c, i64 0, i64 %idxprom20
  %3 = load float, ptr %arrayidx21, align 4, !tbaa !6
  %div = fdiv fast float %2, %3
  %idxprom22 = sext i32 %i12.0 to i64
  %arrayidx23 = getelementptr inbounds [4 x float], ptr @a, i64 0, i64 %idxprom22
  store float %div, ptr %arrayidx23, align 4, !tbaa !6
  %inc25 = add nsw i32 %i12.0, 1
  br label %for.cond13, !llvm.loop !12
}

Notice that my infinity checks (%cmp4 and %cmp8) have no fast-math flags set. Unfortunately, early CSE still eliminates the comparisons (see Compiler Explorer). I’m not exactly sure which classes are involved in the path of reasoning it follows, but I’m pretty sure that the basic idea is that something sees that the non-constant values in the comparison are coming from an operation that has the ‘ninf’ flag set, so the optimizer assumes that the values can’t be infinities. This is consistent with the semantics described in the LLVM Language Reference, but it makes it difficult to support the behavior I’d like to see here.

We could discuss various options within the currently available IR constructs to fix the particular case I’ve just described, but I think they mostly amount to putting obstacles in the way of the optimizer to try to prevent it from making deductions, and as such I think they’d either be susceptible to future enhancements teaching the optimizer to get past those obstacles or they would present broader optimization blocks than we would like.

I think what I’d like to see is something like the current ‘nnan’ and ‘ninf’ flags, but with semantics like “allow optimizations that are unsafe in the prescence of ‘nan|inf’ but do not otherwise make assumptions about the value returned.” Or, alternatively, some way to globally control whether or not the optimizer considers fast-math flags for the purposes of value tracking, scalar evolution, etc.

Opinions?

See discussion on ⚙ D47963 [LangRef] nnan and ninf produce poison. . The LangRef text was changed to “poison” as part of the general effort to clarify the various meanings of “undefined” in the reference. I suggested something more conservative at the time… but I didn’t really have any examples where the “poison” behavior would cause problems. I think it makes sense to revisit if it’s causing practical issues for users.

The most obvious option for this is “freeze”, but there probably isn’t any way for clang to generate that without blocking a bunch of important optimizations.

Thanks for the link. @sanjoyd said there that fcmp is only affected by nnan and ninf if the fcmp instruction has those flags set, but as I demonstrated above, that’s not true. Your comment about the flags working on a per-file basis brings up other problems because the float_control pragma is supposed to allow local control, but that’s currently broken in a few places.

I think the problem I’m describing here is difficult to address while maintaining a well-defined IR definition. Stepping outside the framework of a well-defined IR language, what I want is something that says that the instruction will return whatever value is dictated by the semantics of the instruction without the fast-math flags if the instruction is left in place, BUT the optimizer is allowed to make transformations that violate those semantics in a specific way.

When you try to think about that in the context of a well-formed IR definition, it doesn’t feel right because the whole purpose of the IR definition is to define semantics that cannot be violated by transformations, but I would say that the whole purpose of fast-math flags is to allow semantics to be violated, so we’re left in this uncomfortable state.

Exploring a tangent for a moment, I realized recently that there is a bit of an unwritten rule with regard to ‘nnan’ or ‘ninf’ attached to an fcmp instruction. Suppose I have this instruction:

%cmp4 = fcmp ninf oeq float %0, 0x7FF0000000000000

We will always treat that as evaluating to ‘false’ (because we’re assuming no infinities, right? But the IR definition says that returns poison, regardless of the value of %0, because the second operand is an infinity. So, in theory, we could treat that as always true. It would be fairly hostile to the user to do so, but the IR definition says we could, right?

That’s a bit off-topic, but it’s related to what I’m getting at because regardless of our IR definitions, we ultimately don’t want to completely disregard the intent of the user’s source code.

Yes, I talked about this issue with @jcranmer a couple of weeks ago and he suggested freeze, but it turns out that early CSE gets rid of the freeze in the case I used here. I found another case where freeze blocked an optimization when we could deduce the fpclass from both fast-math flags and an assume intrinsic (Compiler Explorer) but that’s something that could be fixed. I expect that are other cases where freeze would block things in ways that wouldn’t be so easily solved.

I think this is a big issue, and I am expecting the evaluate to false behavior we have today. We just need to fix the LangRef wording. I opened https://github.com/llvm/llvm-project/issues/45422

A half baked thought I’ve been working on is to split the fast math flags into separate input and output flags. I think that would put us in a better state. In some situations you do want to know that there are no nan/infinity inputs, and in others you care about the lack of overflow. In your example, we could refine this to infinity outputs are assumed impossible, while leaving the infinity inputs defined

I think you can formalize this as “if an input is nan/inf, or the output would be nan/inf, the result is equivalent to freeze poison”. Non-fast-math ops just see a regular value, so operations like non-fast-math floating-point comparisons behave consistently. But we don’t need to preserve the exact behavior for nan/inf cases, so it still allows many optimizations. The operation itself can be emitted ignoring the possibility of nan/inf. And if an nnan/ninf operation has a single use that’s also an nnan/ninf operation, the two operations can be emitted as a group ignoring the possibility of nan/inf. (Operations with multiple uses are more complicated to optimize because each use needs to see the same value.)

That said, it’s significantly harder to write optimizations against a rule like this. For example, under this proposal, isKnownNeverNaN() would almost always be false for “nnan” ops; we’d need a different predicate to allow optimizing pairs of nnan instructions.

I think we need something formalizable; something that isn’t is very difficult to maintain. We have new developers coming into LLVM all the time; we need to be able to point at a rule when someone proposes a new transform, not just depend on reviewers to have internalized heuristics for what’s likely to break user code.

Undefined behavior is undefined because once we start assuming the behavior in question never happens, there’s no way to provide reliable guarantees. We use a combination of diagnostics and sanitizers to make working with undefined behavior somewhat predictable. Saying that floating-point optimization flags shouldn’t use an “undefined behavior” ruleset is one thing; trying to compromise on the general principle of undefined behavior is just making the problems more obscure.

In light of the new Clang warning -Wnan-and-infinity-disabled, I’ve been fixing up a few uses of -ffast-math in our codebase which were using inf/nan values, and calling isinf/isnan. (thanks for the warning, btw!)

I can’t really say what the original author expected to happen, but some of these cases actually had benchmarks checked in showing the performance improvement they got from enabling -ffast-math. In those cases, I’ve not yet seen a situation where the code needed anything beyond -fassociative-math -fno-signed-zeros to preserve the (sometimes 3x!) performance gain.

So, personally, I’m definitely in favor of making -ffast-math have less UB in it. I don’t think anyone expects to get UB from enabling that flag, and I don’t think the performance benefits of creating UB are worth the cost.

IMO, it’s not even obvious that -ffinite-math-only, nor its two clang-only subflags -fno-honor-infinities -fno-honor-nans should even exist at all. I think it’d be just as well to remove support for them entirely. Instead, if some code would really like to use an assumption of finite results for some particular operation…just write such an explicit assumption in that code. (And we ought to be able to optimize on such explicit assumptions…though we don’t seem to do so today.)

Is the performance data openly available for your project? If so can you provide a reference?

computeKnownFPClass does understand assumes with nan checks now, but this has the same essentially identical UB-as-defined properties as nofpclass/nnan. This also has all the other drawbacks of assume, and also requires littering them everywhere

I’m sure there are some cases where separating the fmf flags into inputs and outputs would help, but I don’t think it covers everything. In the case I posted, the reciprocal approximation plus Newton-Raphson results in NaN if the denominator is either zero or infinity, so I’d need both inputs and outputs with ‘ninf’ for this optimization.

I think there are two distinct use cases here. For what you’re working on, I think you really do need to be able to assume that there are no NaNs/Infs anywhere so that you can optimize away the special handlers for those cases. What I’m trying to achieve is a bit more vague, I think. I want to enable some optimizations that require the fast-math flags while not optimizing away code that’s necessary to handle the corner cases.

It wouldn’t be too far off to say that I want to be able to move in and out of fast-math mode without risk of the effects of fast-math bleeding over into places where it is turned off. In fact, that’s ultimately where I’d like to get with this. I have an idea for a utility to selectively disable any fast-math-based optimization locally to be able to broadly enable fast-math and then add constraints until the results meet whatever accuracy requirements the user has. But I think being able to preserve comparisons while aggressively optimizing everything else is a useful first step.

BTW, I think you’re right that the fcmp with constant case could be easily codified to the desirable (current) behavior by a change to the definition.

The part about “operations can be emitted as a group ignoring the possibility of nan/inf” makes me a bit nervous with regard to comparisons specifically. Here’s a case that I got from a customer which convinced me that our current aggressive ‘nnan’ handling isn’t safe to use for most situations:

https://godbolt.org/z/184xa9Ydc

NaNs compare as equal to zero there. It’s because of an unfortunate quirk in the ‘ucomisd’ instruction where comparisons with NaN set the zf, pf, and cf flags. Because we told the code generator it could assume no NaNs, it doesn’t bother checking the parity flag and just sees the zero flag as indicating equality. The typical explanation is that this is the kind of thing you are signing up for when you set the ‘nnan’ flag, and that’s true. But that’s what led to my current line of inquiry, I stopped setting the ‘nnan’ and ‘ninf’ flags, and now I’m trying to find ways to recover the optimizations that are lost in the process.

The first obvious thing to try was not setting the fast-math flags on comparisons, but as you see that isn’t sufficient to prevent fast-math assumptions from being made on comparisons.

I agree with you here, but I’m struggling with the precision required by the word “formalizable.” As you know, it entails more than just writing down the rules that we’re following. Those rules need to be rigorously consistent, and as I said I’m trying to come up with a way to enable a sort of behavior that isn’t entirely self-consistent. I want to be able to give one answer to questions of the form “may I perform this transformation?” and a different answer to questions of the form “may I assume that the result of this expression [is|is never] x?”

I don’t think what I want is a change to general principles on undefined behavior. What I want is to be able to stop assuming that the use of NaN or Inf values with an instruction that has ‘nnan’ or ‘ninf’ set is undefined behavior.

I recognize that this is a departure from the current semantics of those flags, and I recognize that there are valid use cases for which the current semantics are good and useful, so I don’t want to get rid of what we have. I want to introduce something new that has a significant number of shared properties with what we have.

Consider the ‘reassoc’ flag. Suppose I have IR like this:

%t1 = fadd reassoc float %x, %y
%t2 = fadd reassoc float %t1, %z

If there are no other uses of %t1 I can transform that to this:

%t1 = fadd reassoc float %x, %z
%t2 = fadd reassoc float %t1, %y

That is a break from the semantics of the original IR. It might produce entirely different results from the original IR. It’s allowed because that’s what the flag means. It doesn’t let me assume anything about the value of %t1 or %t2. It just lets me perform the transformation. It explicitly lets me perform a transformation that violates the original semantics.

That’s more-or-less what I want with ‘nnan’ and ‘ninf’ – I want a weak form of these flags that enables semantic-breaking transformations without introducing any other changes to the semantics of instructions on which they are present.

Right, to handle that, in addition to the new semantics for nnan/ninf, clang would need to stop emitting nnan/ninf on comparisons.

Basically, there are two problems in the given example: the fcmp itself, and the inputs to the fcmp.

If you want to ensure two different fcmp instructions produce consistent results, you can’t have any form of non-determinism (i.e. nnan/ninf flags) on fcmps: otherwise, there isn’t any way to prevent isnan(x) != isnan(x).

Then, on top of that, you need to ensure that the input to those instructions is not poison, which is where my suggested rule change comes in.

The way to reconcile “formalization” with an answer that isn’t actually predetemined is non-determinism: basically, we say that there isn’t one right answer, but there’s a constrained range of answers. This is how afn and nsz are currently defined we say that there’s not one right answer, but a small range of correct answers.

Yes. With the current rules I don’t see anything that would stop someone from writing a transformation that sets the ‘nnan’ flag on an fcmp instruction if the ‘nnan’ flag is set for all of its non-constant operands, and even though we’re not doing exactly that, the constant improvements in tracking fpclass based on fast-math flags and related attributes is having the same effect.

I’m not quite sure what you mean by “small range of correct answers” here.

This is going off in yet another direction, but I have a problem with ‘afn’ in that we don’t say anything about just how close the approximation needs to be. Can I replace ‘sin(x)’ with zero? It’s an approximation. I think we’d all agree that it’s not a sufficiently accurate approximation, but I don’t know that we have any rule for how much error is allowed.

Going back to my original case with reciprocal approximation of division, we’re using Newton-Raphson refinement because just inserting the rcpps instruction isn’t close enough. The ‘arcp’ flag allows the use of a reciprocal, but an unwritten (I think) rule tells us that the answer needs refinement.

Also, in the case of reassociation, depending on the values involved, the “small range” of results can include zero and infinity if the intermediate results overflow or underflow. This gets at a key problem with fast-math. I think people using it intend to say, “I’m willing to tolerate a small amount of variation in my results,” but in practice the results might be completely incorrect, and there’s not really any way the compiler can tell the difference without knowing the range of inputs. That is to say, I don’t think we can define the behavior of some fast-math flags in terms of the numeric impact of the transformations they allow.

e.g. fmul nsz float 1.0, 0.0 is either 0.0 or -0.0. So there’s two “correct” answers. This is formalizable (e.g. alive2 can reason about this sort of result).

Under my proposal, for fmul nnan float 1.0, nan, any floating-point number would be a correct answer.

We have !fpmath metadata, which I think is supposed to cover this?

“Reassociation” as a compiler optimization for floating-point ops is ill-defined, yes, and there’s no obvious way to fix it. I don’t think that should block us from making what guarantees we can for other fast-math flags. And dragging this discussion in too many directions makes it unlikely anyone will address any of it.

Do you mean your proposal to have it return freeze poison? When you say “any floating-point number” do you intend for that to include NaN?

Either way, that’s not really what I’m looking for. What I want is for fmul nnan float 1.0, nan to reliably return NaN, while also allowing the transformation x*1.0 --> x.

I have other problems with !fpmath, but my point was that we don’t say how much error is allowed with afn if you don’t use !fpmath. As far as I know, !fpmath is only ever used for the case where OpenCL allows 2.5 ulp error for single-precision divide and square root. Near as I can tell, it’s based on the assumption that without !fpmath divide and square root will return correctly-rounded results, though I don’t think that’s necessarily a good assumption for all targets.

I see your point, but the reason I keep bringing reassoc back into this is that it’s a good model for what I want to enable with a weak version of the nnan and ninf flags. I’m not looking for a way to “fix” the reassoc flag. I’m looking to use it as a model for other flags. Apart for questions about whether or not it also allows things like distributive optimizations, I don’t think there’s a lot of contention about the meaning of reassoc, and I think it’s easy enough to define it’s behavior – it enables transformations that are algebraically legal based on associative properties of operations. Similarly, with weak-nnan, I want to enable transformations that would be legal if we did not need to consider the possibility of NaN inputs or results.

“any floating-point number” would include NaN, I think. Otherwise I’m not sure how you’d define it in a universally applicable way.

Is x*1.0 --> x what you meant to write? That doesn’t require any fast-math flags.

Requiring that fmul nnan float 1.0, nan returns nan seems to defeat the whole purpose of nnan, which is to assume stuff isn’t nan. I mean, I don’t have an example off the top of my head where it matters for fmul specifically, but if we need to preserve NaN inputs, what optimizations are allowed by nnan? I mean, I guess you could very narrowly state that nnan means something like “if the inputs to an operation are finite, and the IEEE-defined result would be NaN, the operation may instead return zero”. But that seems so narrow it’s hard to imagine it’s useful at all.

computeKnownFPClass does understand assumes with nan checks now

The optimization doesn’t trigger for reciprocal approximation division with appropriate assumes; it looks like that is only checking the ‘ninf’ bit.

but this has the same essentially identical UB-as-defined properties as nofpclass/nnan. This also has all the other drawbacks of assume, and also requires littering them everywhere

IMO, it’s significantly less scary to potentially invoke UB from violating an explicitly-written-in-code __builtin_assume, vs from changing a compiler flag that affects everywhere in the entire TU (including all headers you include, which you may not even know the contents of). That is, IMO, the requirement to “litter them everywhere” would be a feature—because you should only be doing this when you know it’s actually appropriate, for a given performance-hot algorithm—not a downside.

You’re right. Sorry. I guess my brain had shut down by that point in the day. What I meant was that I want “fmul nnan float 0.0, nan” to be defined to return NaN while also allowing something like “x*0.0 → 0.0

I think this is where the disconnect is. I want to allow optimizations that would change the result if one of the values were NaN, but if such optimizations aren’t performed I want the semantics of the IR to be the same as they would have been if the fast-math flags weren’t present. And to be abundantly clear, I am not proposing changing the semantics of the current fast-math flags to have this behavior. I am saying I’d like new fast-math flags that have that behavior.

I think this is particularly important when we’re talking about constant operands. If we’re going to constant fold “fmul nnan float 1.0, nan”, it should always fold to nan. That is to say, “1.0*nan → 1.0” is not an optimization that should be enabled by the nnan flag.

BTW, this seems like a good time to drag up an old problem I had with the current semantics of instructions returning poison if the operands violate ninf/nnan flags.

https://discourse.llvm.org/t/nnan-ninf-and-poison/56506

The issue I had there was that in the presence of fast-math flags, we are introducing infinitity into IR that might not otherwise have had an infinity and then declaring the IR to be undefined behavior which can be optimized away.

define double @f(double %x) {
entry:
%t1 = fadd fast double %x, 0xFFE5555555555555
%t2 = fadd fast double %x, 0xFFE5555555555555
%result = fadd fast double %t1, %t2
ret double %result
}

Becomes:

define double @f(double %x) {
  ret double poison
}

You’re right that the optimization depends on the ‘ninf’ flag, but just FYI assuming isfinite(denominator) wouldn’t be sufficient in this case because it transformation is also incorrect if the denominator is zero. Of course, the fpclass information could tell us that too and maybe you had already accounted for that.

I don’t think requiring something like __builtin_assume() is good for the case that Matt is trying to handle. If you have a library and you want to be able to produce one version that handles NaN and Inf and another version that does not, having to modify the code to insert assume calls is unnecessarily burdensome.

There’s no way to define a coherent set of rules based on that. There’s no objective way to define whether an optimization is obvious. You’re basically saying that whether an optimization is legal depends on the author of the optimization.

I guess you could try to enumerate the exact set of optimizations that are allowed, but I’m not sure you can enumerate them in a way that actually prevents “wrong” transforms.