@arsenm has been working recently to enhance the optimizer’s ability to draw conclusions from fast-math flags and related parameter attributes about the possible fpclass of values in the IR. These changes are very good for the cases Matt is trying to optimize, and I believe they are perfectly legal in accordance with the documented semantics of LLVM IR, as well as consistent with a reasonable interpretation of the documentation for the clang command-line options that lead to the generation of these fast-math flags and attributes.
However, over the course of several reviews (and also from time to time in past years), questions have been raised over whether this is really what users want from the fast-math options or whether we are perhaps being too aggressive in our optimization with regard to the fast-math flags, particularly ‘nnan’ and ‘ninf’. I want to be clear that Matt’s recent changes aren’t the cause of the problems I’m discussing below. They’re just expanding the scope of where such issues are visible.
The approach that we’ve general taken, in practice, is to say that when you invoke a compiler with fast-math enabled (and for the sake of discussion, let’s assume I mean that you’ve used ‘clang -O2 -ffast-math’), you are effectively giving the compiler permission to assume that no NaN or Inf values will be used as inputs or produced as outputs for any operation or function call, and if such values are encountered the compiler may treat it as undefined behavior.
The clang documentation for -ffinite-math-only (the relevant component of -ffast-math) says, “Allow floating-point optimizations that assume arguments and results are not NaNs or ±inf.” Given this definition, the behavior above isn’t unreasonable.
But is that really what users want?
There’s no simple answer to this question, of course. I’m sure there are users who want exactly that. However, I’m equally certain that there are users who would like a milder interpretation along the lines of, “Don’t block optimizations that are unsafe in the presence of NaN or infinites, but still respect the basic logic of my code.” I’d like to discuss this second interpretation, because I’m not sure it’s ever been possible to implement such behavior reliably in LLVM, and the recent enhancements to the value tracking and such are making it even more difficult.
Consider the following code:
float a[4], b[4], c[4];
void foo(float x, float y) {
for (int i = 0; i < 4; ++i) {
c[i] = 0.1f + (x * (float)i * y);
if (c[i] == INFINITY || c[i] == -INFINITY)
c[i] = 1.0f;
}
for (int i = 0; i < 4; ++i)
a[i] = b[i] / c[i];
}
I would like the compiler to generate an approximation for the division in the bottom loop using reciprocal approximation plus Newton-Raphson refinement. Such an approximation is only allowed if c[i] is never zero or infinite. The code guanrantees that c[i] will never be zero, and it has special handling for infinities, so this should be fine, right? The problem is that when I use the fast-math option to enable the reciprocal approximation, my special handling for infinities gets optimized away.
Now suppose I’m trying to create a front-end option that enables fast-math on operations without setting the fast-math flags on explicit comparisons. After some basic simplification, the code above would give me IR like this:
define dso_local void @foo(float noundef nofpclass(nan inf) %x, float noundef nofpclass(nan inf) %y) #0 {
entry:
br label %for.cond
for.cond: ; preds = %for.inc, %entry
%i.0 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
%cmp = icmp slt i32 %i.0, 4
br i1 %cmp, label %for.body, label %for.cond.cleanup
for.cond.cleanup: ; preds = %for.cond
br label %for.cond13
for.body: ; preds = %for.cond
%conv = sitofp i32 %i.0 to float
%mul = fmul fast float %x, %conv
%mul1 = fmul fast float %mul, %y
%idxprom = sext i32 %i.0 to i64
%arrayidx = getelementptr inbounds [4 x float], ptr @c, i64 0, i64 %idxprom
store float %mul1, ptr %arrayidx, align 4, !tbaa !6
%idxprom2 = sext i32 %i.0 to i64
%arrayidx3 = getelementptr inbounds [4 x float], ptr @c, i64 0, i64 %idxprom2
%0 = load float, ptr %arrayidx3, align 4, !tbaa !6
%cmp4 = fcmp oeq float %0, 0x7FF0000000000000
br i1 %cmp4, label %if.then, label %lor.lhs.false
lor.lhs.false: ; preds = %for.body
%idxprom6 = sext i32 %i.0 to i64
%arrayidx7 = getelementptr inbounds [4 x float], ptr @c, i64 0, i64 %idxprom6
%1 = load float, ptr %arrayidx7, align 4, !tbaa !6
%cmp8 = fcmp oeq float %1, 0xFFF0000000000000
br i1 %cmp8, label %if.then, label %for.inc
if.then: ; preds = %lor.lhs.false, %for.body
%idxprom10 = sext i32 %i.0 to i64
%arrayidx11 = getelementptr inbounds [4 x float], ptr @c, i64 0, i64 %idxprom10
store float 1.000000e+00, ptr %arrayidx11, align 4, !tbaa !6
br label %for.inc
for.inc: ; preds = %lor.lhs.false, %if.then
%inc = add nsw i32 %i.0, 1
br label %for.cond, !llvm.loop !10
for.cond13: ; preds = %for.body17, %for.cond.cleanup
%i12.0 = phi i32 [ 0, %for.cond.cleanup ], [ %inc25, %for.body17 ]
%cmp14 = icmp slt i32 %i12.0, 4
br i1 %cmp14, label %for.body17, label %for.cond.cleanup16
for.cond.cleanup16: ; preds = %for.cond13
ret void
for.body17: ; preds = %for.cond13
%idxprom18 = sext i32 %i12.0 to i64
%arrayidx19 = getelementptr inbounds [4 x float], ptr @b, i64 0, i64 %idxprom18
%2 = load float, ptr %arrayidx19, align 4, !tbaa !6
%idxprom20 = sext i32 %i12.0 to i64
%arrayidx21 = getelementptr inbounds [4 x float], ptr @c, i64 0, i64 %idxprom20
%3 = load float, ptr %arrayidx21, align 4, !tbaa !6
%div = fdiv fast float %2, %3
%idxprom22 = sext i32 %i12.0 to i64
%arrayidx23 = getelementptr inbounds [4 x float], ptr @a, i64 0, i64 %idxprom22
store float %div, ptr %arrayidx23, align 4, !tbaa !6
%inc25 = add nsw i32 %i12.0, 1
br label %for.cond13, !llvm.loop !12
}
Notice that my infinity checks (%cmp4 and %cmp8) have no fast-math flags set. Unfortunately, early CSE still eliminates the comparisons (see Compiler Explorer). I’m not exactly sure which classes are involved in the path of reasoning it follows, but I’m pretty sure that the basic idea is that something sees that the non-constant values in the comparison are coming from an operation that has the ‘ninf’ flag set, so the optimizer assumes that the values can’t be infinities. This is consistent with the semantics described in the LLVM Language Reference, but it makes it difficult to support the behavior I’d like to see here.
We could discuss various options within the currently available IR constructs to fix the particular case I’ve just described, but I think they mostly amount to putting obstacles in the way of the optimizer to try to prevent it from making deductions, and as such I think they’d either be susceptible to future enhancements teaching the optimizer to get past those obstacles or they would present broader optimization blocks than we would like.
I think what I’d like to see is something like the current ‘nnan’ and ‘ninf’ flags, but with semantics like “allow optimizations that are unsafe in the prescence of ‘nan|inf’ but do not otherwise make assumptions about the value returned.” Or, alternatively, some way to globally control whether or not the optimizer considers fast-math flags for the purposes of value tracking, scalar evolution, etc.
Opinions?