[RFC] Improving IR fast-math semantics

jcranmer · May 2, 2024, 6:48pm

This is a proposal to ~~fix~~ make less broken the specification of fast-math flags in LLVM IR and to better rationalize what optimizations are allowed to do floating-point arithmetic with and without fast-math flags or other attributes. Given the length of this RFC, I have opted to provide, at the end, a summary of the questions whose answers I am particularly interested in reaching consensus on.

Regular FP semantics

The semantics of floating-point in LLVM IR are generally underdefined, and I’m not going to attempt to fix all the issues now. However, it is necessary to understand what the baseline semantics are before saying how fast-math flags affect these semantics. Nothing I say in this section is intended to be a change to existing semantics, merely a summary of my understanding of semantics.

In the absence of any other annotations, floating-point in LLVM IR is largely intended to match the semantics of a C compiler that has FENV_ACCESS disabled, FE_SNANS_ALWAYS_SIGNAL undefined, and FLT_EVAL_METHOD set to 0 [1]. Furthermore, it is intended that the semantics of the major floating-point types comport to IEEE 754 semantics, albeit with the rounding mode assumed to be FE_TONEAREST, the FP exception word always being in an unspecified state, and NaN payload handling rules that are out of scope of this RFC.

What this essentially means is that if we have something like an fdiv instruction, and we constrain values to be in the set {any valid finite FP value, ±∞, one NaN value}, the optimizer must exactly preserve the resulting value in that set. It is not legal to do a transformation which could change the numeric result: a / b is not the same as a * (1.0 / b). We also assume that we can introduce or remove arbitrary FP operations at our pleasure, without regard for the environment.

A final note is that for the various intrinsics (and math library functions, though there does seem to be an emerging consensus of moving from libm functions to intrinsics), we should assume that any intrinsic corresponding to an operation in IEEE 754 §9.2 (or equivalently the table of C2x §F.3¶20) has unknown accuracy (and thus shouldn’t be constant-folded) such as sin or pow, while the other intrinsics (such as sqrt or rint) should be regarded as always having the correctly-rounded result (for the default rounding mode of course).

[1] This means our behavior on i386 when using x87 and not SSE is buggy, but given the difficulty of the fix, this is unlikely to be fixed anytime soon.

Fast-math semantics

The user goal of specifying fast-math flags is to override the above semantics of floating-point, trading numerical consistency for speed. Clang is somewhat unusual among C compilers in allowing these flags to be specified on a per-block basis, although the C standard’s pragmas for affecting floating-point computation permit a similar per-block granularity, and there is a TS that extends the minimal C pragma set to include many more pragmas for optimization control. As a result of optimization, it is possible for floating-point expressions to contain operations with different sets of flags enabled, and this calls for greater specificity in our semantics than we have historically done.

Currently, we have 7 flags, with an 8th flag (fast) being the conjunction of all the other 7 flags. That fast, which is set by the most general -ffast-math option (and therefore indicates a desire to perform any optimization one might consider acceptable in such a mode), is specifically defined as only the conjunction of the other flags means that we have to shoehorn every optimization into one of those 7 flags. And there are optimizations which don’t fit in well for any of the existing flags, which has led to the undesirable situation of either pushing one of the flags well beyond reasonable user expectations or checking specifically for the fast flag (both approaches are used today).

I believe the flags fall into one of two categories. The nnan, ninf, and nsz flags all fall into what I call Value FMF. For nnan and ninf, they remove NaNs and infinities from the set of valid values and replace them instead with poison. nsz has somewhat more complicated semantics, but can be interpreted along the lines of removing both -0.0 and +0.0 from the set of values and replacing instead with a single 0.0 of indeterminate sign.

The second category is what I call Rewrite FMF and comprises contract, reassoc, arcp, and afn. For these flags, the semantics cannot be described purely by defining allowable values, and instead they are meant to describe allowable transformations [2]. For example, contract enables the rewrite of (a * b + c) into fma(a, b, c), while arcp enables the rewrite of a / b into a * (1.0 / b). The current definitions have severe gaps, both in explaining which rewrites are considered permissible for these flags as well as in defining which flags are necessary in more complex expression trees involving multiple operations.

There is one more issue worth mentioning here. The way fast-math flags are currently being represented in LLVM instructions, there is room for exactly 7 bits to store the flags. All 7 bits are currently being used for flags; adding a new flag requires adding a new bit, and I do not see any easy place where a new bit can be scrounged for fast-math flags.

[2] I am not all that experienced in formal semantics. Nevertheless, I do find operational semantics much easier to reason about, and so the idea of rewriting-based semantics is somewhat terrifying to me. In my attempts to find academic work on fast-math transformations, the only relevant example I can seem to find is Icing: Supporting Fast-Math Style Optimizations in a Verified Compiler, which similarly uses rewriting-based semantics. I would greatly appreciate feedback from those more knowledgeable about formal semantics.

Transformations without flags

Some transformations are legal in the absence of fast-math flags, even in partial violation of IEEE 754. The most notable such transformations are identity transformations, that convert expressions like x * 1.0 to x. These are justifiable in our semantics by our treatment of sNaNs as largely synonymous with qNaN. Additionally, the commutative law for commutative operations (such as addition and multiplication) can be freely applied [3]. See C2x§F.9.2 for some more examples of allowable and disallowable transformations.

An additional concern is what assumptions we can make about math library intrinsics. For example, are we allowed to assume that -sin(x) is equivalent to sin(-x) without any fast-math flags?

[3] Friendly reminder: IEEE 754 makes FP addition and multiplication commutative (i.e., x + y == y + x), because the effect of operand order on which NaN payload is returned is not considered an observable effect in IEEE 754. It is the associative property (i.e., (x + y) + z != x + (y + z)) which is not generally held by FP. Please keep your terminology straight!

Fixing value fast-math flags

It has been noted in the past that the existing definition of nnan and ninf can be unexpectedly strong for users—poison is a potent tool for exploiting undefined behavior. Despite this, their existing semantics are clear and I do not propose to change them now.

The nsz flag is more interesting. The current definition is “Allow optimizations to treat the sign of a zero argument or zero result as insignificant”—implying nondeterministic choice. But the available nondeterministic choice is unclear. Is it:

If an {argument, result} would be -0.0 per IEEE 754, replace it with nondeterministic choice of {+0.0, -0.0}
If an {argument, result} would be +0.0 or -0.0 per IEEE 754, replace it with nondeterministic choice of {+0.0, -0.0}

The intent of nsz is clearly to enable optimizations that won’t correctly preserve the sign of zeroes, and one such optimization is converting x < 0 ? -x : x to fabs(x): in the former expression, the result is -0.0 if x is -0.0, whereas it is 0.0 if x is -0.0. However, this does bring up a hole in the current definition: the key case we care about in this transformation is the false case, where there are no operations done to x other than the select, so we can only do the transformation when the select itself has the nsz flag. But select, and its cousin phi, are data-movement instructions that are as likely as not to come from some transformation like SROA, which lacks information as to what fast-math flags it should set on the new instructions, as opposed to the frontend which knows what fast-math flags to apply to all instructions in a block. In other words, as it stands, nsz is too weak to really enable the optimizations it wants to enable.

However, I don’t have a good solution to the problem here. The flag is essentially asking to be a different type—a nsz float is one that has no semantic difference between +0.0 and -0.0, unlike a regular float—but at the same time, it’s clear that it doesn’t meet the bar for a new type in LLVM IR. The current paths forward that seem contemplatable are to allow these flags to be provided on any operation that produces a float value (including things like load instructions), or to rely on function-level attributes and infer stuff from that, neither of which feel like a great model to rely on.

Fixing rewrite fast-math flags

These flags are the ones whose definitions need the most work, and I specifically intend to give a more thorough and precise definition of these semantics with this RFC.

For the purposes of giving semantics here, I am defining an expression tree as a set of LLVM instructions corresponding to the interior nodes in the computation tree. An expression tree has a given rewrite fast-math flag if and only if that flag is set on all instructions in that expression tree. In other words, the rewrite fast-math flags of a tree is the intersection of all fast-math flags on the operations that comprise the tree [4]. For example:

define void @f(ptr %mem) {
  %a = load double, ptr %mem
  %sqrt = call contract arcp double @llvm.sqrt.f64(double %a)
  %val = fdiv arcp reassoc double 1.0, %sqrt
  store double %val, ptr %mem
}

I can describe an expression tree here as consisting of the sqrt and fdiv instructions, and the fast-math flags on this expression tree is just arcp (the only rewrite flag in common).

When rewriting an expression, all of the nodes in the rewritten expression will have the fast-math flags of the original expression tree (i.e., the intersection).

An expression is mathematically equivalent to a rewritten expression if the two expressions would produce the same result if arithmetic were done with infinite range and precision. This is almost saying that it is true if the numbers were real numbers and not IEEE 754 floating-point numbers, but the expressions must also arrive at the same value in cases for which the mathematical expression is not defined (e.g., 1.0/0.0, which produces +∞). If the value FMF are also present on an instructions within an expression tree, it is not necessary that the rewritten expression produce the same result in the cases the flag would exclude. For example:

define double @g(double %a, double %b) {
  %sqa = call nnan nsz double @llvm.sqrt.f64(double %a)
  %sqb = call nnan nsz double @llvm.sqrt.f64(double %b)
  %sq = fmul double %sqa, %sqb
  ret double %sqa
}

define double @h(double %a, double %b) {
  %mul = fmul double %a, %b
  %sq = call nsz double @llvm.sqrt.f64(double %mul)
  ret double %sqa
}

It is mathematically equivalent to rewrite g into h, as the nnan and nsz on the sqrt calls allow us to infer that a and b are non-negative, and sqrt(a * b) = sqrt(a) * sqrt(b) when a and b are non-negative. But it is not mathematically equivalent to rewrite h into g, as the flags are not enough to include that a and b are nonnegative.

[4] Note that this only applies for rewrite FMF. nnan, ninf, and nsz all continue to apply and impart semantics to individual instructions within an expression tree, and this can be sensibly done because they are defined with value semantics. Indeed, being able to apply different rules to these two categories is part of the reason I define these categories.

`contract`

The contract FMF corresponds to the intent of STDC FP_CONTRACT ON. Unfortunately, the C standard is not especially clear about the intent, and the current LLVM LangRef definition is even worse. My proposal for clarification of intent is as follows:

A FP expression tree with contract can be rewritten into another expression tree that is mathematically equivalent, and where the resulting expression tree is implemented with full precision and no intermediate rounding. In general, the resulting expression tree will be a single operation, but multiple operations are permissible if no intermediate rounding is done (e.g., an fneg operation may be a second operation).

For example, and most importantly, this allows the transformation of (x * y) + z into fma(x, y, z). The transformation of (x * y) - z into fma(x, y, -z) is also legal, even though this creates two operations in the resulting tree.

Another example of an allowable transformation under contract is cr_sin(x)/cr_cos(x) may be transformed into cr_tan(x) [5]. The provision that the operations are correctly rounded is mandatory here; if, as we usually do, we do not know that the intrinsics are correctly rounded, then it is not legal to perform this transformation with the contract flag.

[5] This transformation does make me feel a little uneasy. But this is based on my reading of C2x§6.5¶8 and another example of FP_CONTRACT allowing (float)sqrt(x) to be converted to fsqrt(x), which suggests that <math.h> functions should be considered as contractible “operations”. If people disagree, I do not object to changing the rules here.

`arcp`

The existing Clang command-line flag says “Allow division operations to be reassociated” (which, incidentally, isn’t actually correct); the LLVM language reference says “Allows optimizations to use the reciprocal of an argument rather than perform division” (and the gcc command-line flag uses similar language). In expression form, this specifically permits the expression x / y to be replaced with x * (1.0 / y).

It’s not clear which other rewrites are possible under this flag. However, in the spirit of the flag, I think it is also reasonable to make a / (b / c) convert to a * (c / b). At present, LLVM requires both arcp and reassoc to make this transformation, but I feel that reassoc is not necessary. GCC, with just a arcp, will convert a / (b / c) to (a / b) * c, but I’m not convinced that’s the correct non-reassociating rewrite.

Another note is that I think these rewrite possibilities should be bidirectional. On the face it this may seem to be a bad idea, but allowing the reverse direction would allow arcp reassoc to permit reassociation of division (i.e., a * (b / c) <=> (a * b) / c) without the need for any additional rule to combine flags.

In short, the proposal I have for arcp is as follows:

A FP expression tree with arcp allows the following rewrites to be performed, in both directions:

x / y ↔ x * (1.0 / y)

a / (b / c) ↔ a * (c / b)

`reassoc`

The purpose of this flag is to permit the application of the associativity and distributivity laws to FP math. That it permits distributivity is not clear from the current definition, which only talks about “reassociation”, but it is performed by InstCombine already, and is generally agreed to be part of the definition.

Existing optimizations have, in my opinion, abused the reassoc flag to handle optimizations such as pow(x, y)/x => pow(x, y - 1), or sin(x)/cos(x) → tan(x), in essence arguing that it’s applying to the definition of the underlying elementary functions. While the former case does have some logic to it to me, I am hesitant to endorse optimizations on the not-correctly-rounded-libm functions using only this flag. There is a further complication when you introduce correctly-rounded functions like ldexp or sqrt.

Is ldexp(ldexp(x, a), b) → ldexp(x, a + b) valid with reassoc? (Note that it is not always valid, since one of the ldexps may produce a denormal number, creating inaccuracy, and there are other assumptions about overflow that may apply).

Or how about sqrt(a) * sqrt(b) → sqrt(a * b) if a and b are both non-negative (or the nnan and nsz flags both present)?

In my own opinion, reassoc should be limited to fewer operations. The core operations of addition, multiplication, and fma are clearly permitted. Division and subtraction (when converted to multiplication of reciprocal and addition of negation, respectively) are additionally permitted when other flags or knowledge permits the necessary conversions. I have not come to a firm decision for ldexp or sqrt: ldexp is an exact-but-for-underflow operation, which does feel like it should apply in similar cases as to addition or multiplication associability. Meanwhile, I’m more uncomfortable about sqrt, but it is a core IEEE 754 operation, in the same section as the 4 basic math operations, fma, and integer conversions, and it is correctly-rounded.

`afn`

The existing definition for afn is “Allow substitution of approximate calculations for functions (sin, log, sqrt, etc)”. As a semantic definition, this absolutely terrifies me: how approximate is approximate? Is the optimizer permitted to approximate sin(x) → x, the famous small sine approximation in physics?

I’ve identified the following optimizations enabled by the flag in the current optimizer code:

AMDGPU converts fdiv x, y to fmul x, (rcp y) instructions
AMDGPU will convert some libm functions to intrinsics
PPC will convert, e.g., acos to __x1_acos (added by ⚙ D101759 [PowerPC] Scalar IBM MASS library conversion pass)
pow(x, y) → exp2(log2(x) * y) if x is positive finite non-zero
pow(x, 0.5) → sqrt(x) when either afn or reassoc is present
pow(x, n) → powi(x, n) if n is an integer (also powi(x, n) * sqrt(x) if n is integer + 0.5)
1.0 / sqrt(b) → rsqrt(b) when both afn and contract are present

Overall, I believe I see that this flag has two slightly different definitions. In the first definition, it permits a function to be replaced with a lower-accuracy version (mostly in line with the stated definition in the LangRef). This mostly comes out from targets that have partial-accuracy instructions that could be selected. The other definition is largely converting pow to hopefully-cheaper functions that might do the same thing in certain special cases, which seems to be somewhat more generalizable into allowing transformations on libm functions that would be mathematically correct if floats were real numbers.

My belief is that a vague use-lower-accuracy definition is a poor definition for IR. I would prefer to see a proposal like Andy’s precision attributes to be able to choose between different implementations on the performance-accuracy optimization frontier. If such a feature is added, then we don’t need this flag anymore, allowing its bit to be reclaimed.

`bikeshed`

As alluded to in discussions of reassoc and afn, there is a need for another flag which does not currently exist, one that would allow generic mathematical rewrites that are not otherwise covered by other flags. I do not have a good name for this flag, so I title this section bikeshed instead. I would map this flag to the -funsafe-math-optimizations command-line option (which currently maps instead to fast).

Unlike contract, this new flag would permit the rewrite of an expression tree into another expression tree with more operations. It would also permit the optimization of non-correctly-rounded libm functions (such as sin or pow).

As for interactions with other flags, I cannot come up with any good semantics definition of bikeshed that would not make bikeshed imply the other rewrite flags. Rather than attempting to do so, I would simply make it a requirement that any instruction with this flag include all other rewrite flags. It is, however, possible to come up with semantics for bikeshed with and without the value flags: bikeshed nnan ninf nsz would require that the two expressions be equivalent for all real numbers, while bikeshed nnan nsz would require they be equivalent for all extended real numbers. Similarly, bikeshed without nsz would require that -0.0 and +0.0 work appropriately in both expressions, while bikeshed without nnan requires that the expressions be equivalent should NaNs be involved. This can even be extended to expression trees which do not consistently have flags, if the analysis is careful enough.

Questions for discussion

If we need more FMF, where do we get the bits from?
Are rewrite-based semantics for FMF acceptable? Are they avoidable? Can we develop something like Alive2 to validate FMF-fueled optimizations with rewrite-based semantics?
What assumptions can we make about libm intrinsics/libcalls?
- Do even/odd properties apply?
- Can we assume special cases?
- Can we assume exact values that are not special cases?
- Would correctly-rounded versions change any of these answers?
- Which fast-math flags change these answers?
What operations other than fma(-like) are contractable with contract?
- Specifically, does pow(x, 0.5) → sqrt(x) constitute a valid contraction?
Should arcp without reassoc permit a / (b / c) → a * (c / b)?
Which operations does reassoc apply to beyond addition/multiplication? sqrt? ldexp? pow?
What flag should be necessary to convert sqrt(fpext x to double) → fpext sqrtf(x) to double?
What flag should be used to enable generic algebraic transforms along the lines of gcc’s -funsafe-math-optimizations, e.g., sin(x)/cos(x) → tan(x) or log(exp(x)) => x? If a new flag, what should its name be?
Should afn be retired in favor of better precision control for functions?
What should be done to handle the nsz semantics weakness?
Should expression rewrites that might introduce overflows (such as sqrt(a) * sqrt(b) → sqrt(a * b)) be permissible?

efriedma-quic · May 2, 2024, 7:39pm

It’s going to be hard to get any consensus in a wide-ranging discussion like this; we can’t change everything at once. I guess that’s fine as far as it goes, but please be prepared to split off separate RFCs for specific points.

Maybe we shove the FMF bits into metadata? If I’m remembering the structure of metadata correctly, this should be basically free… but if it isn’t, there are probably some tricks we can pull.

For reassoc, we intentionally allow transforms that completely destroy precision; I don’t know of any way to quantify which precision-destroying optimizations apply.

Alive2 can probably be coerced to validate that a given tree of instructions produces a result within a few ulps of another tree of instructions, if that’s useful.

This seems too aggressive; I mean, most reasonable libc implementations probably canonicalize even/odd for most ops, but I don’t think anyone promises this.

All special cases are explicitly defined in the standard; if your libc doesn’t follow them, it’s broken. That said, I think we saw an issue with mingw sqrt at one point.

In practice, libcs don’t produce exact values for some exact cases; in particular, it’s hard to make pow() produce exact values.

The fact that the source is a float doesn’t really seem relevant… sqrt(2) should produce the same result no matter how you represent “2”.

See Instcombine incorrectly transforms store i64 -> store double · Issue #44497 · llvm/llvm-project · GitHub for discussion of “classifications” of floats.

I’m not sure we can do much more than what we’re currently doing. We could maybe add metadata to “load” to indicate it’s allowed to mess with the sign of zeros.

This doesn’t seem worse than anything else we’re doing.

If pow() is correctly rounded, then these are equivalent, ignoring the zero special-case.

If pow() is not correctly rounded… then this is basically equivalent to asking whether we can constant-fold pow.

andykaylor · May 2, 2024, 8:55pm

I have thought through it completely, but I think Joshua’s concept of mathematical equivalence has some potential here. If you can promote the types in both the original expression and the transformed expression to a higher precision, you can eliminate things like overflow from multiplication. It may not be practical to evaluate transformations in a tool like Alive2 for types higher than single precision, but as a semantic rule invoking the concept of infinite precision seems like it makes for a sound rule.

As you say, we allow transforms that completely destroy precision, but they wouldn’t completely destroy precision if evaluated at infinite precision, so you can reason about that. The reassoc flag really should be accessible and easy to reason about for anyone who passed high school algebra if we constrain it in the way Joshua has proposed.

That may be true for sqrt(2) because 2 can be represented exactly, but it isn’t true for sqrt(0.1) – Compiler Explorer.

There are reasons why the transformation Joshua described is useful. We already do that for sqrt(), but not for other functions like cos() – Compiler Explorer

The question here is specifically about whether contract allows this. I would say afn allows it. If we can assume that sqrt is correctly rounded (which is another question), I think a case can be made that contract allows this even if pow() is not correctly rounded. I’m a bit uncomfortable about that extension of the use of contract but it seems consistent with the direction the C standard is moving with regard to its FP_CONTRACT definition and within the bounds of what it has said since the FP_CONTRACT pragma was introduced.

nlopes · May 2, 2024, 9:05pm

Sorry, I don’t have much time right now to read your proposal and give meaningful feedback (not until the semester ends at least).
But just wanted to say that I’m usually happy with taking whatever semantics you guys want and implementing that in Alive2. Some things are really complicated to implement efficiently, like reassoc, and that’s why I haven’t implement it yet. But in general we’ve been able to implement most things in Alive2.
Just describe things in English. I read them, ask questions, and eventually we reach an agreement about the formal definition. Don’t worry too much with the details.

(ideally, Alive2 would read the semantics of LLVM IR from a file, say LangRef. That’s on the TODO list, but I need a new PhD student to work on that as it’s a big project )

jcranmer · May 2, 2024, 9:35pm

It’s going to be hard to get any consensus in a wide-ranging
discussion like this; we can’t change everything at once. I guess that’s
fine as far as it goes, but please be prepared to split off separate
RFCs for specific points.

I’m aware this is a chonky RFC. But my hope is we can get some agreement
on how to think about fast-math flags first, and then try do some work to actually get each flag semantics worked out more individually. The tricky thing is that things like trying to push rules across different semantics requires discussing all of the flags at once.

For reassoc, we intentionally allow transforms that completely destroy
precision; I don’t know of any way to quantify which
precision-destroying optimizations apply.

Equivalent “as if evaluated with infinite range and precision” is
something that I think is workable, although algebraic equivalence is a
pretty thoroughly undecidable problem, and I’m not a great connoisseur
of computer algebra system literature. That said, I am going to try
poking around with sympy over the next few days to see if I can get it
to catch existing transformations, and maybe something about LLVM IR.

This seems too aggressive; I mean, most reasonable libc
implementations probably canonicalize even/odd for most ops, but I
don’t think anyone promises this.

I forgot to mention, but C2x§F.10¶3 does read to me that even/odd
symmetry is guaranteed, and I’ve observed that gcc does do optimizations
assuming even/odd symmetry… even in -O0.

In practice, libcs don’t produce exact values for some exact cases; in
particular, it’s hard to make pow() produce exact values.

I’m well aware that pow is extremely difficult to get correct for exact
values; for most of the other functions, the set of exact values tends
to be an extremely small set. So something like sinpi(1/2)…? This is
mostly some

The fact that the source is a float doesn’t really seem relevant…
sqrt(2) should produce the same result no matter how you represent “2”.

sqrt(2) should be 0x1.6a09e667f3bcdp+0 as a double, and is 0x1.6a09e6p+0
as a float. So doing the sqrt with float-precision and then extending to
a double gives you different results from doing sqrt with double-precision.

(This is one of the transformations we actually do today, and it applies
if you use the fast flag on the method call.)

I’m not sure we can do much more than what we’re currently doing. We
could maybe add metadata to “load” to indicate it’s allowed to mess
with the sign of zeros.
It’s been discussed if we can add fast-math flags to instructions like
load and stores. Another option that’s been discussed is having SROA use
function-level attributes to infer the flags on new PHI instructions.

efriedma-quic · May 2, 2024, 11:30pm

I mean… if you follow the rule of “all operations with float inputs should be transformed to float operations” you end up making Newton-Raphson iteration impossible. Doesn’t seem like a good rule to me; if someone explicitly converted from single-precision to double-precision, presumably they did it for a reason. (Unless we’re talking about C code where it’s easy to accidentally use the wrong function, but we have -Wdouble-promotion for that.)

andykaylor · May 3, 2024, 12:14am

Whether we can do this and whether we should do it are separate questions, of course. If you’re converting the result back to single precision, I think it makes sense. If not, that’s an awful lot of precision to lose on a fast-math optimization, and not just in a few edge cases.

arsenm · May 3, 2024, 7:35am

ldexp is just a fancy multiply and I expect it to be treated the same as fmul (and we do turn ldexp by constant into the equivalent fmul). I view part of the point of reassoc is to permit eliminating overflow/underflow that would appear in intermediate operations.

To give another we currently do, AMDGPU has 1.0 / sqrt(x) → expansion involving a hardware rsq instruction that improves ULP

We do already have !fpmath metadata, and AMDGPU makes use of it to switch sqrt/fdiv/rsq lowering decisions (although we kind of have to kludge it into codegen since it doesn’t survive to the DAG). Clang also only emits it specifically for fdiv and sqrt intrinsics for OpenCL

My interpretation for contract has been to permit any precision increasing transformations (compared to the real mathematically infinite result)

Another issue I have with nnan and ninf is they are too powerful since they imply the inputs and outputs. I would prefer it if we had separate flags for the inputs and outputs. When writing out the computeKnownFPClass handling for all the intrinsics, I ran into cases where really I wanted “nooverflow” or nnan_result, but did not need to break the handling of non-finite inputs

Another wishlist flag I’ve had is to ignore denormal handling

andykaylor · May 3, 2024, 7:16pm

The reason I proposed the new mechanism for specifying function accuracy in the IR is that the fpmath metadata only allows relaxing the required accuracy. It can’t constrain the accuracy because as metadata it can be dropped. For cases like OpenCL where it is attached to fdiv and sqrt to indicate that 2.5 ulp error is allowed this only works if the default behavior is at least as accurate as the accuracy described by the metadata.

The afn fast math flag has no clear semantics as far as I’m concerned. It basically just says it’s OK to change the numeric result of the function call without providing any guidance as to how much change is allowed. In cases where we don’t know anything about the accuracy of the underlying implementation, this is almost unavoidable.

Suppose we made a rule that afn allows me to introduce a 4 ulp error relative to the original implementation. OK, so now I’m replacing a call to llvm.sin() with some approximation sequence. How accurate does that need to be? I have no way to say without knowing how accurate the platform default implementation would have been.

Alternatively, we could describe the error that afn allows in terms of error relative to the correctly-rounded result for the function being described. That’s better, but we may still have a problem if the platform default implementation. For instance, let’s say we decide that afn approximations must be accurate to within 4 ulp relative to the correctly rounded result. Now I might have IR like this:

%0 = tail call afn float @llvm.powi.f32.i32(float %x, i32 %n)

If I’m targeting OpenCL, the OpenCL specification only requires 16 ulp accuracy for this function, but I have no way of knowing that from the IR.

jcranmer · May 3, 2024, 8:38pm

Another issue I have with nnan and ninf is they are too powerful since
they imply the inputs and outputs. I would prefer it if we had
separate flags for the inputs and outputs. When writing out the
computeKnownFPClass handling for all the intrinsics, I ran into cases
where really I wanted “nooverflow” or nnan_result, but did not need to
break the handling of non-finite inputs

I can definitely see the wisdom of a nooverflow, whose semantics would
be along the lines of “if the operation would raise the division by
zero, overflow, or invalid exceptions (except by reason of sNaN), the
result is poison” which is I think a more succinct way of covering all
the various exception conditions than trying to say “produces infinity
or NaN except when they were an input”. This is definitely the kind of
flag I could see recommending over existing nnan/ninf flags, although I
think they still have their uses in some circumstances.

Another wishlist flag I’ve had is to ignore denormal handling

What do you mean here?

The |afn| fast math flag has no clear semantics as far as I’m
concerned. It basically just says it’s OK to change the numeric result
of the function call without providing any guidance as to how much
change is allowed. In cases where we don’t know anything about the
accuracy of the underlying implementation, this is almost unavoidable.

I’ve considered a definition along the lines of “use a hardware
instruction that provides a lower-accuracy implementation of the
function,” although that is still almost unusably vague–it’s
meaningless to any IR-level optimizations, and even for selecting
instructions, is something like VRSQRT14SD sufficiently accurate to
replace a call to rsqrt, or would you rather do some Newton-Raphson
using that instead. (As pointed out earlier, the definitions I give
don’t allow room for Newton-Raphson in any of the existing flags).

arsenm · May 6, 2024, 7:38pm

It also works because the operations are correctly rounded. Ultimately assuming any undecorated implementation is correctly rounded would be good, and we’re sort of stuck where we are because system libraries are kind of garbage. In an ideal world we could just provide all the implementations and ignore they exist

AMDGPU has a handful of instructions that ignore the denormal mode (mostly for F32), and just flush. In particular one case I’ve been looking into recently is for atomicrmw fadd. The two interesting cases are fmac (the non-fused fmul+fmac) and the atomicrmw case. So some bit that just indicates if a denormal appears it’s OK to treat it as 0. Another interesting case is the exp2 instruction doesn’t handle denormals

andykaylor · May 6, 2024, 11:38pm

This is something I’ve talked about with the SYCL folks. Should we really be assuming that we know anything about the results of any operation? For example, suppose we’re compiling for a device that doesn’t have a native correctly-rounded fdiv instruction. Does that mean that we should be generating emulation to get the right result for undecorated fdiv instructions in the IR? I don’t know if they had an actual target device in mind or if this was just a theoretical discussion for how to interpret the IR when the target device is unknown at compile time, but it seems like a valid question that is likely to have a real-world application sooner or later.

It seems like for such a hypothetical target, it would be better to say that undecorated fdiv means whatever makes most sense for the target and the optimizer isn’t allowed to do anything that would change the result. Of course, that would potentially mean that we wouldn’t be allowed to constant-fold fdiv so it’s probably not a good general behavior.

This is even relevant now (for all basic operations) if anyone cares about numerical consistency for 32-bit x86 targets with no SSE support since, as Joshua mentioned in footnote [1] of his original post, the x86 backend does in fact take a “whatever makes most sense for the target” approach in such cases. Strict semantics would require introducing intermediate instructions to truncate results all over the place, but we don’t do that because it would kill performance.

When I was proposing the intrinsics and attributes to specify allowed/required accuracy, I knew that it would be relevant to fdiv because of the OpenCL case, but I was surprised to get feedback from some FPGA people who were interested in also using it for things like fadd and fmul.

arsenm · May 7, 2024, 6:53am

Emphatically yes. The IR has to have a specific, semantically defined meaning to be useful. It is not useful to have a wishy-washy meaning that tries to accommodate bad hardware. The non-broken IR constructs we have all move in the enable-optimizations direction (e.g. most attributes) with a conservative default.

The tradeoff here is now no target gets any optimizations because the definition was weakened to the point of uselessness. At this point you can just use a target intrinsic.

The solution here is to buy a computer from this millennium. Nobody is ever going to work on this and this problem will never be solved, and that’s OK.

andykaylor · May 7, 2024, 5:09pm

I completely agree with you for the x87 case. I have no intention to advocate for a solution to that, including writing code to fix the current state of the backend to generate correctly rounded results.

I can definitely see your point on the assumption of correctly rounded basic operations for undecorated IR, and I’m not aware of any modern hardware for which it will be a problem. It just seems like we’re trending towards offload devices and data types that care more about speed than accuracy, so I think we need a good way to represent that. In the SYCL specification there are built-ins that represent “native” accuracy with essentially no constraints on what that means – it’s just whatever the target device has instructions for. I think that’s useful, but I wouldn’t know how to represent it in LLVM IR apart from an intrinsic.

andykaylor · May 8, 2024, 11:50pm

I’d like to add something here that has come up in a couple different contexts – what fast-math flags mean when they are attached to intrinsic calls?

I was talking to @phoebe yesterday about Issue #82813. That issue is a problem with the X86 backend trying to apply a non-standard meaning to the reassoc flag when it is attached to the llvm.vector.reduce.fadd intrinsic. The definition of the llvm.vector.reduce.fadd intrinsic says, “If the intrinsic call has the ‘reassoc’ flag set, then the reduction will not preserve the associativity of an equivalent scalarized counterpart.” That may or may not be overloading the meaning of reassoc a bit (more on that below), but it’s not entirely unreasonable.

The problem is that the clang handling of the __builtin_ia32_reduce_fadd_ps512 is setting the reassoc flag unconditionally (that is, without a corresponding command-line option triggering it) and expecting that the X86 backend will understand that this means it should generate instructions that match the semantics described in the Intel Intrinsics Guide for the _mm512_reduce_add_ps intrinsic. This is necessary because the default semantics of llvm.vector.reduce.fadd don’t match the semantics of the Intel intrinsic. It works at the moment, but it’s wrong. The X86 backend has designed a sort of secret handshake where it will produce the expected order for the reduction when the reassoc flag is set, but the semantics of the reassoc flag actually indicate that any order can be used.

I’m certain that we’re misusing the reassoc flag in this case, but it raises the question of what exactly the fast-math flags mean when they are attached to an intrinsic. Do the fast-math flags have a distributive property that lets them be transferred to the internal operations of intrinsics to which they are attached, or do the flags only apply to the intrinsic as an atomic operation?

This same question came up with less clarity in the code review for PR #90434. There, we have an llvm.fmuladd intrinsic with the contract flag set and we want to canonicalize it as separate fmul and fadd with the contract flag set. Should this be allowed?

As @jcranmer suggested on that code review, this is a good topic for discussion at the LLVM Floating Point Working Group meeting scheduled for next Wednesday. I’d like to use that meeting as a place to discuss this entire RFC and hopefully break off manageable chunks and drive them to implementation, and this question about fast-math flags on intrinsics seems like a good place to start. For those who may not be aware, details of the LLVM Floating Point WG are available here.

arsenm · May 9, 2024, 2:07pm

I think this is clear. They cannot change the lowering of the intrinsic, and intrinsics are identical to ordinary instructions. The fast math flags are for context hints for the input and result they are not additional operands to the intrinsic that imply anything about the intrinsic internals

andykaylor · May 9, 2024, 4:00pm

That would be my interpretation as well, but we seem to have taken a different approach with the llvm.fmuladd intrinsic. I think there are probably also cases where we’re taking similar liberties with other intrinsics that have obvious internal workings, such as pow.

Or suppose we had something like this:

%x = call reassoc float @llvm.tan.f32(float %a)
%y = call reassoc float @llvm.cos.f32(float %a)
%r = fmul reassoc float %x, %y

Could we transform that to

%r = call reassoc float @llvm.sin.f32(float %a)

?

jcranmer · May 9, 2024, 8:26pm

Maybe we shove the FMF bits into metadata? If I’m remembering the
structure of metadata correctly, this should be basically free… but if
it isn’t, there are probably some tricks we can pull.

I’ve started doing some poking around, and I’m not entirely sure what
you mean here. Metadata is stored in the context with (effectively) a
DenseMap<Value *, SmallVector<MDAttachment, 1>>, and a quick perusal
of how that code works doesn’t suggest that there’s an easy fast-path
way to make querying FMF cheap (considering right now it pulls from
Value::SubclassOptionalData directly).

Within Value itself, there’s 7 bits for SubclassOptionalData, 16
bits for SubclassData, and 32 bits distributed between
NumUserOperands (currently at 27 bits) and various other flags. If I’m
accounting for all the uses of SubclassData correctly, bits 13 and 14
are currently unused by any subclass of Instruction, and bit 12 is only
used by AtomicRMWInst which can’t have FMF right now. The calling
convention field of CallBase consumes bits 2-11, so there’s no way to
squeeze more than 3 bits out of Instruction’s SubclassData, but I’m
not sure that’s worth trying to do anyways.

Another thought I had, staring at Value, is that we don’t have a clear
hard limit on the number of operands. If we shrunk NumUserOperands
down to 16 bits, we could reorganize the layout of Value so that it
looks roughly like this:

class Value {
   const unsigned char SubClassID;
   unsigned char HasValueHandle: 1;
   unsigned char IsUsedByMD : 1;
   unsigned char HasName : 1;
   unsigned char HasMetadata : 1;
   unsigned char HasHungOffUses : 1;
   unsigned char HasDescriptor : 1;
   // two free bits left in the second byte
   unsigned short SubclassOptionalData;
   unsigned short SubclassData;
   unsigned short NumUserOperands;
   Type *VTy;
   Use *UseList;
};

This gives us two free bits for more bitfields we might want, extends
SubclassOptionalData from 7 to 16 bits, and probably makes
SubclassOptionalData and NumUserOperands slightly cheaper to access
(by removing any need for bitfield masking). The downside is we limit
the number of operands to about 65,536, which might be getting too tight
(I can see switch instructions potentially having a very large number of
operands).

Mitch_Alsup · May 14, 2024, 10:23pm

A large snip and my comments.

It seems to get lost with my [quote] and [/quote]s ?!?

I would not support pow(x, y)/x => pow(x, y - 1). I fear loss of significance in argument y-1
I support sin(x)/cos(x) ↔ tan(x) {both ways, in case one already has a sin(x) or cos(x) around.}

ldexp is precise in all circumstances except underflow and overflow. And in the example above ldexp would overflow or underflow the same.

I would support this, I doubt Kahan would.

In my own opinion, reassoc should be limited to fewer operations. The core operations of addition, multiplication, and fma are clearly permitted. Division and subtraction (when converted to multiplication of reciprocal and addition of negation, respectively) are additionally permitted when other flags or knowledge permits the necessary conversions. I have not come to a firm decision for ldexp or sqrt: ldexp is an exact-but-for-underflow operation, which does feel like it should apply in similar cases as to addition or multiplication associability. Meanwhile, I’m more uncomfortable about sqrt, but it is a core IEEE 754 operation, in the same section as the 4 basic math operations, fma, and integer conversions, and it is correctly-rounded.

afn

The existing definition for afn is “Allow substitution of approximate calculations for functions (sin, log, sqrt, etc)”. As a semantic definition, this absolutely terrifies me: how approximate is approximate? Is the optimizer permitted to approximate sin(x) → x, the famous small sine approximation in physics?

if you know x < 2^-27; sure. otherwise, no.

Dangerous:: the intermediate stage where one calculates log2(x) * y needs around {32-bits (float) 64-bits (double} of fraction so that the exponentiation does not lose precision.

In my own architecture and implementation, exp2(log2(x) * y) is less accurate and slower than pow(). So, if you are going to do this it should have its own flag.

agree with pow(x, y) → `powi(x, (int)y) when y == (int)y

I am not so sure in the y == (int)y+0.5 case. In my implementation it would be both slower and less accurate.

Absolutely support this.

I think a better question is what assumptions we can make about Fortran intrinsics.
{Fortran has a much better thought-out numerical specification–and many of your
transformations and/or questions about same are handled with parenthesis}

 SIN(x)/COS(x) -> TAN(x)

whereas:
SIN(x)/(COS(x)) or (SIN(x))/COS(x) cannot.

IEEE 754 would think so, I agree that this is probably a bridge too far.

[quote]

Can we assume special cases?
- Can we assume exact values that are not special cases?
- Would correctly-rounded versions change any of these answers?

[quote]

Say an implementation intrinsic has an accuracy of 0.50002 ULP in RNE !!
Do you consider this correctly rounded ??
{It is really a question of where do you draw the line–and so is optimization}

Obey Fortran parenthesis and without them its all free game.
The problem you are having is:: “what sets of flags is equivalent to Fortran Parenthesis” !!

sqrt(x) is undoubtedly faster and more precise than 99.44% of libms.

I would say no, unless both flags are set.

ldexp() is precise (correctly rounded because it is an integer add to the exponent and some range checks.)

pow with integer exponent can be converted into multiplication sequences.

[quote]* What flag should be necessary to convert sqrt(fpext x to double) → fpext sqrtf(x) to double?

What flag should be used to enable generic algebraic transforms along the lines of gcc’s -funsafe-math-optimizations, e.g., sin(x)/cos(x) → tan(x) or log(exp(x)) => x? If a new flag, what should its name be?[/quote]

The equivalent to Fortran parenthesis.

IEEE would have been a much better numeric system if, instead of having 2 zeros, it had 3.
-0 stands for underflow of a negative calculation
0 stands for true zero
+0 stands for underflow of a positive calculation

But that ship sailed long ago.

The equivalent to Fortran parenthesis.

andykaylor · May 14, 2024, 11:13pm

And yet, it has been happening since 072b03c.

This raises two questions for me.

Is reassoc the correct flag to enable a transformation like this?
Do we need some way to distinguish between common and aggressive optimizations?

Joshua is raising the first of these questions here, and I believe he has created a somewhat extensive catalog of optimizations we are currently performing and the flags we are checking for them.

I generally agree with what you’ve said here. I think fma, sqrt, and ldexp all hinge on the question of what it means when a fast-math flag is attached to an intrinsic. I raised this question above, and as I said there I’d like to discuss it at the Floating Point WG meeting that’s on the schedule for tomorrow. It seems like a simple question, but when you think about intrinsics like fma, sqrt, and ldexp that have clearly defined inner workings, there is a definite temptation to want to apply the flags to the operations implied by the intrinsic.

There are definitely some transformations we’d like to allow on intrinsics, but I tend to think afn is a necessary flag to enable those transformations.

We introduced the llvm.arithmetic.fence intrinsic to support Fortran-like parentheses in C/C++ with -fprotect-parens

https://godbolt.org/z/hx4xzh4GE

Unfortunately, flang doesn’t appear to use it.

Topic		Replies	Views
Propogation of fpclass assumptions vis a vis fast-math flags IR & Optimizations	28	551	February 3, 2024
[RFC] FP Environment and Rounding mode handling in LLVM LLVM Dev List Archives	16	256	April 11, 2016
Semantics of floating-point instructions are unclear IR & Optimizations core , llvm	11	1327	March 12, 2023
[RFC] FastMath flags support in MLIR (arith dialect) MLIR	24	1798	October 30, 2022
[RFC] FastMath flags support in complex dialect MLIR	6	745	July 28, 2023