Questions about llvm.canonicalize

andykaylor · June 12, 2024, 8:08pm

The reason I would like to see some control on constant folding denormals is that it can lead to observable numeric differences if FTZ/DAZ is set in the execution environment. Here is an example where constant folding after inlining leads to different results than you’d get without inlining:

https://godbolt.org/z/hqP6ETdzr

Note that in this example the expression FLT_MIN/2.0f is initializing an object with static storage duration and is constant folded by the front end to a denormal value. I think this is consistent with the C standard. If the variable isn’t static, clang will insert a constrained fdiv here, but the current LLVM code constant folds it. That should not be happening since I’ve enabled FENV_ACCESS.

I used -ffp-model=strict in my example above to avoid questions about whether I am allowed to have set FTZ and DAZ in the code. If I’m compiling a library, those flags can have been set by the calling program, and some libraries will want to be able to handle that case without using -ffp-model=strict.

BTW, I am not aware of a compiler that doesn’t constant fold the example above, but I don’t think we should let that stop us from being the first to get it right.

What I’d like is for the constant-folder to be aware of the “denormal-fp-math” attribute. If that attribute is set to “ieee” we should constant fold denormals as we currently do. If that attribute is set to “preserve-sign,preserve-sign” or “postivie-zero,positive-zero” we should flush denormals to zero and fold accordingly. If it is set to “dynamic,dynamic” we shouldn’t constant fold denormals.

jyknight · June 12, 2024, 11:09pm

My inclination is that we should not do this. I think we should expose “dynamic” as a command-line option (which we don’t, currently), but that we should not make it the default.

Doing so would pessimize the performance of normal code, for the dubious benefit of getting more consistent results in code which has explicitly asked for inconsistent results via enabling “fast-math”.

andykaylor · June 12, 2024, 11:44pm

jcranmer:

There are, I think, two ways that you can look at the role of a
programming language like C or LLVM IR (very different languages, though
I think the trend is the same):

You can see the language as a way to express the behavior of the
assembly you’re describing, and if that behavior varies among platforms,
well, your code varies among platforms.

Or you can see the language as a way to express a single behavior
independent of the hardware details, and the process of lowering to
hardware assembly requires generating extra code to get that behavior if
necessary.

The dominant trend has definitely been to move in the direction of the
latter, although it is further behind for floating-point (and not helped
by the fact that there is a stronger desire for
speed-at-the-expense-of-correctness for FP than there is for other
operations).

I agree that there is a trend towards the latter, and I think that’s the world we would all like to live in, but users have a nasty habit of not always wanting the standard behavior. So we add command line options, pragmas, fast-math flags, and other things to let users have their off-label use cases supported. The degree to which this happens is one of the significant differences between clang and gcc, I think. There are more “extensions” in gcc, and getting support for such non-standard behavior seems to involve a lot more gritting of teeth and grumbling in the clang community. I understand that there are very good reasons for not wanting to diverge from established standards, but there are also very good reasons for wanting to accommodate real-world use cases.

That’s OK, except that constrained intrinsics don’t fix all the denormal handling issues. I was a bit surprised to discover this when constructing the example I linked to in my previous comment. However, after thinking about it I realized why.

We call APFloat to perform constant folding. If we’re folding a constrained operation, we check the exception flags and only fold if no exceptions are raised by the operation. The problem here is that constant folding to a denormal result doesn’t raise an exception according to APFloat, but performing the same operation with FTZ (but not DAZ) enabled does raise an exception. Performing the operation with both FTZ and DAZ set does not raise an exception.

The result of this is that while we won’t constant fold 1.0f/10.0f in strictfp mode, we will constant fold FLT_MIN/2.0f.

andykaylor · June 12, 2024, 11:48pm

I have come around to agreeing with this perspective over the course of this discussion. We should assume IEEE denormal behavior by default. That’s consistent with our general philosophy and I think it is what most users would want.

I’m still not happy with where we are in the case where the user has explicitly used the -fdenormal-fp-math=dynamic command line option. I really think we should try to avoid any assumptions about denormal behavior in that case. To me, that would include not constant-folding denormals and not optimizing away x * 1.0. Otherwise, we’re giving users an option that sometimes, maybe even usually, but not always, does what it says.

RalfJung · June 13, 2024, 6:43am

That makes this an entirely different question. With strict mode I agree you should get strict behavior – for SNaNs and denormals and everything else. (It seems like that is currently buggy?)

I was specifically commenting on the proposal (as I understood it) to treat denormals different from SNaN, and pessimize optimizations around them even in non-strict mode.

Yeah that’s where things go wrong. All the code needs to be compiled in strict mode to make this work. Again this is no different between SNaNs and denormals.

Having to honor the dynamic FP environment is an extreme pessimization for the vast majority of code, so code that wants to honor the dynamic FP environment needs to explicitly opt-in to that.

(Really the fundamental reason is that global mutable state is the arch enemy of any kind of compositional reasoning. The FP environment is a prime example of global mutable state, and compiler optimizations in a library need to rely on compositional reasoning.)

This boils down to what kind of divergence from standard behavior is supported by strict mode. In the extreme case where hardware may have flags that arbitrarily change the result of FP operations, absolutely nothing could be const-folded. My understanding is that currently, strict mode only supports the range of behaviors explicitly described in the IEEE 758 spec. The request here is to furthermore make an affordance for the non-standard but common case of hardware producing “wrong” (according to IEEE) results for denormals. Given how common that is, that does seem reasonable to support. FWIW, according to the docs, " The -ffp-model option does not modify the fdenormal-fp-math setting", so I would expect that you have to set that as well – but even then clang still seems to optimize this division: Compiler Explorer.

Strict mode could range from “expect hardware to only do things permitted by IEEE” (honoring e.g. exception state but nothing funny with denormals), “common extensions to the standard” (denormals), all the way to “completely non-standard behavior”. It seems like currently, the flag has the first meaning (expect hardware to stay within what IEEE describes), and a dedicated flag is needed to get the second (also expose hardware behavior for denormals) – and furthermore that flag doesn’t actually work as expected.

Also, “strict” seems like a misnomer to me. It’s more about exposing the underlying hardware behavior, and in a sense being less strict about the assumptions that LLVM makes about the hardware (and the FP environment).

andykaylor · June 13, 2024, 6:47pm

Yes, I suppose the question is whether we need to allow for differences in behavior with regard to FTZ/DAZ without strictfp mode. The fact that “denormal-fp-math” was introduced as function attribute and not a modifier on the constrained intrinsics led me to the conclusion that we were leaning towards different treatment for this case. That makes sense for architectures that just don’t support denormals, I also think supporting this mode is a bit less intrusive than even something like allowing dynamic rounding mode (which I feel is more similar than sNaNs).

The point of my example was to show that when denormals are being flushed to zero, our current constant folding behavior can lead to differences in numeric results. This is something we’re currently a bit sloppy about, and not just with denormal flushing. I have a similar complaint about constant evaluation of math library functions.

From a theoretical perspective, I’m willing to admit that allowing for dynamic FTZ/DAZ mode is an instance of the types of things we allow only under the strict FP model. From a practical perspective, I have to deal with customers who just want us not to optimize away explicit operations in their code while otherwise optimizing the code well. I don’t want to be in a position where I tell them they have to use full strict mode and they tell me they’ll need to use a different compiler.

In this sense, -fdenormal-fp-math is a subset of full strict mode in the same way that -ffp-exception-behavior=[strict|maytrap] is. We do implement -ffp-exception-behavior=[strict|maytrap] using constrained intrinsics, but that’s an implementation detail. The simple fact that we want to implement restricted denormal handling doesn’t necessarily mean that we need to use constrained intrinsics to do so. We could, but that’s an implementation detail.

What I am asking for is a way to honor a dynamic denormal mode and still optimize well. I’m open to suggestions as to how to do that. My suggestion is that we honor "denormal-fp-math"="dynamic,dynamic" by not eliminating x * 1.0 and not constant folding denormals when this attribute is present. I wouldn’t call that “an extreme pessimization” and I’m not even asking for it to be the default behavior.

Yes, I wrote that particular bit of documentation, and it reflected my evolving understanding of what the “denormal-fp-math” setting meant. The background is that we used to set "denormal-fp-math"="preserve-sign,preserve-sign" with fp-model=fast on x86 targets because we were linking “crtfastmath.o” when the fast model was used. When @jcranmer fixed the problem where linking that file to shared libraries had an unwanted global effect, we discussed the fact that even when fast-math was used we really had no way of knowing whether FTZ/DAZ would be set for any function we were compiling. So we stopped setting “denormal-fp-math” with fp-model=fast. At the time, I didn’t fully consider fp-model=strict, though I believe the clang driver code had been setting “denormal-fp-math” to “ieee,ieee” for both the strict and precise models.

The proper behavior seems target-specific. For x86-based targets, I think we should be setting "denormal-fp-math"="dynamic,dynamic" for strict and fast models (because we could be flushing to zero in either case), and "ieee,ieee" for precise (because we are assuming the default FP environment in that case).

I think it is “strict” in the sense of strictly honoring IEEE-754 requirements when FENV_ACCESS is on. I’m particularly thinking of section 10.4 which talks about “Applying the properties of real numbers to floating-point expressions only when they preserve numerical results and flags raised.”

RalfJung · June 14, 2024, 6:30am

I don’t see this as being sloppy. For denormals, if anything this is the hardware being sloppy and producing wrong results (when using the IEEE spec as the baseline). LLVM can provide ways to work with such hardware but by default we should assume well-behaved CPUs. (Like others above I don’t understand why you want a guarantee of getting a less precise result, but oh well.)

In general, there’s nothing wrong with code behaving different in debug and release builds when non-deterministic operations are involved. It’s surprising when you don’t realize that there is non-determinism, but I wouldn’t call it “sloppy”.

For math library functions, it is somewhat surprising that LLVM treats them basically as builtin intrinsics even though they look like normal functions, but even that has precedent – memcpy and friends are also treated like that.

Not quite, since the exception behavior is part of the IEEE spec but denormal flushing (in my understanding) is not. If “strict” is about strictly following the IEEE spec, denormals aren’t part of it.

I’m not very familiar with the clang frontend… according to the docs, there’s no “dynamic” for -fdenormal-fp-math, let alone a dynamic,dynamic. Seems like the docs are outdated, or are you suggesting an extension?

Anyway I have no issues with an opt-in flag that makes LLVM behave more like the underlying hardware when that hardware is not IEEE-compliant. We could even consider a flag that inhibits all compile-time assumptions about flaot behavior and leaves it entirely to the hardware.

FWIW I am also confused by the docs for the preserve-* options – does preserve-sign mean that denormals will be flushed to zero or that they may be flushed to zero? Either way I would have expected a name like flush-to-zero-preserve-sign or so; currently the most important part of this flag (that flushing can/will happen) isn’t even mentioned in its name…

arsenm · June 14, 2024, 9:19am

Strongly disagree with this, for any IEEE type, there are no such thing as non-canonical values and denormals are not flushed. You could instead make an argument that the IR default should be assumed denormal-fp-math=dynamic, but if we know the mode is definitively IEEE you should be rely on no incorrect flush happening.

They have mutually contradictory wishes. If you want numeric consistency, you must stay away from FTZ/DAZ.

We honor this by not constant folding the canonicalize intrinsic with dynamic. The canonicalize intrinsic exists specifically for this type of use case. If we were to go down this road, we should also start handling signaling nans in non-strictfp functions. I don’t think it makes any sense to have different signaling nan and denormal flushing policies.

It’s exactly this.

Right, denormal flushing is a noncompliant, buggy implementation.

It’s informative of the default floating point environment. The flush only may occur, except for the canonicalize intrinsic where it’s guaranteed.

andykaylor · June 14, 2024, 6:33pm

I don’t want to guarantee a less precise result. I want to guarantee that the compiler won’t change the numeric results of the program within the bounds of the options provided by the user. Many users value consistency over accuracy (within reasonable limits).

I don’t think I would call the case I’m pursuing non-determinism. You get one result if FTZ/DAZ are set and a different result if they are not set, but within either state the result is deterministic.

I don’t believe the memcpy optimizations change the result in any way. With either memcpy or math library calls you can block the behavior using -fno-builtin-*func*. With the possible exception of calls like sqrt and ldexp we don’t have any reason to believe we can reproduce the exact result of the library call, so I don’t think we should be constant folding them. What I would support instead is defining a set of intrinsics that represent correctly-rounded implementations of the functions unless marked otherwise. If those were used, we could constant-fold with confidence. As the LLVM math library is moving towards correctly-rounded implementations, this is becoming a reasonable possibility. The C standard is also moving towards expecting this, I believe.

Are you sure about that? From section 10.4 of IEEE-754 (2019):

A language implementation preserves the literal meaning of the source code by, for example:
…
Applying the identity laws (0 + x and 1× x) only when they preserve numerical results and
flags raised

I would argue that when I am permitted to modify the floating-point environment (I think restricting that is a language-specific detail), setting the FTZ or DAZ flags is a legitimate modification to the environment and when they are set applying the 1× x identity law changes the numerical result. It may not be explicitly mentioned in the standard (though there is an “abruptUnderflow” attribute that I don’t quite understand), but it seems to be allowed for.

Yes, the documentation is outdated – “dynamic” is a relatively recent addition and the clang documentation wasn’t updated when it was added. It is accepted by the driver though and sets the IR attribute as described in the Lang Ref.

I’m still struggling to understand exactly what you intend it to mean, and I’m not sure I don’t want to make some minor adjustments. If you want the compiler not to make any assumptions about whether flushing will occur, it seems like “dynamic” should be used. If you use “preserve-sign” or “positive-zero” shouldn’t the compiler assume that flushing will occur? So if we constant fold an operation invovling a denormal, it seems like we should flush it according to the state described.

I think this reflects your bias against this processor mode. I understand that changing the state of these flags changes the results, but with the flags set the results are predictable and consistent.

The case that led me to start this topic is actually quite simple. A program (a Fortran program, if that matters) has set the FTZ/DAZ flags in some way. It calls a library function with a denormal argument. The library function is supposed to return zero with the sign of the argument in the case where a denormal argument is passed and FTZ or DAZ are set. A couple of months ago, this was working. When compiled with the latest code from LLVM trunk, it returns the unmodified denormal instead.

I don’t know why it is important that the library call flush the denormal, but the library team has a regression test for it, so I would guess that someone complained about it not happening at some point.

I have told the library team that they can fix this using __builtin_canonicalize, but I’d like to establish some practices that will keep failures like this from popping up in the future. The library team is of the opinion that if the compiler would simply generate code that does what their source asked for they wouldn’t have problems like this.

RalfJung · June 14, 2024, 7:06pm

It is non-deterministic in the same sense that the sign of 0.0 / 0.0 is non-deterministic (as documented in the LangRef): using non-determinism as an overapproximation of the actual factors affecting the result, since the actual factors include “can const-prop determine the values flowing into this operation”, which we don’t want to be part of the spec.

The standard defines the range of possible behaviors for + and the other operators, and “flush denormals” is not one of them.

If we take your interpretation, then hardware may add a flag that makes all operations return imprecise results with much larger errors than usual. Now suddenly are compilers required to preserve those numerical results as well? No of course not. Compilers assume standards-compliant hardware, that’s then only way any kind of optimization can work.
This is a strawman of course, but it shows that just because the hardware has a flag to produce certain behavior, doesn’t mean compilers can be expected to support that. That would be completely impractical.

And what does “dynamic,dynamic” do – why are there two of them?

Seems like the library team wants + (and the other float operations) to mean “whatever my hardware does on +, and no matter whether what the hardware does has anything to do with the IEEE standard”. That kind of + can’t be optimized at all though.

Maybe LLVM should provide some flag to disable all float optimizations, even const-folding 1.0 + 2.0, for cases like this?

Or do you still want some optimizations applied? Then we’re quickly in the weeds of needing a new flag for each user as everyone will have different requirements for what exact constraints they are imposing on the hardware. I guess the point of the denormal-fp-math flag is that this particular case (largely IEEE-compliant, except when it comes to subnormals) is common enough to warrant its own flag?

Anyway, since you’re saying you are fine with not making this the default, I think I am not even disagreeing with you on the actionable part of this. denormal-fp-math=dynamic should do what it says on the tin, so in case of dynamic it shouldn’t optimize any float operations where denormal flushing may occur.

jcranmer · June 14, 2024, 7:06pm

My inclination is that we should not do this. I think we should expose “dynamic” as a command-line option (which we don’t, currently), but that we should not make it the default.

Doing so would pessimize the performance of normal code, for the dubious benefit of getting more consistent results in code which has explicitly asked for inconsistent results via enabling “fast-math”.

My argument for this position is thus:

On any platform where crtfastmath.o exists (from libgcc, this seems to be X86, ARM, AArch64, MIPS, LoongArch, and Sparc of the platforms that LLVM supports, and Itanium and Alpha of those that LLVM doesn’t support), Clang (and I presume gcc) will attempt to link in a global-disable-FTZ/DAZ if the linker command line enables fast-math and it’s not linking a shared library.

At the point we compile any individual C/C++ file, we have no idea what the linker command line will look like, especially the command line of the main program (which may be a separate project altogether). So the reality is we don’t have a strong guarantee that you won’t be in denormal-flushing mode even if you don’t explicitly turn it on yourself.

As for consistency and pessimization issues, the main concerns here are a) whether or not it’s possible to raise @llvm.is_fpclass (which can be synthesized via InstCombine from bit-test instructions!) to an fcmp instruction, and b) constant-folding of denormal inputs. But denormal constants should be relatively rare in practice, so I’m not sure the latter concern is actually a concern in practice. And in the former case, it’s a scenario where someone trying to be careful (by avoiding FP operations that might get screwed over by the fast-math-setting-the-denormal-mode) gets screwed over by the compiler thinking it’s safe: that’s the kind of consistency issue that (to me) obligates the compiler to be pessimistic.

andykaylor · June 14, 2024, 9:01pm

Now that you mention it, it’s not uncommon for the native fdiv instruction on GPUs to return results that aren’t correctly rounded. Should the compiler be obligated to generate a refinement sequence that produces correct results?

I discussed this briefly with @arsenm recently, and we were both of the opinion that the LLVM IR fdiv instruction has to imply correctly-rounded results, but that means that we need some way to decorate the instruction when compiling for a target that doesn’t do that in order to generate code with reasonable performance. We have the “fpmath” metadata, which exists for this purpose, but it can easily be dropped. I don’t think any backend for a device that doesn’t provide correctly-rounded division is likely to generate a refinement sequence by default, and so if we constant fold division it can change the numeric result. In that case, as long as we have a way to describe the special-case requirement, we can point to the specification and say that we are following it.

(As an aside, we’ve got a similar case with x87 instructions where we don’t truncate intermediate results to the IR type precision. It would kill performance if we did, so we live with this “bug” in the backend. If anyone still cared about using x87 instructions for single or double precision calculations, we’d probably provide a mode for forcing the intermediate truncation, as gcc and icc do, but I don’t think there’s much demand for it now, so we just let the backend ignore the IR semantics.)

I think we’re in agreement that the default behavior should be IEEE-compliant numeric results. However, I think we need ways to describe special cases where we know they exist.

The first part indicates flushing of results (FTZ), the second part indicates handling of inputs (DAZ). There are some observable differences in behavior between them. For instance, denormals compare as equal to zero with DAZ but not with FTZ alone. The other difference is exception behavior. When DAZ is set, denormal inputs are flushed to zero, but the underflow flag isn’t raised. So x * 1.0 raises the underflow exception with FTZ alone, but it does not raise it when DAZ is set. This would be important for deciding when we could constant fold in strict mode. We could fold x * 1.0 to zero for denormal values of x if DAZ is set, but not if only FTZ is set because we need to preserve the exception. Right now, the “denormal-fp-math” attribute is the only thing that tells us which to do.

Not really. These people are basically mathematicians for whom the use of a programming language is a necessary evil. They know the IEEE standard quite well. They also know the behavior of the hardware, and they’ve written code that will produce the exact behavior they need if the literal meaning of the code is respected.

Sometimes they’ll perform an operation just to canonicalize the input (I’m trying to convince them to use __builtin_canonicalize for this). Sometimes they’ll perform an operation just to set an exception flag. They work closely with our compiler team to fine tune the performance. They understand that the compiler has to work within the constraints of various standards, including the LLVM IR definition, but they ask us to provide accommodations to get good performance with the behavior they need. So, here I am.

andykaylor · June 14, 2024, 9:15pm

That’s true, but over the course of this discussion, I’ve been persuaded that the compiler should assume IEEE behavior (minus exception semantics) by default. I think that’s reasonable as long as we provide an option for those who need to respect the dynamic denormal behavior to get it.

This is consistent with the C standard. It says the compiler is free to assume the default floating point environment when FENV_ACCESS is off. The LLVM IR floating-point environment description seems to be targeted at implementing that assumption. Other front ends can set the denormal state according to whatever rules their standards require, but I would expect the C rules to be common.

I’m also concerned about the way we’re teaching the optimizer to recognize bit-pattern behavior as FP-related. I think I brought something like that up above and @arsenm reassured me that the cases where we are doing that now respect the side-effects or lack thereof of the operations being synthesized. I’d like to enshrine that in the LangRef in some way to be sure we don’t deviate in the future. I’m not sure what needs to be said.

We also have some potential for dropping performance as we convert bit-accesses to FP intrinsics and then need to lower them later, but I suppose that’s something that can be addressed by improving the target lowering as cases come up. I saw it happen with the fabs recognition with 128-bit types. There the x86 backend was introducing a function call where the original code had none. I think that has been fixed.

RalfJung · June 15, 2024, 7:00am

I don’t see why an LLVM IR fdiv would need an annotation that basically means “yes I really mean the standard fdiv”. You can consider this annotation to be already implicitly present on every single fdiv.

LLVM LangRef should just state explicitly that fdiv (and the others) must provide results as specified by IEEE. I assume this is the semantics frontends generally expect – it certainly is the semantics Rust expects. Then we can already point at that and tell backends that they need to be fixed. (If this affects more than one backend, the infrastructure for this can hopefully be shared, but I know basically nothing about that part of LLVM.)

One might consider adding an “even more strict than strict” mode that exposes the underlying hardware semantics and tells LLVM IR passes that they can make no assumptions about the exact behavior, but that’s a separate question. (Even here one can imagine multiple levels, e.g. maybe the instruction is still pure and can be moved around [and does not depend on some status register], we just can’t say what the numeric result will be. So in that sense this is an orthogonal direction to what strict controls, and more like denormal-fp-math, except it applies not just to denormals.)

I see, thanks!
I hope this can be put in the docs as well at some point.

“literal meaning” is the keyword here. I would say the literal meaning of + on floats is to use IEEE semantics without ifs and buts around denormals. They have written code that does what they need if their intepretation of the literal meaning of + is respected, but that interpretation is clearly different from mine. I don’t know what you/they mean by “literal meaning”, but it seems to be closer to “what the hardware does” than “what IEEE says”.

By saying “the literal meaning”, you are making an implicit claim here that there is “the one and only literal meaning of +” and LLVM/clang are diverging from it, but I don’t think that is a fair way to describe what happens. The reality is that there are many different + on floats (ranging from “IEEE standard without any regards for exceptions” all the way to “whatever the hardware does, even if what it does has nothing to do with IEEE”), and clang provides one (arguably the standard one for C and C++ code) while your customers want a different one, more specifically tailored to their needs.

andykaylor · June 17, 2024, 7:30pm

I think maybe I wasn’t clear. What I intended to say was that fdiv without any decoration (fast-math flag, metadata, attribute, etc.) should mean the IEEE-specified correctly-rounded fdiv operation. However, in order to generate reasonable code for targets that don’t have a correctly-rounded fdiv instruction, we need a way to indicate in the IR that the backend isn’t required to generate a correctly-rounded result.

I was having a discussion earlier today with SYCL developers and this exact topic came up. SYCL is frequently implemented using an OpenCL backend, but by default the OpenCL specification only requires 2.5 ulp accuracy for single-precision division and square root, but OpenCL provides a build option to require correctly-rounded single-precision division and square root. We’d like to be able to describe both cases in LLVM IR. Since the default behavior of fdiv in LLVM IR is correctly-rounded results, we need a way to indicate that some additional error is allowed. That’s what I was getting at.

The idea is that the SYCL runtime library would always use the build option to require correctly-rounded division, but we’d describe individual operations in a way that allows a relaxed implementation where that was desirable. The OpenCL build option is global, so you either get correctly-rounded results everywhere or nowhere. We don’t have any way to describe global settings like this in LLVM IR, and I really don’t think we want one. I’d prefer to allow more fine-grained control anyway.

I guess this is a difference of opinion in how to interpret the IEEE-754 standard. This library team referred me to the section in 10.4 I quoted earlier saying that the preserving the literal meaning includes “Applying the identity laws (0 + x and 1× x) only when they preserve numerical results.”

I don’t really want to litigate what the standard means here. Regardless of how you interpret it, I think it is reasonable to request an off-by-default compiler mode that respects the possibility of a dynamic denormal flushing mode and preserves numerical results of the user’s source code when the flushing modes are enabled, and if we have such a mode I think it is reasonable to enable it when -ffp-model=strict is used, even if that makes the meaning of strict ambiguous.

RalfJung · June 18, 2024, 7:21am

It seems your customers wouldn’t be satisfied by that though – they would want a guarantee that the result matches the precision of the underlying hardware? fdiv inexact with the semantics of “up to 2.5 ulp rounding error” could still be constant-folded with maximum precision, due to the “up to”.

If we follow your/their argument, the compiler would not be allowed to do any constant folding of any float operation, since hardware may as well decide that 3.0 / 2.0 has a numerical result of 1.0 and apparently we have to honor whatever “numerical result” someone else decides is correct. This interpretation of the standard is completely unworkable.

The correct “numerical results” are defined by that standard. LLVM is already correct in this sense: it will preserve numerical results. You can’t just pick-and-choose some paragraphs from the standard and expect a compiler to be correct for whatever subset of the standard you decided to honor today. When you replace some operations at the core of the standard by different operations (by flushing denormals to zero), you have lost the right to use any part of the standard that refers to these operations as a basis for your reasoning.

Agreed. And it seems clang even already offers this flag, it is just not implemented correctly everywhere.

Not sure I agree with that, since there are also people that want strict IEEE compliance without non-standard behavior such as denormal flushing.

andykaylor · June 18, 2024, 8:28pm

This would be a different set of customers. The SYCL and OpenCL programming models involve potential execution-time determination of the target with JIT compilation, so it’s much harder to guarantee floating-point consistency. The context of the discussion I mentioned was that the SYCL developers wanted to provide a way to allow users to request correctly-rounded division results apart from the OpenCL build option. As I said, I believe this is already indicating by the default IR representation of fdiv.

I agree with you that if the IR description of fdiv allows “up to 2.5 ulp error” that it would be perfectly acceptable to constant-fold the operation to a correctly-rounded result.

I can imagine there being users who are programming with SYCL or OpenCL and targeting a specific architecture who might want to disable constant-folding of fdiv and sqrt for numeric consistency reasons, but I haven’t gotten such a request. SYCL has a way to invoke the "native fdiv" operation explicitly, so that might be sufficient.

My argument would be that for processors that support a dynamic denormal flushing mode, that mode is part of the floating-point environment, so we should set “denormal-fp-math” to “dynamic” whenever FENV access is enabled for targets that have this mode. Users who don’t want that restriction would be able to turn it off.

lntue · June 23, 2024, 3:23am

Sorry for a bit tangential question, but how do FTZ/DAZ work with rounding modes? Are they applied before or after rounding?

OTOH, I feel like if we need a consistency model for FTZ/DAZ, maybe we can add / treat them as another rounding mode (or rounding modes, with 1 new one for each rounding mode). Then in that case, basically we can just assume that FTZ/DAZ is applied at the end of every operations. And of course some optimizations will need to be aware of this new rounding modes, just like other rounding modes.

Treating it as such, then together with some compiler builtins (maybe similar to get rounding modes) so that library writers don’t have to rely on fragile trick with some unspecified assumptions such as 1.0 * x.

andykaylor · June 24, 2024, 5:37pm

For x86-based targets, DAZ is applied on the input values, so before the operation even begins. FTZ is documented as applying to the result. That may happen before rounding, but I suppose in theory it could occur as part of the rounding. I don’t know the specifics of the hardware implementation. In some cases, x86-based processors use a microcode assist for denormal calculations, and that is entirely avoided when FTZ is set, but not all denormal calculations require microcode assist, and denormal results are still flushed to zero in those cases.

I don’t think we can treat it as a rounding mode, because the flushing occurs even in cases where the result is exact (such as multiplication by one). So APFloat would indicate that the result is exact and the optimizer would use the result even when the rounding mode is “fpround.dynamic”.

The “trick” with 1.0 * x is necessary because checking the bits in the MXCSR register is significantly slower. So while __builtin_canonicalize() is a reasonable substitute, something like _mm_getcsr() (which we already have) wouldn’t help.

jcranmer · June 24, 2024, 8:15pm

Sorry for a bit tangential question, but how do |FTZ/DAZ| work with
rounding modes? Are they applied before or after rounding?

I don’t have a firm grasp of the semantics of all hardware
implementations. For the manuals I have access to, DAZ is defined
essentially as any denormal input is treated as (signed) zero;
meanwhile, FTZ is essentially defined as “if the operation would trigger
FE_UNDERFLOW, it is (signed) zero instead”.

Note, however, that there are multiple possible definitions of when
operations trigger FE_UNDERFLOW (essentially, you can do it before or
after rounding), and from a quick scan, at least x86 and AArch64 use
different definitions of FTZ.

(fma(0x1.p-54,-0x1.p-1022, 0x1.p-1022) is a result that is smaller
than the smallest normal but rounds to a normal, so it should be
diagnostic of a pre- or post-rounding FTZ mode).

As an aside, after staring at IEEE 754 long enough, there are cases
where you can get an underflow exception triggered after rounding even
if the result is a normal number.

Topic		Replies	Views
should we have IR intrinsics for integer min/max? LLVM Dev List Archives	7	161	November 14, 2016
clarification needed for the constrained fp implementation. LLVM Dev List Archives	0	131	November 27, 2017
[RFC] Stronger guarantees for "denormal-fp-math IR & Optimizations	12	295	July 3, 2024
[RFC] Improving IR fast-math semantics IR & Optimizations core , rfc , llvm , llvm-ir	22	953	May 31, 2024
canonicalizing types for vector operations LLVM Dev List Archives	2	84	August 19, 2009

Questions about llvm.canonicalize

Related topics