I’m working with a team developing a math library, and they have some code that is doing something like dX = x * 1.0; to detect the case where the FTZ or DAZ control bits are set (on an x86-based system). They do not want to compile with -ffp-model=strict (for performance reasons), so the optimizer is assuming the default floating-point environment and optimizing the above instruction away.
I think the llvm.canonicalize intrinsic should provide the functionality they want. Unfortunately, lowering of llvm.canonicalize isn’t implemented for the x86 backend (bug), but that should be easy to solve. I have a few concerns about the handling of this intrinsic that I’d like to ask about.
First, the Language Reference says this is useful for platforms “like GPUs or ARMv7 NEON” that treat subnormal values as non-canonical encodings of zero. It seems to me that an x86-based processor with the FTZ and DAZ control bits set in MXCSR behaves this same way, and so I think it makes sense to treat this intrinsic as a valid and reliable way to flush a denormal value to zero. Am I correct?
Second, the Lang Ref says, “(@llvm.canonicalize(x) == x) is equivalent to (x == x)”. Can I assume that this is based on the above remark about subnormals being treated as a non-canonical encoding of zero? The rest of the description seems to imply so. In particular, the general statement that this intrinsic can be implementable as multiplication by 1.0. It just feels like a gray area for an x86-based processor that is capable of recognizing subnormal values but doesn’t do so in FTZ/DAZ mode.
Finally, the one that most concerns me, the Lang Ref says that this intrinsic can be optimized away if “The result is consumed only by (or fused with) other floating-point operations. That is, the bits of the floating-point value are not examined.” I looked at the case where the library team I’m working with needs this behavior, and the result is indeed used in various bit checks. My concern is that we seem to be trending towards recognizing bit-manipulation patterns in floating-point code and translating them into intrinsics so that we can reason about what they are doing. I guess I’d like stronger guarantees that my llvm.canonicalize intrinsic won’t be optimized away. Is there anything that can be done about that?
FWIW, modern GPUs can also recognize subnormals but are often configured not to (especially in graphics). So this sounds like it’s identical to what you’re seeing on x86.
Second, the Lang Ref says, “(@llvm.canonicalize(x) == x) is equivalent to (x == x)”. Can I assume that this is based on the above remark about subnormals being treated as a non-canonical encoding of zero?
Short answer: Yes, that matches my understanding.
Longer answer: It’s really a corollary of the statement that you get the (target-specific) canonical encoding of a number that has multiple encodings.
If the target treats subnormals as 0, then subnormals are non-canonical encodings of 0 and they should behave like 0.
Finally, the one that most concerns me, the Lang Ref says that this intrinsic can be optimized away if “The result is consumed only by (or fused with) other floating-point operations. That is, the bits of the floating-point value are not examined.” I looked at the case where the library team I’m working with needs this behavior, and the result is indeed used in various bit checks. My concern is that we seem to be trending towards recognizing bit-manipulation patterns in floating-point code and translating them into intrinsics so that we can reason about what they are doing. I guess I’d like stronger guarantees that my llvm.canonicalize intrinsic won’t be optimized away. Is there anything that can be done about that?
This is probably one of those areas where we just have to clear up ambiguities over time.
For example, fneg is defined as:
The value produced is a copy of the operand with its sign bit flipped. The value is otherwise completely identical; in particular, if the input is a NaN, then the quiet/signaling bit and payload are perfectly preserved.
This suggests to me that fneg float x really is identical to xor i32 x, 0x80000000, and fneg (canonicalize x) cannot be combined to just fneg x in general. But we can (and I believe we do) transform it to canonicalize (fneg x), which can lead to eliminating the canonicalize later.
Yes, assuming in a non-strictfp function the assumed FP mode is denormal-fp-math=preserve-sign
The optimizable case is where you aren’t using bit checks and only canonicalizing operations observe the value. Note we don’t guarantee canonicalization for “canonicalizing” non-strictfp operations, so this optimization can only really apply late in codegen unless we clean up a lot of code to insert new canonicalizes (with the obvious exception of the canonicalize(canonicalize(x)) → canonicalize(x) case)
You can’t bit recognize into a potentially canonicalizing operation, that would just be wrong.
I /think/ the |llvm.canonicalize| intrinsic should provide the
functionality they want. Unfortunately, lowering of
|llvm.canonicalize| isn’t implemented for the x86 backend (bug https://github.com/llvm/llvm-project/issues/32650), but that should
be easy to solve. I have a few concerns about the handling of this
intrinsic that I’d like to ask about.
Looking at the LangRef here, I think the documentation here suffers from
the confusing interaction of a specific intrinsic with general
floating-point implementation rules. The way I’m interpreting the
semantics here is that it’s akin to doing a guaranteed-fmul 1.0, %x
with the current hardware mode, with the effects that it would have in
hardware, and without the looser rules on fmul that the compiler is
allowed to apply (e.g., LLVM doesn’t guarantee sNaN quieting in general,
but I interpret llvm.canonicalize to require sNaN quieting), and these
stricter semantics should apply even in non-strictfp mode.
And with that interpretation, I would argue that the llvm.canonicalize should be what they want, and if it deviates from those semantics
today, that deviation is the bug that need to fixed.
First, the Language Reference says this is useful for platforms “like
GPUs or ARMv7 NEON” that treat subnormal values as non-canonical
encodings of zero. It seems to me that an x86-based processor with the
FTZ and DAZ control bits set in MXCSR behaves this same way, and so I
think it makes sense to treat this intrinsic as a valid and reliable
way to flush a denormal value to zero. Am I correct?
I think the text could be changed to say so, but yes, canonicalize
should flush zero on machines that have FTZ/DAZ control bits and those
bits are dynamically set during execution.
(There’s a side argument about what happens if you’ve got calls to llvm.canonicalize and FP environment calls that twiddle FTZ/DAZ bits
in the same function, with and without strictfp, but I think that’s
better saved for more general discussions around the FP environment).
Second, the Lang Ref says, “(@llvm.canonicalize(x) == x) is equivalent
to (x == x)”. Can I assume that this is based on the above remark
about subnormals being treated as a non-canonical encoding of zero?
The rest of the description seems to imply so. In particular, the
general statement that this intrinsic can be implementable as
multiplication by 1.0. It just feels like a gray area for an x86-based
processor that is capable of recognizing subnormal values but doesn’t
do so in FTZ/DAZ mode.
There’s a transform in InstCombine that drops llvm.canonicalize on
arguments to fcmp:
(Incidentally, the requirement here means that a mode where FTZ, but not
DAZ, is set implies that denormals are not noncanonical versions of 0 in
the mode. Which… I’m not sure this requirement in the LangRef is
actually correct the harder I think about it, although I’m not sure how
much we should care about FTZ & !DAZ mode in the first place.)
Finally, the one that most concerns me, the Lang Ref says that this
intrinsic can be optimized away if “The result is consumed only by (or
fused with) other floating-point operations. That is, the bits of the
floating-point value are not examined.” I looked at the case where the
library team I’m working with needs this behavior, and the result is
indeed used in various bit checks. My concern is that we seem to be
trending towards recognizing bit-manipulation patterns in
floating-point code and translating them into intrinsics so that we
can reason about what they are doing. I guess I’d like stronger
guarantees that my llvm.canonicalize intrinsic won’t be optimized
away. Is there anything that can be done about that?
The first thing that could be done is making sure the denormal-fp-mode
attribute is set to dynamic, which prevents constant-folding of llvm.canonicalize on denormal inputs. I know we’ve had many
discussions on denormal handling, but this should be done by default in
Clang for any target with user-flippable ftz/daz bits. Beyond that, the
only optimizations that are done today on llvm.canonicalize are
removing one level in llvm.canonicalize(llvm.canonicalize(x)) and the
aforementioned InstCombine transformation. Optimizations on canonicalize
more generally seem hard to sanction unless we start changing identity
transformations to instead go onto canonicalize.
I think the bit check concern you have is mostly around replacing some icmp (bitcast to iN) checks with is.fpclass, which purposefully does
not canonicalize its input, so it isn’t legal to drop llvm.canonicalize that are inputs to that call. I wouldn’t be entirely
confident, however, that all backends lower llvm.is.fpclass to
something that is robust to noncanonical inputs (including tests for 0
in DAZ/FTZ mode), but that’s more in line of “bugs that need fixing”
rather than “design is broken”.
That’s the rub. What can/should we assume in a non-strictfp function about FTZ and DAZ on x86? They are part of the floating-point environment, so I think according to the Lang Ref we’d assume “ieee,ieee” unless the “dernomal-fp-math” attribute is set, but as I’ve argued elsewhere for x86 we should really be setting this to “dynamic” unless instructed otherwise because we don’t know. The setting is global. It may have been changed in main. It may have been changed in a static initializer. If the source language is C, the user isn’t allowed to modify MXCSR unless FENV_ACCESS is enabled, but we have a command-line option to ask the compiler to set the initial state of MXCSR. So, for the purposes of llvm.canonicalize on X86, I don’t think we can make any assumption and need to perform the multiplication unless the “denormal-fp-math” attribute tells us otherwise.
OK, so it looks like llvm.canonicalize does what I want, except that we’re going to need to set “denormal-fp-math” to “dynamic” for x86 targets.
It turns out that setting “denormal-fp-math” to “ieee” by default is also how I got involved with the library team. They had a workaround for the fact that we were optimizing away their x = x * 1.0 code, but a recent improvement to ValueTracking broke their workaround, but the improvement to ValueTracking was based.on checking “denormal-fp-math” to verify that we could count on the results of a comparison with zero meaning the input value was zero.
No. We do not guarantee canonicalization, but it may happen. Denormal flushing is semantically terrible, and not really defined in any standard. The only reason you would want to enable it is performance hacking, in which case you wouldn’t want a possible flush to block optimizations. The answer to this should be the same as if we guarantee a signaling nan is quieted
Denormal flushing may be semantically terrible, but if we have a way to describe it in the IR, that should work correctly. The Lang Ref says denormal-fp-math “indicates the denormal (subnormal) handling that may be assumed for the default floating-point environment.” If it is set to “dynamic,dynamic” that means that we can’t assume anything about denormal flushing, right? The x * 1.0 -> x transformation is based on the assumption that denormals won’t be flushed. Therefore, it should only be legal if we can assume IEEE behavior. Lang Ref is explicit about this, “If the mode is dynamic, the behavior is derived from the dynamic state of the floating-point environment. Transformations which depend on the behavior of denormal values should not be performed.”
I’ve been going back and forth about this in my head as to what the default should be for architectures that allow this to be changed dynamically. I don’t think it’s accurate to say that if flushing is enabled it means the user only cares about performance. If I’m writing a math library, I need the library to return correct result when FTZ is set. The user who set the flag may have done so for “performance hacking” but the library can’t assume that it is OK to return incorrect results for denormals in that case. So we absolutely need a way to handle that case. I had convinced myself yesterday that should be the default behavior and the behavior that is set by -ffp-model=precise because it is most conservatively correct. After reflecting on it more, I think I’ve persuaded myself that “ieee” is the correct default, and the library people need to use -fdenormal-fp-math=dynamic (which would mean we need to document that). Either way, we need the “dynamic” mode to work correctly.
No, because we do not guarantee canonicalization. By the same logic we would have to try to be signaling nan clean in non-strictfp functions. It doesn’t makes sense to have different policies for signaling nan quieting and denormal flushing, they are a set pair that behave the same way and would require additional instructions in the exact same contexts. Trying to maintain the behavior the underlying target implementation “probably” has will just introduce additional costs that nobody wants.
The mode is informative of whatever the “floating-point environment” is, but we don’t have a mapping from whatever the target’s idea of that is to any particular IR instruction behavior (i.e. denormal-fp-math=preserve-sign does not guarantee when lowered an fadd will flush denormals). For example AMDGPU has instructions that ignore the denormal mode register. Some cases this in the bad direction and always flush (which is the main reason we need to be aware of the mode at all in codegen), and in other cases will never flush even if flushing is enabled.
This could be rephrased, but the intention was that you cannot replace non-canonicalizing operations with canonicalizing operations without knowing the mode. That is, you are allowed to drop canonicalizations but not introduce them. The primary example of which is llvm.is.fpclass to fcmp, which changes the behavior depending on whether a flush may happen.
The difference is that the Lang Ref says we do not distinguish between quiet and signaling NaNs. If we introduced an attribute that indicated that we did recognize the difference, we’d have to do that.
My position is that unless fast-math is enabled the compiler should not be doing anything that leads to numeric differences to the results. In particular, if the user has explicitly told us not to assume IEEE behavior for denormals, we cannot optimize away an operation that would have an observable effect on numeric results.
If you can’t replace non-canoncializing operations with canonicalizing operations without knowing the mode, why is it OK to simply eliminate canonicalizing operations when the IR says not to make any assumptions about the denormal mode?
The statement I quoted about transformations that depend on the behavior of denormal values seems exactly correct as it’s currently written.
In the example I linked to in comment 9 above, we’re choosing not to assume that an equal comparison with 0.0 proves that the input value is zero when “denormal-fp-math” is “dynamic”. How is that any different than choosing not to optimize away x * 1.0?
The LangRef has never done a good job of describing this, but this description is unworkable. Practically speaking the only context where signaling nans matter is for minnum/maxnum, due to their unreasonable behavior of inverting the behavior for signaling nans vs. quiet. We’re going to have to fix ignoring this at some point, as apparently glibc fixed the non-IEEE behavior 10 years ago. It really should be something like we don’t guarantee quieting of signaling nan inputs.
You have no expectation of numeric consistency if you have disabled denormal handling. Flushing denormals is an illegitimate practice that we should not be trying to treat as an effect we must maintain as semantically desirable. The closest thing we have to a standard of how a denormal flushing option should behave is OpenCL, which places no requirement that operations flush if you enable -cl-denorms-are-zero. It permits, but does not mandate, flushing to occur. The entire concept only exists as a performance hack and mandating flushing only adds undesirable optimization barriers.
The reason we have the canonicalize intrinsic is to observe these non-guaranteed flush and quieting effects. If we guaranteed canonicalization, we would not need the intrinsic.
The fcmp-must-be-zero logic relies on knowing there will not be a hidden flush if the input mode is definitively IEEE. Similarly, if we know the input mode is DAZ, we can also treat the input as-if it could have been flushed, and assume the compare passes for all zero and denormal values.
In the dropped-canonicalize x * 1 → x case, we’re allowed to drop a flush which isn’t guaranteed. We’re not relying on knowledge that the flush would or would not happen.
I don’t necessarily accept this. In the case I’m trying to handle, my customer is compiling a math library and while they are not setting FTZ/DAZ, they have a requirement of producing an expected set of results if their library is called with FTZ/DAZ set. If we want to say that library developers need to use the explicit canonicalize built-in to get this behavior, I can live with that. (In fact, that’s what I’ve told them already.) I think this will require some clarification in the Lang Ref.
The currently documented behavior for "denormal-fp-math"=["preserve-sign"|"positive-zero"] aligns well with what you’ve said. It says the compiler may flush denormals but is not guaranteed or required to do so.
The currently documented behavior for "denormal-fp-math"="dynamic" is not consistent with what you’ve said. The Lang Ref says, “If the mode is "dynamic" , the behavior is derived from the dynamic state of the floating-point environment. Transformations which depend on the behavior of denormal values should not be performed.”
Based on what you’ve said above, I think this would be more appropriate:
If the mode is “dynamic”, the optimizer may not make assumptions about the floating-point environment in which the program will be executed. The execution-time behavior is derived from the dynamic state of the floating-point environment. Transformations cannot assume that previous instructions have not flushed denormal values to zero or that previous instructions will have flushed denormal values to zero. However, the optimizer is not required to preserve flushing behavior, except in the case of the llvm.canonicalize intrinsic, which is provided for this exact purpose.
I think that describes the behavior you’re advocating for, right?
I don’t particularly like this. Why wouldn’t we want to provide a way to describe and respect denormal modes that correspond to the architecture of the hardware that we’re compiling for? For instance, if I have a target that doesn’t support denormal values for a particular type (which I believe is common for fp16), I would need the compiler to respect "denormal-fp-math"=["preserve-sign" for that architecture in order to achieve numeric consistency. Similarly, if I have an architecture (and I do) for which the denormal state can be modified in a way that is outside of the control of the program or library which I am compiling, why shouldn’t we provide a mode that respects that dynamic behavior? I can understand why we wouldn’t want this to be the default behavior, but I don’t understand why we wouldn’t want to support it.
I realize that it would be difficult to maintain and enforce such semantics, but the “dynamic” semantics described above also seem difficult to enforce. We could at least set it as the expectation and then fix bugs as they occur.
I’m not sure what you mean by “illegitimate” here. Do you mean it doesn’t conform to IEEE-754? I suppose that’s true. On the other hand, there is hardware that flushes denormals, and there are reasons why doing so is desirable. It’s really no different conceptually than “flushing” values that are beyond the range of what can be expressed with the IEEE encoding. It just puts the limit in a different place. If the more limited range is acceptable to a developer and the program is faster with the limited range, why is that illegitimate?
I think that about covers it (but I might phrase it a bit differently).
There’s a large semantic gulf between what we have in the IR and what any particular target provides here. We don’t really have a model, much less one that satisfies what every target wants, that the general IR could match.
Ultimately the IR is defined by program semantics, not hardware. It’s the backend’s responsibility to conform the target implementation to those semantics, and for the most part not the other way around. The flushing behavior is a target defined, and not necessarily internally consistent.
Restricting higher level optimizations to conform to how a target might implement something is backwards. In the past there have been some IR mistakes of this pattern (e.g. long ago we wouldn’t constant fold sqrt intrinsics with negative arguments just in case some target didn’t properly emit a nan, when a target with busted handling should deal with that in lowering rather than blocking the general IR).
Supporting this is the point of the canonicalize intrinsic. The odd user that cares can observe the expected canonical target behavior.
I increasingly think the way we would handle this kind of thing (signaling nans and flushing), is to just have an early pass to just insert canonicalizes around any floating point operations. It will multiply your instruction count, but should get you to consistent behavior.
Another option I’ve thought about is having a family of float_ftz_daz types, which is a different nightmare.
“For a particular type” is a much neater model of flushing than the variety that does exist, where some opcodes may or may not flush regardless of the mode.
Largely yes, and it’s not as if some other standard is there to fill in the gap with what the what the behavior should be. It’s nothing more than a hardware performance hack, and different devices have made different choices as to how that hack behaves.
It’s quite different. There are reasons for denormals to exist to manage underflow.
This is still describing a performance hack. Nobody would choose to have denormal flushing as an intrinsically desirable behavior
I don’t 100% agree with this perspective. Yes, ultimately, the IR defines the semantics that we promise to follow, but the IR is driven by the needs of the compiler. It has to be flexible enough to describe the use cases that we want to support in the compiler, and if it isn’t we update it to make it so. Not that long ago, “denormal-fp-math” didn’t exist at all. What motivated its being added? Was it not a target-specific use case that we wanted to be able to handle? If we just say, “Flushing denormals is an illegitimate practice,” we could make everything “ieee” and pretend that other cases don’t matter.
Yes, I overstated my case a bit. My point is that there are use cases where flushing denormals doesn’t cause problems and does improve performance. Is it a “performance hack”? You can call it that, but it’s something supported by a lot of hardware that has proven useful.
I don’t particularly like this. Why wouldn’t we want to provide a way
to describe and respect denormal modes that correspond to the
architecture of the hardware that we’re compiling for? For instance,
if I have a target that doesn’t support denormal values for a
particular type (which I believe is common for fp16), I would need the
compiler to respect |“denormal-fp-math”=[“preserve-sign”| for that
architecture in order to achieve numeric consistency. Similarly, if I
have an architecture (and I do) for which the denormal state can be
modified in a way that is outside of the control of the program or
library which I am compiling, why shouldn’t we provide a mode that
respects that dynamic behavior? I can understand why we wouldn’t want
this to be the default behavior, but I don’t understand why we
wouldn’t want to support it.
There are, I think, two ways that you can look at the role of a
programming language like C or LLVM IR (very different languages, though
I think the trend is the same):
You can see the language as a way to express the behavior of the
assembly you’re describing, and if that behavior varies among platforms,
well, your code varies among platforms.
Or you can see the language as a way to express a single behavior
independent of the hardware details, and the process of lowering to
hardware assembly requires generating extra code to get that behavior if
necessary.
The dominant trend has definitely been to move in the direction of the
latter, although it is further behind for floating-point (and not helped
by the fact that there is a stronger desire for
speed-at-the-expense-of-correctness for FP than there is for other
operations).
Denormal flushing is, from my perspective, an especially cursed
behavior. In some cases, hardware denormal flushing is mandatory (as in,
the hardware doesn’t support a non-FTZ/DAZ mode for a particular FP
type); in other cases, it’s controllable by the floating-point
environment. But where it is controllable, it seems to generally be
controllable with far more fine-grained controls than a simple “on” or
“off”–there’s a distinction between FTZ+DAZ, FTZ w/o DAZ, and DAZ w/o
FTZ, and I happen to have an architecture manual on my desk for one that
(if I’m reading it correctly) lets you pick between flushing to signed
zero or flushing to positive-zero at your leisure.
But the real problem with denormal flushing to me is that it’s a
floating point environment-consequent change that we largely don’t want
to handle with our existing floating point environment mechanisms (i.e.,
constrained intrinsics). The semantics we generally have for FP IR is
“regular instructions assume default FP environment, and if you’re not
in the default FP environment, that’s what the constrained intrinsics
are for”. Denormal flushing is, well, not the default FP environment,
but its purpose is extra speed which is the exact opposite of what
constrained intrinsics provide, so we’re left with ungainly semantics as
a result.
To try to bring this thread to a conclusion, though, here are some
principles I hope everyone can agree on:
llvm.canonicalize has the expected behavior of flushing denormals if
denormal flushing is dynamically enabled.
llvm.canonicalize more broadly should be considered exempt from
standard LLVM rules on FP behavior, and instead a lot closer to call @llvm.experimental.constrained.fmul(%x, 1.0). (What keeps it from
saying it’s an exact equivalence is I don’t have full knowledge of all
the funky FP knobs one can turn on every single piece of hardware.) As a
result, it should generally be treated as unoptimizable except in
limited circumstances (e.g., you can constant-fold a normal value, but
not a denormal [even with denormal-fp-math=ieee], and repeated calls
to llvm.canonicalize can be optimized away because it is
idempotent–but I am wary of llvm.canonicalize(x) == x => x == x,
because it’s not true for FTZ-without-DAZ).
The default value of denormal-fp-math for any platform with a crtfastmath.o that sets FTZ/DAZ should be set to dynamic. This value
should be used regardless of the current fast-math flags applied or if
clang can find crtfastmath.o because by the point that it’s possible
for someone to compile code that twiddles FTZ/DAZ with just -ffast-math, it’s exceedingly difficult for the toolchain (or even the
user in many cases) to honestly assert that nobody has twiddled it.
Floating-point operations (other than llvm.canonicalize) are not
guaranteed to have canonicalizing behavior, contrary to IEEE 754
semantics. This includes sNaN quieting rules–which are already
documented as such–and would include possible denormal flushing (and
decimal noncanonicals, when we get around to implementing decimal fp).
This is more controversial, but I think correct:
Constant folding in general should just fail to constant fold where
denormals are involved, independent of the mode.