Handling of FP denormal values

andykaylor · September 16, 2019, 11:57pm

Hi all,

While reviewing a recent clang documentation change, I became aware of an issue with the way that clang is handling FP denormals. There is currently some support for variations in the way denormals are handled, but it isn’t consistent across architectures and generally feels kind of half-baked. I’d like to discuss possible solutions to this problem.

First, there is a clang command line option:

-fdenormal-fp-math=

Select which denormal numbers the code is permitted to require.

Valid values are: ieee, preserve-sign, and positive-zero, which

correspond to IEEE 754 denormal numbers, the sign of a flushed-to-zero

number is preserved in the sign of 0, denormals are flushed to positive

zero, respectively.

A quick survey of the code leads me to believe this has no effect for targets other than ARM. For X86 targets we may want different options. I’ll say more about that below. The wording of the documentation is sufficiently ambiguous that I’m not entirely certain whether it is intended to control the target hardware or just the optimizer.

In addition, when either -Ofast or -ffast-math is used, we attempt to link ‘crtfastmath.o’ if it can be found. For X86 targets, this object file adds a static constructor that sets the DAZ and FTZ bits of the MXCSR register. I expect that it has analogous behavior for other architectures when it is available. This object file is typically available on Linux systems, possibly also with things like MinGW. If it isn’t found, the denomral control flags will be left in their default state.

There is also a CUDA-specific option, -f[no-]cuda-flush-denormals-to-zero. I don’t know how this is implemented, but the documentation says it is specific to CUDA device mode.

Finally, there is an OpenCL-specific option, -cl-denorms-are-zero. Again, I don’t know how it is implemented.

So… I’d like to talk about how we can corral all of this into some interface that is consistent (or at least consistently sensible) across architectures.

The problems I see are:

-fdenormal-fp-math needs to handle all scenarios needed by all architectures (or needs to be limited to a common subset).
-fdenormal-fp-math needs to be reconciled with -ffast-math and its variants.
-fdenormal-fp-math needs to be consistent about whether or not it imposes hardware changes when applicable.

I can only really speak to X86, so I’ll say a few words about that to start the discussion.

The current choices for -fdenormal-fp-math are: ieee, preserve-sign, and positive-zero. With X86, you get ieee behavior if neither DAZ or FTZ are set. If FTZ is set you get ‘preserve sign’ behavior – i.e. denormal results are flushed to zero and the sign of the result is kept. There is no way to get ‘positive zero’ behavior with X86. At the hardware level, modern X86 processors have separate controls for ftz (results are flushed to zero) and daz (inputs are flushed to zero before calculations), but I doubt that they are used independently often enough to distinguish them at the command line option level.

Also, any X87 instructions that happen to be generated (such as if the code contains ‘long double’ data on Linux) will ignore the ftz and daz settings. There are some early Pentium 4 processors that don’t support ‘daz’ but I hope we can safely ignore that fact.

Linking in crtfastmath.o when -Ofast or -ffast-math are used is consistent with GCC’s behavior. However, it implicitly ignores -fdenormal-fp-math, which GCC doesn’t have. In most cases if a user sets a fast math option they probably also want DAZ and FTZ, but there might be some reason why an advanced user would want to treat them separately. This can be done with intrinsics, of course, but if we have an option to control it, we should respect that option. Also, it is possible to construct fast math behavior cafeteria-style (i.e. setting some fast math flags and not others) so we should probably have a way to add ftz behaviors a la carte.

FWIW, ICC sets the FTZ and DAZ flags from a function call that is inserted into main depending on the options used to compile the file containing main.

Trying to go back to the general case, I’d like to solicit information about whether other targets have/need different denormal options than are described above. Futher, I’d suggest that for any architecture that supports FTZ behavior, a well-document default be automatically set when fast math is enabled via

-Ofast, -ffast-math, or -funsafe-math-optimizations unless that option is turned off by a subsequent -fno-fast-math/-fno-unsafe-math-optimizations option or overridden by a subsequent -fdenormal-fp-math option, and if -fdenormal-fp-math is used, some code will be emitted to set the relevant hardware controls.

I don’t have a strong opinion on whether it is better to emit a static constructor or to inject a call into main. The latter seems more predictable. I’d like to avoid a dependency on crtfastmath.o either way.

Do we need an ftz fast-math flag?

Are there any other facets to this problem that I’ve overlooked?

Thanks,

Andy

cjm345 · September 17, 2019, 12:58am

Hi all,

While reviewing a recent clang documentation change, I became aware of an issue with the way that clang is handling FP denormals. There is currently some support for variations in the way denormals are handled, but it isn’t consistent across architectures and generally feels kind of half-baked. I’d like to discuss possible solutions to this problem.

First, there is a clang command line option:

-fdenormal-fp-math=

Select which denormal numbers the code is permitted to require.

Valid values are: ieee, preserve-sign, and positive-zero, which

correspond to IEEE 754 denormal numbers, the sign of a flushed-to-zero

number is preserved in the sign of 0, denormals are flushed to positive

zero, respectively.

A quick survey of the code leads me to believe this has no effect for targets other than ARM. For X86 targets we may want different options. I’ll say more about that below. The wording of the documentation is sufficiently ambiguous that I’m not entirely certain whether it is intended to control the target hardware or just the optimizer.

In addition, when either -Ofast or -ffast-math is used, we attempt to link ‘crtfastmath.o’ if it can be found. For X86 targets, this object file adds a static constructor that sets the DAZ and FTZ bits of the MXCSR register. I expect that it has analogous behavior for other architectures when it is available. This object file is typically available on Linux systems, possibly also with things like MinGW. If it isn’t found, the denomral control flags will be left in their default state.

There is also a CUDA-specific option, -f[no-]cuda-flush-denormals-to-zero. I don’t know how this is implemented, but the documentation says it is specific to CUDA device mode.

Finally, there is an OpenCL-specific option, -cl-denorms-are-zero. Again, I don’t know how it is implemented.

So… I’d like to talk about how we can corral all of this into some interface that is consistent (or at least consistently sensible) across architectures.

The problems I see are:

-fdenormal-fp-math needs to handle all scenarios needed by all architectures (or needs to be limited to a common subset).

-fdenormal-fp-math needs to be reconciled with -ffast-math and its variants.

-fdenormal-fp-math needs to be consistent about whether or not it imposes hardware changes when applicable.

I can only really speak to X86, so I’ll say a few words about that to start the discussion.

The current choices for -fdenormal-fp-math are: ieee, preserve-sign, and positive-zero. With X86, you get ieee behavior if neither DAZ or FTZ are set. If FTZ is set you get ‘preserve sign’ behavior – i.e. denormal results are flushed to zero and the sign of the result is kept. There is no way to get ‘positive zero’ behavior with X86. At the hardware level, modern X86 processors have separate controls for ftz (results are flushed to zero) and daz (inputs are flushed to zero before calculations), but I doubt that they are used independently often enough to distinguish them at the command line option level.

Also, any X87 instructions that happen to be generated (such as if the code contains ‘long double’ data on Linux) will ignore the ftz and daz settings. There are some early Pentium 4 processors that don’t support ‘daz’ but I hope we can safely ignore that fact.

Linking in crtfastmath.o when -Ofast or -ffast-math are used is consistent with GCC’s behavior. However, it implicitly ignores -fdenormal-fp-math, which GCC doesn’t have. In most cases if a user sets a fast math option they probably also want DAZ and FTZ, but there might be some reason why an advanced user would want to treat them separately. This can be done with intrinsics, of course, but if we have an option to control it, we should respect that option. Also, it is possible to construct fast math behavior cafeteria-style (i.e. setting some fast math flags and not others) so we should probably have a way to add ftz behaviors a la carte.

FWIW, ICC sets the FTZ and DAZ flags from a function call that is inserted into main depending on the options used to compile the file containing main.

Trying to go back to the general case, I’d like to solicit information about whether other targets have/need different denormal options than are described above. Futher, I’d suggest that for any architecture that supports FTZ behavior, a well-document default be automatically set when fast math is enabled via

-Ofast, -ffast-math, or -funsafe-math-optimizations unless that option is turned off by a subsequent -fno-fast-math/-fno-unsafe-math-optimizations option or overridden by a subsequent -fdenormal-fp-math option, and if -fdenormal-fp-math is used, some code will be emitted to set the relevant hardware controls.

I don’t have a strong opinion on whether it is better to emit a static constructor or to inject a call into main. The latter seems more predictable. I’d like to avoid a dependency on crtfastmath.o either way.

I would like to see it called from .init_array (or equivalent) with the highest init_priority. That way, dynamic initializers get the benefit too. If we’re requesting DAZ+FTZ on the command line, there’s no need for a slow start-up.

Digressing a bit, but I don’t like how some implementations of crtfastmath.o clear all the flags while setting the DAZ+FTZ flags (e.g. AArch64). Seems unnecessary and makes its position on the link line significant.

arsenm · September 17, 2019, 1:43am

Do we need an ftz fast-math flag?

This would be useful for matching a handful of AMDGPU instructions (a fmad that only always flushes being the most important). We have a dedicated intrinsic to allow flushing in this case when denormals are enabled

Are there any other facets to this problem that I’ve overlooked?

For AMDGPU we need to split -denormal-fp-math into per-FP type flags (and the corresponding IR attribute). The denorm mode register has separate fields for f32, and f64+f16. The default for each of these is different depending on the subtarget/language combination. Mostly we want f64+f16 to always be on, and only change the f32 mode. The current naming implies changing all of the modes.

The different sign of 0 modes as exist now aren’t available. There are however separate flags for enabling flushing on input and output. This isn’t particular important, and currently we just set both bits at the same time but it might be something to think about if this is being expanded.

-Matt

cjm345 · September 17, 2019, 3:07pm

Do we need an ftz fast-math flag?

This would be useful for matching a handful of AMDGPU instructions (a fmad that only always flushes being the most important). We have a dedicated intrinsic to allow flushing in this case when denormals are enabled

+1

For FTZ/DAZ, we’re currently getting cases like this incorrect:

%add = fadd nnan ninf nsz float %a, 0.000000e+00

That cannot be safely optimized to ‘a’ with FTZ/DAZ enabled. Although, there’s admittedly a small chance of problems, since a following FP operation would normalize it, but here be dragons.

Are there any other facets to this problem that I’ve overlooked?

For AMDGPU we need to split -denormal-fp-math into per-FP type flags (and the corresponding IR attribute). The denorm mode register has separate fields for f32, and f64+f16. The default for each of these is different depending on the subtarget/language combination. Mostly we want f64+f16 to always be on, and only change the f32 mode. The current naming implies changing all of the modes.

The different sign of 0 modes as exist now aren’t available. There are however separate flags for enabling flushing on input and output. This isn’t particular important, and currently we just set both bits at the same time but it might be something to think about if this is being expanded.

At the command-line level, I don’t see a lot of value in separating the two flags. At the Function/Loop/Block/Instruction level, separating the two would be more useful though. E.g. normalizing input/output; or sacrificing accuracy to speed up a hot loop.

cjm345 · September 17, 2019, 3:55pm

EDIT: ‘accuracy’ should be ‘precision’.

andykaylor · September 17, 2019, 4:27pm

At the command-line level, I don’t see a lot of value in separating the two flags. At the Function/Loop/Block/Instruction level, separating the two would be more useful though. E.g. normalizing input/output; or sacrificing accuracy to speed up a hot loop.

EDIT: ‘accuracy’ should be ‘precision’.

How do you imagine that being specified in the local scope? Two ways that come to mind would be a pragma or an intrinsic. The pragma would probably be the cleanest, though more work for the front end. I suspect most architectures already have intrinsics to control this locally, but we could possibly add a target-independent intrinsic that would be better for the optimizer. But I think you want this to set or clear a flag on individual operations to help with instruction selection, right?

cjm345 · September 17, 2019, 5:30pm

>> At the command-line level, I don't see a lot of value in separating the two flags. At the Function/Loop/Block/Instruction level, separating the two would be more useful though. E.g. normalizing input/output; or sacrificing accuracy to speed up a hot loop.

> EDIT: 'accuracy' should be 'precision'.

How do you imagine that being specified in the local scope? Two ways that come to mind would be a pragma or an intrinsic. The pragma would probably be the cleanest, though more work for the front end. I suspect most architectures already have intrinsics to control this locally, but we could possibly add a target-independent intrinsic that would be better for the optimizer.

Good question. I haven't thought about it. I don't know if I have a
strong opinion either. It's pretty clear that something will be
needed, since tracking bits being flipped in the control register is
dubious.

It's probably a question for the CFE experts. Assuming that we add
FTZ/DAZ fast math flags, what would be the best way to attach a
FTZ/DAZ fast math flag to an individual IR instruction? Is that
currently done for other FMFs? Or are they just toggled by the higher
level -ffast-math and friends?

But I think you want this to set or clear a flag on individual operations to help with instruction selection, right?

I think that would be useful. Well, at least I imagine it could be
useful. My personal experience is that users want all-or-nothing
regarding DAZ+FTZ, so a command-line switch would be sufficient.

Topic		Replies	Views
-fdenormal-fp-math Clang Frontend	15	161	March 17, 2017
Denormal-fp-math and fast-math Clang Frontend	1	249	April 25, 2024
Questions about llvm.canonicalize IR & Optimizations	46	1129	June 27, 2024
[RFC] Stronger guarantees for "denormal-fp-math IR & Optimizations	12	456	July 3, 2024
NEON FP flags LLVM Dev List Archives	9	188	April 1, 2016

Handling of FP denormal values

Related topics