While reviewing a recent clang documentation change, I became aware of an issue with the way that clang is handling FP denormals. There is currently some support for variations in the way denormals are handled, but it isn’t consistent across architectures and generally feels kind of half-baked. I’d like to discuss possible solutions to this problem.
First, there is a clang command line option:
Select which denormal numbers the code is permitted to require.
Valid values are: ieee, preserve-sign, and positive-zero, which
correspond to IEEE 754 denormal numbers, the sign of a flushed-to-zero
number is preserved in the sign of 0, denormals are flushed to positive
A quick survey of the code leads me to believe this has no effect for targets other than ARM. For X86 targets we may want different options. I’ll say more about that below. The wording of the documentation is sufficiently ambiguous that I’m not entirely certain whether it is intended to control the target hardware or just the optimizer.
In addition, when either -Ofast or -ffast-math is used, we attempt to link ‘crtfastmath.o’ if it can be found. For X86 targets, this object file adds a static constructor that sets the DAZ and FTZ bits of the MXCSR register. I expect that it has analogous behavior for other architectures when it is available. This object file is typically available on Linux systems, possibly also with things like MinGW. If it isn’t found, the denomral control flags will be left in their default state.
There is also a CUDA-specific option, -f[no-]cuda-flush-denormals-to-zero. I don’t know how this is implemented, but the documentation says it is specific to CUDA device mode.
Finally, there is an OpenCL-specific option, -cl-denorms-are-zero. Again, I don’t know how it is implemented.
So… I’d like to talk about how we can corral all of this into some interface that is consistent (or at least consistently sensible) across architectures.
The problems I see are:
-fdenormal-fp-math needs to handle all scenarios needed by all architectures (or needs to be limited to a common subset).
-fdenormal-fp-math needs to be reconciled with -ffast-math and its variants.
-fdenormal-fp-math needs to be consistent about whether or not it imposes hardware changes when applicable.
I can only really speak to X86, so I’ll say a few words about that to start the discussion.
The current choices for -fdenormal-fp-math are: ieee, preserve-sign, and positive-zero. With X86, you get ieee behavior if neither DAZ or FTZ are set. If FTZ is set you get ‘preserve sign’ behavior – i.e. denormal results are flushed to zero and the sign of the result is kept. There is no way to get ‘positive zero’ behavior with X86. At the hardware level, modern X86 processors have separate controls for ftz (results are flushed to zero) and daz (inputs are flushed to zero before calculations), but I doubt that they are used independently often enough to distinguish them at the command line option level.
Also, any X87 instructions that happen to be generated (such as if the code contains ‘long double’ data on Linux) will ignore the ftz and daz settings. There are some early Pentium 4 processors that don’t support ‘daz’ but I hope we can safely ignore that fact.
Linking in crtfastmath.o when -Ofast or -ffast-math are used is consistent with GCC’s behavior. However, it implicitly ignores -fdenormal-fp-math, which GCC doesn’t have. In most cases if a user sets a fast math option they probably also want DAZ and FTZ, but there might be some reason why an advanced user would want to treat them separately. This can be done with intrinsics, of course, but if we have an option to control it, we should respect that option. Also, it is possible to construct fast math behavior cafeteria-style (i.e. setting some fast math flags and not others) so we should probably have a way to add ftz behaviors a la carte.
FWIW, ICC sets the FTZ and DAZ flags from a function call that is inserted into main depending on the options used to compile the file containing main.
Trying to go back to the general case, I’d like to solicit information about whether other targets have/need different denormal options than are described above. Futher, I’d suggest that for any architecture that supports FTZ behavior, a well-document default be automatically set when fast math is enabled via
-Ofast, -ffast-math, or -funsafe-math-optimizations unless that option is turned off by a subsequent -fno-fast-math/-fno-unsafe-math-optimizations option or overridden by a subsequent -fdenormal-fp-math option, and if -fdenormal-fp-math is used, some code will be emitted to set the relevant hardware controls.
I don’t have a strong opinion on whether it is better to emit a static constructor or to inject a call into main. The latter seems more predictable. I’d like to avoid a dependency on crtfastmath.o either way.
Do we need an ftz fast-math flag?
Are there any other facets to this problem that I’ve overlooked?