Support of trapping math

Running applications with floating point trapping enabled has some issues.

Users may want FP traps enabled for various reasons. Some need this mechanism to make more robust applications by catching errors even in production environment. Others want to get better performance - for example, to run computations with fast-math options and, if infinities or NaNs appear, switch to the slow version. The users know that running applications with traps enabled costs nothing, and expect no performance drop.

Running application with FP traps enabled requires some support from compiler, GCC has option -ftrapping-math for that. Clang also has such option but it is implemented as a synonym of -ffp-exception-behavior=strict. These are different thing however. Strict exception tracking guarantees that FP status bits are changed according to the statement sequence in the source file. Trap on the other hand is a general mechanism provided by a processor. Setting status bits may initiate a trap but otherwise these are independent. One can easily imagine a core that does not have FP exception bits as IEEE-754 requires but still performs a trap on overflow or invalid operation.

Strict exception handling now is the only way in Clang to get semantics consistent with trapping. Unfortunately this mode is not suitable for practical use. It requires all floating-point operations to have side effect, which substantially limits optimizations. Also strict exception tracking is inconsistent with vectorization. As a result, performance is poor. For example, running SPEC 2017 fpspeed with option -O3 demonstrates 30% slowdown for 638.imagick_s if -ffp-exception-behavior=strict is also specified.

Calculation with default FP modes is the most suitable solution as it provides the best performance. However Clang makes transformations that are not valid for trapping math, in particular:

  • Constant folding is allowed to evaluate expressions that otherwise can perform a trap. For example, 0.0/0.0 can appear as a result of some optimization, like LTO. Now it is replaced by NaN and trapping does not happen.
  • Checks made by fcmp and is_fpclass are interchangeable now. However they have different behavior if traps are enabled and an argument is signaling NaN (which may be used to represent uninitialized FP value). If fcmp is used instead of is_fpclass, a superfluous trap is performed, in the reverse case the trap is missed.

The C Standard does not restrict using traps in default mode, #pragma STDC FENV_ACCESS ON is not required for it, because setting traps does not change FP control modes. So trapping-math should be enabled by default. GCC behaves in this way.

I am interested in proper implementation of -ftrapping-math. The option should be independent of other code generation options and can be combined with any of them: strict exceptions, fast-math, default. The reverse option “-fno-trapping-math” is added to the set of fast-math flags as GCC does. In IR this option is represented by the existing function attribute no-trapping-math, its default value depends on the target. If it is set to false, transformations preserve trapping on FP exceptions Invalid operation, Overflow and Division by zero, and do not perform new traps. It is not intended to precisely keep all exceptions, they can be avoided by allowed transformation (like reassociation).

Any feedback is appreciated.

Thanks,
Serge

It should be possible to vectorize strict-fp intrinsics, at least in some cases; the intrinsics support vector operands, and there isn’t any general prohibition on reordering strict-fp operations. Not sure how hard this is to implement.

We currently vectorize non-strict fp instructions in ways that aren’t legal under -ftrapping-math.

So either way, to get the optimizations you want, we need to mess with the vectorizer.


Most of the reason that we introduced the strict-fp intrinsics in the first place also applies to any form of trapping: we assume non-strict-fp floating-point operations don’t have side-effects, and changing that is hard. We intentionally made that tradeoff at the time to avoid impacting normal workloads… and I think it worked out in the sense that we managed to incrementally introduce strict-fp without breaking anything. We could maybe consider revisiting the split between strict and non-strict fp now that strict-fp is more mature.

I don’t think this is interpreting the standard correctly. FENV_ACCESS governs access to the “floating-point environment”. Annex F describes trapping as part of the floating-point environment.

It’s surprising to me this is the gcc default.

I don’t think this is interpreting the standard correctly. FENV_ACCESS governs access to the “floating-point environment”. Annex F describes trapping as part of the floating-point environment.

In the C2x spec (the wording I have most easily available), the wording is specified as

If part of a program tests floating-point status flags or establishes non-default floating-point mode settings using any means other than the FENV_ROUND pragmas, but was translated with the state for the FENV_ACCESS pragma “off”, the behavior is undefined.

“Floating-point mode” isn’t precisely defined anywhere, but the text of the standard can be inferred to mean the equivalent of a bit in the MXCSR or FPCR or similar registers. As you mention, however, Annex F footnote 443 states

Dynamic rounding precision and trap enablement modes are examples of such extensions.

F.8.3 further provides that the initial dynamic floating-point environment is initialized so that “Trapping or stopping (if supported) is disabled on all floating-point exceptions.”, from which it can be inferred that the ‘default’ mode setting for trapping is to disable it.

IEEE 754-2008 removed trapping-based handlers for floating-point and replaced it with alternate exception handling; the C binding for alternate exception handling has languished in TS 18661-5 with little enthusiasm from WG14 for adopting it.

Also note that “undefined behavior” explicitly means that LLVM can choose to define whatever semantics it wants for the behavior.

As for the meat of the proposal, some thoughts:

I am of the opinion that the optimizer needs to be taught about the floating-point environment, so that it can understand the various flags when they have constant values and how to optimize them. This doesn’t mean supporting all of the hardware flags–I’m not proposing to support, say, the MIPS pre-2008 NaN handling bit, or x87 precision control–but those hardware flags that are reasonably common should be known to the compiler. Which is to say, knowledge of the current dynamic rounding mode, the status of trap-on-exception bits, and FTZ/DAZ bits [note that existing FTZ/DAZ support being the opposite direction from strictfp handling kind of makes the current compiler behavior of this feature somewhat unstable].

We can make the behavior of the existing constrained intrinsics more fine-grained in terms of how they interact with the dynamic floating-point environment, and especially in the case where the environment is partially known (dominated by a fesetmode, for example), it should be possible to make their use less of an optimization barrier. There’s been some suggestions in the past that we can ease some of the issues around constrained intrinsics, particularly with regards to target-specific intrinsics and libm functions, by using operand bundles, and I do want to play around with this, but I haven’t had the time yet to do so.

Any implied twiddling of the actual floating-point environment I strongly believe should be done on a lexically-scoped basis, akin to how C2x specifies FENV_ROUND should work, rather than a “let’s just enable it once at startup” like how -ffast-math enables FTZ/DAZ bits.

The standard clarifies meaning of control modes in the footnote 12 in 5.1.2.3 Program execution:

12)The IEC 60559 standard for binary floating-point arithmetic requires certain user-accessible status flags and control
modes. Floating-point operations implicitly set the status flags; modes affect result values of floating-point operations.

Trapping does not affect result values of floating-point operations, so according to this note it is not a control mode. But the footnote F.8p1 indeed can be understood as if trapping is a control mode. The standard is not unambiguous at this point.

Thank you for the reference. It looks like this alternate exception handling is quite complex and tries solving different problems. It is more about setting exception handling for small pieces of code, but the mentioned use cases need setting traps enabled for the entire application or for large pieces of code including functions called from that code.

From this viewpoint in addition to -ftrapping-math it is profitable to have more fine-grained options like -ftrap-on-overflow etc, so that compiler had more opportunities.

It definitely worth doing but for this particular problem it is not the best solution. The constrained intrinsics anyway have side effect and that would reduce performance. Much simpler and more effective solution is to add trapping to the default FP environment if -ftrapping-math is specified.

Trapping is not a side effect, or it is a side effect of different nature than changing control modes. Unless trap is finished with resume with the modified instruction result, the changes to the execution environment are not observable in the interrupted code.

Question

Can we consider the effect of -ftrapping-math as adding trap-awareness to the default control modes?

Such solution could support important use case without substantial efforts and do not disturb existing functionality. It also could enable some optimizations that are not possible now (like -ffp-exception-behavior=strict -fno-trapping-math).

@spavloff, we discussed this in 2019 ([cfe-dev] [llvm-dev] Floating point operations with specific rounding and exception properties) and I still agree with you. Without a performant implementation, trap-safe code generation is of little use besides checking a box on a feature list.

The team I’m currently on has fairly frequent requests to support trap-safe code generation. But unfortunately, I am not convinced that the constrained intrinsics will perform without a Herculean effort. It would be much faster and easier to attack the problem from a different angle.

P.S. note that I’m assuming the constrained intrinsics are not performing well. I haven’t benchmarked these in many years, so please shout if I’m mistaken.

I think there’s two questions here: the semantic question, and the IR representation question.

The semantic question is, is there currently some way to represent the semantics you want? Constrained FP currently has three forms of trapping specification: “fpexcept.ignore”, “fpexcept.maytrap”, “fpexcept.strict”; does one of those match “-ftrapping-math”? Is there some other optimization forbidden by using constrained FP?

The representation question is, should we mess with Instruction::FAdd to add support for some subset of constrained FP semantics? This is purely a question about LLVM-internal datastructures, and how we structure checks for whether a given transform is legal.

I don’t think the two questions belong together in the same thread. Mixing the two together is guaranteed to just confuse everyone.

Apologies in advance for conflating your two questions, @efriedma-quic, but it’s inherent to the problem. Hopefully this explains why…

The semantic question is, is there currently some way to represent the semantics you want? Constrained FP currently has three forms of trapping specification: “fpexcept.ignore”, “fpexcept.maytrap”, “fpexcept.strict”; does one of those match “-ftrapping-math”?

“fpexcept.maytrap” is the match. @spavloff has his own ideas about this, but I suspect that our goals are the same. That is, fast code generation that doesn’t introduce traps.

Is there some other optimization forbidden by using constrained FP?

It’s not that that optimizations are forbidden by using constrained FP intrinsics. It’s that optimizing constrained FP intrinsics requires every optimization to be updated to support them.

There are only a handful of optimizations that aren’t trap-safe. I’m suggesting that rather than teach every optimization about constrained FP intrinsics, we disable the unsafe optimizations for the operations instead (assumes FP intrinsics are replaced with IR).

Let me reframe the problem. We have two endpoints on a line:

A) 100% trap-safe code generation. I.e. no optimizations at all, so < -O0.
B) 0% trap-safe code generation. I.e. -Ofast.

Our goal is to find the point on that line that is trap-safe and performs optimally. The constrained intrinsics are starting the search from endpoint A. But a little intuition shows that the goal is not too far from endpoint B. Metaphorically, we’re traveling all the way around Earth to get to our neighbor’s house. Can this be done (constrained FP intrinsics)? It can. But it’s not the best way to go about solving the problem.

Starting the discussion from “fpexcept.maytrap generates slow code” is much different from starting with “clang is broken because LLVM allows hosting Instruction::FAdd”, even if the proposed solution is essentially the same. Instead of actually discussing the existing infrastructure and the proposed new infrastructure, the discussion started with meaning of Annex F.


Without further analysis, I’m a bit worried we start going to “the neighbor’s house”, but find the distance is more than we thought. For example, in SelectionDAG, non-STRICT ops don’t have a chain operand; how do we deal with that?

Without further analysis, I’m a bit worried we start going to “the neighbor’s house”, but find the distance is more than we thought. For example, in SelectionDAG, non-STRICT ops don’t have a chain operand; how do we deal with that?

It’s important to keep in mind that GNU’s -ftrapping-math isn’t “full FPEnv support down to the instruction level, including FENV_ACCESS support”. Rather it’s “optimizations performed don’t introduce traps”.

At least at first, we don’t need to tag an operation so that optimizations won’t move it. We should rather teach unsafe optimizations not to touch FP operations. E.g. make sure that MachineLICM leaves FP operations alone.

I can’t speak for everyone’s workflows, but fpexcept.strict doesn’t have a lot of value in HPC. Users won’t tolerate < -O0 performance in exchange for trap support. That’s why I’m proposing we focus on fpexcept.maytrap, with fpexcept.strict being a nice to have. This is more pragmatic and in line with GNU’s solution.

The main problem is that constrained intrinsics always have side effect. It restricts optimizations and performance of such code is lower than compiled with default FP mode. Users expect no performance loss because running with traps enabled requires only setting some bits in control register.

Side effect prevents from undesired instruction moves, for example, it keeps instruction that sets rounding mode before the instruction that uses that mode. In other words, constrained intrinsics is a tool for ordering access to FP environment. If a function does not change FP environment and does not read status bits, it does not need constrained intrinsics, it simply does not need ordering. Running code with traps enabled is just this case.

On the other part, it is profitable to have orthogonal tools for different tasks, they can be combined in various ways and provide flexible toolset. If support of trapping-math is separate from constrained intrinsics, we could have combinations “fast-math”+“trapping” to have more reliable fast-math calculations, or “rounding-math”+“no-trapping” to better optimize code.

No, we use them if FPE ordering is not required. As for other thing, which now are bound to constrained intrinsics (like support of SNaNs), they should be presented as separate features, just as trapping.

Preventing reordering of FP instructions is crucial when running with traps on.

The issue that got me involved in this was the case where an optimized “(isnan(x) ? 0 : (int)x)” would, when given a SNaN, trap because the FP-to-int conversion was hoisted and done unconditionally.

Without strict ordering there’s nothing to prevent an FP instruction from being sunk below the call that enables traps, or being hoisted above a call that disables traps, or some other movement that results in unexpected traps.

There is no third way for use with traps on. Either the constrained intrinsics (or some equivalent replacement) are used, or they aren’t. Traps on requires strict ordering, and today that requires the constrained intrinsics.

Is anyone besides me working on the optimizers? Except I’m stuck trying to get the IR Verifier changes landed.

It does seem to me that -ftrapping-math, in order to be correct must significantly reduce optimization capability, by restricting the ability to reorder floating-point math.

But, GCC doesn’t seem to restrict reordering in the way I would expect. E.g., this code, built by GCC, moves the multiplies before the call. That’s not correct if the multiply could trap, since the call to g() might exit, or disable traps. (Note that GCC has -ftrapping-math on by default). Is this just a GCC bug, or does their idea of -ftrapping-math mean something much weaker than it seems like it should? (And if so: what?)

void g();
float a[2], b[2], res[2];
void f() {
  float a1 = a[0], a2 = a[1];
  float b1 = b[0], b2 = b[1];
  g();
  res[0] = a1*b1;
  res[1] = a2*b2;
}

with gcc -O3 -o - -S test.c gives (trimmed)

	movq	a(%rip), %xmm0
	movq	b(%rip), %xmm1
	mulps	%xmm1, %xmm0    ;; !!!!!!
	movlps	%xmm0, 8(%rsp)
	call	g@PLT
	movq	8(%rsp), %xmm0
	movlps	%xmm0, res(%rip)

Is this just a GCC bug, or does their idea of -ftrapping-math mean something much weaker than it seems like it should? (And if so: what?)

Yes! Now we’re on the right track…

I’ve never worked inside GNU, so I can’t give you a definitive answer. But, this is the defacto standard in HPC. -ftrapping-math guarantees that traps will not be introduced by compiler optimizations. There’s no guarantee that enabling or disabling traps at the source code level will be semantically correct. gfortran traditionally controls whether traps are masked or not through -ffpe-trap=list, which I believe is a link time option (I might be wrong about this part).

Also, if I understand correctly, there is a desire for a stronger -ftrapping-math in GNU that would prevent reordering. I am not aware if that was ever implemented. I assume it has not been since they face the same problems we’re discussing in this thread.

That all said, -ftrapping-math gives smoking fast performance. And that’s what HPC users want. I.e. no compiler introduced traps and fast performance. If you’d like to convince yourself of this, try compiling your favorite benchmark with -ftrapping-math and measure the performance. Dollars to donuts you’ll see a stark difference over a strict/-O0ish implementation.

Ah, I’ve overlooked an important detail in the last comment. It’s also acceptable for -ftrapping-math to remove existing traps from the source code.

But it hasn’t guaranteed that, in my example! The compiler optimization has moved the multiply which may trap, in front of the call which may exit the program.

This is different use case. The case you mention involves setting FP traps and making FP calculation in the same function, something like:

void run() {
  enable_fp_traps();
  ...
  Do some calculation here
  ...
  disable_fp_traps();
}

In this case functions {enable,disable}_fp_traps change FP environment, so the block where they are called must be compiler with #pragma STC FENV_ACCESS ON. It is not related to the implementation of trapping-math, it is requirement of C standard.

The case I am interested is like:

void run() {
  enable_fp_traps();
  do_calculations();
  disable_fp_traps();
}

The body of run also must have FENV_ACCESS due to the same reason. But do_calculations and functions, called from it, does not access FPE and do not need access ordering. So it should be possible to compile them without strict ordering, but with trap-awareness.

Why such restriction is required if function does not access FPE?

If you mean that traps in g() must be observed prior to the traps in multiplications, then this a separate use case. In this case the user needs code in which traps occur in the same order as specified by statements in the source file. As the traps are triggered by changes in FP exception bits, so this means that FP exception should be raised according to source order. It is equivalent to the code where each FP operation and function call is followed by fetestexcept. This is responsibility of strict exception handling.

If user is not interested in exact place of trapping instruction and the only fact of exception is enough, strict order is not required. It does not matter if calculation is interrupted by division by zero in g() or by overflow in res[0] = a1*b1, both are errors.

GCC now does not support strict exception handling, so -ftrapping-math in GCC is implemented for the second use case. Clang could support both.

That gave me a chuckle and is horrible for my argument, but I still stand by my assessment. :slight_smile:

Again, I have no insight into how GNU makes their decisions, but handed down knowledge and lots of intuition tells me that this is acceptable in practice. The trap existed in the source code. The compiler didn’t create it. Whether optimizations expose the trap or not is unfortunate, but it’s not the result of some local optimization (e.g. reassociation or vector widening) that spontaneously created a trap that didn’t exist in the source code.

I admit that my argument is super weak since there’s no formal specification for -ftrapping-math, but it is a pragmatic solution. It’s proved itself useful over decades. We shouldn’t write it off because it is poorly (at least to us) defined.

No, I mean g exits the program! E.g.

void g() { exit(0); }

The correct behavior of calling f(), then, is to exit cleanly with exit code zero.

However, if floating-point traps are enabled and the compiler moves the multiply before the call, then the behavior will instead be “raise SIGFPE”, which is incorrect. The optimizer has changed the behavior of the function.

No. It’s about whether the compiler can reorder the code like (in pseudocode):

if isnan(x)
  y = 0
else
  y = float_to_int(x)

y1 = 0
y2 = float_to_int(x)
if isnan(x)
  y = y1
else
  y = y2

This optimization is only permissible without changing program semantics if float_to_int doesn’t have any side-effect. If float_to_int can raise a floating-point-trap, we have changed the program behavior from returning 0 if x is NaN, to raising a trap if x is NaN.

Compiler may reorder calculations if it is allowed to do so. If default control modes are in effect, FP operations do not have side effect and may be reordered. Compare with:

void g();
int a[2], b[2], res[2];
void f() {
  int a1 = a[0], a2 = a[1];
  int b1 = b[0], b2 = b[1];
  g();
  res[0] = a1/b1;
  res[1] = a2/b2;
}

compiler puts division before the function call: https://godbolt.org/z/EbcdsEPxz and it is not a problem. Why FP arithmetic cannot have such mode?

That is a truly excellent question!

I’d say that appears to be a serious bug. While invoking integer division-by-zero is undefined behavior, it’s not undefined-behavior to have unexecuted division-by-zero in your program. Thus, I don’t see a valid justification for this reordering.

And yet…it’s worked like this for the entire lifetime of LLVM. So…huh.