[FP] Constant folding of floating-point operations

I opened an issue yesterday (powi(x, n) is constant folded as if it were pow(x, (double)n) · Issue #65088 · llvm/llvm-project · GitHub) about the way that llvm.powi() is constant folded, and today I’ve been working on a patch to fix it. Unfortunately, as I’ve been trying to wrap my head around the correct way to do that, I’ve discovered that I’m pulling on a thread that may unravel a lot of related things. So, I’d like to start a general discussion about constant folding of floating-point operations.

@AaronBallman opened an issue in May (Incorrect constant folding behavior of floating-point operations · Issue #62479 · llvm/llvm-project · GitHub) on this same topic, but the discussion there was centered on interpretation of the C and C++ language standards as they relate to evaluating constant expressions. That’s definitely relevant to what I’m bringing up here, but I’d like to shift the focus to rules for constant folding with regard to the LLVM IR language definition.

I was somewhat dismayed by the claim in issue 62479 that the C and C++ language standards don’t require that all calls to math library functions will return the same result for a given input. I can’t find any such requirement. I also can’t find anything in the LLVM IR language reference that explicitly states such a requirement. However, I am certain that users universally expect this to be the case, and in the case of LLVM IR, I think this behavior is implied by the existence of the ‘afn’ fast-math flag. That this flag grants permission to return an approximate result implies that the absence of the flag must require that the result not be approximated. To me that means that for a given input value, all calls to the function will return the same result. You can’t not approximate something that doesn’t return a consistent value.

There is a problem, however, in that when we see a known math library function called with constant arguments, the LLVM optimizer will constant fold it by calling that function at compile time. The problem is that this means my program may produce different numeric results depending on whether or not the compiler is able to deduce that the argument is constant. So, for example, my results may change based on a decision the compiler makes about function inlining. This seems to be violating very basic rules of the optimizer respecting program semantics.

During the discussion in issue 62479, there was an emphasis on the impact of constant folding for cross-compilation. However, the problem isn’t necessarily limited to cross-compilation. It can happen any time the program being compiled is linked against a different math library than the compiler itself was linked against. For instance, if the compiler is linked against an LLVM math library, and the program being compiled is linked against a GNU math library, there is a potential for this problem to arise. Possibly even different versions of a library from the same vendor could cause it.

I know of at least two other cases where this can be a problem even if the compiler and the program being compiled are both using the same math library. One is the llvm.powi() function as I described in issue 65088. The other has to do with FMA formation. Since I’ve already described the powi() issue in 65088, I’ll describe only the FMA issue here.

Consider the following code:

double f(double x, double y, double z) {
return x * y + z;
}

If I compiled this program using ‘clang -O2 -ffp-contract=on’ clang will produce this IR:

define double @f(double %x, double %y, double %z) {
entry:
%0 = tail call double @llvm.fmuladd.f64(double %x, double %y, double %z)
ret double %0
}

The intrinsic in this case gives the compiler permission to generate fuse the operations “if the code generator determines that (a) the target instruction set has support for a fused operation, and (b) that the fused operation is more efficient than the equivalent, separate pair of mul and add instructions.” However, the LLVM constant folder will evaluate this operation as a fused multiply and add even if I’m compiling for a target that doesn’t support fused operations. So, once again, the result can depend on earlier optimization decisions made by the compiler. What’s worse (to me at least) is that if I compile with -ffp-contract=fast instead of -ffp-contract=on, the constant folder doesn’t use fused operations, even if the target does support them.

In the cases of the llvm.fmuladd and llvm.powi intrinisics, our current definitions leave some room for this variation in behavior, but I’d like to suggest that this is a problem in those definitions.

My basic proposal here is that the LLVM IR language definition should assume that all floating point operations, including math library functions and math intrinsics, should produce consistent and reproducible results that are independent of compiler optimizations unless fast-math flags permit otherwise.

Opinions?

I’m unsure why calls are special. That said, I doubt we want to require “consistent reproducible” for approximate optimizations (but we want to stay deterministic), since I don’t see how we could guarantee that even if we wanted to.
Finally, I believe we can try to be more consistent, best effort!

I agree that calls shouldn’t be special. But we’re treating them differently today.

I’m not sure we have the same understanding of “consistent reproducible” results. What I’m trying to say is that no optimization should be allowed to change the value to which an operation would evaluate at runtime unless there is something in the IR explicitly giving that permission. So, for example, if I see a call to cos(4.0), in the absence of the ‘afn’ flag, I can’t replace it with a constant value unless I can guarantee that the value I am replacing it with is the value that the cos(4.0) call would return at runtime.

I should clarify that when I say calls shouldn’t be special, I mean that they shouldn’t have different rules with regard to value-changing optimizations. The reason that this problem is seen with calls and generally not with other FP operations is that the basic operations (aside from FMA) produce correctly-rounded results which are consistent across all implementations. The constant folding issues only occur for operations that don’t produce correctly rounded results in all implementations.

In theory, we could see this same problem if we constant folded a division operation on a platform (such as OpenCL with single-precision values) where division wasn’t required to produce correctly rounded results. I don’t know if users expect consistent results in that case. Those users tend to be more concerned with performance. Also, in that case, the fdiv instruction should have the fpmath metadata. If inconsistent results in that case aren’t acceptable, that would imply that metadata isn’t a suitable way to describe the permitted lower accuracy.

On an architecture for which there is a RoundingMode specifier in some control register (almost all of them), “x+y×z” will/may have different results under different rounding control. Changing x or y or z into a constant does not alter the fact that the calculation “x+y×z” delivers different results from the same operands under different rounding modes. Rounding Mode is a silent (and implied) Operand that does
alter the result.

In that case, you need to use constrained intrinsics for the operations. Except when constrained intrinsics are used, LLVM IR assumes the default floating-point environment (i.e. “Results assume the round-to-nearest rounding mode.”). I believe we only constant fold constrained intrinsic operations if APFloat reports that the result is exact.

But all results are approximate, so this isn’t particular clear what it means. The best you can hope for is correctly rounded, which is still an approximation and provided by very few real libm implementations for any of the difficult functions.

There certainly is no requirement. LLVM and C, C++ barely have defined floating point requirements, much less values for specific functions. OpenCL at least gives ulp tolerances (which also do not need to be consistent)

I don’t see this as an issue. There’s a broader question of what constitutes the “result” of the operation. In the presence of any kind of folding combine (e.g. any use of the contract or reassoc flag), the apparent result can change depending on the use context beyond constant folding

Kind of the same thing, and I do think this is a problem. It’s been a long standing aspirational issue to have consistent constant folding (which then doesn’t necessarily match the target library behavior), just internally consistent in the compiler. Ideally we would just use MPFR or equivalent for this.

This is definitely fine. The point is produce whatever is fastest, and constant fold as fma is perfectly fast (and also the better result). “If the target supports them” is meaningless. The compiler could always choose to implement fma in software to implement fmuladd. This is defined operationally and should not differ based on target codegen preference. If the target had some pipelining issue it could resolve by selectively emitting separate fmul+add or an fma, that should also be fine. If you want a specific behavior, you can choose to not use fmuladd.

Practically speaking this proposal is “do no constant folding” which will not go over well. I would love if we had perfect host independent constant folding but that’s a lot of work nobody has seriously thought about undertaking

@arsenm Yes, most functions are returning an approximation of some kind. To be honest, I don’t really like the afn flag because it doesn’t say how accurate the approximation needs to be. But that’s a different problem. What I take it to mean is that if the flag isn’t set the compiler isn’t allowed to substitue another means of performing the operation for the function call. For instance, this often blocks vectorization because vector implementations of a function don’t generally yield the same answer as the scalar call. This is inconvenient, but I think it’s necessary.

What I’m proposing doesn’t really amount to “do no constant folding.” It’s just “don’t constant fold function calls unless the answer is exact.” That could be a bit of a performance hit, but I’m not sure how big it is. I would support a specific option to allow function calls to be constant folded. I just don’t think we should take that liberty by default.

I am also going to claim that the optimizer does currently assume that the math intrinsics always produce the same result when called with the same value. This can be seen in the fact that we CSE calls to these functions. I think math library calls aren’t CSE’d in many cases because they may have side effects (specifically, setting errno), but otherwise, I think the same assumption would hold.

With regard to consistent results, I realize that not everyone needs this, but it’s my understanding that the default rules for floating-point optimization are designed to provide consistent results. And, as far as I know, every optimization except constant folding supports this behavior.

Let me say again that my concern here is for consistency, not absolute accuracy. My experience is that users who care about the accuracy of floating-point operations have to have some way of validating their programs. Typically, the initial validation process is quite lengthy and then subsequent system changes (new versions, new compilers, new options sets, etc.) are validated by comparison against a previously validated reference implementation. If the new version has too much variation compared to the reference implementation, they have to go through the full validation process originally used. This is true even if the new version is theoretically providing more accurate answers than the reference version, because proving that is impractical. Obviously not all, and probably not most, users care about this, but it’s a usage model that we need to support.

This is true, but as you say such transformations require the use of the contract or reassoc flag. What I’m asking for here is that potentially value-changing constant folding be held to the same standard – it can’t happen without a flag saying it can happen.

Thanks for the link to issue 8565. That is absolutely an instance of what I’m talking about. I disagree that MPFR (or equivalent) is the solution, unless the customer is also using a library that provides correctly rounded results for run-time evaluated function calls, and once again that potentially introduces performance issues, and probably more significant ones that not constant folding function calls would.

I’m a bit more sympathetic in this case because the fmuladd intrinsic results from optimization settngs (though the default ones in the case of clang host-compilation), but I still don’t think it’s right for the compiler to produce one result if the expression is constant folded and a different result if it is not. That’s my concern.

That I find very reasonable. I misunderstood the initial post apparently.

Simple CSE is allowed for functions with a non-deterministic result. As an extreme example, consider “freeze i32 poison”; the result is completely non-deterministic. You can CSE two “freeze i32 poison” into one operation; since both instructions could produce the same result, merging them only restricts the possible results.

The relevant restriction for functions with a non-deterministic result is that you can’t split the instruction: for example, sinking the instruction into a loop is illegal.

(I don’t think this negates the rest of your argument, but just wanted to point that out to avoid any future confusion.)

[quote=“arsenm, post:7, topic:73138”] But all results are approximate, so this isn’t particular clear what it means. The best you can hope for is correctly rounded,
[/quote]

1.0+1.0 is not approximate, it is exact. You only round when you have an inexact result.

[quote=“arsenm, post:7, topic:73138”]There certainly is no requirement. LLVM and C, C++ barely have defined floating point requirements, much less values for specific functions.
[/quote]

At some point it becomes a quality of implementation issue.

You already mentioned that the results may differ even between different versions of the same runtime implementation. This pretty much means that the “guarantee” is nigh impossible for a lot of functions (or, taken to the extreme, plain impossible): the “guarantee” means “bug” compatibility with load-compatible libraries “from the future”.

In this way, functions are intrinsically special, and it sounds like you may be proposing to ban constant folding of function calls except where a flag is present.

Such a change to the IR semantics may warrant IR migration concerns: old IR that did not need to waive this “guarantee” explicitly should still be treated as waiving this “guarantee”.

My point was given the range of readily available implementation choices, it’s the same thing. We both do not have a method for computing the exact result, and we also do not necessarily know the underlying libm implementation will produce the exact result for the same value (e.g. denormal flushing enabled but not mandatory, and the implementation happened to not flush). Relying on the host functions as we do now, we can probably only get away with constant folding sqrt and nothing else.

I think the constant folding case is the only real situation (excluding FMF) where returning different results happens. I think if we produce a result that’s as good, it’s what most users would want. There are methods for defeating constant folding (e.g. volatile), and we could improve on that (e.g. don’t allow constant folding of llvm.arithmetic.fence or similar).

I was also recently thinking about interpreting the contract flag as allowing constant folding in a higher precision. It’s not a great overloaded interpretation, but better than requiring afn for simple constant folding.

Is there any distinction between wanting this level of consistency and using nobuiltin? You’d lose some other optimizations on it, but is that an issue if you’re aiming for exactly matching the underlying function behavior?

In an ideal world we would always provide the complete implementation of the math functions, and have that implementation available to exactly constant fold. A single software implementation that produces correctly rounded results (i.e. MPFR) would be more likely to be implemented in the real world.

You sort of get this today with the AMDGPU functions, since we link in the full bodies of every function. With the exception of a few currently unhandled target intrinsics, most of these functions exactly constant fold as a standard optimization. Hypothetically we could just ship the bitcode implementations of the functions in the compiler and interpret them for a more standard library linking scheme.

1 Like

Consider the following code::

int        RM[]   = { RNE, RNI, RPI, RZ };
int        RMmax = 4;
double table[32] = { list of constant values };
double table1[32]={ list of other constant values};
double reslt[4][32];

 for( i = 0; I < RMmax; i++ )
 {
      setroundmode( RM[i] );
      for( j = 0; j < 32; j++ )
           result[i][j] = table[j]+table1[j];
 }

If table[]+table1[] are all exact, then you are allowed to do all of this at compile time leaving only the result arrays without knowing the rounding mode, and remove the loops altogether.

If table[]+table1[] are inexact, then you can perform these at run time only when you can control the rounding mode during compile time constant evaluation.

Yes, that is what I’m proposing. You would still be able to fold function calls that return exact results. If you don’t have this rule, you can’t get results that are independent of optimization levels. I think it’s a reasonable rule. MSVC and ICC seems to follow this rule. I know no one cares about ICC, but I mention it because I can tell you that in the case of ICC constant folding function calls is not allowed for exactly the reasons I am presenting in this thread.

I wouldn’t call this a change to the IR semantics. I’m not aware of anything in the IR semantics says that function calls can be folded. The fact that the optimizer currently does this is an implementation detail and, in my opinion, a bug. There’s no guarantee that IR will be optimized in the same way that it was with previous versions of the compiler, so loading old IR and not constant folding function calls would be perfectly fine.

No, I suppose nobuiltin does produce the behavior I’m suggesting here with regard to constant folding (except also preventing the very limited number of cases where we could prove that the result is exact. I don’t know what other effects that would have.

1 Like

That’s different because the LLVM language reference explicitly says that the default rounding mode is assumed. The C language standard also says that the user can’t change the rounding mode (such as by calling setroundmode()) unless FENV_ACCESS is enabled. When FENV_ACCESS is enabled (and for any non-C front end that wants to allow dynamic rounding mode changes) the constrained floating point intrinsics should be used.

My basic proposal here is that the LLVM IR language definition should assume that all floating point operations, including math library functions and math intrinsics, should produce consistent and reproducible results that are independent of compiler optimizations unless fast-math flags permit otherwise.

Strict application of this rule is, I think, impracticable. There are already cases we know about where we’ve decided to accept deviations between constant evaluation and hardware evaluation, namely things like the x87 extra precision issue or differences in NaN payload. Or, as you allude to earlier, there’s a possibility that basic division and sqrt are implemented to much lower precision (e.g., OpenCL) than the correctly-rounded version provided by APFloat. I know these issues aren’t what you’re thinking about with this proposal, but these issues do give me pause to endorse this kind of proposal.

Not that it solves the actual issue at hand, but I think the better starting point is to have LLVM IR use IEEE 754 semantics (insofar as IEEE 754 fully specifies it), with an exception for the Table 9.1 math functions that are rarely implemented to the mandated correct rounding. For these math functions in particular, we would instead have to adopt a more conservative approach. (The Table 9.1 math functions are the trigonometric, exp, log, and pow functions, which are I think the math functions you’re concerned about). Notably, this would mean that we would have no issue constant folding things like @llvm.sqrt or @llvm.ldexp or @llvm.round.

But it’s worth noting that there are still questions I have about how far we should be able to optimize calls to problematic functions:

  • Should sin(0.0)0.0 be permissible?
  • Should log(-1.0)NaN be permissible?
  • Should sinpi(0.5)1.0 be permissible?
  • Should fmin(-0.0, 0.0)-0.0 be permissible?
  • Should cos(fneg x)cos(x) be permissible [i.e., can the optimizer rely on evenness/oddness of functions]?

The last question in particular is interesting, since I’m aware of a user who complained about gcc having an oddness assumption on a function whose implementation was not in fact odd.

Well, x87 is kind of a curse and for the most part it doesn’t get used at runtime with modern processors, but strictly speaking in the cases when compiling for targets with x87 and not SSE we probably should consider the excess precision when constant folding. But we do have ways to control the excess precision to get consistent results for anyone who needs that.

IEEE-754 says that changing the payload of a quiet NaN preserves the literal meaning of the source, and since LLVM IR assumes, by default, that all NaNs are quiet, I think that case is OK too.

To some extent OpenCL is going to provide inherently inconsistent results, since the target and therefore the math library implementation is typically unknown at compile time, but in the cases I referenced there should be explicit IR indicators that single-precision divide and square root are not required to be exact so in that case I would argue that the user has opted in to inconsistent results. Certainly anyone using that option favors speed over accuracy.

I’m not sure what you mean here. Can you elaborate on what IEEE 754 semantics we aren’t currently using?

Without going through these one-by-one, I would fall back on the question of whether we have any sort of guarantee that we know what result the called function will produce. For example, “sin(0.0) → 0.0” seems like a reasonable optimization, but I don’t think there is any guarantee that the call will return 0.0, so like reassociation it’s an algebraically correct optimization, but it isn’t necessarily value-safe.

I guess log(-1.0) is slightly more of a gray area, but assuming you accept my premise above that changing the payload of a NaN is OK, I think we could fold “log(-1.0) → NaN” in the case of the llvm.log intrinsic since log() should always return some NaN for negative numbers and the intrinsic (unlike the function call) doesn’t have side effects. It’s interesting that the current constant folding implementation doesn’t fold this case specifically because it is checking to see if an exception is raised.

This is one of many relics of pretending to try to handle strictfp before there was any kind of design or direction. It should just be fixed to fold. Similarly we have 3 identical intrinsics for roundeven