Implementing the constrained FP intrinsics for the x87

I’ve just recently handed in my PhD thesis on floating-point decision procedures and I’m now interested in going through the gymnastics of implementing x87 codegen support for the constrained FP intrinsics. However, I’m unsure where to start since I’ve never contributed to LLVM. Looking at the commit history, it looks like @arsenm and @spavloff might be the right people to talk to? Any advice would be appreciated. I’m very interested in seeing LLVM have proper support for floating-point.

We are supporting constrained FP for x87, see the lit tests:
grep -r llvm.experimental.constrained llvm/test/CodeGen/X86/ | cut -f 1 -d ':' | uniq | xargs grep X87

Ah, thank you. In which case, it seems that I should instead work on fixing the x87 constrained FP implementation. For instance, this test I extracted from llvm/test/CodeGen/X86/fp-strict-scalar.ll is incorrect: Compiler Explorer. The x87’s precision is not set appropriately, so this function may return a result which is not representable with f32. Additionally, if the inputs are large, the result may have an exponent greater than the maximum f32 exponent, since the x87 always uses a 15-bit exponent. Overcoming this is a little bit trickier. A simple way to do it is to store to memory and then reload the result.

The x87’s precision is a global setting (specified by OS or comipler option). It’s not semantically correct to change it for each arithmetic operation.
To store and then reload is the correct way, but I’m not sure if it’s the desired behavior for constrained FP @andykaylor . We have an option fexcess-precision which is intended to help with this. I’d consider it’s orthogonal to constrained FP, or constrained FP implicitly enable fexcess-precision=standard.
So I think the actual problem is to implement fexcess-precision=none/standard for x87.

I thought the objective of constrained FP is to provide IEEE 754 compliant behavior, including with rounding mode modification and exception handling? That includes producing compliant results, which that code does not ensure.

In any case, I don’t see how setting the precision at each operation is incorrect. Furthermore, store and reload doesn’t prevent double rounding, which is the core problem with using excess precision. Specifically, the excess precision in the 80-bit format does not ensure correctly rounded results for individual FP operations on smaller formats.

I’m considering more from the orthogonality. Would some user wants both exception handling and fast excess precision. Note, rounding mode can be non dynamic under constrained FP, which usually means user doesn’t care about it.

For example, if the OS or user set precision to 64-bits. Would you set it to 80-bits for a FP80 operation or check the global setting before doing it? If you set it to 80-bits, you volatile users assumption, which is another example for orthogonality.

I think excess precision just provides a choice for user between fast speed and exact result. It looks to me double rounding is more like a theoretical problem instead of practical. Furthermore, given CPU doesn’t do internal arithmetic in an infinite precision, doesn’t it mean double rounding is occurring all the time?

LLVM currently provides no option which provides IEEE-compliant floating-point on the x87. It is noncompliant in all circumstances. Furthermore, double rounding is very much a practical problem. Any sort of computation where data gets passed between memory and the x87 FP stack may exhibit different behavior between compilations under the slightest changes. It’s only not a problem if the inputs are conveniently distributed, which is far from guaranteed. More seriously, the same program may produce different number between different platforms. This is a practical problem for distributed computing, since you don’t want different nodes computing different values for the same data.

If a user wants fast execution, good numerical behavior, and full exception handling on an x87 unit, they should stick to using 80-bit FP numbers. At least then they will get consistent behavior. The current behavior with 32-bit and 64-bit floats causes branches to evaluate both true and false: Wrong optimization: instability of x87 floating-point results leads to nonsense · Issue #44218 · llvm/llvm-project · GitHub

The CPU is in fact computing exactly rounded results, “as if” to infinite precision. That is, it computes enough bits to round the result correctly. This is true on any IEEE 754-like platform, including the x87. It’s a core feature: compliant IEEE 754 implementations must produce the exact same results for the same computation (as long as it only involves +, -, *, /, fma, sqrt and mod). Therefore there is no double rounding until it is introduced by a faulty compiler.

If a user wants to perform a 64-bit op, they should convert their inputs to 64-bit floats. Note that setting the control word to a reduced precision truncates everything on the x87 stack to that precision (Intel Architecture Manual vol. 1, section 8.1.5.2) so there’s no advantage to not doing so.

If I recall correctly, to get correctly rounded results out of x87, you need to both mess with the control word, and store every result to the stack to force overflow to ±Inf. Or something like that; I don’t remember the details. Adding an option to do that seems reasonable. But this is mostly orthogonal to “constrained FP”, which has nothing to do with precision; it just means we allow users to touch the floating point rounding mode/exception flags.

If we document “if you use -fexcess-precision, code generated by clang will override the precision set in the FP control word”, or something like that, that seems sufficient. The only reason the user would mess with the precision globally in the first place is if they’re using a compiler that doesn’t know how to modify it as needed.

Right, it seems I got the wrong idea about constrained FP. In any case, I’m keen to make whatever changes are necessary to make this happen. It’s been a bit of a problem for Rust which nominally provides IEEE 754-compliant floating-point operations. I just want to know where I should start looking. As I said, I haven’t yet done any work on LLVM. (Also the bug I linked earlier could be fixed by telling LLVM to spill the full 80 bit result instead of storing the 64-bit result but I have no idea where to start with that.)

That suffices for addition, subtraction, sqrt and mod, but I’m not sure it does for multiplication, division or fma. For instance, an f64 product could be smaller than the smallest normal f64 number, but still produce a normal product on the x87 because it has 4 extra exponent bits. It might (hopefully) be possible to show that double rounding is benign in that case? I’ll have to investigate.

The issue would be that the number is normal before rounding, but subnormal after rounding, so we end up double-rounding despite the markings? Hmm, I see what you mean, that’s probably an issue for double precision.

If you’re completely unfamiliar with LLVM, this might be a bit tricky to tackle… but start by grepping for FP32_TO_INT32_IN_MEM; that code messes with the control word in a similar way to what you need. You then want to do the same sort of lowering for the relevant ops.

To get reasonable performance, you might need to consider rewriting the way we do control word modifications; something similar to X86VZeroUpper.cpp.

I’m not sure it’s really that simple, but you can change the opcode used for spilling in getLoadStoreRegOpcode in X86InstrInfo.cpp. And then do something to mess with the spill slot size, which is defined in the .td file. (I think this is done with RegInfos?)

Yeah, you might end up something normalized inside the x87 register but with an exponent too small for an actual f64, so you get hit by double rounding on the store anyway.

The hope would be to batch them. What about calls to functions outside the current compilation unit? Functions in libraries will likely not be compiled with this option. Should we reset the rounding mode to the system default before every call?

The most conservative option would be to save the mode, and restore it before any call/return. Not that expensive relative to all the other code we’re inserting anyway, so probably just go with that.

Some of the discussion around the RISCV vxrm register might be helpful as reference; see ⚙ D113439 [RISCV] Add IR intrinsics for reading/write vxrm. and ⚙ D151396 [2/3][RISCV][POC] Model vxrm in LLVM intrinsics and machine instructions for RVV fixed-point instructions .