[RFC] Floating-point accuracy control

I think the simplest thing is just to apply the accuracy globally to all recognized library calls in the region. (I don’t think it makes sense to apply it to calls we can’t identify.) The usual restrictions, such as errno requirements and ‘nobuiltin’ attributes, would apply. That’s what I’d recommend for the initial implementation.

You’re probably aware of the extensive set of ‘-fimf-’ options that the Intel C/C++ and Fortran compilers support to handle this kind of thing. Those options allow you to apply the option to individual functions and even the divide operation. I’m not proposing recreating the entire Intel infrastructure in LLVM, but I think the ability to apply the command-line option to individual functions could be useful if anyone has the time and motivation to implement it.

We can sort that all out if/when this gets as far as a clang RFC.

I guess I don’t see why this much complexity is warranted, or how it will actually fix the issue.

In general, I think that programmers who are upset that their results change when offloading will still be upset even if the differing value is within the ULP estimate of their original libc implementation. (And this assumes that their libc even documents its accuracy, and that that accuracy value is even correct!)

You even mention this issue in the context of metadata dropping: that a developer will be unhappy when they ask for 2.5ulp error implementation of division, if they sometimes get a 0ulp implementation instead. Both values comply with the accuracy requirement – but the developer is unhappy because the results are different, not because they’re not accurate enough.

My anecdotal observations regarding user expectations match @jyknight’s. Users do get upset by the changes in the results, even when they are an improvement. Or, rather, most of the time they tend to complain about the results when they differ from a reference computation done on the host. If both implementations match, the changes would be easier to accept.

IMO, global accuracy control will not be very useful in practice – large enough code base will likely have different components with different accuracy requirements. I think a block-scope #pragma would work better for this.

I wonder if we could provide some sort of __builtin_fp_accuracy(value, accuracy) which would work as a hint for LLVM to take the accuracy into account when we optimize the value and its data dependencies. This way we may not need to grow whole families of limited-precision equivalents of all functions. When we encounter a library or intrinsic call, we could just query the target – do you have a lower-accuracy equivalent of function X or operation Y (e.g. division)? If not, which will likely be the case for most calls, we just leave it alone. If target has a better alternative, we’ll use it.

Yes, I acknowledge that libraries providing more-accurate-than-requested implementations can be a problem. In my case, there is a library team that I work very closely with to achieve the results our customers want. So I can solve this problem for Intel customers using Intel compilers and Intel libraries on Intel hardware. However, one of the primary goals of SYCL is to avoid silos like that and open up the possibility to move across different architectures. That’s why I’m trying to add some mechanism to enable this. There will still be an onus on library vendors to get the details right. I’m just trying to provide a mechanism that enables this sort of development.

I also want to address the comment about users complaining when the results differ from “a reference computation done on the host.” I don’t think we can assume that. It is the most common case right now, but I think there are also users whose reference is a CUDA/OpenMP/OpenCL/SYCL implementation. And so in those cases, the host might need to match the device results.

Do you see what I’m trying to achieve here? The intent it’s basically to expand what’s possible.

BTW, I’ve proposed a BoF to discuss this at the upcoming LLVM Dev Meeting.

Say we make a standardized set of algorithms. For each function, we can offer a correctly rounded algorithm, a very close algorithm (within 1 ulp), and something super-approximate-but-fast (not sure what accuracy bound we want here). Name them “standard-cr”, “standard-accurate”, and “standard-fast”, or something like that. We publish the algorithms as a standard, so they mean the same thing for every vendor. Then we provide some mechanism to extend the list, so vendors can provide their own named algorithms that are more optimized for specific uses/hardware (e.g. “vendorname-accurate”, “vendorname-fast”, “vendorname-superfast-sqrt”).

The user can then request whatever algorithm is appropriate for their use (either for every function, or on a function-by-function basis). Users who care about reproducible results on different hardware will stick with the standardized algorithms. Users who care more about speed than the precise algorithm can request vendor-specific algorithms. And we can provide a way to let the implementation choose an algorithm within constraints: for example, “any-accurate” chooses the fastest algorithm the implementation knows about that produces results within 1 ulp.

In LLVM, this looks something like call float @llvm.fpalgorithm.exp10(float %x, !metadata !"standard-accurate").

The primary problem with this is finding someone willing to write specifications for all the “standard” algorithms.

This is just a rough sketch I wrote in a few minutes, but hopefully conveys the idea of how we could solve the predictability and accuracy problems together.

@efriedma-quic That’s not far from what I have in mind, but I think it might be expecting too much from library vendors. It would set up a nice infrastructure for open source libraries, but I expect at least some library vendors will want to keep their algorithms private. Something like what you are suggesting is probably the only way to get consistent results across different library implementations, but I don’t think that’s the only goal.

So for instance, if vendor A provides a 4 ulp implementation of some function and vendor B provides a 4 ulp implementation of the same function, they may still give very different results within those 4 ulps. But I think there’s an important use case where developers calling a function within their own implementation just need to know that the function is 4 ulp accurate and the amount of error doesn’t need to be consistent. Whether the error gets magnified or canceled out depends on how the result is used. If I’m developing code and I know a function returns a result with 4 ulp error, I can reason about what I need to do to make that acceptable. In this case, specifying the allowable error is all about performance.

I know that kind of contradicts some of the things I’ve agreed with earlier, but it’s a different use case. It relies on very highly skilled users, but they’re out there.

My suggestion doesn’t require vendors to provide any particular documentation for their proprietary algorithms. Only the “standard” algorithms would be required to be documented.

Does “consistent” here mean consistent across platforms, or consistent across invocations? e.g. is it a problem if calling sin(1) twice produces different results?

I’m sure a few people out there know how to do numerical analysis based on ulps, but I expect the vast majority of users don’t. So I’m not against helping those users, but we should make sure whatever direction we take doesn’t make it harder to satisfy users who don’t know how to write that sort of proof.

I whole heartedly agree with this. That’s also one of my goals. Maybe I should have included my front end proposal here too, because that does some hiding of the implementation details.

One thing I’d like to have is a command-line option that something like this:

-ffp-accuracy=[low|medium|high|glibc|sycl|cuda]

Then there’d be some mechanism for the front end to translate whichever accuracy level was chosen into an ulp spec for each function. One of the problems here is that within existing libraries and standards, different functions have different accuracy, and I do want to isolate users from the details if they don’t want to know them.

It’s pretty clear that this is a rather sprawling problem with a lot of conflicting priorities and use cases. I’m trying to come up with something that allows the most expert people to get very fine control over things without getting in the way of people who just want to use the standard library or anything in between. Satisfying the most expert users is a very big job, and I don’t imagine that my proposal gets there completely, but I hope it will open up some new possibilities.

Going back to my statement about vendors keeping private algorithms, what I was trying to get at is that having some interface that lets someone call “vendorname-fast” doesn’t provide a way for other vendors to produce results that are consistent with that algorithm unless the algorithm is shared. And maybe I’m not understanding your suggested but it seemed to me like there would be no way for a user to invoke that algorithm unless they knew it existed.

What I’m suggesting instead is that the intrinsics will provide a way for front ends to specify that the user wants to call some function, cosf() for example, and they require a result with at most 4 ulp error. The front end can do this in any number of ways. One might be to have an option like “-ffp-accuracy=medium” and the front end decodes that to 4 ulp for cosf(). Then the LLVM backend (or a runtime library in cases with deferred compilation) can use some mechanism to match that requirement to an available implementation.

Of course, if the front end has no need for such accuracy control, it can go on using the existing IR representations and they’ll get optimized the same way they are today. I don’t think my proposal will have any impact on optimization in cases where it isn’t used.

Regarding my remark that “the amount of error doesn’t need to be consistent” I meant across platforms. I do intend that multiple calls to the same function with the same accuracy requirements and the same inputs should produce the same results within the parts of an executable that run on the same hardware.

There is a problem to be solved there regarding vector function calls. Strictly requiring consistent results across a program generally prevents vectorization. When fast-math (or even just ‘afn’) is used, it may be reasonable to interpret that as permitting some variation in the vector case (icc and gcc do this). When fast-math isn’t used, you just can’t vectorize in these cases without something we don’t currently have. In the Intel compilers, we offer an option to call an implementation from our vector library in place of scalar functions to achieve vector/scalar consistency in the precise case, but that changes the default accuracy for many functions. Example: Compiler Explorer

Isn’t there another consideration here. With FP, ulp is not normally enough on its own. Doesn’t the accuracy vary depending on the input value range?
When tuning an algorithm, some of the FP instructions need to be at a higher accuracy than other FP instructions. Normally depending on whether subsequent FP instructions amplify the error or reduce the error.
So, is your proposal to somehow let the compiler work all that out, and give a ulp for the algorithm as a whole, or is your proposal more of a global one, to select each instructions to a particular ulp and not take the function as a whole?

No, I’m just focusing on the accuracy of the functions that are provided by the standard math library and equivalent built-in functions provided by various offload programming models. Basically, the calls that might get represented using LLVM math intrinsics today. You’re right that there’s a lot more that goes into figuring the accuracy of an algorithm, but that’s beyond the scope of what I’m trying to do here. I just want to provide a way to select the accuracy of parts of the algorithm that might otherwise be treated as black boxes.

I do think my proposal can be extended to allow non-precise versions of basic operations. OpenCL needs that for single precision divide and square root. My colleagues who work on FPGA compilers tell me that it would be useful for their customers to have the same capability for other operations such as addition and multiplication. Those operations are assumed to provide correctly rounded results by default, but in cases where the target instruction set can provide a faster implementation with less accuracy it can be worth introducing intrinsics to allow that.

In terms of the overall accuracy of some user algorithm, all I want to do is to provide a mechanism to constrain or relax the accuracy of a small set of known operations/functions. It will be left to the user to factor that accuracy into their algorithm to get the results they want.

Regarding input value range, that’s a potential consideration but one that I didn’t want to tackle. Most of the library specifications I’ve looked at provide a single accuracy value in ulps for each function. I would guess that there is some variability across the range of possible input values where the results will be more accurate for some inputs than for others, but the stated ulp accuracy should be the worst case. There are some cases where the documented accuracy of a function is more complicated. For example, CUDA’s __logf() intrinsic is documented as “For x in [0.5, 2], the maximum absolute error is 2-21.41, otherwise, the maximum ulp error is 3.” I don’t want to try to generalize that to represent it in IR, but if anyone has a suggestion I’d be open to it.

I had a lot of helpful discussions about this topic at the LLVM Dev Meeting. Thanks to everyone who gave their input.

Tue Ly (@lntue) suggested that it would be helpful to have additional properties like rounding mode represented. That would allow us to select different implementations of functions that provide correctly rounded results. Other constraints like limited domain might also be helpful in the future. This leads me to want a more open-ended and extensible way of specifying constraints, which could possibly be combined with the existing constrained fp handling.

Johannes Doerfert (@jdoerfert) said he would like to get rid of the existing math intrinsics (llvm.sin, llvm.cos, etc.) since they almost entirely duplicate what is done with the LibFunc handling, with the minor difference being errno handling. He suggested I could get the behavior I want by attaching attributes to the regular function calls and the math intrinsics could be eliminated altogether.

I’m not entirely comfortable with that suggestion for several reasons. (1) A lot of existing optimizations would need to be updated to respect the attribute and future optimizations could break things if they didn’t know to look for the attributes. (2) I’m not comfortable with the idea of replacing one named function call with a completely different function call. It’s technically doable, and probably happens already in some case, but it feels to me like something that the IR isn’t specifically allowing. (3) I intend for my function accuracy handling to target various calls that are equivalent to the standard math library call but original as something different, such as the SYCL or CUDA builtin transcendental functions.

I have a new vision for how to bring this all together, and I think it’s pretty good. I’ll put together a new RFC providing more details and even a patch to go along with it, but for now I’d like to sketch out the basic idea here and ask for feedback.

First, I’d like to introduce a new set of math intrinsics, which I hope will eventually become the only math intrinsics. I want to give them a new name so they don’t inherit any unwanted behavior and the existing intrinsics can be phased out gradually. There will be two key characteristics of these intrinsics: (1) They will be tightly coupled with a new set of call site attributes that are valid only for these intrinsics and have defined default values that apply if the attribute is not present. (2) They will be defined explicitly as being intended to be replaced with alternate implementations that meet the requirements described by the associated attributes.

So for my accuracy use case, I’d imagine a call like this:

%0 = call double @llvm.fpbuiltin.cos(double %x) #0 
...
attribute #0 = "fp-max-error"="4.0"

I’d add a wrapper class that allows FPMathBuiltin to act like a first class instruction, and for any attribute we add it would have an accessor function like FPMathBuiltin::getRequiredAccuracy().

For the first phase of implementation, I imagine using this only for the accuracy case. Later, we could move over the constrained intrinsics by adding support for “fp-rounding-mode” and “fp-exception-behavior” attributes. Any other constraints we need later could be added without requiring a new set of intrinsics.

Once everyone is comfortable with the idea that you have to check the attributes before doing anything with these intrinsics, they could replace the existing math intrinsics.

I’d also like to tie this back to the correctly rounded math library implementations Tue Ly has created. As I understand it, some of these rely on the availability of non-default architecture features like FMA. The mechanism I’m proposing here could be used to select alternate math library implementations. I’m picturing something analogous to clang’s -fveclib. This could potentially have multiple implementations like __cos_cr_avx2, __cos_cr_sse, etc., not to mention vector versions like __cos_cr_f64x4_avx2.

1 Like

New proposal seems more flexible for future changes.

This is unavoidable. The point is that we should provide an alternative to the current state of affairs, where the compiler implicitly picks “vendorname-fast” by default.

We can have two modes; one where the user picks the exact algorithm, and one where we let the vendor pick an algorithm. So the vendor can automatically pick vendorname-fast in the latter mode.

There is a VFDatabase in LLVM, a Vector Function Database: ⚙ D67572 [VectorUtils] Introduce the Vector Function Database (VFDatabase).

Maybe there is space for a similar design for floating point. Vendors could provide several variants of the same operation, i.e, cosine. During compilation the database is queried for the most suitable variant. Fast or not-fast is probably not precise enough.

@tschuett Thanks for pointing that out. There is a fair amount of overlap between my proposal and the vector function handling that I need to figure out. I definitely want them to be aligned.

It is about separation of concerns and being independent of TLI.

Imagine there is a Floating point function database (FPFDatabase). You are a vendor of an FPGA and you probably offer more than one cosine with different accuracies. It is your job to populate the database with your functions, accuracy, rounding mode, and how to invoke your functions.

The customers can query the database: I need a cosine with this accuracy. The selection algorithm is up to you.

If you are a vendor of an Nvidia H100, then you will probably only offer one cosine with complete different properties.

The customers do the same queries to the database.

Is there a design document somewhere for the Vector Function Database or is the code and review discussion the only documentation?

I’ve been modeling my implementation on the veclib handling, and it isn’t clear to me how that is supposed to fit in with the Vector Function Database. The general idea sounds like what I want, but I haven’t quite processed how it all is supposed to fit together.

There were several RFCs on the mailing lists:
https://lists.llvm.org/pipermail/llvm-dev/2019-June/133484.html

I found it at the top of the review.

BTW, that reminds me of that bug in Chrome that fdlibm pow is worse precision.

Still not fixed.

I’ve posted a preliminary implementation of what I’m proposing here: ⚙ D138867 [RFC] Add new intrinsics and attribute to control accuracy of FP calls

This doesn’t yet incorporate the suggestion to follow the design of the vector function database. My proposed design is based on the implementations of constrained FP intrinsics and veclib handling.

" single function (like, say, atan(y / x)atan2(y, x) "

Just so everyone is aware:: atan( y/x ) cannot produce 3pi/4 whereas it is a required value for ATAN2()

But the problem to be solved, here, is bit accurate results on a per instruction basis.

In general, there are 4 categories of results for round to nearest and round to nearest magnitude::
a) bigger than 1 ULP
b) 1 ULP
c) faithful
d) 0.5 ULP

Most FP elementary function libraries are in class (a), and very few libraries are in class (d), most libraries that claim “compliant” rounding only do 0.5 ULP in RNE mode, and even here when exotic argument reduction (Payne and Hanek) is required, they may only achieve 0.505 ULP rounding. So, we have nothing in class (d) for 3 of the 5 rounding modes, and we are uncertain for rnm !?!

For example; I know of only 1 implementation that gets COS( 2^709 ) IEEE correct.

Most of the libraries in class (a) might average 150 cycles per elementary function, while those in class (d) might average 300 cycles with worse cases in the 20,000 cycle range. All for ½-bit more accuracies.
The libraries purporting to be IEEE accurate have been available for over a decade (for free) that fact they are not being used more is a testament of computer consumers than available software.