[RFC] Floating-point accuracy control

Hi everyone,

I’ve been working on a number of problems related to floating-point accuracy of math library calls, and I’d like to propose a set of new intrinsics to specify required accuracy for math library calls and the corresponding LLVM intrinsics. I’m also going to suggest some clarifications of the semantics of existing IR constructs related to floating-point accuracy.

I think this has some potential applicability for traditional host-only optimization, but it is especially important for heterogeneous programming models like OpenCL, OpenMP, CUDA, and SYCL.

I’m going to spend quite a bit more time in this RFC describing the problem than my proposed solution. I think this is necessary to establish a common understanding of exactly what I am trying to do so that you can properly evaluate whether or not my proposed solution accomplishes what it intends to. Feel free to skip to the end and just look at the proposed intrinsics, but thanks in advance if you have time to read the rest.

Background

Historically, this has been an area where there isn’t a whole lot to be done. If you call a math library function, you get whatever accuracy the library implementation provides and there’s not much the compiler can do about it. Maybe you enable fast-math and in a few cases the compiler might substitute an approximation sequence, but things can get shaky really fast when you head down that path. There’s also a bit of a problem if the compiler decides to vectorize the call, which I’ll discuss more in a bit.

With the rise of heterogeneous computing, however, this is a much bigger issue. If I have an algorithm that calls cos(x), for instance, I might get one result if the code executes on the host computer and something different if the code executes on an offload device. As a developer, I need some way to get this under control.

Math Functions in a Heterogeneous Environment

This is a tricky problem. Ordinarily, you think of the math library as a single entity. I can call cosf(x) from anywhere in my program, and I can count on all of them being resolved to the same function. I can go to the documentation for the library (hopefully) and look up the accuracy of the cosf() function provided by that library. However, in a heterogeneous environment I don’t have just one math library. I potentially have a different math library for each device I’m using, and each of these libraries is likely to provide different accuracy for its functions. For example, the OpenCL specification requires cosf() to provide 4 ulp accuracy. Conforming implementations from different vendors are required to be at least this accurate, but they may be more accurate. The same function provided by CUDA will have 2 ulp accuracy. The x86_64 glibc implementation has 1 ulp accuracy. I’m doing a bit of hand-waving here with regard to the definition of the function. Namespaces and such distinguish which is which within the program, but as a developer I still think of it as a call to the single-precision cosine function.

So the problem that this introduces is that I can use the same algorithm in code that executes on the host computer or various offload devices and get significantly different results. As a compiler vendor, I can patiently explain to my users why that happens, but my experience is that while they understand and accept my explanation they really don’t care why it happens. They want it not to happen.

This is the thing that’s motivating the RFC for me. I want to able to provide a way for users to get the results they want. For example, if someone is developing a SYCL kernel, the default assumption is set by the accuracy requirements of the SYCL standard. These requirements are rather loose and geared toward performance. I want to be able to offer options to instead allow the user to specify that their kernels should be constrained to an accuracy that would be consistent with what they get when executing equivalent CUDA code on a different device, or accuracy that is consistent with the host math library. And I want to be able to define an interface that allows device vendors to provide multiple library implementations and select one at runtime based on options that were passed to the compiler when the main program was compiled. That’s a different problem, but it’s one I can’t solve without having an LLVM IR representation to describe function accuracy requirements.

Existing Constructs

Currently, we have two ways of describing floating point accuracy in LLVM IR – the ‘afn’ fast-math flag allows the compiler to substitute an approximation for a math function, and the ‘fpmath’ metadata tells the compiler how much error is acceptable in the result of a floating-point operation. I think there are potential problems with both of these.

The ‘afn’ fast-math flag is defined in the LLVM Language Refence as allowing “substitution of approximate calculations for functions (sin, log, sqrt, etc).” The problem with this is that there is no indication of how “approximate” the calculation will be. In practice, we seem to only use it for transformations that are algebraically equivalent like ‘pow(X, -0.5) → 1/sqrt(X)’ which is good. However, there are cases when you get into code generation where you might want to use instructions like RSQRT14PS plus a number of refinement iterations to approximate square root. How many iterations do you need? That depends on how accurate the answer has to be, and the ‘afn’ flag doesn’t say.

The ‘fpmath’ metadata has a different problem. I believe it was introduced to handle the case where OpenCL allows 2.5 ulp error on single precision divide and square root. This is great. If the metadata is present, I can substitute and approximate square root or divide, if it isn’t (including cases where it got dropped during optimization), I have to generate correctly rounded results.

One problem with ‘fpmath’ is it only works in cases where the operation returns a correctly rounded result by default. What happens if I add “!fpmath !{float 2.5}” to a function call whose math library implementation only provides 4 ULP accuracy? Is that an error? What if the default implementation provides 4 ULP accuracy but another implementation is available that provide 2 ULP accuracy? In that case, I might select the more accurate implementation if the ‘fpmath’ metadata is present but select the less accurate implementation if it is dropped. This means I can’t reliably use ‘fpmath’ to require an implementation that is more accurate than what I’d get by default.

Another problem with ‘fpmath’ is that because it is metadata it can and will be dropped. It may seem like this is OK in cases like fdiv where the default implementation provides a correctly rounded result, but I think it’s actually problematic. Consider the OpenCL case. The OpenCL specification allows 2.5 ulp error for single precision division. This isn’t a case like the fast-math flags where the user is signing off on a bit of wiggle room in the answer. It’s just a lower accuracy requirement. The generated code can use an implementation that return 2.5 ulp error and that’s OK. The generated code can instead use an implementation that returns a correctly rounded result, and that’s OK. However, it can’t do both in different places. Or at least, as a developer I wouldn’t be happy if it did. So if I have a function that contains “%z = fdiv float %x, %y !fpmath !{float 2.5}” in multiple locations then the compiler drops the metadata from one of them, I’m potentially going to get two different answers for what should be the exact same operation.

Vectorization of Function Calls

One last thing I’d like to mention before I get to my proposed intrinsics is a problem I came across recently with function vectorization. Consider this loop:

  for (int i = 0; i < 4; ++i)
    x[i] = sinf(y[i]);

If I have turned off math errno and enabled a vector library, clang will compile that to this call instruction for x86 targets (Compiler Explorer):

  %4 = tail call <4 x float> @_ZGVbN4v_sinf(<4 x float> %3)

That’s nice. I like it. The problem is, this function isn’t as accurate as the default math library implementation of the scalar sinf function. Unless I’ve done something to relax the accuracy requirements, that shouldn’t happen. GCC won’t generate this vector call unless I use the -ffast-math option. That’s better, but it takes us back to the problem I described previously with the ‘afn’ flag not defining how accurate the substitue implementation has to be.

This problem could be solved just by looking for ‘afn’ flag if we’re willing to assume that the vector library implementations are accurate enough to conform to the semantics of that flag (which as I’ve said above, is insufficiently defined). Ideally, I’d like to have an interface that allows the vector library to define the accuracy of its implementations and allows the vectorizer to decide if that’s OK.

Proposal: fpaccuracy intrinsics

To allow control over the required accuracy of floating-point library calls, I’d like to introduce a new set of intrinsics of the form:

  {type} @llvm.fpaccurracy.{function}.{type-specifier}({type} %x, float {accuracy})

For example

  float @llvm.fpaccurracy.cos.f32(float %x, float 2.5)
  <2 x float> @llvm.fpaccurracy.cos.f32(<2 x float> %x, float 2.5)
  double @llvm.fpaccurracy.cos.f64(double %x, float 2.5)

There are definitely problems with introducing yet another set of floating point intrinsics. We already have constrained intrinsics and vector-predication intrinsics, and this is another set that aren’t mutually exclusive with either of the other two.

The other possibility I considered is defining a new type of operand bundle that can be attached to existing intrinsics. Something like this:

  %y = call float @llvm.cos.f32(float %x) [ "max-error"(float 2.5) ]
  %y = call float @llvm.experimental.constrained.cos.f32(float %x, metadata "round.dynamic", metadata "fpexcept.strict") [ "max-error"(float 2.5) ]
  %r = call <4 x float> @llvm.vp.fdiv.v4f32(<4 x float> %a, <4 x float> %b, <4 x i1> %mask, i32 %evl) [ "max-error"(float 2.5) ]

We’ve previously discussed combining the constrained float-point intrinsics with the vector predication intrinsics using operand bundles, but I don’t think that has made much progress, and in general I’m not sure there’s consensus that it’s reasonable to use operand bundles this way.

For now, I’m going to proceed on the assumption that new intrinsics will be used, but if we decide operand bundles are better, I think most of the reasoning will apply in the same way.

TargetLibraryInfo interface additions

The semantics of the fpaccuracy intrinsics I am proposing introduces requirements that won’t be met in most cases. It will require the compiler to have some knowledge of the accuracy of the math library functions. If the accuracy provided by the function library is unknown, in most cases we would need to assume that it does not meet explicity accuracy requirements.

My expectation is that front ends would only generate calls to the fpaccuracy intrinsics if they know that there will be target library support for these intrinsics. This would be accomplished using the TargetLibraryInfo interface. Similarly, the vectorizer would use new TLI methods to determine if a function call could be vectorized with the required accuracy.

Here are the new TLI functions I am proposing"

  bool isFPAccuracyAvailable() const
  bool isFPAccuracyResolvedAtRuntime() const
  bool isFunctionAccuracyAvailable(StringRef F, float Accuracy) const
  StringRef getFunctionImplWithAccuracy(StringRef F, float Accuracy) const
  bool isFunctionVectorizableWithAccuracy(StringRef F, const ElementCount &VF, float Accuracy) const
  bool isFunctionVectorizableWithAccuracy(StringRef F, float Accuracy) const
  StringRef getVectorizedFunctionWithAccuracy(StringRef F, const ElementCount &VF, float Accuracy) const

The TargetLibraryInfo implementation may defer decision on the availability of specific function accuracy implementations for cases like SYCL or OpenMP when the call will be resolved at runtime. In such cases, the TLI implementation should return true for all accuracy requests it wants to allow and provide some means of reporting an error at runtime if the required accuracy is not available.

Lowering of fpaccuracy intrinsics

I propose that a new pass will be introduced in the CodeGen phase which will lower fpaccuracy intrinsics to specific function calls before instruction selection. If no function which is known to meet the accuracy requirements exists, this pass should report a fatal error. Otherwise, the intrinsic call will be replaced by a call to the function returned by TLI::getFunctionImplWithAccuracy().

For example:

  %y = call float @llvm.fpaccuracy.exp10(float %x, float 1.0)

may become

  %y = call float @__exp10_ha(float %x)

For programming models like SYCL that might involve runtime resolution of these functions, the programming model will need a way to describe that in its intermediate representation. There are people working on this for SPIR-V.

Clang support

If this proposal gets enough traction to move forward, I will be proposing corresponding clang changes to enable the interface, including a new command-line option (-ffp-accuracy) and a new pragma (#pragma clang fp accuracy). However, I’d like to hold off on that for now until I see if others agree that this is a useful capability to support in the backend.

2 Likes

What do you mean by “different”? As far as I know, correctly-rounded cos() implementations are not generally in use. Given that, every implementation is going to produce different results to some extent. Accuracy controls can’t solve this; at best, they make the issue slightly less visible. If you want bit-identical results, the only solution is for the developer/compiler to provide an implementation for use on all platforms, built on top of correctly-rounded operations, and stop relying on functions provided by device vendors.

That isn’t to say that accuracy controls are useless, but I don’t think they solve the problem you’re saying they solve.

No, I’m not expecting to get bitwise identical results. I’m just looking for something more bounded. My experience in dealing with customers who are concerned about floating-point accuracy is that they typically have a relatively expensive process of validation to verify the acceptability of some implementation that is then accepted as a reference. Subsequent implementations must yield final results (whatever they’re measuring as the output of the algorithm, not individual function calls) within some margin of error relative to the reference implementation. If the results are not within this margin of error, the new implementation has to go through the full validation process. This is a major barrier to moving from one hardware architecture to another, or often even from one compiler to another. That’s one of the main problems I’m trying to solve.

So, for example, if I have a reference implementation on x86-64-based hardware that has been verified and I want to port some part of the computation to an offload device using SYCL, the relatively large error tolerated by the SYCL standard is a barrier that must be overcome.

Another problem, which is admittedly a bit more ambitious, is that by relaxing the accuracy requirements where they can be relaxed and tightening them where they must be tightened, a developer can fine tune an algorithm to maximize performance while maintaining acceptable results. Here’s an example of academic work related to what I have in mind: [1509.01347] Verificarlo: checking floating point accuracy through Monte Carlo Arithmetic

The LLVM libc guys are building a libm and there are lot of diffs about correctly rounded math functions:
https://reviews.llvm.org/D118962?id=406579

Maybe the libc can be ported to your SYCL device?!?

Yes, it could be. That’s just the start of the problem though. I need a way to tell the SYCL device that I want the accuracy that is consistent with the host libm implementation.

That is. there are any number of accuracy levels that might make sense for a user to select. I might want the SYCL-compliant implementation, which is less accurate but faster, or I might want the host-compliant accuracy, or I might want accuracy comparable to what the CUDA library provides.

Apart from wanting to achieve consistency between different library implementations, I might also want to choose different accuracy for performance reasons. In general, less accurate implementations are significantly faster, and if I know I don’t need the full accuracy of a function I can get some extra performance by letting the compiler know that.

The ‘afn’ fast-math flag is defined in the LLVM Language Refence as allowing “substitution of approximate calculations for functions (sin, log, sqrt, etc).” The problem with this is that there is no indication of how “approximate” the calculation will be. In practice, we seem to only use it for transformations that are algebraically equivalent like ‘pow(X, -0.5) → 1/sqrt(X)’ which is good. However, there are cases when you get into code generation where you might want to use instructions like RSQRT14PS plus a number of refinement iterations to approximate square root. How many iterations do you need? That depends on how accurate the answer has to be, and the ‘afn’ flag doesn’t say.

Fast-math in general can be summarized as “prefer speed over floating-point accuracy” (leaving aside the issue, as you note, that the degree to which accuracy can be specified is poorly indicated), but it also is treated in LLVM largely as the conjunction of the individual fast-math flags. This means that some optimizations that introduce faster approximate calculations that require more than just a single function (like, say, atan(y / x)atan2(y, x), though I don’t think that one’s actually a speed improvement) seem like they should be enabled by -ffast-math but don’t map neatly to any of the flags, given that afn’s definition seems to imply it only allows approximation of libm functions directly.

To allow control over the required accuracy of floating-point library calls, I’d like to introduce a new set of intrinsics of the form:

{type} @llvm.fpaccurracy.{function}.{type-specifier}({type} %x, float {accuracy})

The issue I see with this approach is that it doesn’t generalize well to constraining accuracy of things that aren’t individual operations, and this first iteration focuses on constraining libm functions. While I appreciate that there is likely to be few opportunities for a compiler to find optimization opportunities other than substituting different function definitions on these library operations, the one case that does come to mind is if you’ve got something like (float)f((double)x) (where f is some small function or sequence of instructions whose body the compiler can see), where you can do the computation in float instead of double if you know it’ll be good enough. Using operand bundles seems like it could handle that case where f is an outlined function, but if f was inlined, you’d still lose that information.

If this proposal gets enough traction to move forward, I will be proposing corresponding clang changes to enable the interface, including a new command-line option (-ffp-accuracy) and a new pragma (#pragma clang fp accuracy).

How would those options work? Apply accuracy globally to all of the function calls in the region?

I’d note that the latest C standard draft has reserved names for a new family of functions starting with “cr_” – e.g. “cr_sin”, “cr_exp10”, etc., which allows user code to indicate the requirement for a correctly-rounded operation, for the calls where the user has such a requirement. (At this point, the names have only been reserved in C for that purpose, not actually specified.)

Perhaps it’d be better to push that path forward, than adding llvm-specific functionality?

@jcranmer I’m not proposing getting rid of the ‘afn’ fast-math flag. I think it’s useful. I just want to tighten up the definition with some indication of what it permits. The primary way that the ‘afn’ flag relates to the new intrinsics I’m proposing is that the intriniscs allow approximation that goes beyond what I think the ‘afn’ flag should permit.

As an example, the Intel Software Optimization Manual describes three methods for calculating single-precision square root: the SQRTPS instruction, the RSQRT14PS instruction followed by the MULPS instruction, and the RSQRT14PS instruction followed by the MULPS instruction and one iteration of Newton-Raphson refinement. The SQRTPS instruction provides correctly rounded results, so it’s always an acceptable substitute for a call to sqrtf(). RSQRT14PS + MULPS + one Newton-Raphson iteration provides ~2 ulp error, so it’s probably an OK substitution if ‘afn’ is set, but the performance difference is such that it may not be a good universal substitution. RSQRT14PS + MULPS only provides ~1024 ulp accuracy (14 correct bits), so it shouldn’t ever be used with just the ‘afn’ flag as justification, but there are circumstances in which a user might be OK with this level of accuracy.

@jyknight I was aware that there was some movement in that direction in the standards, but I was kind of viewing this in the opposite way. If I have existing code with math library functions, it would be great to just recompile with a new option and have my math functions remapped to the correctly rounded versions.

Ultimately, this will end up with the functions being resolved to different implementations, probably with some indication in the function name of what it is. What I’m trying to do is generalize the handling of it so that this option doesn’t necessarily have to be hard-wired in the code. So the code can be written and then the user can recompile with different options to trying things out. Maybe they try a very relaxed accuracy requirement and it’s very fast but the results aren’t within the acceptable range, so they recompile and try again with medium accuracy, and so on.

Putting this in the IR has an additional benefit that relying on function names doesn’t. Let’s say my code is calling sinf(x) and I don’t need the math errno value to be maintained. The front will generate a call to lllvm.sin.f32(x). With my proposal, if I’ve further indicated that I can get by with 4 ulp accuracy for that function, the front end can instead generate a call to llvm.fpaccuracy.sin.f32(x, 4.0). What’s significant about this is that if I have a vector library available, the vectorizer can turn it into a call to llvm.fpaccuracy.sin.v4f32(x_veec, 4.0) or whatever. Now it’ll end up getting resolved to a different function, but it’s still in this abstract format that the optimizer can recognize and potentially do other things with.

Ultimately I’m not against standard function naming to solve this problem for anyone who is happy with that. I would just like to have a facility to also generalize it within the compiler.

I think the simplest thing is just to apply the accuracy globally to all recognized library calls in the region. (I don’t think it makes sense to apply it to calls we can’t identify.) The usual restrictions, such as errno requirements and ‘nobuiltin’ attributes, would apply. That’s what I’d recommend for the initial implementation.

You’re probably aware of the extensive set of ‘-fimf-’ options that the Intel C/C++ and Fortran compilers support to handle this kind of thing. Those options allow you to apply the option to individual functions and even the divide operation. I’m not proposing recreating the entire Intel infrastructure in LLVM, but I think the ability to apply the command-line option to individual functions could be useful if anyone has the time and motivation to implement it.

We can sort that all out if/when this gets as far as a clang RFC.

I guess I don’t see why this much complexity is warranted, or how it will actually fix the issue.

In general, I think that programmers who are upset that their results change when offloading will still be upset even if the differing value is within the ULP estimate of their original libc implementation. (And this assumes that their libc even documents its accuracy, and that that accuracy value is even correct!)

You even mention this issue in the context of metadata dropping: that a developer will be unhappy when they ask for 2.5ulp error implementation of division, if they sometimes get a 0ulp implementation instead. Both values comply with the accuracy requirement – but the developer is unhappy because the results are different, not because they’re not accurate enough.

My anecdotal observations regarding user expectations match @jyknight’s. Users do get upset by the changes in the results, even when they are an improvement. Or, rather, most of the time they tend to complain about the results when they differ from a reference computation done on the host. If both implementations match, the changes would be easier to accept.

IMO, global accuracy control will not be very useful in practice – large enough code base will likely have different components with different accuracy requirements. I think a block-scope #pragma would work better for this.

I wonder if we could provide some sort of __builtin_fp_accuracy(value, accuracy) which would work as a hint for LLVM to take the accuracy into account when we optimize the value and its data dependencies. This way we may not need to grow whole families of limited-precision equivalents of all functions. When we encounter a library or intrinsic call, we could just query the target – do you have a lower-accuracy equivalent of function X or operation Y (e.g. division)? If not, which will likely be the case for most calls, we just leave it alone. If target has a better alternative, we’ll use it.

Yes, I acknowledge that libraries providing more-accurate-than-requested implementations can be a problem. In my case, there is a library team that I work very closely with to achieve the results our customers want. So I can solve this problem for Intel customers using Intel compilers and Intel libraries on Intel hardware. However, one of the primary goals of SYCL is to avoid silos like that and open up the possibility to move across different architectures. That’s why I’m trying to add some mechanism to enable this. There will still be an onus on library vendors to get the details right. I’m just trying to provide a mechanism that enables this sort of development.

I also want to address the comment about users complaining when the results differ from “a reference computation done on the host.” I don’t think we can assume that. It is the most common case right now, but I think there are also users whose reference is a CUDA/OpenMP/OpenCL/SYCL implementation. And so in those cases, the host might need to match the device results.

Do you see what I’m trying to achieve here? The intent it’s basically to expand what’s possible.

BTW, I’ve proposed a BoF to discuss this at the upcoming LLVM Dev Meeting.

Say we make a standardized set of algorithms. For each function, we can offer a correctly rounded algorithm, a very close algorithm (within 1 ulp), and something super-approximate-but-fast (not sure what accuracy bound we want here). Name them “standard-cr”, “standard-accurate”, and “standard-fast”, or something like that. We publish the algorithms as a standard, so they mean the same thing for every vendor. Then we provide some mechanism to extend the list, so vendors can provide their own named algorithms that are more optimized for specific uses/hardware (e.g. “vendorname-accurate”, “vendorname-fast”, “vendorname-superfast-sqrt”).

The user can then request whatever algorithm is appropriate for their use (either for every function, or on a function-by-function basis). Users who care about reproducible results on different hardware will stick with the standardized algorithms. Users who care more about speed than the precise algorithm can request vendor-specific algorithms. And we can provide a way to let the implementation choose an algorithm within constraints: for example, “any-accurate” chooses the fastest algorithm the implementation knows about that produces results within 1 ulp.

In LLVM, this looks something like call float @llvm.fpalgorithm.exp10(float %x, !metadata !"standard-accurate").

The primary problem with this is finding someone willing to write specifications for all the “standard” algorithms.

This is just a rough sketch I wrote in a few minutes, but hopefully conveys the idea of how we could solve the predictability and accuracy problems together.

@efriedma-quic That’s not far from what I have in mind, but I think it might be expecting too much from library vendors. It would set up a nice infrastructure for open source libraries, but I expect at least some library vendors will want to keep their algorithms private. Something like what you are suggesting is probably the only way to get consistent results across different library implementations, but I don’t think that’s the only goal.

So for instance, if vendor A provides a 4 ulp implementation of some function and vendor B provides a 4 ulp implementation of the same function, they may still give very different results within those 4 ulps. But I think there’s an important use case where developers calling a function within their own implementation just need to know that the function is 4 ulp accurate and the amount of error doesn’t need to be consistent. Whether the error gets magnified or canceled out depends on how the result is used. If I’m developing code and I know a function returns a result with 4 ulp error, I can reason about what I need to do to make that acceptable. In this case, specifying the allowable error is all about performance.

I know that kind of contradicts some of the things I’ve agreed with earlier, but it’s a different use case. It relies on very highly skilled users, but they’re out there.

My suggestion doesn’t require vendors to provide any particular documentation for their proprietary algorithms. Only the “standard” algorithms would be required to be documented.

Does “consistent” here mean consistent across platforms, or consistent across invocations? e.g. is it a problem if calling sin(1) twice produces different results?

I’m sure a few people out there know how to do numerical analysis based on ulps, but I expect the vast majority of users don’t. So I’m not against helping those users, but we should make sure whatever direction we take doesn’t make it harder to satisfy users who don’t know how to write that sort of proof.

I whole heartedly agree with this. That’s also one of my goals. Maybe I should have included my front end proposal here too, because that does some hiding of the implementation details.

One thing I’d like to have is a command-line option that something like this:

-ffp-accuracy=[low|medium|high|glibc|sycl|cuda]

Then there’d be some mechanism for the front end to translate whichever accuracy level was chosen into an ulp spec for each function. One of the problems here is that within existing libraries and standards, different functions have different accuracy, and I do want to isolate users from the details if they don’t want to know them.

It’s pretty clear that this is a rather sprawling problem with a lot of conflicting priorities and use cases. I’m trying to come up with something that allows the most expert people to get very fine control over things without getting in the way of people who just want to use the standard library or anything in between. Satisfying the most expert users is a very big job, and I don’t imagine that my proposal gets there completely, but I hope it will open up some new possibilities.

Going back to my statement about vendors keeping private algorithms, what I was trying to get at is that having some interface that lets someone call “vendorname-fast” doesn’t provide a way for other vendors to produce results that are consistent with that algorithm unless the algorithm is shared. And maybe I’m not understanding your suggested but it seemed to me like there would be no way for a user to invoke that algorithm unless they knew it existed.

What I’m suggesting instead is that the intrinsics will provide a way for front ends to specify that the user wants to call some function, cosf() for example, and they require a result with at most 4 ulp error. The front end can do this in any number of ways. One might be to have an option like “-ffp-accuracy=medium” and the front end decodes that to 4 ulp for cosf(). Then the LLVM backend (or a runtime library in cases with deferred compilation) can use some mechanism to match that requirement to an available implementation.

Of course, if the front end has no need for such accuracy control, it can go on using the existing IR representations and they’ll get optimized the same way they are today. I don’t think my proposal will have any impact on optimization in cases where it isn’t used.

Regarding my remark that “the amount of error doesn’t need to be consistent” I meant across platforms. I do intend that multiple calls to the same function with the same accuracy requirements and the same inputs should produce the same results within the parts of an executable that run on the same hardware.

There is a problem to be solved there regarding vector function calls. Strictly requiring consistent results across a program generally prevents vectorization. When fast-math (or even just ‘afn’) is used, it may be reasonable to interpret that as permitting some variation in the vector case (icc and gcc do this). When fast-math isn’t used, you just can’t vectorize in these cases without something we don’t currently have. In the Intel compilers, we offer an option to call an implementation from our vector library in place of scalar functions to achieve vector/scalar consistency in the precise case, but that changes the default accuracy for many functions. Example: Compiler Explorer

Isn’t there another consideration here. With FP, ulp is not normally enough on its own. Doesn’t the accuracy vary depending on the input value range?
When tuning an algorithm, some of the FP instructions need to be at a higher accuracy than other FP instructions. Normally depending on whether subsequent FP instructions amplify the error or reduce the error.
So, is your proposal to somehow let the compiler work all that out, and give a ulp for the algorithm as a whole, or is your proposal more of a global one, to select each instructions to a particular ulp and not take the function as a whole?

No, I’m just focusing on the accuracy of the functions that are provided by the standard math library and equivalent built-in functions provided by various offload programming models. Basically, the calls that might get represented using LLVM math intrinsics today. You’re right that there’s a lot more that goes into figuring the accuracy of an algorithm, but that’s beyond the scope of what I’m trying to do here. I just want to provide a way to select the accuracy of parts of the algorithm that might otherwise be treated as black boxes.

I do think my proposal can be extended to allow non-precise versions of basic operations. OpenCL needs that for single precision divide and square root. My colleagues who work on FPGA compilers tell me that it would be useful for their customers to have the same capability for other operations such as addition and multiplication. Those operations are assumed to provide correctly rounded results by default, but in cases where the target instruction set can provide a faster implementation with less accuracy it can be worth introducing intrinsics to allow that.

In terms of the overall accuracy of some user algorithm, all I want to do is to provide a mechanism to constrain or relax the accuracy of a small set of known operations/functions. It will be left to the user to factor that accuracy into their algorithm to get the results they want.

Regarding input value range, that’s a potential consideration but one that I didn’t want to tackle. Most of the library specifications I’ve looked at provide a single accuracy value in ulps for each function. I would guess that there is some variability across the range of possible input values where the results will be more accurate for some inputs than for others, but the stated ulp accuracy should be the worst case. There are some cases where the documented accuracy of a function is more complicated. For example, CUDA’s __logf() intrinsic is documented as “For x in [0.5, 2], the maximum absolute error is 2-21.41, otherwise, the maximum ulp error is 3.” I don’t want to try to generalize that to represent it in IR, but if anyone has a suggestion I’d be open to it.

I had a lot of helpful discussions about this topic at the LLVM Dev Meeting. Thanks to everyone who gave their input.

Tue Ly (@lntue) suggested that it would be helpful to have additional properties like rounding mode represented. That would allow us to select different implementations of functions that provide correctly rounded results. Other constraints like limited domain might also be helpful in the future. This leads me to want a more open-ended and extensible way of specifying constraints, which could possibly be combined with the existing constrained fp handling.

Johannes Doerfert (@jdoerfert) said he would like to get rid of the existing math intrinsics (llvm.sin, llvm.cos, etc.) since they almost entirely duplicate what is done with the LibFunc handling, with the minor difference being errno handling. He suggested I could get the behavior I want by attaching attributes to the regular function calls and the math intrinsics could be eliminated altogether.

I’m not entirely comfortable with that suggestion for several reasons. (1) A lot of existing optimizations would need to be updated to respect the attribute and future optimizations could break things if they didn’t know to look for the attributes. (2) I’m not comfortable with the idea of replacing one named function call with a completely different function call. It’s technically doable, and probably happens already in some case, but it feels to me like something that the IR isn’t specifically allowing. (3) I intend for my function accuracy handling to target various calls that are equivalent to the standard math library call but original as something different, such as the SYCL or CUDA builtin transcendental functions.

I have a new vision for how to bring this all together, and I think it’s pretty good. I’ll put together a new RFC providing more details and even a patch to go along with it, but for now I’d like to sketch out the basic idea here and ask for feedback.

First, I’d like to introduce a new set of math intrinsics, which I hope will eventually become the only math intrinsics. I want to give them a new name so they don’t inherit any unwanted behavior and the existing intrinsics can be phased out gradually. There will be two key characteristics of these intrinsics: (1) They will be tightly coupled with a new set of call site attributes that are valid only for these intrinsics and have defined default values that apply if the attribute is not present. (2) They will be defined explicitly as being intended to be replaced with alternate implementations that meet the requirements described by the associated attributes.

So for my accuracy use case, I’d imagine a call like this:

%0 = call double @llvm.fpbuiltin.cos(double %x) #0 
...
attribute #0 = "fp-max-error"="4.0"

I’d add a wrapper class that allows FPMathBuiltin to act like a first class instruction, and for any attribute we add it would have an accessor function like FPMathBuiltin::getRequiredAccuracy().

For the first phase of implementation, I imagine using this only for the accuracy case. Later, we could move over the constrained intrinsics by adding support for “fp-rounding-mode” and “fp-exception-behavior” attributes. Any other constraints we need later could be added without requiring a new set of intrinsics.

Once everyone is comfortable with the idea that you have to check the attributes before doing anything with these intrinsics, they could replace the existing math intrinsics.

I’d also like to tie this back to the correctly rounded math library implementations Tue Ly has created. As I understand it, some of these rely on the availability of non-default architecture features like FMA. The mechanism I’m proposing here could be used to select alternate math library implementations. I’m picturing something analogous to clang’s -fveclib. This could potentially have multiple implementations like __cos_cr_avx2, __cos_cr_sse, etc., not to mention vector versions like __cos_cr_f64x4_avx2.

1 Like