[RFC] All the math intrinsics

I think we should add llvm intrinsics for the following operations:

My primary motivation here is for HLSL support in clang, but they’re generally useful anywhere where we (1) need vector expansions of these operations, (2) want to handle these operations even in freestanding environments, or (3) have high level operations in the target that we want to preserve.

HLSL and the DirectX and SPIR-V backends overlap with all three of these cases.

This list mostly consists of C stdlib math functions that have corresponding operations in HLSL, with the exception of dot, rsqrt, and clamp. These three don’t have C standard equivalents but are similarly generic and come up a lot in GPUs. Note that we could separate the non-C ones into a separate proposal if folks want separate justification for those three.

Some History

While we’ve had math intrinsics like llvm.sin since time immemorial, there’s generally been contention around adding more of them. There are a couple of reasons for this:

  1. For most targets these are just a libcall anyway, so it’s kind of
    pointless.
  2. There are a bunch of places where we (used to) treat intrinsics
    specially
    rather than look at function attributes.

IMO this has been keeping us in a place that’s the worst of both worlds.

Why Now?

There are two main things that are different today from some of the times we’ve discussed this in the past.

  1. We have targets that have these operations. The SPIR-V and DirectX backends need to maintain the high level function in their output.

  2. We have languages that want vector versions of all of these operations. OpenCL and HLSL both need vector versions of all of these.

Additionally, any approach to handling this differently is made awkward by the fact that we already have the intrinsics for many math intrinsics. If we don’t add these intrinsics a comprehensive lowering solution to the DirectX or SPIR-V backend has to handle both library call recognition and intrinsics, whereas an intrinsics only solution is much simpler.

Alternatives

The most reasonable alternative to this solution would be to remove the set of math intrinsics that currently exist, and settle on a library call recognition based approach across the board. While I do think there are advantages to this idea, the scope of the change would be enormous, and I don’t think it’s practical to hold up progress on a redesign of parts of LLVM that have been in place for nearly 20 years. Also note that if such a redesign ever did occur having a handful more intrinsics to deal with wouldn’t make it substantially more or less difficult.

The other alternative to this solution is just much worse: implement all of these math functions as target intrinsics in the DirectX and SPIR-V backends, and any other backend that wants them. We would result in a huge amount of duplication with very little benefit IMO.

Conclusion

Adding these intrinsics is pragmatic and avoids accruing technical debt from the hoops that we’d have to jump through without them. We should add these 16 intrinsics generically to LLVM.

7 Likes

We already have copysign (llvm.copysign), we already have modf (the frem instruction), we already have ldexp (llvm.ldexp), we already have frexp (llvm.frexp).

Is rsqrt the same as the C23 function?

Can “clamp” be expressed in terms of llvm.maxnum+llvm.minnum?

Can “dot” be expressed in terms of fmul+llvm.vector.reduce.fadd?


I think the historical resistance to adding the trigonometric functions is based around a few things:

  • CPU-based backends don’t have any native instructions that are helpful for expanding them.
  • The only commonly available implementation on CPU targets is in libc, and on Linux that modifies errno (which causes issues because the intrinsics are marked readnone).
  • APFloat can’t constant-fold these functions.

But I think it’s fair to say we should do something to accommodate GPU backends, which do tend to need special handling in the backend.

1 Like

I think these intrinsics could have default expansions, like the one you gave. Hopefully that would be good enough for CPU.

For “dot” in particular, we should allow integer versions as well.

Adding a new “llvm.dot” intrinsic, as opposed to making the frontend just emit “fmul” and “llvm.vector.reduce.fadd”, means all the existing optimizations that understand “fmul” and “llvm.vector.reduce.fadd” need to be rewritten to also handle llvm.dot. Adding that handling is more work than just adding a pattern-match for this pattern in the relevant backends.

We’ve occasionally added intrinsics for things which could be expressed other ways because we ran into specific issues in optimizations (see llvm.smax etc.), but I don’t think that applies here.

This is under the assumption that the two formulations are actually equivalent, of course. (The linked documentation doesn’t specify how the rounding for “dot” is supposed to work.)

1 Like

“Dot” is a popular operation lately with convolutions showing up everywhere. Some architectures have some sort of hardware support for it (Hexagon has it for integers, AMDGPUs have it for floats and integers, for example).

It’s easier to teach optimizations about “dot” than it is to synthesize it from code that has gone through an optimizer.

1 Like

I must have missed these when I was going through the list of what was available. Thanks!

Yes it is. I hadn’t realized this had been added in C23.

It can be, but it’s fairly difficult to pattern match it back to clamp from there. You can implement clamp(x, minVal, maxVal) simply via something like this:

min(max(x, minVal), maxVal)

However, you can’t necessarily convert that back to clamp unless you can prove that minVal < maxVal.

It can, and this is probably easier to recover the dot from than the clamp case, but as @kparzysz mentions, this is becoming a more ubiquitous operation all the time.

1 Like

For something like “dot”, how are we seeing aspects like the precision of the accumulator (often different than the precision of the multiplication)?

I would like to see intrinsics for all of the IEEE 754 operations eventually (well, at least those that are reflected in C23; the operations in sections 9.3-9.5 are currently languishing in a floating-point TS after WG14 declined to adopt them for C23) [0]. So in principle, I’m not opposed to this change. That said…

While we’ve had math intrinsics like llvm.sin since time immemorial, there’s generally been contention around adding more of them.

I’m not sure you’ve correctly captured what I find to be the main point of contention here: the LLVM intrinsics don’t share the semantics of the C library functions (the intrinsics don’t touch errno, the C library functions do [1]), so it’s not always safe to use intrinsics, so you need to end up supporting both intrinsics and libcalls to do things ‘right’. (Many of the new trigonometric optimizations people have been adding recently only support one or the other, not both).

Now, we’ve known about the intrinsics/libcall issue for a decade now, without much progress on fixing them. I don’t believe it reasonable to demand waiting for a fix that is unlikely to be forthcoming here. The better solution to the problem in the long run, I believe, is to (eventually) normalize everything into intrinsics, which helps sidestep annoyances like the frexp pointer argument issue (as C lacks multiple return values, which LLVM can fake better) or what-type-is-long double-on-this-target annoyances. If intrinsics are likely to be the future, even less reason to object to adding new intrinsics now.

This list mostly consists of C stdlib math functions that have corresponding Intrinsic Functions - Win32 apps | Microsoft Learn, with the exception of dot, rsqrt, and clamp. These three don’t have C standard equivalents but are similarly generic and come up a lot in GPUs. Note that we could separate the non-C ones into a separate proposal if folks want separate justification for those three.

rsqrt was added in C23. TS 18661-4 has reduc_sumprod which implements an IEEE 754 dot operator, although it’s not in a convenient vector-based form as C lacks hardware vector types.

[0] I actually have a document that maps between IEEE 754 operations, C version(s), LLVM intrinsics, and APFloat methods, if anyone is interested.

[1] Yes, this means libcalls lowering these to libm are wrong.

For AMDGPU the docs aren’t explicit, but it may be the case that f16 dot uses 32-bit accumulator (@arsenm will probably know better). They do state that all such dot operations treat subnormal values as zeros.

The integer case on Hexagon worked on four i8 (plus maybe two i16) at a time, resulting in a vector of i32. From then on you’d have to do rotates and adds. Either way any overflow was impossible for i8/i16.

As for f16 dot specifically, AFAIK RDNA3 has native f16 dot (V_DOT2_F16_F16) but the precision is really bad, generally do not use this one. CDNA3 has a variant that accumulates to fp32 and is useable ( V_DOT2C_F32_F16), with the caveat that it does not support denormals and rounding modes.

It’s not obvious to me that the math intrinsics are a feature. Our optimization pipeline is inconsistent about handling them, e.g. constant folding different arguments when it’s a function vs an intrinsic.

An intrinsic is a known function whose address is not taken (modulo a proposal to remove that invariant which I think was rejected last time around) and which has type overloads burned into it more directly than C++ name mangling.

For math, I think we have some additional things like the intrinsic is assumed to not set errno, which is in dubious grounds when said intrinsic is lowered to a libc call that does. I think rounding is underspecified as well - maybe they’re perfectly rounded, maybe they’re very imperfectly on perform at grounds, and that might differ between backends.

Somewhat related, a little while ago amdgpu had a whole load of intrinsics for math functions all of it’s own .

I am nervous about the constant folding part. It’s unlikely that llvm will do the same rounding as whatever libc would have done. I’m also nervous about the freestanding behaviour and generally assuming that the input language is hosted C in terms of turning functions called sin into intrinsics.

Is the actual missing feature that is driving this is that we have no way to mark an IR function as existing for various types? We can’t take @sin, see it is being called in a loop of length 8 and turn it into a call to @sin.v8f64 in the vectoriser without also making it an Intrinsic, for IR functions don’t have that sort of array of types in their structure which intrinsics do.

I wonder how scary a change that would be. Each declaration/definition of a function would have specific types, can print it the same foo.f32.i64 style as intrinsics. For math functions, lower in the same fashion we would anyway.

I think erasing that limitation would have uses elsewhere, and it would allow us to discard all the math intrinsics if we chose to. Assume no errno or similar is very like a function attribute.

Thoughts?

I talked to @jcranmer about this at the floating point working group meeting earlier this week. I’m not convinced that we ever meant to promise that the intrinsics won’t set errno, even though we explicitly say that in the lang ref definition. That wording was mostly introduced by ⚙ D39304 [IR] redefine 'reassoc' fast-math-flag and add 'trans' fast-math-flag and I can’t find any comments in that review that seem like they would have led to that definition.

I think what we meant to say was that we don’t guarantee that the intrinsic will set errno. I don’t think we’ve ever had an x86-backend implementation that respected documented behavior of not setting errno.

In practice, clang will generate the intrinsics for math library calls that have intriniscs available if you pass -fno-math-errno on the command line. This is very useful because we can’t do things like vectorization if the user isn’t relying on errno being set, so if we’re using a vector library that has vector implementations available, we can vectorize calls to the intrinsic, while the non-intrinsic call tells us we can’t do that (Compiler Explorer). I suppose the front end could mark these calls with the no-builtins attribute whenever errno is required, but I don’t know what other problems that might introduce.

On the other hand, if the user’s code is checking errno for something other than math error handling, and we stomp on it because we hoisted an intrinsic that we didn’t even need to set errno in the first place, that’s not so great.

By the way, regarding the hosted vs. freestanding issue, clang does use the no-builtins attribute for that, though I believe C23 adds the common math library functions, conditionally, as part of the freestanding environment.

In any event, I am in favor of adding the proposed intrinsics.

We can, if we so choose, not mention errno at all for math intrinsics.
If we do so, we could always use the intrinsics, and get rid of the library call handling for good.
If the frontend/user is aware errno is to be ignored (-fno-math-errno), readnone is placed on the call site.
This does not solve the issue that we might pick an errno setting implementation for a non-errno setting intrinsic, but it also doesn’t make it worse.
What this would do is to unify our handling of “math” in the IR.

FWIW, I always advocated against more intrinsic and in favor of removing the ones we have.
Since that never got enough traction, I am happy to support the “all intrinsic” approach.

I think we’ve got a lot of work to do to go “all intrinsics” but I do find it appealing. In addition to the standard libm calls, which we may or may not recognize via the LibFunc interface, we have the vector library handling, which uses library-specific function names when the calls have been vectorized rather than vector forms of the intrinsics. There are also the vector-predicated intrinsics and the constrained intrinsics.

Then there’s the problem that programming models like SYCL, Cuda, and OpenCL have their own builtin math library functions, which may need to go through some additional layer like SPIRV with all their semantics intact. And, of course, you have other languages like Julia and Fortran with the same basic set of functions but possibly additional semantic differences.

Finally, if we want to mix correctly rounded implementations with implementations that aren’t necessarily correctly rounded, that’s another wrinkle. Speaking of which, I forgot to mention that I completely agree with the concern @JonChesterfield raised about constant folding. If we don’t know that the function being called returns correctly rounded results or we don’t know the rounding mode that will be in use at execution time, we shouldn’t be constant folding library calls without the ‘afn’ fast-math flag set.

I think we would all agree that finding a way to unify all of these representations would be fantastic and that tying everything to the standard C math library is surely not a good solution.

It seems to me that what we need is a set of intrinsics that represent, at least, the full set of IEEE-754 recommended operations and have an extensible capability to describe semantic requirements such as vector predication, errno support, exception handling, rounding mode, required accuracy, domain requirements, and whatever else comes along. I think this can be done with a combination of attributes and operand bundles, but only if we have some mechanism to let optimizations indicate which attributes they know how to handle and tell them when an intrinsic has attributes they don’t know about.

If this sounds at all familiar, it’s because I proposed this a couple of years ago. The basic shape of this came out of discussions I had at the 2022 LLVM DevMeeting in San Jose and was described here: [RFC] Floating-point accuracy control - #20 by andykaylor. I also posted a preliminary implementation here: ⚙ D138867 [RFC] Add new intrinsics and attribute to control accuracy of FP calls. Of particular interest in that review was the idea of adding a function, hasUnrecognizedFPAttrs, to the FPBuiltinIntrinsic interface.

The proposal didn’t seem to have a lot of support at the time, but I implemented it in the intel/llvm fork that we use for SYCL development, and we’re using it there to select different implementations based on accuracy requirements for SYCL and OpenMP offload. We want to merge all of the SYCL support into the main LLVM repo over time, so I’m eventually going to need some solution for this problem. Of course, I’m happy to adapt it as needed to make it general enough to support the needs of all LLVM-based targets.

It seems to me that what we need is a set of intrinsics that
represent, at least, the full set of IEEE-754 recommended operations
and have an extensible capability to describe semantic requirements
such as vector predication, errno support, exception handling,
rounding mode, required accuracy, domain requirements, and whatever
else comes along. I think this can be done with a combination of
attributes and operand bundles, but only if we have some mechanism to
let optimizations indicate which attributes they know how to handle
and tell them when an intrinsic has attributes they don’t know about.

This is basically the model I’ve also been leaning towards: just unify
everything into one set of intrinsics and use attributes, metadata,
and/or operand bundles to differentiate between slightly-different
semantics such as errno. (Prototyping using operand bundles for strictfp
is still on my TODO list). That llvm-libc is working towards a complete
set of correctly-rounded implementations of all the operations even
gives us a starting point to correctly lower attribute combinations that
the host target isn’t natively capable of supporting.

It sounds to me like there’s a general consensus that adding these intrinsics is the right direction, or at least not a harmful direction to go. There are some questions on aspects like “what’s allowed w.r.t. constant folding?” and “when can we replace operations with these intrinsics (be it library functions or instruction sequences)?”, but I think those are somewhat orthogonal to actually adding the intrinsics. We can continue to explore improving those situations with attributes/metadata/operand bundles as needed in the future.

So I believe the next steps are individual reviews for the intrinsics we intend to add, as listed in the original post.

It has been a week, and I don’t see any disagreement, so I’ve implemented the tan intrinsic in 4 stages for HLSL, clang, and three backends in x86, SPIRV, and DXIL

  1. [clang][hlsl] Add tan intrinsic part 1 by farzonl · Pull Request #90276 · llvm/llvm-project · GitHub
  2. [DXIL] Add tan intrinsic Part 2 by farzonl · Pull Request #90277 · llvm/llvm-project · GitHub
  3. [SPIRV] Add tan intrinsic part 3 by farzonl · Pull Request #90278 · llvm/llvm-project · GitHub
  4. [x86] Add tan intrinsic part 4 by farzonl · Pull Request #90503 · llvm/llvm-project · GitHub
1 Like