The particular motivation for this e-mail is my desire for feedback on how to fix PR17758; but there is a core design issue here, so I'd like a wide audience.
The underlying issue is that, because the semantics of llvm.sqrt are purposefully defined to be different from libm sqrt (unlike all of the other llvm.<libm function> intrinsics) (*), and because autovectorization relies on the vector forms of these intrinsics when vectorizing function calls to libm math functions, we cannot vectorize a libm sqrt() call into a vector llvm.sqrt call. However, in fast-math mode, we'd like to vectorize calls to sqrt, and so I modified Clang to emit calls to llvm.sqrt in fast-math mode for sqrt (and sqrt[fl]). This makes it similar to the libm pow and fma calls, which Clang always transforms into the llvm.pow and llvm.fma intrinsics.
Here's the problem: There is an InstCombine optimization for sqrt (inside visitFPTrunc), and a bunch of optimizations inside SimplifyLibCalls that apply only to the sqrt libm call, and not to the intrinsics. The result, among other things, is PR17758, where fast-math mode actually produces slower code for non-vectorized sqrt calls.
- Is the asymmetry between optimizations performed on libm calls and their corresponding llvm.<libm function> intrinsics intentional, or just due to a lack of motivation?
- Even if unintentional, is this asymmetry in any way desirable (for sqrt in particular, or in general)?
- I can refactor all existing optimizations to be libm-call vs. intrinsics agnostic, but is that the desired solution? If so, any advice on a particularly-nice way to do this would certainly be appreciated.
For example, an alternative solution for PR17758 in particular would be to revert the Clang change, introduce a new intrinsic for sqrt that does match the libm semantics, and have vectorization use that when available.
Another alternative is to revert the Clang change and make autovectorization of libm sqrt -> llvm.sqrt dependent on the NoNaNsFPMath TargetOptions flag (this requires directly exposing parts of TargetOptions to the IR level).
I believe that both of these alternatives also require fixing the inliner to deal properly with fast-math attributes during LTO (unless I can just ignore this for now). This was the objection raised to the TargetOptions solution when I first brought it up.
(*) According to the language reference, the specific difference is, "Unlike sqrt in libm, however, llvm.sqrt has undefined behavior for negative numbers other than -0.0 (which allows for better optimization, because there is no need to worry about errno being set). llvm.sqrt(-0.0) is defined to return -0.0 like IEEE sqrt."