[RFC] How to fix sqrt vs llvm.sqrt optimization asymmetry

Hello everyone,

The particular motivation for this e-mail is my desire for feedback on how to fix PR17758; but there is a core design issue here, so I'd like a wide audience.

The underlying issue is that, because the semantics of llvm.sqrt are purposefully defined to be different from libm sqrt (unlike all of the other llvm.<libm function> intrinsics) (*), and because autovectorization relies on the vector forms of these intrinsics when vectorizing function calls to libm math functions, we cannot vectorize a libm sqrt() call into a vector llvm.sqrt call. However, in fast-math mode, we'd like to vectorize calls to sqrt, and so I modified Clang to emit calls to llvm.sqrt in fast-math mode for sqrt (and sqrt[fl]). This makes it similar to the libm pow and fma calls, which Clang always transforms into the llvm.pow and llvm.fma intrinsics.

Here's the problem: There is an InstCombine optimization for sqrt (inside visitFPTrunc), and a bunch of optimizations inside SimplifyLibCalls that apply only to the sqrt libm call, and not to the intrinsics. The result, among other things, is PR17758, where fast-math mode actually produces slower code for non-vectorized sqrt calls.

Some questions:

- Is the asymmetry between optimizations performed on libm calls and their corresponding llvm.<libm function> intrinsics intentional, or just due to a lack of motivation?

- Even if unintentional, is this asymmetry in any way desirable (for sqrt in particular, or in general)?

- I can refactor all existing optimizations to be libm-call vs. intrinsics agnostic, but is that the desired solution? If so, any advice on a particularly-nice way to do this would certainly be appreciated.

For example, an alternative solution for PR17758 in particular would be to revert the Clang change, introduce a new intrinsic for sqrt that does match the libm semantics, and have vectorization use that when available.

Another alternative is to revert the Clang change and make autovectorization of libm sqrt -> llvm.sqrt dependent on the NoNaNsFPMath TargetOptions flag (this requires directly exposing parts of TargetOptions to the IR level).

I believe that both of these alternatives also require fixing the inliner to deal properly with fast-math attributes during LTO (unless I can just ignore this for now). This was the objection raised to the TargetOptions solution when I first brought it up.

(*) According to the language reference, the specific difference is, "Unlike sqrt in libm, however, llvm.sqrt has undefined behavior for negative numbers other than -0.0 (which allows for better optimization, because there is no need to worry about errno being set). llvm.sqrt(-0.0) is defined to return -0.0 like IEEE sqrt."

Thanks again,
Hal

Hi Hal, all.

I'm not sure why llvm.sqrt is 'special'. Maybe because there is a SSE packed sqrt instruction (SQRTPS) but not e.g. a packed sin instruction AFAIK.

As mentioned in a recent mail to this list, I would like llvm.sqrt to be defined as NaN for argument x < 0. I believe this would bring it more into line with the other intrinsics, and with the libm result, which is NaN for x < 0: sqrt

Cheers,
     Nick

Hi Hal, all.

I'm not sure why llvm.sqrt is 'special'. Maybe because there is a
SSE
packed sqrt instruction (SQRTPS) but not e.g. a packed sin
instruction
AFAIK.

This seems relevant: http://lists.cs.uiuc.edu/pipermail/llvmdev/2007-August/010248.html

Chris, et al., does the decision on how to treat sqrt predate our current way of handling errno?

-Hal

The intention of llvm.sqrt is that it is the same as sqrt(), but doesn't affect errno. If it helps, I think it would be completely reasonable to change this in langref.html:

"Unlike sqrt in libm, however, llvm.sqrt has undefined behavior for ..."

to "... produces an undefined value", with a link back to ##undefined-values.

Transformations that apply to sqrt() should generally apply to llvm.sqrt as well.

-Chris

The intention of llvm.sqrt is that it is the same as sqrt(), but doesn't affect errno. If it helps, I think it would be completely reasonable to change this in langref.html:

"Unlike sqrt in libm, however, llvm.sqrt has undefined behavior for ..."

to "... produces an undefined value", with a link back to ##undefined-values.

I don't think that helps the front-end writers significantly (they'll
still have to check the input to get a sensible total function).

CodeGen and backends can already do whatever they like with the
current semantics, but it looks like we could tighten those up to "...
produces a NaN" without inconveniencing any targets I've found. This
also matches at least OSX and Linux man-pages about the sqrt function.

Cheers.

Tim.

>> Hi Hal, all.
>>
>> I'm not sure why llvm.sqrt is 'special'. Maybe because there is a
>> SSE
>> packed sqrt instruction (SQRTPS) but not e.g. a packed sin
>> instruction
>> AFAIK.
>
> This seems relevant:
> http://lists.cs.uiuc.edu/pipermail/llvmdev/2007-August/010248.html
>
> Chris, et al., does the decision on how to treat sqrt predate our
> current way of handling errno?

The intention of llvm.sqrt is that it is the same as sqrt(), but
doesn't affect errno. If it helps, I think it would be completely
reasonable to change this in langref.html:

"Unlike sqrt in libm, however, llvm.sqrt has undefined behavior for
..."

to "... produces an undefined value", with a link back to
##undefined-values.

I'm not sure that helps, because it will prevents sqrt + -fno-math-errno (a readnone sqrt) -> llvm.sqrt -- and thus still prevents the use of llvm.sqrt to vectorize sqrt. I think what will help is just saying something like this:

  "Unlike sqrt in libm, after calling llvm.sqrt with a negative argument (except for -0.0), the state of errno, and any other portions of the floating-point environment that are used to record errors, is undefined."

FWIW, we may want to make a similar change for the other llvm.<libm function> intrinsics, at least for the vector versions (as vector ops on many ISAs don't affect the FP state bits like the scalar ops do).

Transformations that apply to sqrt() should generally apply to
llvm.sqrt as well.

And the intent is for this to be true for all of the other llvm.<libm function> intrinsics as well, right?

Thanks again,
Hal

>
>
> >> Hi Hal, all.
> >>
> >> I'm not sure why llvm.sqrt is 'special'. Maybe because there is
> >> a
> >> SSE
> >> packed sqrt instruction (SQRTPS) but not e.g. a packed sin
> >> instruction
> >> AFAIK.
> >
> > This seems relevant:
> > http://lists.cs.uiuc.edu/pipermail/llvmdev/2007-August/010248.html
> >
> > Chris, et al., does the decision on how to treat sqrt predate our
> > current way of handling errno?
>
> The intention of llvm.sqrt is that it is the same as sqrt(), but
> doesn't affect errno. If it helps, I think it would be completely
> reasonable to change this in langref.html:
>
> "Unlike sqrt in libm, however, llvm.sqrt has undefined behavior for
> ..."
>
> to "... produces an undefined value", with a link back to
> ##undefined-values.

I'm not sure that helps, because it will prevents sqrt +
-fno-math-errno (a readnone sqrt) -> llvm.sqrt -- and thus still
prevents the use of llvm.sqrt to vectorize sqrt. I think what will
help is just saying something like this:

  "Unlike sqrt in libm, after calling llvm.sqrt with a negative
  argument (except for -0.0), the state of errno, and any other
  portions of the floating-point environment that are used to record
  errors, is undefined."

(to use the correct terminology): "used to record errors" -> "used to record exceptions".

-Hal

"Unlike sqrt in libm, however, llvm.sqrt has undefined behavior for
..."

to "... produces an undefined value", with a link back to
##undefined-values.

I'm not sure that helps, because it will prevents sqrt + -fno-math-errno (a readnone sqrt) -> llvm.sqrt -- and thus still prevents the use of llvm.sqrt to vectorize sqrt. I think what will help is just saying something like this:

"Unlike sqrt in libm, after calling llvm.sqrt with a negative argument (except for -0.0), the state of errno, and any other portions of the floating-point environment that are used to record errors, is undefined.”

Makes sense to me.

FWIW, we may want to make a similar change for the other llvm.<libm function> intrinsics, at least for the vector versions (as vector ops on many ISAs don't affect the FP state bits like the scalar ops do).

Yep.

Transformations that apply to sqrt() should generally apply to
llvm.sqrt as well.

And the intent is for this to be true for all of the other llvm.<libm function> intrinsics as well, right?

Yep!

-Chris

>> "Unlike sqrt in libm, however, llvm.sqrt has undefined behavior
>> for
>> ..."
>>
>> to "... produces an undefined value", with a link back to
>> ##undefined-values.
>
> I'm not sure that helps, because it will prevents sqrt +
> -fno-math-errno (a readnone sqrt) -> llvm.sqrt -- and thus still
> prevents the use of llvm.sqrt to vectorize sqrt. I think what will
> help is just saying something like this:
>
> "Unlike sqrt in libm, after calling llvm.sqrt with a negative
> argument (except for -0.0), the state of errno, and any other
> portions of the floating-point environment that are used to
> record errors, is undefined.”

Makes sense to me.

>
> FWIW, we may want to make a similar change for the other llvm.<libm
> > intrinsics, at least for the vector versions (as vector
> ops on many ISAs don't affect the FP state bits like the scalar
> ops do).

Yep.

Okay, good. I'll construct a documentation patch for review shortly.

>
>>
>> Transformations that apply to sqrt() should generally apply to
>> llvm.sqrt as well.
>
> And the intent is for this to be true for all of the other
> llvm.<libm function> intrinsics as well, right?

Yep!

Okay; I'll work on refactoring the existing optimizations to apply to both.

-Hal