[PATCH 0/2] More trig builtins

I've implemented asin in terms of acos (which was sent to the list
a few days ago).

Tangent is implemented using a sin and a square root instead of sin(x)/cos(x).
sin(x)/cos(x) which produces much more verbose assembly than using the sqrt.

That being said, I am not sure if there's a better way to implement tan(x)
while still keeping the required precision. If someone has a better option,
I'm all ears. This implementation passes the piglit unit tests, at least.

I haven't checked if llvm.sin and llvm.cos intrinsics have enough precision for
float when used together (they didn't for just calculating sin/cos, so I figured
using both intrinsics together would just increase the error).

asin(x) = PI/2 - acos(x)

We already have an implementation of acos(x), so just use that.

Signed-off-by: Aaron Watry <awatry@gmail.com>

Uses the algorithm:
tan(x) = sin(x) / sqrt(1-sin^2(x))

An alternative is:
tan(x) = sin(x) / cos(x)

Which produces more verbose bitcode and longer assembly.

Either way, the generated bitcode seems pretty nasty and a more optimized
but still precise-enough solution is welcome.

Signed-off-by: Aaron Watry <awatry@gmail.com>

asin(x) = PI/2 - acos(x)

LGTM.

just out of curiosity.
How does the precision compare to just using
atan2(x, ( sqrt(1-x^2) ) )
from 5) of your acos patch?

I assume (PI/2 -) does not shift the balance.

jan

Uses the algorithm:
tan(x) = sin(x) / sqrt(1-sin^2(x))

An alternative is:
tan(x) = sin(x) / cos(x)

Which produces more verbose bitcode and longer assembly.

this is weird. both EG and SI have both sin and cos instructions. Is the
input normalization code so bad that we are better of doing MUL+SUB+SQRT
instead?

Uses the algorithm:
tan(x) = sin(x) / sqrt(1-sin^2(x))

An alternative is:
tan(x) = sin(x) / cos(x)

Which produces more verbose bitcode and longer assembly.

this is weird. both EG and SI have both sin and cos instructions. Is the
input normalization code so bad that we are better of doing MUL+SUB+SQRT
instead?

Those are only useful for native_sin / native_cos. For the standard function, they are far from precise enough. The current (float) sin implementation should be correct, though native_sin right now is still defined to just be the regular sin function instead of the LLVM intrinsic

>> Uses the algorithm:
>> tan(x) = sin(x) / sqrt(1-sin^2(x))
>>
>> An alternative is:
>> tan(x) = sin(x) / cos(x)
>>
>> Which produces more verbose bitcode and longer assembly.
>
> this is weird. both EG and SI have both sin and cos instructions. Is the
> input normalization code so bad that we are better of doing MUL+SUB+SQRT
> instead?

Those are only useful for native_sin / native_cos. For the standard
function, they are far from precise enough. The current (float) sin
implementation should be correct, though native_sin right now is still
defined to just be the regular sin function instead of the LLVM
intrinsic

oh I didn't know the hw implementaion was so imprecise. In that case it
makes sense. Although I wonder why it ended up needing twice as many
instructions. it looks to me that sin and cos don't differ in more than
4 operations, so CSE should have eliminated most of it.
either way it's not going to be more efficient than this patch.

LGTM

asin(x) = PI/2 - acos(x)

LGTM.

just out of curiosity.
How does the precision compare to just using
atan2(x, ( sqrt(1-x^2) ) )
from 5) of your acos patch?

I assume (PI/2 -) does not shift the balance.

The precision of both implementations looks ok. The existing piglit
tests pass when tightened down to a tolerance of 1 ULP and fail at 0
ULP. Given that the spec gives us 4 ULP as allowed variance, it seems
like we're good.

I did the following, alternate implementations and did a quick check
on bitcode length and number of instructions on evergreen. It seems
like the second variation gives us sufficient precision and fewer
hardware instructions for at least the tested architecture (CEDAR on
latest svn llvm/clang).

If you prefer, I can commit the second implementation instead.

--Aaron

diff --git a/generic/lib/math/asin.inc b/generic/lib/math/asin.inc
index f1a65b3..661663a 100644
--- a/generic/lib/math/asin.inc
+++ b/generic/lib/math/asin.inc
@@ -5,7 +5,15 @@
#endif

_CLC_OVERLOAD _CLC_DEF __CLC_GENTYPE asin(__CLC_GENTYPE x) {
+#if 0
+ //Passes with 1ulp on evergreen, fails at 0
+ //(float16): 1786 DW on CEDAR, 22 GPRs, %694 is highest numbered
bitcode instr
   return ( (__CLC_GENTYPE)PI2 - acos(x));
+#else
+ //Passes with 1ulp on evergreen, fails at 0
+ //(float16): 1622 DW on CEDAR, 22 GPRs, %691 is highest numbered
bitcode instr
+ return atan2(x, sqrt((__CLC_GENTYPE)1.0 -(x*x) ) );
+#endif
}

#undef PI2

>> asin(x) = PI/2 - acos(x)
>
> LGTM.
>
> just out of curiosity.
> How does the precision compare to just using
> atan2(x, ( sqrt(1-x^2) ) )
> from 5) of your acos patch?
>
> I assume (PI/2 -) does not shift the balance.
>

The precision of both implementations looks ok. The existing piglit
tests pass when tightened down to a tolerance of 1 ULP and fail at 0
ULP. Given that the spec gives us 4 ULP as allowed variance, it seems
like we're good.

I did the following, alternate implementations and did a quick check
on bitcode length and number of instructions on evergreen. It seems
like the second variation gives us sufficient precision and fewer
hardware instructions for at least the tested architecture (CEDAR on
latest svn llvm/clang).

If you prefer, I can commit the second implementation instead.

I'm ok with both, I'll leave the decision to you.

jan