[PATCH] math: Implement remainder(x, y)

Mostly ported from the amd-builtins branch.

The amd-builtins branch uses __amdil_improved_fdiv_f32 and FTZ which aren't
available in generic CLC.

__amdil_improved_fdiv_f32 points to native_divide which does
native_recip(y)*x.

Since we don't have native_divide or native_recip yet, I've just stuck an
actual division here.

I've taken a shot at a replacement for FTZ(x), but feel free to suggest
alternatives.

Tested via piglit on a Radeon HD 7850 using the tests just sent to that list.

Signed-off-by: Aaron Watry <awatry@gmail.com>

Mostly ported from the amd-builtins branch.

The amd-builtins branch uses __amdil_improved_fdiv_f32 and FTZ which aren't
available in generic CLC.

__amdil_improved_fdiv_f32 points to native_divide which does
native_recip(y)*x.

Since we don't have native_divide or native_recip yet, I've just stuck an
actual division here.

Those can just be trivially implemented as the regular divide. Refinements would be to use the intrinsic (should only be needed for f64, f32/f16 it should should just happen for amdgcn)

I've taken a shot at a replacement for FTZ(x), but feel free to suggest
alternatives.

A denormal flush function should be implemented with llvm.canonicalize (for subtargets where f32 denormals are off by default, which is always true now but will probably change for VI). It’s only implemented for amdgcn in the backend currently

-Matt