CUDA and libm

Hi, I'm looking for some advice on the right way to implement math
functions in device-side CUDA code. I'm pretty new here, so please
forgive the likely many gross misunderstandings below.

There are three classes of things I'm concerned with:

* Functions declared in math.h (e.g. sinf),
* Builtin math functions (e.g. __builtin_sinf), and
* LLVM intrinsic math instructions (e.g. llvm.sin.f32).

At the moment the way this works is that the CUDA headers provided by
nvidia define inline functions along the lines of

  __device__ float sinf(float v) {
  #ifndef FAST_MATH
    return __nv_fast_sinf(v);
  #else
    return __nv_sinf(v);
  #endif
  }

This function is C++, *not* extern "C". __nv_sin and __nv_fast_sin
are defined in libdevice [1], a bitcode library provided by nvidia.

Some functions inside libdevice, e.g. __nv_fast_sinf, do nothing more
than call the equivalent llvm nvvm intrinsic, e.g.
llvm.nvvm.sin.approx.f. This then gets lowered to an equivalent nvvm
machine instruction. Other functions in libdevice, such as __nv_sinf,
do nontrivial computation explicitly written out in the llvm bitcode.

Following so far? If so, you may note that the state of the world is
rather incomplete! Here are the problems I see at the moment:

* Many builtins don't work. For example, clang emits __builtin_sinf
as a call to sinf [2]. But that function doesn't exist; nvidia's
not-extern-"C" ::sinf is not the same. In particular this means that
libstdc++ is not going to work well, since it implements e.g.
std::sin(float) as a call to __builtin_sinf.

* Many math optimizations aren't going to work (I think), because e.g.
SimplifyLibCalls checks for function names like "log" [3], but by the
time we get there, we're calling __nv_log.

* (At least some) llvm intrinsics sort of work, but e.g. llvm.sin.f32
gets lowered to the nvvm instruction sin.approx.f32, while ::sin from
the CUDA headers only does this transformation if fast-math is
enabled. Maybe this is sound if we only emit llvm.sin.f32 if
fast-math is enabled; I dunno.

My question for the list is about the right way to fix these problems.
It seems to me that since the optimizer explicitly uses knowledge of
the various math functions, we shouldn't define inline versions of
them. Instead, we should leave them as plain calls to e.g. sinf until
we lower to nvvm in llvm. We have a header baked into clang that
already disables some CUDA headers; in theory we should be able to
disable CUDA's math_functions.h using the same mechanism.

When lowering to nvvm, we can make the determination as to whether we
want to call a function defined in libdevice, invoke an nvvm
intrinsic, or whatever. If we call into libdevice we'll want to rerun
some optimization passes, but I presume we can order these order these
passes appropriately.

In order to do this, we'll need libdevice to be available to llvm.
It's not clear to me whether it is at the moment; clang sees it, but
I'm not sure if clang passes both its generated IR and all of
libdevice to llvm, or if it just copies the relevant definitions from
libdevice into the IR it sends to llvm. If it's the latter, we could
always copy *all* of libdevice into the generated IR. But I hope we
could do better.

I think this proposal lets us solve all three problems above, but
again I'm not sure I'm not missing something, or if there's a more
canonical way to do this. Any feedback is appreciated!

-Justin

[1] libdevice User's Guide :: CUDA Toolkit Documentation
[2] https://github.com/llvm-mirror/clang/blob/e2636ac0bad65451c3eb6272d7ab3abbba96da17/lib/CodeGen/CGBuiltin.cpp#L1971
[3] llvm/SimplifyLibCalls.cpp at b3bc79d5556108307026be07e7eaa644cce041ab · llvm-mirror/llvm · GitHub

From: "Justin Lebar via cfe-dev" <cfe-dev@lists.llvm.org>
To: cfe-dev@lists.llvm.org, c-gpu-team@google.com
Sent: Sunday, January 24, 2016 1:29:10 PM
Subject: [cfe-dev] CUDA and libm

Hi, I'm looking for some advice on the right way to implement math
functions in device-side CUDA code. I'm pretty new here, so please
forgive the likely many gross misunderstandings below.

There are three classes of things I'm concerned with:

* Functions declared in math.h (e.g. sinf),
* Builtin math functions (e.g. __builtin_sinf), and
* LLVM intrinsic math instructions (e.g. llvm.sin.f32).

At the moment the way this works is that the CUDA headers provided by
nvidia define inline functions along the lines of

  __device__ float sinf(float v) {
  #ifndef FAST_MATH
    return __nv_fast_sinf(v);
  #else
    return __nv_sinf(v);
  #endif
  }

This function is C++, *not* extern "C". __nv_sin and __nv_fast_sin
are defined in libdevice [1], a bitcode library provided by nvidia.

Some functions inside libdevice, e.g. __nv_fast_sinf, do nothing more
than call the equivalent llvm nvvm intrinsic, e.g.
llvm.nvvm.sin.approx.f. This then gets lowered to an equivalent nvvm
machine instruction. Other functions in libdevice, such as
__nv_sinf,
do nontrivial computation explicitly written out in the llvm bitcode.

Following so far? If so, you may note that the state of the world is
rather incomplete! Here are the problems I see at the moment:

* Many builtins don't work. For example, clang emits __builtin_sinf
as a call to sinf [2]. But that function doesn't exist; nvidia's
not-extern-"C" ::sinf is not the same. In particular this means that
libstdc++ is not going to work well, since it implements e.g.
std::sin(float) as a call to __builtin_sinf.

* Many math optimizations aren't going to work (I think), because
e.g.
SimplifyLibCalls checks for function names like "log" [3], but by the
time we get there, we're calling __nv_log.

We already have a solution to this problem. When you instantiate TLI, you can call TLI::setAvailableWithName to set an alternate name for some of the library calls. We already do some of this in lib/Analysis/TargetLibraryInfo.cpp in initialize.

-Hal

Thanks a lot, Hal. I've sent a patch disabling the standard library
for NVVM in TLI for now, since none of those functions is guaranteed
to be available. See D16604 for details about why we couldn't just
map e.g. __nv_sin to sin.

-Justin