Hi, I'm looking for some advice on the right way to implement math

functions in device-side CUDA code. I'm pretty new here, so please

forgive the likely many gross misunderstandings below.

There are three classes of things I'm concerned with:

* Functions declared in math.h (e.g. sinf),

* Builtin math functions (e.g. __builtin_sinf), and

* LLVM intrinsic math instructions (e.g. llvm.sin.f32).

At the moment the way this works is that the CUDA headers provided by

nvidia define inline functions along the lines of

__device__ float sinf(float v) {

#ifndef FAST_MATH

return __nv_fast_sinf(v);

#else

return __nv_sinf(v);

#endif

}

This function is C++, *not* extern "C". __nv_sin and __nv_fast_sin

are defined in libdevice [1], a bitcode library provided by nvidia.

Some functions inside libdevice, e.g. __nv_fast_sinf, do nothing more

than call the equivalent llvm nvvm intrinsic, e.g.

llvm.nvvm.sin.approx.f. This then gets lowered to an equivalent nvvm

machine instruction. Other functions in libdevice, such as __nv_sinf,

do nontrivial computation explicitly written out in the llvm bitcode.

Following so far? If so, you may note that the state of the world is

rather incomplete! Here are the problems I see at the moment:

* Many builtins don't work. For example, clang emits __builtin_sinf

as a call to sinf [2]. But that function doesn't exist; nvidia's

not-extern-"C" ::sinf is not the same. In particular this means that

libstdc++ is not going to work well, since it implements e.g.

std::sin(float) as a call to __builtin_sinf.

* Many math optimizations aren't going to work (I think), because e.g.

SimplifyLibCalls checks for function names like "log" [3], but by the

time we get there, we're calling __nv_log.

* (At least some) llvm intrinsics sort of work, but e.g. llvm.sin.f32

gets lowered to the nvvm instruction sin.approx.f32, while ::sin from

the CUDA headers only does this transformation if fast-math is

enabled. Maybe this is sound if we only emit llvm.sin.f32 if

fast-math is enabled; I dunno.

My question for the list is about the right way to fix these problems.

It seems to me that since the optimizer explicitly uses knowledge of

the various math functions, we shouldn't define inline versions of

them. Instead, we should leave them as plain calls to e.g. sinf until

we lower to nvvm in llvm. We have a header baked into clang that

already disables some CUDA headers; in theory we should be able to

disable CUDA's math_functions.h using the same mechanism.

When lowering to nvvm, we can make the determination as to whether we

want to call a function defined in libdevice, invoke an nvvm

intrinsic, or whatever. If we call into libdevice we'll want to rerun

some optimization passes, but I presume we can order these order these

passes appropriately.

In order to do this, we'll need libdevice to be available to llvm.

It's not clear to me whether it is at the moment; clang sees it, but

I'm not sure if clang passes both its generated IR and all of

libdevice to llvm, or if it just copies the relevant definitions from

libdevice into the IR it sends to llvm. If it's the latter, we could

always copy *all* of libdevice into the generated IR. But I hope we

could do better.

I think this proposal lets us solve all three problems above, but

again I'm not sure I'm not missing something, or if there's a more

canonical way to do this. Any feedback is appreciated!

-Justin

[1] libdevice User's Guide :: CUDA Toolkit Documentation

[2] https://github.com/llvm-mirror/clang/blob/e2636ac0bad65451c3eb6272d7ab3abbba96da17/lib/CodeGen/CGBuiltin.cpp#L1971

[3] llvm/SimplifyLibCalls.cpp at b3bc79d5556108307026be07e7eaa644cce041ab · llvm-mirror/llvm · GitHub