Cannot pass __device__ function as template parameter in CUDA?


I’m compiling this CUDA C++ code with Clang 14.0.5 (and CUDA 11.5):

template<class F>
__global__ void launch (F f) { f(); }

template <typename p> class c {
    __device__ static bool test();
    template <typename F> static void f(F&& test);

template <typename p>
__device__ auto c<p>::test() -> bool { return true; }

template <typename p>
template <typename F>
void c<p>::f(F&& test) {
    launch<<<1,1>>>([=] __device__ () { if (test()) { printf("success!"); } });

struct s {};

void x() {

but it fails with the error:

$ clang -x cuda -std=c++17 --cuda-gpu-arch=sm_70 -L$CUDA_HOME/lib64 -lcudart test.cpp
test.cpp:22:19: error: reference to __device__ function 'test' in __host__ function
test.cpp:6:28: note: 'test' declared here
    __device__ static bool test();
1 error generated when compiling for host.

I think this should be legal to do – is this supported in Clang?

If I change test() to a __host__ __device__ function, then it compiles without warnings, but does not launch any kernels at runtime. Regardless of the earlier point, I think this should be considered a bug.


The problem is that c<s>::test does not exist on the host side, so we physically do not have anything to pass to c<s>::f.

You could launch test as the member of class c<s>: Compiler Explorer
Or wrap it in a lambda: Compiler Explorer, so we can pass a lambda object.

In general, the test function is not a member of class c<s>, but wrapping it in a lambda would work. Why does that work, though? Is the lambda itself not a device function?

Why does that work, though? Is the lambda itself not a device function?

Lambda is an object, which we can construct on the host side just fine. Non-capturing lambda can also be converted to a function pointer and that’s what makes it work – the object gives us something to pass on the host side, and the GPU-side operator() in a non-capturing lambda gives us something to execute on the GPU (and it does not depend on the lambda object itself).