Linking CUDA bitcode files and generating CUDA executable

Hi everyone,

Could someone share the recipe for getting bitcode files out of a CUDA program and then linking them to generate an executable? I’m following these steps but when I run the executable no CUDA kernels are executed:

   clang++ -emit-llvm -c --cuda-path=$(CUDA_PATH) --cuda-gpu-arch=sm_35
   clang++ -c program.bc -o program.o
   llc program-cuda-nvptx64-nvidia-cuda-sm_35.bc -o program-cuda-nvptx64-nvidia-cuda-sm_35.ptx
   nvcc -arch=sm_35 --device-c program-cuda-nvptx64-nvidia-cuda-sm_35.ptx -o program-cuda-nvptx64-nvidia-cuda-sm_35.o
   nvcc -arch=sm_35 -dlink program.o program-cuda-nvptx64-nvidia-cuda-sm_35.o -o linkedcode.o
   clang++ -o program linkedcode.o program.o program-cuda-nvptx64-nvidia-cuda-sm_35.o -L$(CUDA_LIB) -lcudart_static -lcudadevrt -ldl -lrt -pthread

I don’t get any error when doing this, but when I run it, no kernels execute. When I use cuda-memcheck it tells me this:

“Program hit cudaErrorInvalidDeviceFunction (error 8) due to "invalid device function" on CUDA API call to cudaLaunch. ”

My device is "Tesla K40m" with compute capability 3.5. I'm using clang/llvm 4.0, and CUDA 8.0. Can someone point out what I am doing wrong?

Thank you very much in advance,