NVPTX Back-end: relocatable device code support for dynamic parallelism

Hi everyone,

CUDA allows to call some runtime functions also from the device code. On a multi-GPU system this allows the GPU to determine its device id on its own via cudaGetDevice().
Unfortunately i cannot get it working when compiling with clang. When compiling with nvcc relocatable device code needs to be set to true (-rdc=true) and the cudadevrt is needed when linking [0]. I did not found such switches to turn rdc for clang. Just compiling does not work as ptxas does not find the function cudaGetDevice().

My guess is, that this feature is not supported. Does anyone know is this is the case?

I also tried to find out what nvcc is doing when setting rdc to on, but hat a few problem trying to understand whats going on. I will attach the verbose output of nvcc. I
have no clue what the binaries cudafe/cudafe++ and cicc are doing so its rather hard to guess whats happening.
There are additional options like -D__CUDACC_RDC__, --device-c and --compile-only that are not used when rdc is off. All but --device-c can be used with clang and i can compile my program, however i can't get it to run properly. For each runtime call i get an unknown error with code 30.

I have few hope, that someone already has figured out how to use get rdc to work with clang, but i will be grateful for any hint. To whom could i write to regarding this problem? Maybe the NVPTX developers can help?

[0] Programming Guide :: CUDA Toolkit Documentation

nvcc-rdc-verbose.txt (6.32 KB)

Sorry for the long delay in replying, I'm not good at reading the mailing list.

My guess is, that this feature is not supported. Does anyone know is this is the case?

Clang does not support dynamic parallelism or relocatable CUDA code
today. Patches -- and documentation fixes to mention this at
Compiling CUDA with clang — LLVM 18.0.0git documentation -- are certainly