linking relocatable device object code with clang

Dear clang developers,
thanks to the recent work by Alexey Bataev, Jonas Hahnfeld, and others, the trunk version of clang includes support for compiling device code into relocatable object files [1].

These object files can be linked with nvlink (once per GPU architecture), combined with fatbin, embedded in a host object file, and linked with the other host code by the host linker.
Usually nvcc can take care of this part - but it refuses to do so for unsupported host compilers (gcc 8, clang 6).

It would be great if support for this “device link” step could be added to the clang driver.
I am interested to work on it myself, but I would need some guidance on how to start.

In the meantime, to show each step and validate that different approaches are equivalent, I have adapted the original example by NVIDIA and set up an example on GitHub at https://github.com/fwyzard/cuda-linking/ :

clone the repository

git clone git@github.com:fwyzard/cuda-linking.git

cd cuda-linking

build and link with nvcc

make clean nvcc
./app

build with nvcc, link explicitly with nvlink/fatbin

make clean nvlink

./app

build with clang, link explicitly with nvlink/fatbin

make clean clang

./app

Best regards,
.Andrea

[1] Separate Compilation and Linking of CUDA C++ Device Code, https://devblogs.nvidia.com/separate-compilation-linking-cuda-device-code/

You might want to start looking around in the Driver code, how things are done and especially how you invoke external tools.
There is ⚙ D47394 [OpenMP][Clang][NVPTX] Replace bundling with partial linking for the OpenMP NVPTX device offloading toolchain which does somewhat related things for OpenMP, maybe you can reuse some parts once the change lands?

Cheers,
Jonas