Kernel launch failed invalid device function when both CUDA and OpenMP offload are in the same CMake shared library

When I build a shared library w/ CMake that contains both CUDA (compiled w/ clang++) and OpenMP offload code, I have unexplained runtime crashes, eg the above kernel launch failure, and I’ve also seen crash in CUDA in first call to the CUDA runtime (before any kernel launch).

This is only happening with a shared library. When I use a static library the code runs correctly. Also, I can compile the same code into 2 libraries and then link into my exe and run without issue. I believe that the code itself is correct but that there is some issue created by linking them together into the same shared library.

I am using CMake, and telling it to use clang as both CMAKE_CXX_COMPILER and CMAKE_CUDA_COMPILER. I have also CMAKE_CUDA_SEPARABLE_COMPILATION ON (for relocatable device code). CMake handle CUDA w/ clang seamlessly, but does not know anythong about OpenMP offload, and those flags I have to pass manually via target_compile_options and target_link_options. Perhaps the flags are not correct?

The problematic library and executable are configured as :

# library with both OpenMP and CUDA
set_source_files_properties(../cu_impl.cpp PROPERTIES LANGUAGE CUDA)
add_library(cump_impl ../mp_impl.cpp ../cu_impl.cpp)
target_compile_options(cump_impl PUBLIC -fopenmp --offload-arch=sm_75 -fopenmp-offload-mandatory --offload-new-driver)
set_target_properties(cump_impl PROPERTIES CUDA_ARCHITECTURES "75")

# executable that uses the library
add_executable(cump_both_2 ../main.cpp)
target_link_libraries(cump_both_2 cump_impl omp omptarget)
target_compile_definitions(cump_both_2 PUBLIC -DCUMP_USE_OPENMP -DCUMP_USE_CUDA)
target_compile_options(cump_both_2 PUBLIC -fopenmp --offload-arch=sm_75 -fopenmp-offload-mandatory --offload-new-driver)
target_link_options(cump_both_2  PUBLIC -L/home/bloring/work/llvm/llvm-install/lib/ -fopenmp --offload-arch=sm_75 -fopenmp-offload-mandatory --offload-new-driver)

There is a reproducer here:

This can be compiled and run with:

mkdir build
cd build
cmake -DBUILD_TESTING=ON -DBUILD_SHARED_LIBS=ON -DCMAKE_CXX_COMPILER=`which clang++` -DCMAKE_CUDA_COMPILER=`which clang++` -DCMAKE_BUILD_TYPE=Debug ../
make -j8
ctest

the last of the 4 tests fails, the one that uses the library with both CUDA and OpenMP offload. Change -DBUILD_SHARED_LIBS=OFF and all tests run correctly.

I’m using clang17 from git early June, and CMake 3.26.2 from Fedora 37

Are you using --offload-new-driver for both the OpenMP and CUDA compilations? That forces them to go through the same pipeline and allows them to be linked together. I believe CMake supports separable compilation for Clang doing a lot of build system magic, using --offload-new-driver should make that unnecessary as it has the magic built-in. In general, when you do this you’re going to be running two copies of the CUDA driver, one for OpenMP and one for CUDA. So it’s possible when we do this via a shared library there’s some reference counting or delayed constructor weirdness when they get loaded later.

2 Likes

I am, but now that you point this out I see that it was not in the linking step for the shared library. Adding it to the link options:

target_link_options(cump_impl  PUBLIC -L/home/bloring/work/llvm/llvm-17_5_25-install/lib/ -fopenmp --offload-arch=sm_75 -fopenmp-offload-mandatory --offload-new-driver) 

solved the problem. Thanks for the suggestion!