OpenMP offload: Using global variable with a library result in "CUDA error: Loading global 'xxxx' Failed" or "nvlink error : Undefined reference to ..."

Dear All,

I have two questions regarding the usage of global variables with OpenMP offload:

  1. When I have global variable usage in the code from which I create a library then I get “CUDA error: Loading global ‘x’ Failed” error. Here is a simple reproducer showing the issue:

$ cat test.sh

CXX=clang++
CXXFLAGS="-fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -g -O2"
${CXX} ${CXXFLAGS} -c test.cpp
ar cq libtest.a test.o
${CXX} ${CXXFLAGS} -o test1 main.cpp -L. -ltest
${CXX} ${CXXFLAGS} -o test2 main.cpp test.o

$ cat test.cpp

#pragma omp declare target
int y;
#pragma omp end declare target

int test() {
y = 24;
#pragma omp target update to(y)
y = 42;

int x;
#pragma omp target map(from:x)
{
x = y;
}
return x;
}

$ cat main.cpp
extern int test();

int main() {
return test();
}

Running the ./test2 works as expected as I am not using static library but the ./test1 fails with an error shown below:

$ ./test2
$ echo $?
24
$ ./test1
CUDA error: Loading global ‘y’ Failed
CUDA error: named symbol not found
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
test.cpp:7:3: Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Aborted

Is this expected behaviour? Is there any workaround? I have tested this with PGI/NVHPC compiler and it works there.

  1. The second scenario is similar but now I am trying to use a global variable from the library into the offload region in main.cpp i.e. modified main.cpp looks as:

$ cat main.cpp
extern int test();

#include

#pragma omp declare target
extern int y;
#pragma omp end declare target

int main() {
#pragma omp target teams distribute parallel for
for(int i=0; i<5; i++) {
printf("–> %d \n", y + i);
}
return test();
}

This now fails to compile with a library:

In a big application, we build & ship libraries and it’s not easy/convenient to use objects for linking. Do you have any recommendations to solve/workaround issues for this use case?

Thank you very much!

Regards,
Pramod Kumbhar

As of today, linking static libraries with device code in it remains problematic. If your project uses CMake, use object target to link object files directly.
Ye

Kumbhar,

As ye said, the current support for linking static libraries is a little flaky. I’m currently working on a series of patches to hopefully address this problem that is under review at https://reviews.llvm.org/D116541 and its child revisions. I can currently link and run your tests using my development branch after adding some additional support to search the library paths for static libraries. This should solve your problem once these patches are ready to land upstream.

Thanks,
Joseph Huber

Thank you Ye and Joseph for clarifications! I am glad that the issues is already being addressed!

Just would like to add one clarification - the issue/question #1 is addressed by switching to shared library but the issue/question #2 (i.e. extern global variables in offload region) remains with the shared library :

clang++ -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -g -O2 -fpic -shared test.cpp -o libtest.so
clang++ -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -g -O2 -o test3 main.cpp -L. -ltest -Wl,-rpath .
nvlink error : Undefined reference to ‘y’ in ‘/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/151397/main-1999e6.cubin
clang-13: error: nvlink command failed with exit code 255 (use -v to see invocation)
clang version 13.0.0
Target: x86_64-unknown-linux-gnu
Thread model: posix

I haven’t built the PR D116541 locally yet but I assume the underlying issue is the same and will be addressed by the same PR?

Thank you again!

Shared libraries work as long as offloading code isn’t shared between the library and the application. When we do offloading linking we have a fat binary the contains code for the host and the device. With static linking we can extract the device code and use it to create a complete device image prior to linking. Shared libraries are loaded at runtime so we can’t do the same approach and will need to leverage some kind of JIT to create a complete host binary. The work in the PR does not directly address the problem with shared libraries, but will make it easier to implement JIT functionality and potentially allow linking with shared libraries.

Thanks,
Joseph Huber