Compiling OpenMP code with CUDA support using libc++

Hi,

I just compiled LLVM and enable NVPTX by -DLLVM_TARGETS_TO_BUILD=“X86;NVPTX”. But when I compiled my code using the below command:

clang++ main.cpp -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -o a_gpu.exe

I got the below error:

/usr/lib64/gcc/x86_64-pc-linux-gnu/9.2.1/…/…/…/…/include/c++/9.2.1/bits/std_abs.h:75:3: error: declaration conflicts with target of using declaration already in scope
abs(float __x)

I think the reason is I am using GCC-9. I did a bit search and someone said using libc++ can address this. So I compiled libcxx and libcxxabi, and used the below command to compile my code again:

clang++ stdlib=libc++ main.cpp -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -o a_gpu.exe

I got different errors:

nvlink error : Undefined reference to ‘_ZNKSt3__120__vector_base_commonILb1EE20__throw_length_errorEv’ in ‘/tmp/main-42e0a6.cubin’
nvlink error : Undefined reference to ‘abort’ in ‘/tmp/main-42e0a6.cubin’

I think the reason here is nvlink don’t know we should link libc++ with those cubin files together. But I don’t know how to solve this.

Any one know a workaround to this?

Thanks!

Neither libc++, nor libstdc++ cannot be linked with nvlink. NVidia does not provide implementations for either libc++, or libstdc++. You must exclude the use of the standard c++ library from target regions.

Best regards,
Alexey Bataev

Thanks! I didn’t notice this. The code is from a legacy project and I just checked all the target regions. It did use STL vectors. I just replaced all those vectors with arrays. Now it can successfully compile. But When I run it, there is another error:

Libomptarget fatal error 1: failure of target construct while offloading is mandatory

I tried it on a simple vector add example, and got the same error.

Below is the debug information with LIBOMPTARGET_DEBUG=1:

Libomptarget → Loading RTLs…
Libomptarget → Loading library ‘libomptarget.rtl.ppc64.so’…
Libomptarget → Unable to load library ‘libomptarget.rtl.ppc64.so’: libomptarget.rtl.ppc64.so: cannot open shared object file: No such file or directory!
Libomptarget → Loading library ‘libomptarget.rtl.x86_64.so’…
Libomptarget → Successfully loaded library ‘libomptarget.rtl.x86_64.so’!
Libomptarget → Registering RTL libomptarget.rtl.x86_64.so supporting 4 devices!
Libomptarget → Loading library ‘libomptarget.rtl.cuda.so’…
Target CUDA RTL → Start initializing CUDA
Libomptarget → Successfully loaded library ‘libomptarget.rtl.cuda.so’!
Libomptarget → Registering RTL libomptarget.rtl.cuda.so supporting 1 devices!
Libomptarget → Loading library ‘libomptarget.rtl.aarch64.so’…
Libomptarget → Unable to load library ‘libomptarget.rtl.aarch64.so’: libomptarget.rtl.aarch64.so: cannot open shared object file: No such file or directory!
Libomptarget → RTLs loaded!
Libomptarget → Image 0x000000000041ad20 is NOT compatible with RTL libomptarget.rtl.x86_64.so!
Libomptarget → Image 0x000000000041ad20 is compatible with RTL libomptarget.rtl.cuda.so!
Libomptarget → RTL 0x00000000015b3c40 has index 0!
Libomptarget → Registering image 0x000000000041ad20 with RTL libomptarget.rtl.cuda.so!
Libomptarget → Done registering entries!
Libomptarget → Call to omp_get_num_devices returning 1
Libomptarget → Default TARGET OFFLOAD policy is now mandatory (devices were found)
Libomptarget → Checking whether device 0 is ready.
Libomptarget → Is the device 0 (local ID 0) initialized? 0
Target CUDA RTL → Init requires flags to 1
Target CUDA RTL → Getting device 0
Target CUDA RTL → Max CUDA blocks per grid 2147483647 exceeds the hard team limit 65536, capping at the hard limit
Target CUDA RTL → Using 1024 CUDA threads per block
Target CUDA RTL → Max number of CUDA blocks 65536, threads 1024 & warp size 32
Target CUDA RTL → Default number of teams set according to library’s default 128
Target CUDA RTL → Default number of threads set according to library’s default 128
Libomptarget → Device 0 is ready to use.
Target CUDA RTL → Load data from image 0x000000000041ad20
Target CUDA RTL → Error when loading CUDA module
Target CUDA RTL → CUDA error is: device kernel image is invalid
Libomptarget → Unable to generate entries table for device id 0.
Libomptarget → Failed to init globals on device 0
Libomptarget → Failed to get device 0 ready
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Libomptarget → Unloading target library!
Libomptarget → Image 0x000000000041ad20 is compatible with RTL 0x00000000015b3c40!
Libomptarget → Unregistered image 0x000000000041ad20 from RTL 0x00000000015b3c40!
Libomptarget → Done unregistering images!
Libomptarget → Removing translation table for descriptor 0x0000000000440810
Libomptarget → Done unregistering library!
Libomptarget → Deinit target library!

Any hints about this?

Regards,
Gang Zhao

Alexey Bataev <a.bataev@hotmail.com> 于2020年3月5日周四 上午5:44写道:

I ran into this ‘Target CUDA RTL → CUDA error is: device kernel image is invalid’ issue. This might be due to mismatch in the SM-architecture between what the device supports and what the clang installation was configured. Running devicequery from the samples that gets shipped with Nvidia SDK will tell you the compute capability. For instance, if its 6.1, check the cmake configuration to see if its configured to work with 61. It can be configured using -DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=61

Hope this helps.
-Sriram

Yeah, the example code is below:

int N = 1<<20;

float *x = new float[N];
float *y = new float[N];

for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}

float *z = new float[N];
int i;
#pragma omp target map(x, y, z)
#pragma omp parallel for
for (i=0; i < N; i++) {
z[i] = x[i] + y[i];
}

I just grab a piece of code from https://www.openmp.org/wp-content/uploads/openmp-examples-4.5.0.pdf for testing. I also tested other examples in that document, but none worked. Initially, I was working on a piece of code from a legacy project.

BTW, when I compiled the example code, I got some warnings:

clang-11: warning: Unknown CUDA version 10.2. Assuming the latest supported version 10.1 [-Wunknown-cuda-version]
clang-11: warning: Unknown CUDA version 10.2. Assuming the latest supported version 10.1 [-Wunknown-cuda-version]
clang-11: warning: No library ‘libomptarget-nvptx-sm_35.bc’ found in the default clang lib directory or in LIBRARY_PATH. Expect degraded performance due to no inlining of runtime functions on target devices. [-Wopenmp-target]

I am not sure if the error is caused by the newer CUDA version (I thought 10.2 should be compatible with 10.1)

Thanks!

Gang Zhao

Alexey.Bataev <a.bataev@outlook.com> 于2020年3月5日周四 下午12:59写道:

  1. Yes. I forgot that I changed the static array to dynamic. So my mapping is wrong. I just changed it to the format like x[:N], but the error still occurred.

I’ll search more to see if it’s due to cuda 10.2. Thanks!

Alexey.Bataev <a.bataev@outlook.com> 于2020年3月5日周四 下午2:31写道:

The compute capability of my device is 7.5, but when I compiled LLVM, I did not set DCLANG_OPENMP_NVPTX_DEFAULT_ARCH and DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES. It’s mentioned somewhere that the default value is 3.5. I thought it’s backward compatible. I’ll recompile it with the exact number to see if it works. Thanks!

Regards,

Gang Zhao

Sriram Ananthakrishnan <sriram.aananth@gmail.com> 于2020年3月5日周四 下午2:13写道: