Fail to generate cubin file when compiling OpenMP Applications with Nvidia GPUs

Hi,

I have compiled the latest LLVM using source code from the Github main branch, but the clang++ compiler reports a nvlink error, “could not open input file 'OMPStream-766936.cubin”, when compiling the OpenMP version BabelStream benchmark.

I found out the nvlink error arises when CMake executes the following three commands:
“/home/lechen/Repository/OpenMP/llvm-install/main/bin/clang-offload-bundler” -type=o -targets=host-x86_64-unknown-linux-gnu,openmp-nvptx64-nvidia-cuda-sm_70 -inputs=CMakeFiles/omp-stream.dir/src/omp/OMPStream.cpp.o -outputs=/home/lechen/.tmux_tmp/OMPStream-feb037.o,/home/lechen/.tmux_tmp/OMPStream-766936.cubin -unbundle -allow-missing-bundles

“/home/lechen/Repository/OpenMP/llvm-install/main/bin/clang-offload-bundler” -type=o -targets=host-x86_64-unknown-linux-gnu,openmp-nvptx64-nvidia-cuda-sm_70 -inputs=CMakeFiles/omp-stream.dir/src/main.cpp.o -outputs=/home/lechen/.tmux_tmp/main-e9854a.o,/home/lechen/.tmux_tmp/main-81e658.cubin -unbundle -allow-missing-bundles

“/rwthfs/rz/cluster/home/lechen/Repository/OpenMP/llvm-install/main/bin/clang-nvlink-wrapper” -o /home/lechen/.tmux_tmp/OMPStream-d73a78.out -v -arch sm_70 -L/home/lechen/Repository/OpenMP/llvm-install/main/lib -L/usr/local_rwth/sw/cuda/11.2.0/nvvm/libdevice -L/rwthfs/rz/cluster/home/lechen/Repository/OpenMP/llvm-install/main/lib /home/lechen/.tmux_tmp/OMPStream-766936.cubin /home/lechen/.tmux_tmp/main-81e658.cubin --nvlink-path=/rwthfs/rz/SW/cuda/11.2.0/RHEL_7/cuda/bin/nvlink
nvlink fatal : Could not open input file '/home/lechen/.tmux_tmp/OMPStream-766936.cubin’

It seems that the cubin file OMPStream-766936.cubin is not correctly generated. I also tried to manually execute the three commands. This time OMPStream-766936.cubin is generated but actually an empty file, and another generated cubin file, main-81e658.cubin, is also empty.

I was wondering whether this is due to the incorrect CMake setting for the LLVM. I have listed the details of this error in the following paragraphs. If anyone knows the root cause of this issue, please let me know. Thanks.

CMake setting for LLVM:
cmake -GNinja -DCMAKE_BUILD_TYPE=Release
-DCMAKE_INSTALL_PREFIX=$INSTALLDIR
-DLLVM_ENABLE_LIBCXX=ON
-DLLVM_LIT_ARGS="-sv -j12"
-DPAPI_PREFIX=${PAPI_ROOT}
-DCLANG_DEFAULT_CXX_STDLIB=libc++
-DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_70
-DLIBOMPTARGET_ENABLE_DEBUG=on
-DLIBOMPTARGET_NVPTX_ENABLE_BCLIB=true
-DLIBOMPTARGET_NVPTX_AUTODETECT_COMPUTE_CAPABILITY=OFF
-DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=“70”
-DLLVM_ENABLE_PROJECTS=“clang;compiler-rt;libcxxabi;libcxx;libunwind;flang;clang-tools-extra;openmp”
-DCMAKE_BUILD_WITH_INSTALL_RPATH=ON
-DLLVM_INSTALL_UTILS=ON
$LLVM_SOURCE

CMake setting for BabelStream:
cmake -DCMAKE_CXX_COMPILER=clang++
-DMODEL=“omp”
-DOFFLOAD=NVIDIA:sm_70
$BABELSTREAM_SOURCE

System and LLVM details:
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/lechen/Repository/OpenMP/llvm-install/main/bin
Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/4.8.2
Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/4.8.5
Selected GCC installation: /usr/lib/gcc/x86_64-redhat-linux/4.8.5
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64
Found CUDA installation: /rwthfs/rz/SW/cuda/11.2.0/RHEL_7/cuda, version 11.2

Warns and errors reported by clang++:
warning:
linking module ‘/home/lechen/Repository/OpenMP/llvm-install/main/lib/libomptarget-new-nvptx-sm_70.bc’: Linking two modules of different target triples: ‘/home/lechen/Repository/OpenMP/llvm-install/main/lib/libomptarget-new-nvptx-sm_70.bc’ is ‘nvptx64’ whereas ‘/home/lechen/Repository/BabelStream/src/main.cpp’ is ‘nvptx64-nvidia-cuda’

nvlink error:
nvlink fatal : Could not open input file ‘/home/lechen/.tmux_tmp/OMPStream-766936.cubin’’
home/lechen/Repository/OpenMP/llvm-install/main/bin/clang-nvlink-wrapper: error: ‘nvlink’ failed
clang-14: error: nvlink command failed with exit code 1 (use -v to see invocation)

This looks like the good old “no device code no cookie” error. Basically, if you have a source file w/o device code it will trip up nvlink in the current driver. The new driver under review is totally fine with this (@jhuber6 can point you at it and it should be available soon/in 14).
To work around this till then, I’d recommend to add some target directive in the file that causes the problem here, can be in a dead but externally visible function, or a declare target variable, I think.

Let me know if this helped.

Yes, currently if no device code is present when we extract device files an empty one will be created. This causes a problem because nvlink does not ignore empty cubin files. The new driver solves this problem by only linking files that were actually found to contain offloading code. The reviews can be found here (more patches under the “stack” view) which should hopefully land in the next week or so. If you want to try it now you can use my development branch. That being said, now that we use the clang-nvlink-wrapper we could probably solve this with the old approach for now until it gets removed.

Hi, @jdoerfert @jhuber6 ,

Thanks for the prompt reply. I have tried the workaround, but the nvlink error still arises. Now both two source files (main.cpp. OMPSteam.cpp) have code snippets using target constructs, but the generated cubin files are still empty. I was wondering whether it is due to some other reasons?

@jhuber6 has made ⚙ D117777 [OpenMP] Don't pass empty files to nvlink to fix this. You could try that patch out if you build your own clang anyway.

@lechenyu I landed ⚙ D117777 [OpenMP] Don't pass empty files to nvlink which should solve your problem. Can you verify the solution?

Hi,

I have tried the patch. Unfortunately, it still does not work. Now the nvlink error becomes “No input files specified”. The details of the error are attached in the following paragraph. If you need more information, please let me know. Thanks.

Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/lechen/Repository/OpenMP/llvm-install/main/bin
Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/4.8.2
Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/4.8.5
Selected GCC installation: /usr/lib/gcc/x86_64-redhat-linux/4.8.5
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64
Found CUDA installation: /rwthfs/rz/SW/cuda/11.2.0/RHEL_7/cuda, version 11.2
“/home/lechen/Repository/OpenMP/llvm-install/main/bin/clang-offload-bundler” -type=o -targets=host-x86_64-unknown-linux-gnu,openmp-nvptx64-nvidia-cuda-sm_70 -inputs=CMakeFiles/omp-stream.dir/src/omp/OMPStream.cpp.o -outputs=/home/lechen/.tmux_tmp/OMPStream-adfe27.o,/home/lechen/.tmux_tmp/OMPStream-775b86.cubin -unbundle -allow-missing-bundles
“/home/lechen/Repository/OpenMP/llvm-install/main/bin/clang-offload-bundler” -type=o -targets=host-x86_64-unknown-linux-gnu,openmp-nvptx64-nvidia-cuda-sm_70 -inputs=CMakeFiles/omp-stream.dir/src/main.cpp.o -outputs=/home/lechen/.tmux_tmp/main-d65256.o,/home/lechen/.tmux_tmp/main-efd820.cubin -unbundle -allow-missing-bundles
“/rwthfs/rz/cluster/home/lechen/Repository/OpenMP/llvm-install/main/bin/clang-nvlink-wrapper” -o /home/lechen/.tmux_tmp/OMPStream-e7a11d.out -v -arch sm_70 -L/home/lechen/Repository/OpenMP/llvm-install/main/lib -L/usr/local_rwth/sw/cuda/11.2.0/nvvm/libdevice -L/rwthfs/rz/cluster/home/lechen/Repository/OpenMP/llvm-install/main/lib /home/lechen/.tmux_tmp/OMPStream-775b86.cubin /home/lechen/.tmux_tmp/main-efd820.cubin --nvlink-path=/rwthfs/rz/SW/cuda/11.2.0/RHEL_7/cuda/bin/nvlink
nvlink fatal : No input files specified; use option --help for more information
/rwthfs/rz/cluster/home/lechen/Repository/OpenMP/llvm-install/main/bin/clang-nvlink-wrapper: error: ‘nvlink’ failed
clang-14: error: nvlink command failed with exit code 1 (use -v to see invocation)

That means that none of the input files had any Offloading code. I could also work around this error by simply returning an empty file from the linker, but it suggests that something’s probably wrong if we’re not getting any offloading code whatsoever. Also, looking at your CMake configuration, do you use a two-step build? When compiling with OpenMP offloading support you need to build the OpenMP project with the clang compiler that was just built previously. The easiest way to do this is to remove openmp from the projects and add -DLLVM_ENABLE_RUNTIME=openmp.

Hi, @jhuber6.

The benchmark I use is BabelStream. The source file OMPStream.cpp does have offloading code. However, after applying the patch the generated cubin file is still empty.

For the llvm installation, I use an existing clang-11 to compile the latest llvm. I am used to installing llvm in this way and it works. So I think the root cause is not the two-step installation.

As I learned recently, the main reason for using LLVM_ENABLE_RUNTIME is to build llvm runtime for a different architecture in a cross-compiler setup. The CMake files for libomptarget are properly set up to use the clang compiler just compiled.

I can reproduce the issue, that all cubin files are empty, although the source code contains offloading regions. After a bit of manually trying to compile, I think the reason is that CMake of bubble stream separates the build into compilation to object files and a separate linking step. When we compile the source files into binary in one step, the linking succeeds.

I can reproduce it as well. When I compile it manually with the source files I don’t get the error, but it doesn’t seem to perform offloading. I get this output from nvprof

No kernels were profiled.
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
      API calls:   62.36%  17.274us         3  5.7580us     247ns  16.776us  cuDeviceGetAttribute
                   28.24%  7.8230us         1  7.8230us  7.8230us  7.8230us  cuDeviceGetPCIBusId
                    6.43%  1.7820us         3     594ns     245ns  1.1680us  cuDeviceGetCount
                    2.97%     822ns         1     822ns     822ns     822ns  cuDeviceGet

But the runtime indicates a cuda module was loaded at least

Libomptarget --> Image 0x000000000040b900 is compatible with RTL libomptarget.rtl.cuda.so!
Libomptarget --> RTL 0x0000000001d7dd00 has index 0!
Libomptarget --> Registering image 0x000000000040b900 with RTL libomptarget.rtl.cuda.so!
Libomptarget --> Done registering entries!

You probably need to add -DOMP_TARGET_GPU for compilation, so that the offloading is actually turned on.

You’re right, It’s working now. So it seems this is an issue with Babel Stream’s build system for offloading.

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   30.33%  537.33ms       100  5.3733ms  5.3340ms  5.6322ms  __omp_offloading_fd02_2167efb7__ZN9OMPStreamIdE3dotEv_l229
                   19.65%  348.23ms       117  2.9763ms  1.2150us  116.23ms  [CUDA memcpy DtoH]
                   13.89%  246.05ms       100  2.4605ms  2.4548ms  2.4659ms  __omp_offloading_fd02_2167efb7__ZN9OMPStreamIdE3addEv_l155
                   13.85%  245.44ms       100  2.4544ms  2.4453ms  2.4608ms  __omp_offloading_fd02_2167efb7__ZN9OMPStreamIdE5triadEv_l180
                   11.07%  196.17ms       100  1.9617ms  1.9550ms  1.9712ms  __omp_offloading_fd02_2167efb7__ZN9OMPStreamIdE4copyEv_l108
                   11.05%  195.80ms       100  1.9580ms  1.9512ms  1.9646ms  __omp_offloading_fd02_2167efb7__ZN9OMPStreamIdE3mulEv_l132
                    0.15%  2.7280ms         1  2.7280ms  2.7280ms  2.7280ms  __omp_offloading_fd02_2167efb7__ZN9OMPStreamIdE11init_arraysEddd_l62

Hi, @jdoerfert @jhuber6

I just find that I am able to manually compile BabelStream with a single clang++ command, but if I use the CMake which first generates .o files separately then link them together, the nvlink error will arise.

Command to manually compile BabelStream:
clang++ -o stream14.exe -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target -march=sm_70 -O2 -DOMP_TARGET_GPU=1 -DOMP=1 -I. -I./omp main.cpp omp/OMPStream.cp

On a separate note, I also notice that the compiled BabelStream only reach half of the peak performance of Nvidia V100. With clang-11, the throughputs for most kernels are around 800 GB/s,
but the latest llvm can only achieve 400 - 500 GB/s. I was wondering what is the root cause of the performance loss?

Performance with clang-11 and clang-14
Screenshot from 2022-01-21 12-12-34

We’ll look into this. I think I know the answer and it basically boils down to upstreaming the missing OpenMP optimizations we have.
Will keep you posted.

I just pushed a patch that addresses the performance differences when I run it. Can you do a pull from upstream LLVM now and verify that the performance improves?