BTW, now you can simply use clang++ -fopenmp --offload-arch=sm_80 to compile. No need to use the long -fopenmp-targets=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_80 -Xopenmp-target -march=sm_80.
We only started supporting this fully in LLVM 15, see this talk. LLVM 14 still requires using -fopenmp-targets=nvptx64 -Xopenmp-target=nvptx64 -march=sm_80 to specify it. I would recommend using LLVM 15 because it fully supports -foffload-lto, which generally improves performance on Nvidia targets.
I’d recommend to update to LLVM 15 or even a development version (build from the git version). The FAQ on openmp.llvm.org has information on the latter.
I think it (nowadays) is. Thought that’s not the point of the question.
FWIW, with LLVM 15 and newer you can drop the cuda-gpu-arch and replace -Xopenmp-target -march=sm_80 with --offload-arch=sm_80. If you like cude mode, clang --help-hidden | grep fopenmp` has some more assume flags you might like. As @jhuber6 noted, -foffload-lto is also something to consider for sure. (Side note: Not sure if we already describe all of them on our webpage (openmp.llvm.org), @jhuber6.)
Device 0 and 1 is what should work, so the example as shown looks OK. I’m assuming @jhuber6’s answer will fix your problem. LIBOMPTARGET_INFO=-1 ./testapp will give you more information about what’s going on, e.g., it might tell you that the image is not compatible with the device as sm_80 has not been piped through properly.
Thank you for the clarification, Shilei! I now feel convinced that the code should indeed work with device(0) and device(1). As suggested by Johannes and Joseph I will install the latest version of Clang instead and see if the issue persists.
Thank you, Johannes! I definitely have a lot to learn about Clang. I will try out the development branch and if I get it to work, I will try out some of the assume flags to improve the performance.
I am very impressed that I got so many replies in such a short time!
That is because clang picks up the system GCC 4.8.5 in the second build, which doesn’t support fully C++14. Unfortunately in the past you can set GCC_INSTALL_PREFIX when building LLVM, but it is removed recently (Add --gcc-install-dir=, deprecate --gcc-toolchain=, and remove GCC_INSTALL_PREFIX). CCC_OVERRIDE_OPTIONS seems the only way to tell clang to use your own GCC. You can check clang/tools/driver/driver.cpp line 105 to see how to use that environment variable.
Okay, I checked the wrong directory before. GCC_INSTALL_PREFIX is not removed yet. You can still use that. I’m not sure if it will be removed soon, but at least it can still work.
The CMake variable GCC_INSTALL_PREFIX is discouraged. Which value do you use? It’s recommended to specify --gcc-install-dir= (e.g. --gcc-install-dir=/usr/lib/gcc/x86_64-linux-gnu/10) if you have a specific GCC version requirement.
The problem is, users don’t have direct control of the second build (runtime build in LLVM_ENABLE_RUNTIMES). It is invoked by CMake directly. In this case, we can either tell clang by using CCC_OVERRIDE_OPTIONS, or clang can use the one specified in GCC_INSTALL_PREFIX when it is built.
I omitted a couple of unrelated arguments. Here openmp is set in LLVM_ENABLE_RUNTIMES. In this mode, both clang and llvm will be built first, and then openmp will be configured (CMake) and built. During the configuration, the clang just built will be used as compiler. The second stage is invoked by CMake automatically because runtimes is a CMake target at the top level. So at that stage, clang will use the GCC specified in GCC_INSTALL_PREFIX directly.
I was trying to say that, asking users to specify --gcc-install-dir is not feasible in this case.
Yes, but when using runtime build (LLVM_ENABLE_RUNTIMES), either users don’t have write access to system folder, or the clang binary folder is not created yet.
Passing --config /path/to/confg.cfg to CMAKE_CXX_FLAGS might be a way to solve this with less steps, but this would also load the config file when building clang/llvm (which may not be what you want). Is there a way to pass cxx flags only to the runtime builds?
AFAIK, only specific CMake arguments can be passed to the runtime builds, like <runtime>_ABC. Then when configuring the runtime, <runtime>_ABC will be passed through. That being said, there is no way to pass CMake related arguments to runtime build. (That’s my impression about one year ago when I enabled runtime build for OpenMP. I’m not sure if it has been improved.) That’s also the main reason that I never use runtime build, because my general setting is release LLVM plus debug OpenMP, and there is no way to do it with runtime build. However, for OpenMP users, runtime build is the recommended method.
Thank you for all of the really good recommendations! I guess the issue of specifying the device number has been fixed in a more recent version of Clang. With the release version compiled from the main branch there is indeed no issue with multi-GPU offloading.
@shiltian Adding -DGCC_INSTALL_PREFIX made the second phase of the compilation work. Thank you so much for suggesting that. I would never have found out how to do that on my own.
The following resulted in an installation that allows me to do multi-GPU target offloading in Clang++.