How can I use multiple GPUs in OpenMP?

Deal LLVM community

I have developed some C and C++ applications using multiple GPUs which I have so far only compiled with NVC. I have made a few simple test problems to get started with Clang, but I can’t get them to work as expected. Everything goes fine as long as I use only one GPU. Do any of you know what I am doing wrong?

With llvm-omp-device-info I find my two Nvidia A100 GPUs listed as devices number 4 and 5 respectively. However, omp_get_num_devices() returns 2, indicating that the four other devices (0 to 3) do not count.

Also, omp_get_initial_device() returns 2 and would return -1 when compiled with NVC (which is fine since it is not specified what number it should be given in the standard). I am in doubt about if I simply do not specify the correct device numbers or if I am doing something else wrong. I compile my examples with

clang++ -Wall -g -v -std=c++17 -O3 -fopenmp -fopenmp-cuda-mode -fopenmp-targets=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_80 -Xopenmp-target -march=sm_80 -o testapp testapp.cpp -lm -L/appl/gcc/11.3.0-binutils-2.38/lib64/ -Wl,-rpath,/appl/gcc/11.3.0-binutils-2.38/lib64/

The following example works

#import <iostream>
#include <omp.h>

using namespace std;

int main(void)
{
	int len = 1000000;
	double * arr = new double[len];

	#pragma omp parallel for schedule(static)
	for(int i=0;i<len;i++){
		arr[i] = ((double) i)/100000.0;
	}

	double result = 0.0;
	#pragma omp target teams distribute parallel for schedule(static) map(to:arr[0:len]) \
	map(tofrom:result) reduction(+:result)
	for(int i=0;i<len;i++){
		result += arr[i];
	}

	cout << "The GPU reduction yielded " << result << endl;
	delete[] arr;
	return 0;
}

And the following does not

#import <iostream>
#include <omp.h>

using namespace std;

int main(void)
{
	int len = 1000000;
	double * arr = new double[len];

	#pragma omp parallel for schedule(static)
	for(int i=0;i<len;i++){
		arr[i] = ((double) i)/100000.0;
	}

	for (int j=0;j<omp_get_num_devices();j++){
		double result = 0.0;
		#pragma omp target teams distribute parallel for schedule(static) map(to:arr[0:len]) \
		map(tofrom:result) reduction(+:result) device(j)
		for(int i=0;i<len;i++){
			result += arr[i];
		}

		cout << "The GPU reduction yielded " << result << endl;
	}
	delete[] arr;
	return 0;
}

I also tried replacing device(j) with device(4+j) to match numbers 4 and 5, but that also does not work. I get the error: Libomptarget error: Device XXX is not ready.

I have Clang 14.0.6 installed. Any help would be greatly appreciated!

Kind regards, Anton

With llvm-omp-device-info I find my two Nvidia A100 GPUs listed as devices number 4 and 5 respectively. However, omp_get_num_devices() returns 2, indicating that the four other devices (0 to 3) do not count.

The reason llvm-omp-device-info returns multiple devices (more than 2) is, it also opens the generic plugin, which returns four CPU devices (hardcoded). What omp_get_num_devices gives you is correct, because when the program starts, device images will be checked and loaded. Only devices compatible with those images will be counted. Here you compile for A100 (SM80), then only the two GPUs are counted.

Based on OpenMP spec:

The effect of this routine is to return the device number of the host device. The value of the device number is the value returned by the omp_get_num_devices routine.

So LLVM OpenMP implementation does conform with the spec. Device id -1 is used to indicate user doesn’t specify any device id on a target directive.

To summarize, in your case, device 0 and 1 are the two A100 GPUs, because they are the only two devices compatible with the device images.

BTW, now you can simply use clang++ -fopenmp --offload-arch=sm_80 to compile. No need to use the long -fopenmp-targets=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_80 -Xopenmp-target -march=sm_80.

1 Like

BTW, now you can simply use clang++ -fopenmp --offload-arch=sm_80 to compile. No need to use the long -fopenmp-targets=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_80 -Xopenmp-target -march=sm_80.

We only started supporting this fully in LLVM 15, see this talk. LLVM 14 still requires using -fopenmp-targets=nvptx64 -Xopenmp-target=nvptx64 -march=sm_80 to specify it. I would recommend using LLVM 15 because it fully supports -foffload-lto, which generally improves performance on Nvidia targets.

1 Like

I’d recommend to update to LLVM 15 or even a development version (build from the git version). The FAQ on openmp.llvm.org has information on the latter.

I think it (nowadays) is. Thought that’s not the point of the question.

FWIW, with LLVM 15 and newer you can drop the cuda-gpu-arch and replace -Xopenmp-target -march=sm_80 with --offload-arch=sm_80. If you like cude mode, clang --help-hidden | grep fopenmp` has some more assume flags you might like. As @jhuber6 noted, -foffload-lto is also something to consider for sure. (Side note: Not sure if we already describe all of them on our webpage (openmp.llvm.org), @jhuber6.)

Device 0 and 1 is what should work, so the example as shown looks OK. I’m assuming @jhuber6’s answer will fix your problem. LIBOMPTARGET_INFO=-1 ./testapp will give you more information about what’s going on, e.g., it might tell you that the image is not compatible with the device as sm_80 has not been piped through properly.

1 Like

Thank you for the clarification, Shilei! I now feel convinced that the code should indeed work with device(0) and device(1). As suggested by Johannes and Joseph I will install the latest version of Clang instead and see if the issue persists. :smiling_face:

Thank you, Joseph! I will revisit your slides and try to install a newer version of Clang

Thank you, Johannes! I definitely have a lot to learn about Clang. I will try out the development branch and if I get it to work, I will try out some of the assume flags to improve the performance.

I am very impressed that I got so many replies in such a short time!

I am facing some problems when installing LLVM from Github. I tried

mkdir LLVM
cd LLVM
git clone https://github.com/llvm/llvm-project/tree/main
module load cmake/3.23.2
module load gcc/11.3.0-binutils-2.38
export CC=`which gcc`
export CXX=`which g++`
cmake llvm-project/llvm -DLLVM_ENABLE_PROJECTS='clang;lld' -DLLVM_ENABLE_RUNTIMES='openmp' -DCMAKE_BUILD_TYPE=Release
make -j 16

But I got a lot of errors like the following

Do you know how I can fix those?

See below :wink:

That is because clang picks up the system GCC 4.8.5 in the second build, which doesn’t support fully C++14. Unfortunately in the past you can set GCC_INSTALL_PREFIX when building LLVM, but it is removed recently (Add --gcc-install-dir=, deprecate --gcc-toolchain=, and remove GCC_INSTALL_PREFIX). CCC_OVERRIDE_OPTIONS seems the only way to tell clang to use your own GCC. You can check clang/tools/driver/driver.cpp line 105 to see how to use that environment variable.

1 Like

Okay, I checked the wrong directory before. GCC_INSTALL_PREFIX is not removed yet. You can still use that. I’m not sure if it will be removed soon, but at least it can still work.

The CMake variable GCC_INSTALL_PREFIX is discouraged. Which value do you use? It’s recommended to specify --gcc-install-dir= (e.g. --gcc-install-dir=/usr/lib/gcc/x86_64-linux-gnu/10) if you have a specific GCC version requirement.

The problem is, users don’t have direct control of the second build (runtime build in LLVM_ENABLE_RUNTIMES). It is invoked by CMake directly. In this case, we can either tell clang by using CCC_OVERRIDE_OPTIONS, or clang can use the one specified in GCC_INSTALL_PREFIX when it is built.

I don’t understand the request. At which stage do you specify GCC_INSTALL_PREFIX? Have you checked Clang in Gentoo now sets default runtimes via config file – Michał Górny ?

At which stage do you specify GCC_INSTALL_PREFIX ?

Typical CMake arguments we recommend for building LLVM and OpenMP on a HPC system would be some like:

-DLLVM_ENABLE_PROJECTS="clang"
-DLLVM_ENABLE_RUNTIMES="openmp"
-DLLVM_TARGETS_TO_BUILD="host;NVPTX;AMDGPU"
-DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_XX
-DGCC_INSTALL_PREFIX=/opt/gnu/gcc/8.4.0/

I omitted a couple of unrelated arguments. Here openmp is set in LLVM_ENABLE_RUNTIMES. In this mode, both clang and llvm will be built first, and then openmp will be configured (CMake) and built. During the configuration, the clang just built will be used as compiler. The second stage is invoked by CMake automatically because runtimes is a CMake target at the top level. So at that stage, clang will use the GCC specified in GCC_INSTALL_PREFIX directly.

I was trying to say that, asking users to specify --gcc-install-dir is not feasible in this case.

Have you checked Clang in Gentoo now sets default runtimes via config file – Michał Górny ?

I did a quick experiment on my system. So it generally requires two steps:

  1. Set CLANG_CONFIG_FILE_USER_DIR when building LLVM. Otherwise there will be no default config file search directory.
  2. Create a config file under CLANG_CONFIG_FILE_USER_DIR, and have --gcc-install-dir in that cfg file.

It generally works, but not as convenient as GCC_INSTALL_PREFIX.

According to the docs, there are supposed to be default search directories (system directory and clang binary directory).

Yes, but when using runtime build (LLVM_ENABLE_RUNTIMES), either users don’t have write access to system folder, or the clang binary folder is not created yet.

I see, so you are still running clang out of the build directory, because it’s being used to build OpenMP and nothing has been installed yet.

Exactly

Passing --config /path/to/confg.cfg to CMAKE_CXX_FLAGS might be a way to solve this with less steps, but this would also load the config file when building clang/llvm (which may not be what you want). Is there a way to pass cxx flags only to the runtime builds?