What should `omp_get_num_devices()` return?

Hi,

I have a general question on the behavior of omp_get_num_devices(), which seems to be interpreted differently across different compiler vendors (I tried Clang/LLVM and IBM XL).

Consider this simple example:

#include <omp.h>

int main() {

printf(“num devices: %ld\n”, omp_get_num_devices());

}

With clang, when this program is compiled for the host (-fopenmp) “omp_get_num_devices() == 0”, even if there are GPUs in the system. If it is compiled with -fopenmp-targets=nvptx64 it returns a value >0 (if there are GPUs in the system of course).

With IBM XL, no matter if you compile with offload support or not, it seems to always return a value >0 if there are GPUs in the system.

It looks like in the OpenMP spec is not clear what’s the right behavior in this case. I can see that both clang and IBM XL approach are ok (especially if this is implementation dependent). However, this could cause ambiguity in some tests, that expect either behaviors.

Consider this sollve test: https://raw.githubusercontent.com/SOLLVE/sollve_vv/master/tests/4.5/target_teams_distribute_parallel_for/test_target_teams_distribute_parallel_for_devices.c:

int num_dev = omp_get_num_devices();

for (dev = 0; dev < num_dev; ++dev) {

#pragma omp target enter data map(to: a[0:SIZE_N]) device(dev)

}

for (dev = 0; dev < num_dev; ++dev) {

#pragma omp target teams distribute parallel for device(dev) map(tofrom: isHost)

for (i = 0; i < SIZE_N; i++) {

if (omp_get_team_num() == 0 && omp_get_thread_num() == 0) {

isHost[dev] = omp_is_initial_device();// Checking if running on a device

}

a[i] += dev;

}

}

for (dev = 0; dev < num_dev; ++dev) {

#pragma omp target exit data map(from: a[0:SIZE_N]) device(dev)

OMPVV_INFOMSG(“Device %d ran on the %s”, dev, isHost[dev] ? “host” : “device”);

OMPVV_TEST_AND_SET(errors, isHost[dev] && dev != omp_get_initial_device());

for (i = 0; i < SIZE_N; i++) {

OMPVV_TEST_AND_SET(errors, a[i] != 1 + dev);

}

}

This test fails with IBM XL if compiled for host only because it returns a number of devices > 0, but passes with clang because it returns a number of devices == 0 and it basically skips both for loops.

Of course the test can be changed, but I have seen this pattern quite often in different benchmarks and this difference will make it harder to write a test that is compatible on host and device (I know that OpenMP spec doesn’t guarantee this) and with different compilers.

I think it is always nice to have consistent behaviors across compilers as much as possible, but maybe I am missing something, so I wanted to hear your thoughts.

Thanks.

Simone