CUDA error is: invalid device ordinal

Hi all,
Hopefully I can get some insights from the wider community.
My application runs fine on x86-64 + CUDA.
When I built the same version of clang and application on Power9+V100, I got “CUDA error is: invalid device ordinal”. It seems that the cuda plugin got the device 0 but failed to create a context. I paste the debug + nvprof output at the end of this email.

I used the same compiler to build a small test program. It runs fine.
What can be a potential cause of this CUDA error?

Ye

Libomptarget → Call to omp_get_num_devices returning 1
Libomptarget → Default TARGET OFFLOAD policy is now mandatory (devices were found)
Libomptarget → Entering data begin region for device -1 with 1 mappings
Libomptarget → Use default device id 0
Libomptarget → Checking whether device 0 is ready.
Libomptarget → Is the device 0 (local ID 0) initialized? 0
Target CUDA RTL → Init requires flags to 1
Target CUDA RTL → Getting device 0
Target CUDA RTL → Error returned from cuCtxCreate
Target CUDA RTL → CUDA error is: invalid device ordinal
Libomptarget → Failed to init device 0
Libomptarget → Device 0 is not ready.
Libomptarget → Failed to get device 0 ready
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
==176195== Profiling application: …/…/…/…/bin/qmcpack qmc_short_vmcbatch.in.xml
Libomptarget → Unloading target library!
Libomptarget → Image 0x00000000107b6470 is compatible with RTL 0x000000003b329020!
Libomptarget → Unregistered image 0x00000000107b6470 from RTL 0x000000003b329020!
Libomptarget → Done unregistering images!
Libomptarget → Removing translation table for descriptor 0x0000000010900318
Libomptarget → Done unregistering library!
Libomptarget → Deinit target library!
==176195== Profiling result:
No kernels were profiled.
Type Time(%) Time Calls Avg Min Max Name
API calls: 87.10% 1.75034s 7 250.05ms 250.00ms 250.28ms cudaFree
12.02% 241.59ms 1 241.59ms 241.59ms 241.59ms cuDevicePrimaryCtxRelease
0.42% 8.4971ms 1 8.4971ms 8.4971ms 8.4971ms cuCtxCreate
0.31% 6.1826ms 3 2.0609ms 827.87us 3.7271ms cuModuleUnload
0.08% 1.5932ms 97 16.424us 241ns 652.53us cuDeviceGetAttribute
0.05% 1.0525ms 1 1.0525ms 1.0525ms 1.0525ms cuDeviceTotalMem
0.01% 209.36us 1 209.36us 209.36us 209.36us cuDeviceGetName
0.00% 73.862us 7 10.551us 4.6310us 28.909us cudaSetDevice
0.00% 4.3990us 3 1.4660us 543ns 2.6840us cuDeviceGet
0.00% 3.9920us 1 3.9920us 3.9920us 3.9920us cuDeviceGetPCIBusId
0.00% 3.0740us 1 3.0740us 3.0740us 3.0740us cudaGetDeviceCount
0.00% 3.0000us 4 750ns 407ns 1.2090us cuDeviceGetCount
0.00% 2.1410us 1 2.1410us 2.1410us 2.1410us cuInit
0.00% 2.1080us 1 2.1080us 2.1080us 2.1080us cuDriverGetVersion
0.00% 1.9570us 1 1.9570us 1.9570us 1.9570us cuGetErrorString
0.00% 1.2870us 1 1.2870us 1.2870us 1.2870us cuCtxSetCurrent
0.00% 393ns 1 393ns 393ns 393ns cuDeviceGetUuid

It is on the Summit supercomputer. I will ask the administrators for help.
Ye

@Alexey Why do you think it is a CUDA error and not a race in the libomptarget?

@Ye Can we run this on a different system too?

@Johannes This is not related to the race I mentioned to you yesterday. The error shows up at the very beginning of the execution in serial and only appears on Summit.

I can run the full code on x86_64 with CUDA 11 installed. On X86_64 with CUDA < 11, the code passes that point but stops in another place.
Ye

Sounds “good”. So we “just” need to figure out which of these it is :wink:

@Ye, you’ll talk to the Summit admins, correct?

Probably I will try a local P9+V100 first.

Ye

No need to bother the OLCF support. I resolved the problem when investigating another issue.
libomptarget caused the problem.

https://reviews.llvm.org/D82718

Ye