I can not get a simple loop like below:
#pragma omp target parallel for map(to: nranks)
for ( thread rank = 0; rank < nranks;
offload to the device at run time; I am getting these:
Libomptarget → Entering target region with entry point 0x000010000191abd6 and device Id -1
Libomptarget → Checking whether device 0 is ready.
Libomptarget → Is the device 0 (local ID 0) initialized? 0
Target CUDA RTL → Init requires flags to 1
Target CUDA RTL → Getting device 0
Target CUDA RTL → Max CUDA blocks per grid 2147483647 exceeds the hard team limit 65536, capping at the hard limit
Target CUDA RTL → Using 1024 CUDA threads per block
Target CUDA RTL → Max number of CUDA blocks 65536, threads 1024 & warp size 32
Target CUDA RTL → Default number of teams set according to library’s default 128
Target CUDA RTL → Default number of threads set according to library’s default 128
Libomptarget → Device 0 is ready to use.
Target CUDA RTL → Load data from image 0x000000001006bc00
Target CUDA RTL → CUDA module successfully loaded!
Target CUDA RTL → Sending global device environment data 4 bytes
Libomptarget → Unable to generate entries table for device id 0.
Libomptarget → Failed to init globals on device 0
Libomptarget → Failed to get device 0 ready
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
In a simple test code the above works as expected, so I am stuck at this
moment. Any help would be appreciated.