Libomptarget fatal error 1: failure of target construct while offloading is mandatory

Hi,

today I've installed llvm-trunk. Unfortunately, I get an error for one of my
programs.

loki introduction 110 clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda dot_prod_accelerator_OpenMP.c
loki introduction 111 a.out
Number of processors: 24
Number of devices: 1
Default device: 0
Is initial device: 1
Libomptarget fatal error 1: failure of target construct while offloading is mandatory

loki introduction 112 setenv OMP_DEFAULT_DEVICE 1
loki introduction 113 a.out
Libomptarget fatal error 1: failure of target construct while offloading is mandatory

loki introduction 114 clang -v
clang version 8.0.0 (trunk 343447)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/local/llvm-trunk/bin
Found candidate GCC installation: /usr/lib64/gcc/x86_64-suse-linux/4.8
Selected GCC installation: /usr/lib64/gcc/x86_64-suse-linux/4.8
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64
Found CUDA installation: /usr/local/cuda-9.0, version 9.0
loki introduction 115

The program works fine with llvm-7.0.0.

loki introduction 125 clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda dot_prod_accelerator_OpenMP.c
loki introduction 126 a.out
Number of processors: 24
Number of devices: 1
Default device: 0
Is initial device: 1
sum = 6.000000e+08

loki introduction 127 setenv OMP_DEFAULT_DEVICE 1
loki introduction 128 a.out
Number of processors: 24
Number of devices: 1
Default device: 1
Is initial device: 1
sum = 6.000000e+08

loki introduction 129 clang -v
clang version 7.0.0 (tags/RELEASE_700/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/local/llvm-7.0.0/bin
Found candidate GCC installation: /usr/lib64/gcc/x86_64-suse-linux/4.8
Selected GCC installation: /usr/lib64/gcc/x86_64-suse-linux/4.8
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64
Found CUDA installation: /usr/local/cuda-9.0, version 9.0
loki introduction 130

Hopefully somebody can fix the problem. Do you need anything else to locate the error? Thank you very much for any help in advance.

Kind regards

Siegmar

dot_prod_accelerator_OpenMP.c (1.95 KB)

Hi Siegmar,

Apparently your application fails to offload to the GPU. And because offloading is mandatory (that’s the default behavior) the library terminates the application.

Can you compile libomptarget in debug mode and run the app with LIBOMPTARGET_DEBUG=1 to see the debug output? That will help us identify the problem.

George

Hi Siegmar,

In addition to the explanation by George:

after 7.0 release, the 5.0 workflow for OMP_TARGET_OFFLOAD was fixed.
In 7.0 release, if execution on the device fails for whatever reason (signal, device is used exclusively by another user, ...) the target region is executed by the host device.

Now, if an error occurs, you will get an error and execution aborts.

Best
Joachim

A bit late to the party: You can also run under cuda-gdb. This should give you the CUDA error message.

Based on the filename (dot_prod_accelerator_OpenMP.c) I'd guess that you are using reduction() on a teams directive. This doesn't work yet, see
http://clang.llvm.org/docs/OpenMPSupport.html#features-not-supported-or-with-limited-support-for-cuda-devices

Jonas

Hi Siegmar,

Apparently your application fails to offload to the GPU. And because
offloading is mandatory (that's the default behavior) the library
terminates the application.

(pedantic on; the default is "default" which is raised to "mandatory" iff libomptarget can find devices; pedantic off)

Hi George,

thank you very much for your suggestions.

Apparently your application fails to offload to the GPU. And because offloading is mandatory (that's the default behavior) the library terminates the application.

Can you compile libomptarget in debug mode and run the app with LIBOMPTARGET_DEBUG=1 to see the debug output? That will help us identify the problem.

loki introduction 115 clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda dot_prod_accelerator_OpenMP.c
loki introduction 116 a.out
Number of processors: 24
Number of devices: 1
Default device: 0
Is initial device: 1
Libomptarget fatal error 1: failure of target construct while offloading is mandatory

loki introduction 117 setenv LIBOMPTARGET_DEBUG 1
loki introduction 118 a.out
Libomptarget --> Loading RTLs...
Libomptarget --> Loading library 'libomptarget.rtl.ppc64.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.ppc64.so': libomptarget.rtl.ppc64.so: cannot open shared object file: No such file or directory!
Libomptarget --> Loading library 'libomptarget.rtl.x86_64.so'...
Libomptarget --> Successfully loaded library 'libomptarget.rtl.x86_64.so'!
Libomptarget --> Registering RTL libomptarget.rtl.x86_64.so supporting 4 devices!
Libomptarget --> Loading library 'libomptarget.rtl.cuda.so'...
Target CUDA RTL --> Start initializing CUDA
Libomptarget --> Successfully loaded library 'libomptarget.rtl.cuda.so'!
Libomptarget --> Registering RTL libomptarget.rtl.cuda.so supporting 1 devices!
Libomptarget --> Loading library 'libomptarget.rtl.aarch64.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.aarch64.so': libomptarget.rtl.aarch64.so: cannot open shared object file: No such file or directory!
Libomptarget --> RTLs loaded!
Libomptarget --> Image 0x0000000000602090 is NOT compatible with RTL libomptarget.rtl.x86_64.so!
Libomptarget --> Image 0x0000000000602090 is compatible with RTL libomptarget.rtl.cuda.so!
Libomptarget --> RTL 0x00000000609f95d0 has index 0!
Libomptarget --> Registering image 0x0000000000602090 with RTL libomptarget.rtl.cuda.so!
Libomptarget --> Done registering entries!
Libomptarget --> Call to omp_get_num_devices returning 1
Libomptarget --> Default TARGET OFFLOAD policy is now mandatory (devicew were found)
Libomptarget --> Entering target region with entry point 0x00000000004012d0 and device Id -1
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 0
Target CUDA RTL --> Getting device 0
Target CUDA RTL --> Max CUDA blocks per grid 2147483647 exceeds the hard team limit 65536, capping at the hard limit
Target CUDA RTL --> Using 1024 CUDA threads per block
Target CUDA RTL --> Max number of CUDA blocks 65536, threads 1024 & warp size 32
Target CUDA RTL --> Default number of teams set according to library's default 128
Target CUDA RTL --> Default number of threads set according to library's default 128
Libomptarget --> Device 0 is ready to use.
Target CUDA RTL --> Load data from image 0x0000000000602090
Target CUDA RTL --> CUDA module successfully loaded!
Target CUDA RTL --> Entry point 0x0000000000000000 maps to __omp_offloading_2b_1890d30_main_l48 (0x0000000060f23320)
Target CUDA RTL --> Entry point 0x0000000000000001 maps to __omp_offloading_2b_1890d30_main_l67 (0x0000000060f27c70)
Target CUDA RTL --> Sending global device environment data 4 bytes
Libomptarget --> Entry 0: Base=0x0000000000613bf0, Begin=0x0000000000613bf0, Size=800000000, Type=0x22
Libomptarget --> Entry 1: Base=0x00000000301043f0, Begin=0x00000000301043f0, Size=800000000, Type=0x22
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000000000613bf0, Size=800000000)...
Libomptarget --> Creating new map entry: HstBase=0x0000000000613bf0, HstBegin=0x0000000000613bf0, HstEnd=0x00000000301043f0, TgtBegin=0x0000000b08c20000
Libomptarget --> There are 800000000 bytes allocated at target address 0x0000000b08c20000 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000301043f0, Size=800000000)...
Libomptarget --> Creating new map entry: HstBase=0x00000000301043f0, HstBegin=0x00000000301043f0, HstEnd=0x000000005fbf4bf0, TgtBegin=0x0000000b38720000
Libomptarget --> There are 800000000 bytes allocated at target address 0x0000000b38720000 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000000000613bf0, Size=800000000)...
Libomptarget --> Mapping exists with HstPtrBegin=0x0000000000613bf0, TgtPtrBegin=0x0000000b08c20000, Size=800000000, RefCount=1
Libomptarget --> Obtained target argument 0x0000000b08c20000 from host pointer 0x0000000000613bf0
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000301043f0, Size=800000000)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00000000301043f0, TgtPtrBegin=0x0000000b38720000, Size=800000000, RefCount=1
Libomptarget --> Obtained target argument 0x0000000b38720000 from host pointer 0x00000000301043f0
Libomptarget --> Launching target execution __omp_offloading_2b_1890d30_main_l48 with pointer 0x0000000060ee2ee0 (index=0).
Target CUDA RTL --> Setting CUDA threads per block to default 128
Target CUDA RTL --> Using requested number of teams 1
Target CUDA RTL --> Launch kernel with 1 blocks and 128 threads
Target CUDA RTL --> Launch of entry point at 0x0000000060ee2ee0 successful!
Target CUDA RTL --> Kernel execution at 0x0000000060ee2ee0 successful!
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000301043f0, Size=800000000)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00000000301043f0, TgtPtrBegin=0x0000000b38720000, Size=800000000, updated RefCount=1
Libomptarget --> There are 800000000 bytes allocated at target address 0x0000000b38720000 - is last
Libomptarget --> Moving 800000000 bytes (tgt:0x0000000b38720000) -> (hst:0x00000000301043f0)
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000301043f0, Size=800000000)...
Libomptarget --> Deleting tgt data 0x0000000b38720000 of size 800000000
Libomptarget --> Removing mapping with HstPtrBegin=0x00000000301043f0, TgtPtrBegin=0x0000000b38720000, Size=800000000
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000000000613bf0, Size=800000000)...
Libomptarget --> Mapping exists with HstPtrBegin=0x0000000000613bf0, TgtPtrBegin=0x0000000b08c20000, Size=800000000, updated RefCount=1
Libomptarget --> There are 800000000 bytes allocated at target address 0x0000000b08c20000 - is last
Libomptarget --> Moving 800000000 bytes (tgt:0x0000000b08c20000) -> (hst:0x0000000000613bf0)
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000000000613bf0, Size=800000000)...
Libomptarget --> Deleting tgt data 0x0000000b08c20000 of size 800000000
Libomptarget --> Removing mapping with HstPtrBegin=0x0000000000613bf0, TgtPtrBegin=0x0000000b08c20000, Size=800000000
Libomptarget --> Call to omp_get_num_devices returning 1
Number of processors: 24
Number of devices: 1
Default device: 0
Is initial device: 1
Libomptarget --> Entering target region with entry point 0x00000000004012d1 and device Id -1
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 1
Libomptarget --> Device 0 is ready to use.
Libomptarget --> Entry 0: Base=0x0000000000613bf0, Begin=0x0000000000613bf0, Size=800000000, Type=0x21
Libomptarget --> Entry 1: Base=0x00000000301043f0, Begin=0x00000000301043f0, Size=800000000, Type=0x21
Libomptarget --> Entry 2: Base=0x00007fff707a86e8, Begin=0x00007fff707a86e8, Size=8, Type=0x23
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000000000613bf0, Size=800000000)...
Libomptarget --> Creating new map entry: HstBase=0x0000000000613bf0, HstBegin=0x0000000000613bf0, HstEnd=0x00000000301043f0, TgtBegin=0x0000000b08c20000
Libomptarget --> There are 800000000 bytes allocated at target address 0x0000000b08c20000 - is new
Libomptarget --> Moving 800000000 bytes (hst:0x0000000000613bf0) -> (tgt:0x0000000b08c20000)
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000301043f0, Size=800000000)...
Libomptarget --> Creating new map entry: HstBase=0x00000000301043f0, HstBegin=0x00000000301043f0, HstEnd=0x000000005fbf4bf0, TgtBegin=0x0000000b38720000
Libomptarget --> There are 800000000 bytes allocated at target address 0x0000000b38720000 - is new
Libomptarget --> Moving 800000000 bytes (hst:0x00000000301043f0) -> (tgt:0x0000000b38720000)
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007fff707a86e8, Size=8)...
Libomptarget --> Creating new map entry: HstBase=0x00007fff707a86e8, HstBegin=0x00007fff707a86e8, HstEnd=0x00007fff707a86f0, TgtBegin=0x0000000b68220000
Libomptarget --> There are 8 bytes allocated at target address 0x0000000b68220000 - is new
Libomptarget --> Moving 8 bytes (hst:0x00007fff707a86e8) -> (tgt:0x0000000b68220000)
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000000000613bf0, Size=800000000)...
Libomptarget --> Mapping exists with HstPtrBegin=0x0000000000613bf0, TgtPtrBegin=0x0000000b08c20000, Size=800000000, RefCount=1
Libomptarget --> Obtained target argument 0x0000000b08c20000 from host pointer 0x0000000000613bf0
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000301043f0, Size=800000000)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00000000301043f0, TgtPtrBegin=0x0000000b38720000, Size=800000000, RefCount=1
Libomptarget --> Obtained target argument 0x0000000b38720000 from host pointer 0x00000000301043f0
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007fff707a86e8, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007fff707a86e8, TgtPtrBegin=0x0000000b68220000, Size=8, RefCount=1
Libomptarget --> Obtained target argument 0x0000000b68220000 from host pointer 0x00007fff707a86e8
Libomptarget --> Launching target execution __omp_offloading_2b_1890d30_main_l67 with pointer 0x0000000060ee2e70 (index=1).
Target CUDA RTL --> Setting CUDA threads per block to default 128
Target CUDA RTL --> Using requested number of teams 1
Target CUDA RTL --> Launch kernel with 1 blocks and 128 threads
Target CUDA RTL --> Launch of entry point at 0x0000000060ee2e70 successful!
Target CUDA RTL --> Kernel execution error at 0x0000000060ee2e70!
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Libomptarget --> Executing target region abort target.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Libomptarget --> Unloading target library!
Libomptarget --> Image 0x0000000000602090 is compatible with RTL 0x00000000609f95d0!
Libomptarget --> Unregistered image 0x0000000000602090 from RTL 0x00000000609f95d0!
Libomptarget --> Done unregistering images!
Libomptarget --> Removing translation table for descriptor 0x0000000000613b90
Libomptarget --> Done unregistering library!
Target CUDA RTL --> Error when unloading CUDA module
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
loki introduction 119

Thank you very much for your help in advance.

Best regards

Siegmar

Hi Siegmar,

Something happens during the execution of the second target region. The only thing I can suspect is the reduction. I don’t have access to a workstation right now to test this, though…

Can anyone confirm whether threads reduction is working correctly in the trunk version?

George

Nope, it's broken: "Warp Illegal Address".
I can confirm that the program is working correctly when compiled with Clang 7.0.0! (@Siegmar please ignore my previous statement about teams reductions, I didn't notice that you were attaching the source code in your initial post which is only using parallel reductions)

Interestingly it works with Clang trunk when compiling with -fopenmp-cuda-force-full-runtime, so maybe that optimization is not correct for reductions?

Jonas

Okay, that was easy: https://reviews.llvm.org/D52725
This works for me, please give it a try.

Jonas

Hi Siegmar,

The problem was related to parallel reduction. Jonas Hahnfeld submitted a patch (thanks Jonas). Your code must now work. Please update your copy of libomptarget and recompile it. Let us know if there are any further issues.

George