Target architecture does not support unified addressing

Hi,
Using a pragma like below:

$ cat tmp.cpp
#pragma omp requires unified_shared_memory

int main() {
}

produces en error on a POWER8 based system with P100 devices (that support unified memory).

$ clang++ -fopenmp -fopenmp-targets=nvptx64 tmp.cpp
tmp.cpp:1:22: error: Target architecture does not support unified addressing
#pragma omp requires unified_shared_memory
^
1 error generated.

The Clang is locally and natively built with the appropriate capability, so
what does this mean?

I’ve been building trunk Clang locally targeting the P100 device attached to Host. Should I check the tool chain?

Thank you, Alexey. Now I am seeing:

$ clang++ -fopenmp -fopenmp-targets=nvptx64 tmp.cpp
tmp.cpp:1:22: error: Target architecture sm_60 does not support unified addressing
#pragma omp requires unified_shared_memory
^
1 error generated.

P100 is a SM60 device, but supports unified memory. Is a requirement sm_70 equals or greater
enforced in Clang?

Can you say briefly as to why SM60, while it is capable of handing unified addresses, is not supported in Clang?

Doru - can you jump in this discussion?

Hi Itaru,

We did not test those features on an sm_60 machine like a Pascal GPU so I can’t guarantee it will work. I suggest you enable it locally and see how it performs.
You only need to make a small change in “void CGOpenMPRuntimeNVPTX::checkArchForUnifiedAddressing(const OMPRequiresDecl *D)” to allow sm_60 to be accepted as a valid target.

Thanks,

–Doru

Doru,

Would you mind adding the SM60 support into the trunk so I can continuously test the device with POWER8 host which I have access to?

Doru,
What’s the current way of enabling SM_60 CUDA architecture support for unified addressing?
It’s been modified since we exchanged the message.

Executing shared_update.c on P100 results in errors;

==130340== NVPROF is profiling process 130340, command: ./a.out
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
==130340== Profiling application: ./a.out
==130340== Warning: 1 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the option --device-buffer-size.
==130340== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 89.68% 40.950us 2 20.475us 18.103us 22.847us [CUDA memcpy DtoH]
10.32% 4.7100us 1 4.7100us 4.7100us 4.7100us [CUDA memcpy HtoD]
API calls: 69.95% 400.85ms 1 400.85ms 400.85ms 400.85ms cuCtxCreate
15.17% 86.932ms 1 86.932ms 86.932ms 86.932ms cuStreamSynchronize
12.11% 69.398ms 1 69.398ms 69.398ms 69.398ms cuCtxDestroy
2.68% 15.375ms 1 15.375ms 15.375ms 15.375ms cuModuleLoadDataEx
0.06% 363.13us 32 11.347us 754ns 171.53us cuStreamCreate
0.01% 48.938us 2 24.469us 19.581us 29.357us cuMemcpyDtoH
0.00% 22.184us 1 22.184us 22.184us 22.184us cuLaunchKernel
0.00% 7.6760us 1 7.6760us 7.6760us 7.6760us cuMemcpyHtoD
0.00% 4.7430us 32 148ns 113ns 520ns cuStreamDestroy
0.00% 2.9060us 3 968ns 562ns 1.5750us cuModuleGetGlobal
0.00% 2.8940us 2 1.4470us 336ns 2.5580us cuModuleGetFunction
0.00% 2.8250us 3 941ns 181ns 2.2050us cuDeviceGetCount
0.00% 2.6040us 2 1.3020us 965ns 1.6390us cuDeviceGet
0.00% 2.4200us 5 484ns 137ns 882ns cuCtxSetCurrent
0.00% 1.6450us 6 274ns 117ns 671ns cuDeviceGetAttribute
0.00% 804ns 1 804ns 804ns 804ns cuFuncGetAttribute
0.00% 296ns 1 296ns 296ns 296ns cuModuleUnload
======== Error: Application returned non-zero code 1

deviceQuery returns:

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “Tesla P100-SXM2-16GB”
CUDA Driver Version / Runtime Version 10.1 / 8.0
CUDA Capability Major/Minor version number: 6.0
Total amount of global memory: 16281 MBytes (17071734784 bytes)
(56) Multiprocessors, ( 64) CUDA Cores/MP: 3584 CUDA Cores
GPU Max Clock rate: 1481 MHz (1.48 GHz)
Memory Clock rate: 715 Mhz
Memory Bus Width: 4096-bit
L2 Cache Size: 4194304 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 5 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes

Setting LIBOMPTARGET_DBUEG=1, on POWER8 with P100 GPUs I get:

$ ./a.out
Libomptarget → Loading RTLs…
Libomptarget → Loading library ‘libomptarget.rtl.ppc64.so’…
Libomptarget → Successfully loaded library ‘libomptarget.rtl.ppc64.so’!
Libomptarget → Registering RTL libomptarget.rtl.ppc64.so supporting 4 devices!
Libomptarget → Loading library ‘libomptarget.rtl.x86_64.so’…
Libomptarget → Unable to load library ‘libomptarget.rtl.x86_64.so’: libomptarget.rtl.x86_64.so: cannot open shared object file: No such file or directory!
Libomptarget → Loading library ‘libomptarget.rtl.cuda.so’…
Target CUDA RTL → Start initializing CUDA
Libomptarget → Successfully loaded library ‘libomptarget.rtl.cuda.so’!
Libomptarget → Registering RTL libomptarget.rtl.cuda.so supporting 1 devices!
Libomptarget → Loading library ‘libomptarget.rtl.aarch64.so’…
Libomptarget → Unable to load library ‘libomptarget.rtl.aarch64.so’: libomptarget.rtl.aarch64.so: cannot open shared object file: No such file or directory!
Libomptarget → RTLs loaded!
Libomptarget → Image 0x0000000010001300 is NOT compatible with RTL libomptarget.rtl.ppc64.so!
Libomptarget → Image 0x0000000010001300 is compatible with RTL libomptarget.rtl.cuda.so!
Libomptarget → RTL 0x0000010001b6d860 has index 0!
Libomptarget → Registering image 0x0000000010001300 with RTL libomptarget.rtl.cuda.so!
Libomptarget → Done registering entries!
Libomptarget → New requires flags 8 compatible with existing 8!
Libomptarget → Call to omp_get_num_devices returning 1
Libomptarget → Default TARGET OFFLOAD policy is now mandatory (devices were found)
Libomptarget → Entering target region with entry point 0x0000000010001110 and device Id -1
Libomptarget → Checking whether device 0 is ready.
Libomptarget → Is the device 0 (local ID 0) initialized? 0
Target CUDA RTL → Init requires flags to 8
Target CUDA RTL → Getting device 0
Target CUDA RTL → Max CUDA blocks per grid 2147483647 exceeds the hard team limit 65536, capping at the hard limit
Target CUDA RTL → Using 1024 CUDA threads per block
Target CUDA RTL → Using warp size 32
Target CUDA RTL → Max number of CUDA blocks 65536, threads 1024 & warp size 32
Target CUDA RTL → Default number of teams set according to library’s default 128
Target CUDA RTL → Default number of threads set according to library’s default 128
Libomptarget → Device 0 is ready to use.
Target CUDA RTL → Load data from image 0x0000000010001300
Target CUDA RTL → CUDA module successfully loaded!
Target CUDA RTL → Entry point 0x0000000000000000 maps to __omp_offloading_46_804afcb6_main_l41 (0x0000110000350fd0)
Target CUDA RTL → Entry point 0x0000000000000001 maps to __omp_offloading_46_804afcb6_main_l89 (0x0000110000361810)
Target CUDA RTL → Sending global device environment data 4 bytes
Libomptarget → Entry 0: Base=0x00003ffff55df0b0, Begin=0x00003ffff55df0b0, Size=8, Type=0x23
Libomptarget → Entry 1: Base=0x00003ffff55de0a8, Begin=0x00003ffff55de0a8, Size=4096, Type=0x223
Libomptarget → Entry 2: Base=0x00003ffff55df0c0, Begin=0x00003ffff55df0c0, Size=8, Type=0x23
Libomptarget → Entry 3: Base=0x0000010001bd1a80, Begin=0x0000010001bd1a80, Size=0, Type=0x220
Libomptarget → Looking up mapping(HstPtrBegin=0x00003ffff55df0b0, Size=8)…
Libomptarget → Return HstPtrBegin 0x00003ffff55df0b0 Size=8 RefCount= updated
Libomptarget → There are 8 bytes allocated at target address 0x00003ffff55df0b0 - is not new
Libomptarget → Looking up mapping(HstPtrBegin=0x00003ffff55de0a8, Size=4096)…
Libomptarget → Return HstPtrBegin 0x00003ffff55de0a8 Size=4096 RefCount= updated
Libomptarget → There are 4096 bytes allocated at target address 0x00003ffff55de0a8 - is not new
Libomptarget → Looking up mapping(HstPtrBegin=0x00003ffff55df0c0, Size=8)…
Libomptarget → Return HstPtrBegin 0x00003ffff55df0c0 Size=8 RefCount= updated
Libomptarget → There are 8 bytes allocated at target address 0x00003ffff55df0c0 - is not new
Libomptarget → Looking up mapping(HstPtrBegin=0x0000010001bd1a80, Size=0)…
Libomptarget → There are 0 bytes allocated at target address 0x0000000000000000 - is not new
Libomptarget → Looking up mapping(HstPtrBegin=0x00003ffff55df0b0, Size=8)…
Libomptarget → Get HstPtrBegin 0x00003ffff55df0b0 Size=8 RefCount=
Libomptarget → Obtained target argument 0x00003ffff55df0b0 from host pointer 0x00003ffff55df0b0
Libomptarget → Looking up mapping(HstPtrBegin=0x00003ffff55de0a8, Size=4096)…
Libomptarget → Get HstPtrBegin 0x00003ffff55de0a8 Size=4096 RefCount=
Libomptarget → Obtained target argument 0x00003ffff55de0a8 from host pointer 0x00003ffff55de0a8
Libomptarget → Looking up mapping(HstPtrBegin=0x00003ffff55df0c0, Size=8)…
Libomptarget → Get HstPtrBegin 0x00003ffff55df0c0 Size=8 RefCount=
Libomptarget → Obtained target argument 0x00003ffff55df0c0 from host pointer 0x00003ffff55df0c0
Libomptarget → Looking up mapping(HstPtrBegin=0x0000010001bd1a80, Size=0)…
Libomptarget → Get HstPtrBegin 0x0000010001bd1a80 Size=0 RefCount=
Libomptarget → Obtained target argument 0x0000010001bd1a80 from host pointer 0x0000010001bd1a80
Libomptarget → Launching target execution __omp_offloading_46_804afcb6_main_l41 with pointer 0x0000110000322840 (index=0).
Target CUDA RTL → Setting CUDA threads per block to requested 1
Target CUDA RTL → Adding master warp: +32 threads
Target CUDA RTL → Using requested number of teams 1
Target CUDA RTL → Launch kernel with 1 blocks and 33 threads
Target CUDA RTL → Launch of entry point at 0x0000110000322840 successful!
Libomptarget → Looking up mapping(HstPtrBegin=0x0000010001bd1a80, Size=0)…
Libomptarget → Get HstPtrBegin 0x0000010001bd1a80 Size=0 RefCount= updated
Libomptarget → There are 0 bytes allocated at target address 0x0000010001bd1a80 - is not last
Libomptarget → Looking up mapping(HstPtrBegin=0x00003ffff55df0c0, Size=8)…
Libomptarget → Get HstPtrBegin 0x00003ffff55df0c0 Size=8 RefCount= updated
Libomptarget → There are 8 bytes allocated at target address 0x00003ffff55df0c0 - is not last
Libomptarget → Looking up mapping(HstPtrBegin=0x00003ffff55de0a8, Size=4096)…
Libomptarget → Get HstPtrBegin 0x00003ffff55de0a8 Size=4096 RefCount= updated
Libomptarget → There are 4096 bytes allocated at target address 0x00003ffff55de0a8 - is not last
Libomptarget → Looking up mapping(HstPtrBegin=0x00003ffff55df0b0, Size=8)…
Libomptarget → Get HstPtrBegin 0x00003ffff55df0b0 Size=8 RefCount= updated
Libomptarget → There are 8 bytes allocated at target address 0x00003ffff55df0b0 - is not last
Target CUDA RTL → Error when synchronizing stream. stream = 0x00001100002cd7c0, async info ptr = 0x00003ffff55dddf8
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuModuleUnload
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Libomptarget → Unloading target library!
Libomptarget → Image 0x0000000010001300 is compatible with RTL 0x0000010001b6d860!
Libomptarget → Unregistered image 0x0000000010001300 from RTL 0x0000010001b6d860!
Libomptarget → Done unregistering images!
Libomptarget → Removing translation table for descriptor 0x00000000100193e8
Libomptarget → Done unregistering library!
Libomptarget → Deinit target library!