Target architecture does not support unified addressing

ikitayama · November 5, 2019, 8:16am

Hi,
Using a pragma like below:

$ cat tmp.cpp
#pragma omp requires unified_shared_memory

int main() {
}

produces en error on a POWER8 based system with P100 devices (that support unified memory).

$ clang++ -fopenmp -fopenmp-targets=nvptx64 tmp.cpp
tmp.cpp:1:22: error: Target architecture does not support unified addressing
#pragma omp requires unified_shared_memory
^
1 error generated.

The Clang is locally and natively built with the appropriate capability, so
what does this mean?

ikitayama · November 5, 2019, 8:01pm

I’ve been building trunk Clang locally targeting the P100 device attached to Host. Should I check the tool chain?

ikitayama · November 5, 2019, 10:38pm

Thank you, Alexey. Now I am seeing:

$ clang++ -fopenmp -fopenmp-targets=nvptx64 tmp.cpp
tmp.cpp:1:22: error: Target architecture sm_60 does not support unified addressing
#pragma omp requires unified_shared_memory
^
1 error generated.

P100 is a SM60 device, but supports unified memory. Is a requirement sm_70 equals or greater
enforced in Clang?

ikitayama · November 5, 2019, 11:03pm

Can you say briefly as to why SM60, while it is capable of handing unified addresses, is not supported in Clang?

ikitayama · November 5, 2019, 11:20pm

Doru - can you jump in this discussion?

Gheorghe-Teod_Bercea · November 6, 2019, 7:05pm

Hi Itaru,

We did not test those features on an sm_60 machine like a Pascal GPU so I can’t guarantee it will work. I suggest you enable it locally and see how it performs.
You only need to make a small change in “void CGOpenMPRuntimeNVPTX::checkArchForUnifiedAddressing(const OMPRequiresDecl *D)” to allow sm_60 to be accepted as a valid target.

Thanks,

–Doru

ikitayama · November 7, 2019, 6:18am

Doru,

Would you mind adding the SM60 support into the trunk so I can continuously test the device with POWER8 host which I have access to?

ikitayama · May 1, 2020, 11:24pm

Doru,
What’s the current way of enabling SM_60 CUDA architecture support for unified addressing?
It’s been modified since we exchanged the message.

ikitayama · May 2, 2020, 1:31am

Executing shared_update.c on P100 results in errors;

==130340== NVPROF is profiling process 130340, command: ./a.out
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
==130340== Profiling application: ./a.out
==130340== Warning: 1 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the option --device-buffer-size.
==130340== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 89.68% 40.950us 2 20.475us 18.103us 22.847us [CUDA memcpy DtoH]
10.32% 4.7100us 1 4.7100us 4.7100us 4.7100us [CUDA memcpy HtoD]
API calls: 69.95% 400.85ms 1 400.85ms 400.85ms 400.85ms cuCtxCreate
15.17% 86.932ms 1 86.932ms 86.932ms 86.932ms cuStreamSynchronize
12.11% 69.398ms 1 69.398ms 69.398ms 69.398ms cuCtxDestroy
2.68% 15.375ms 1 15.375ms 15.375ms 15.375ms cuModuleLoadDataEx
0.06% 363.13us 32 11.347us 754ns 171.53us cuStreamCreate
0.01% 48.938us 2 24.469us 19.581us 29.357us cuMemcpyDtoH
0.00% 22.184us 1 22.184us 22.184us 22.184us cuLaunchKernel
0.00% 7.6760us 1 7.6760us 7.6760us 7.6760us cuMemcpyHtoD
0.00% 4.7430us 32 148ns 113ns 520ns cuStreamDestroy
0.00% 2.9060us 3 968ns 562ns 1.5750us cuModuleGetGlobal
0.00% 2.8940us 2 1.4470us 336ns 2.5580us cuModuleGetFunction
0.00% 2.8250us 3 941ns 181ns 2.2050us cuDeviceGetCount
0.00% 2.6040us 2 1.3020us 965ns 1.6390us cuDeviceGet
0.00% 2.4200us 5 484ns 137ns 882ns cuCtxSetCurrent
0.00% 1.6450us 6 274ns 117ns 671ns cuDeviceGetAttribute
0.00% 804ns 1 804ns 804ns 804ns cuFuncGetAttribute
0.00% 296ns 1 296ns 296ns 296ns cuModuleUnload
======== Error: Application returned non-zero code 1

ikitayama · May 2, 2020, 3:55am

deviceQuery returns:

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “Tesla P100-SXM2-16GB”
CUDA Driver Version / Runtime Version 10.1 / 8.0
CUDA Capability Major/Minor version number: 6.0
Total amount of global memory: 16281 MBytes (17071734784 bytes)
(56) Multiprocessors, ( 64) CUDA Cores/MP: 3584 CUDA Cores
GPU Max Clock rate: 1481 MHz (1.48 GHz)
Memory Clock rate: 715 Mhz
Memory Bus Width: 4096-bit
L2 Cache Size: 4194304 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 5 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes

ikitayama · May 2, 2020, 10:45am

Setting LIBOMPTARGET_DBUEG=1, on POWER8 with P100 GPUs I get:

$ ./a.out
Libomptarget → Loading RTLs…
Libomptarget → Loading library ‘libomptarget.rtl.ppc64.so’…
Libomptarget → Successfully loaded library ‘libomptarget.rtl.ppc64.so’!
Libomptarget → Registering RTL libomptarget.rtl.ppc64.so supporting 4 devices!
Libomptarget → Loading library ‘libomptarget.rtl.x86_64.so’…
Libomptarget → Unable to load library ‘libomptarget.rtl.x86_64.so’: libomptarget.rtl.x86_64.so: cannot open shared object file: No such file or directory!
Libomptarget → Loading library ‘libomptarget.rtl.cuda.so’…
Target CUDA RTL → Start initializing CUDA
Libomptarget → Successfully loaded library ‘libomptarget.rtl.cuda.so’!
Libomptarget → Registering RTL libomptarget.rtl.cuda.so supporting 1 devices!
Libomptarget → Loading library ‘libomptarget.rtl.aarch64.so’…
Libomptarget → Unable to load library ‘libomptarget.rtl.aarch64.so’: libomptarget.rtl.aarch64.so: cannot open shared object file: No such file or directory!
Libomptarget → RTLs loaded!
Libomptarget → Image 0x0000000010001300 is NOT compatible with RTL libomptarget.rtl.ppc64.so!
Libomptarget → Image 0x0000000010001300 is compatible with RTL libomptarget.rtl.cuda.so!
Libomptarget → RTL 0x0000010001b6d860 has index 0!
Libomptarget → Registering image 0x0000000010001300 with RTL libomptarget.rtl.cuda.so!
Libomptarget → Done registering entries!
Libomptarget → New requires flags 8 compatible with existing 8!
Libomptarget → Call to omp_get_num_devices returning 1
Libomptarget → Default TARGET OFFLOAD policy is now mandatory (devices were found)
Libomptarget → Entering target region with entry point 0x0000000010001110 and device Id -1
Libomptarget → Checking whether device 0 is ready.
Libomptarget → Is the device 0 (local ID 0) initialized? 0
Target CUDA RTL → Init requires flags to 8
Target CUDA RTL → Getting device 0
Target CUDA RTL → Max CUDA blocks per grid 2147483647 exceeds the hard team limit 65536, capping at the hard limit
Target CUDA RTL → Using 1024 CUDA threads per block
Target CUDA RTL → Using warp size 32
Target CUDA RTL → Max number of CUDA blocks 65536, threads 1024 & warp size 32
Target CUDA RTL → Default number of teams set according to library’s default 128
Target CUDA RTL → Default number of threads set according to library’s default 128
Libomptarget → Device 0 is ready to use.
Target CUDA RTL → Load data from image 0x0000000010001300
Target CUDA RTL → CUDA module successfully loaded!
Target CUDA RTL → Entry point 0x0000000000000000 maps to __omp_offloading_46_804afcb6_main_l41 (0x0000110000350fd0)
Target CUDA RTL → Entry point 0x0000000000000001 maps to __omp_offloading_46_804afcb6_main_l89 (0x0000110000361810)
Target CUDA RTL → Sending global device environment data 4 bytes
Libomptarget → Entry 0: Base=0x00003ffff55df0b0, Begin=0x00003ffff55df0b0, Size=8, Type=0x23
Libomptarget → Entry 1: Base=0x00003ffff55de0a8, Begin=0x00003ffff55de0a8, Size=4096, Type=0x223
Libomptarget → Entry 2: Base=0x00003ffff55df0c0, Begin=0x00003ffff55df0c0, Size=8, Type=0x23
Libomptarget → Entry 3: Base=0x0000010001bd1a80, Begin=0x0000010001bd1a80, Size=0, Type=0x220
Libomptarget → Looking up mapping(HstPtrBegin=0x00003ffff55df0b0, Size=8)…
Libomptarget → Return HstPtrBegin 0x00003ffff55df0b0 Size=8 RefCount= updated
Libomptarget → There are 8 bytes allocated at target address 0x00003ffff55df0b0 - is not new
Libomptarget → Looking up mapping(HstPtrBegin=0x00003ffff55de0a8, Size=4096)…
Libomptarget → Return HstPtrBegin 0x00003ffff55de0a8 Size=4096 RefCount= updated
Libomptarget → There are 4096 bytes allocated at target address 0x00003ffff55de0a8 - is not new
Libomptarget → Looking up mapping(HstPtrBegin=0x00003ffff55df0c0, Size=8)…
Libomptarget → Return HstPtrBegin 0x00003ffff55df0c0 Size=8 RefCount= updated
Libomptarget → There are 8 bytes allocated at target address 0x00003ffff55df0c0 - is not new
Libomptarget → Looking up mapping(HstPtrBegin=0x0000010001bd1a80, Size=0)…
Libomptarget → There are 0 bytes allocated at target address 0x0000000000000000 - is not new
Libomptarget → Looking up mapping(HstPtrBegin=0x00003ffff55df0b0, Size=8)…
Libomptarget → Get HstPtrBegin 0x00003ffff55df0b0 Size=8 RefCount=
Libomptarget → Obtained target argument 0x00003ffff55df0b0 from host pointer 0x00003ffff55df0b0
Libomptarget → Looking up mapping(HstPtrBegin=0x00003ffff55de0a8, Size=4096)…
Libomptarget → Get HstPtrBegin 0x00003ffff55de0a8 Size=4096 RefCount=
Libomptarget → Obtained target argument 0x00003ffff55de0a8 from host pointer 0x00003ffff55de0a8
Libomptarget → Looking up mapping(HstPtrBegin=0x00003ffff55df0c0, Size=8)…
Libomptarget → Get HstPtrBegin 0x00003ffff55df0c0 Size=8 RefCount=
Libomptarget → Obtained target argument 0x00003ffff55df0c0 from host pointer 0x00003ffff55df0c0
Libomptarget → Looking up mapping(HstPtrBegin=0x0000010001bd1a80, Size=0)…
Libomptarget → Get HstPtrBegin 0x0000010001bd1a80 Size=0 RefCount=
Libomptarget → Obtained target argument 0x0000010001bd1a80 from host pointer 0x0000010001bd1a80
Libomptarget → Launching target execution __omp_offloading_46_804afcb6_main_l41 with pointer 0x0000110000322840 (index=0).
Target CUDA RTL → Setting CUDA threads per block to requested 1
Target CUDA RTL → Adding master warp: +32 threads
Target CUDA RTL → Using requested number of teams 1
Target CUDA RTL → Launch kernel with 1 blocks and 33 threads
Target CUDA RTL → Launch of entry point at 0x0000110000322840 successful!
Libomptarget → Looking up mapping(HstPtrBegin=0x0000010001bd1a80, Size=0)…
Libomptarget → Get HstPtrBegin 0x0000010001bd1a80 Size=0 RefCount= updated
Libomptarget → There are 0 bytes allocated at target address 0x0000010001bd1a80 - is not last
Libomptarget → Looking up mapping(HstPtrBegin=0x00003ffff55df0c0, Size=8)…
Libomptarget → Get HstPtrBegin 0x00003ffff55df0c0 Size=8 RefCount= updated
Libomptarget → There are 8 bytes allocated at target address 0x00003ffff55df0c0 - is not last
Libomptarget → Looking up mapping(HstPtrBegin=0x00003ffff55de0a8, Size=4096)…
Libomptarget → Get HstPtrBegin 0x00003ffff55de0a8 Size=4096 RefCount= updated
Libomptarget → There are 4096 bytes allocated at target address 0x00003ffff55de0a8 - is not last
Libomptarget → Looking up mapping(HstPtrBegin=0x00003ffff55df0b0, Size=8)…
Libomptarget → Get HstPtrBegin 0x00003ffff55df0b0 Size=8 RefCount= updated
Libomptarget → There are 8 bytes allocated at target address 0x00003ffff55df0b0 - is not last
Target CUDA RTL → Error when synchronizing stream. stream = 0x00001100002cd7c0, async info ptr = 0x00003ffff55dddf8
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuStreamDestroy
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Target CUDA RTL → Error returned from cuModuleUnload
Target CUDA RTL → CUDA error is: an illegal memory access was encountered
Libomptarget → Unloading target library!
Libomptarget → Image 0x0000000010001300 is compatible with RTL 0x0000010001b6d860!
Libomptarget → Unregistered image 0x0000000010001300 from RTL 0x0000010001b6d860!
Libomptarget → Done unregistering images!
Libomptarget → Removing translation table for descriptor 0x00000000100193e8
Libomptarget → Done unregistering library!
Libomptarget → Deinit target library!

Topic		Replies	Views
[RFC][OpenMP][CUDA] Unified Offloading Support in Clang Driver Clang Frontend	50	129	March 21, 2016
Troubles with offloading in Clang 6.0 and trunk OpenMP	1	115	March 13, 2018
Specifying OMPTARGET OpenMP	0	81	October 17, 2017
Help with OpenMP pragmas source to source translation. Clang Frontend	1	79	September 27, 2019
aarch64 openmp offloading support Clang Frontend	1	65	September 13, 2018

Target architecture does not support unified addressing

Related Topics