OpenMP GPU shared memory

hello everybody,

I have a question about GPU shared memory in the OpenMP implementation in LLVM.

In the paper by Grinberg, Bertolli, and Haque (Hands on with OpenMP 4.5 and Unified Memory: Developing Applications for IBM's Hybrid CPU + GPU systems (Part II), IWOMP 2017) I found "3. Clang's Extension for OpenMP 4.5 for device On-chip Memory Allocation" and learnt that the GPU shared memory can be used in a tricky manner with OpenMP directives. In order to find the compiler limit for this static memory allocation I looked at the source code files under `openmp`. It seems the relevant files are:

1. openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
     * commit: 197b7b24
     * line: DS_Slot_Size = 256,

2. openmp/libomptarget/deviceRTLs/common/omptarget.h
     * commit: d0b9ed5c
     * line: char Data[DS_Slot_Size];

My questions are:

1. Is the hard-coded limit for GPU shared memory 256 Bytes or (256 * 4) Bytes? Because I see the comment in `openmp/libomptarget/deviceRTLs/common/omptarget.h`

// Additional master slot type which is initialized with the default master slot
// size of 4 bytes.

2. Could we enlarge this limit to, e.g. 512 Bytes or even 1024 Bytes? Concerning the hardware specification of green GPUs, if we assume the shared memory per multiprocessor is 48 KB and at most 32 thread blocks (or contention groups) reside on one multiprocessor, this limit can be as large as 1536 Bytes, isn't it?

3. How could we check/verify that the static memory allocation is on GPU shared memory (not on global memory), when an OpenMP source file is compiled by Clang/LLVM? My current approach is to look at the generated assembly code (`-S`), which is not really convenient. It would be good, if the compiler can print some message or give a short report during compilation.

Thank you in advance!

Best wishes!


Hi Xin,

I think what you found is some runtime code that lives in shared memory. This is not to be confused with user data put into shared memory.

To do the latter, you can use the allocate directive, e.g.,

int Global[32];

#pragma omp allocate(Global) allocator(omp_pteam_mem_alloc)

Wrt. to the feedback I don’t think there is anything in place. You could use nvprof if you run it maybe. However, I agree we should have a

flag that provides better information.

I hope this helps.