Potential offload kernel data race

Hi all,
I got libomptarget failure when offloading the same kernel function from multiple host threads. I reported the bug as https://bugs.llvm.org/show_bug.cgi?id=46257.

After a bit investigation, I’m able to narrow down a bit the scope of the issue.

The error from CUDA runtime is “an illegal memory access was encountered”.

A) On the libomptarget side

In libomptarget/plugins/cuda/src/rtl.cpp, I added a few synchronization before and after cuLaunchKernel and a mutex to protect the kernel execution.

Err = cuCtxSynchronize();
if (!checkResult(Err, “Error before cuLaunchKernel\n”))

Err = cuLaunchKernel(KernelInfo->Func, CudaBlocksPerGrid, /* gridDimY / 1,
gridDimZ / 1, CudaThreadsPerBlock,
blockDimY / 1, / blockDimZ / 1,
sharedMemBytes */ 0, Stream, &Args[0], nullptr);
if (!checkResult(Err, “Error returned from cuLaunchKernel\n”))

Err = cuCtxSynchronize();
if (!checkResult(Err, “Error after cuLaunchKernel\n”))

Without the mutex. the first error from all the threads is “Error after cuLaunchKernel”. After enabling the mutex, the code runs well. If I move the mutex unlock() before the cuCtxSynchronize, the code still runs into error. So I think the error comes from kernel execution not something else.

  1. on my application side, I tried to see what triggers the error. Once I commented out “omp parallel” in the hierarchical parallelism. https://github.com/QMCPACK/miniqmc/blob/5a11c4131dbf91bf9f9977057a03aee485368f0d/src/QMCWaveFunctions/einspline_spo_omp.cpp#L258

Multi host threads offloading start to run without error.

I’d like to ask the wider community to see potential causes of this bug.

Is there a thread unsafe state machine with a kernel generated from a target region? Or any other potential reason for the error?


It was from commit 1c3d7709dec22c61d9c3105e4838edce8e6ac014 (HEAD → master, origin/master, origin/HEAD)

I added those synchronizations for debugging purposes.

Yes. It makes the code working.
Do you have any idea about the code pattern?

Before the compiler fix gets in, I can put the workaround in the real code which has multiple places similar to the miniapp.


It is working well now. I tested both the miniapp and the full application.

Could you add documentation for this option?
The name doesn't reflect the actual effect.
The differential review mentions

`Added support for dynamic memory allocation for globalized variables in`
`case if execution of target regions in parallel is required.`

So what are the globalized variables? When do they occur? since it impacts performance, application developers desire to know more details.


It might use "slow" allocation functions, in general, since it may use malloc on the device side. Being disabled, it uses statically preallocated memory, which might be faster, if parallel target regions are not required.


In the slow code path, does it always call malloc in the device or malloc gets called when exceeding a size limit? Is the call explicitly generated by clang or the cuda driver handles it?`

`I saw some UseSharedMemory flags in the review. Is the slow code path uses CUDA shared memory and it may need global memory when the size limit is exceeded?`

`In the fast code path, what is the scope of this statically preallocated memory? Is it owned by the CUDA plugin?`



I think hard-coded size 128 is suboptimal. Can we allow users to select that?
Since we cannot use shared memory directly, I believe there are plenty of shared memory available.
It seems that my test code is exceeding 128 byte and I’d like to set a higher limit.


I cannot use openmp-cuda-mode. I don’t remember now exactly where (reduction?) but it breaks other things in my code.


Do you have an example using shared memory via #pragma omp allocate?

Thank you. I will try that.