openmp 4.5 and cuda streams

Dear All,

I'm using clang 9.0.0 to compile a code which offloads sections of a code on a GPU using the openmp target construct.
I also use the nowait clause to overlap the execution of certain kernels and/or host<->device memory transfers.
However, using the nvidia profiler I've noticed that when I compile the code with clang only one cuda stream is active,
and therefore the execution gets serialized. On the other hand, when compiling with XLC I see that kernels are executed
on different streams. I could not understand if this is the expected behavior (e.g. the nowait clause is currently not supported),
or if I'm missing something. I'm using a NVIDIA Tesla P100 GPU and compiling with the following options:

-target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target=nvptx64-nvidia-cuda -march=sm_60

best wishes

Alessandro

[+Ye, Johannes]

I recall that we've also observed this behavior. Ye, Johannes, we had a
work-around and a patch, correct?

-Hal

I don’t think it will be very easy. It requires some additional work in libomptarget + some fixes in the clang itself. Otherwise there might be some race conditions.

I don't think it will be very easy. It requires some additional work
in libomptarget + some fixes in the clang itself. Otherwise there
might be some race conditions.

Can you be more specific? I thought that the mapping table, etc. were
already appropriately protected.

As a general thought, we should probably have a mode in which the
runtime is compiled with ThreadSanitizer to check for these kinds of things.

Thanks again,

Hal

Hal, seems to me, not everything is protected. Some buffers are reused for different kernels, I assume. Better to ask Alex Eichenberger, he knows more about it, I did not not investigate this problem.

As to clang, we try to reduce the size of the buffers in the global memory for the reduction/lastprivate/etc. vars, which may escape their declaration context. These buffers cannot be combined in streams mode, need to allocate unique buffer for each particular kernel. It is not very hard to do, it is just not implemented yet.

Hi Hal,
My experience of llvm/clang so far shows:

  1. all the target offload is blocking synchronous using the default stream. nowait is not supported.
  2. all the memory transfer calls invoke cudaMemcpy. There are no async calls.
  3. I had an experiment in the past turning on in libomptarget.
    Then I use multiple host threads to do individual blocking synchronous offload. I got it sort of running and saw multple streams but the code crashes due to memory corruption probably due to some data race in libomptarget.
    Best,
    Ye

Hi Hal,
My experience of llvm/clang so far shows:
1. all the target offload is blocking synchronous using the default stream. nowait is not supported.
2. all the memory transfer calls invoke cudaMemcpy. There are no async calls.
3. I had an experiment in the past turning on CUDA_API_PER_THREAD_DEFAULT_STREAM in libomptarget.
Then I use multiple host threads to do individual blocking synchronous offload. I got it sort of running and saw multple streams but the code crashes due to memory corruption probably due to some data race in libomptarget.

Thanks, Ye. That's consistent with Alexey's comments.

Is there already a bug open on this? If not, we should open one.

Alexey, the buffer-reuse optimizations in Clang that you mentioned, how much memory/overhead do they save? Is it worth keeping them in some mode?

-Hal

Best,
Ye

Hope to send this message from the main dev e-mail this time :slight_smile:

Well, about the memory. It depends on the number of kernels you have. All the memory in the kernels that must be globalized is squashed into a union. With streams we need to use the separate structure for each particular kernel. Plus, we cannot use shared memory for this buffer anymore again because of possible conflict.

We can add a new compiler option to compile only some files with streams support and use unique memory buffer for the globalized variables. Plus, some work in the libomptarget is required, of course.

Hope to send this message from the main dev e-mail this time :slight_smile:

Well, about the memory. It depends on the number of kernels you have. All the memory in the kernels that must be globalized is squashed into a union. With streams we need to use the separate structure for each particular kernel. Plus, we cannot use shared memory for this buffer anymore again because of possible conflict.

We can add a new compiler option to compile only some files with streams support and use unique memory buffer for the globalized variables. Plus, some work in the libomptarget is required, of course.

Do we also need some kind of libomptarget API change in order to communicate the fact that it's allowed to run multiple target regions concurrently?

Thanks again,

Hal

Not sure about the API, most probably just some internal work is required. Better to ask Alex Eichenberger, he knows more about this.