Zero Copy openmp offload

addy419 · August 30, 2023, 2:21am

Greetings,

I am porting an application from cuda to openmp offload. While the application works as expected, nvprof/nvvp shows high H2D and D2H calls. I am presuming this is due to the original code using cudaMallocHost whereas openmp offload code uses normal malloc. Is it possible to pin memory on the host for zero copy access with openmp offload?

This application is portable, so I’d like to avoid a solution like mixing cuda code with openmp offload (since amd and intel support is also needed). Also, I’m unsure how to mix them in the first place (compiling cuda with clang)

jhuber6 · August 30, 2023, 2:32am

OpenMP by default maintains a mapping of host pointers to device pointers. When we enter a target region, we automatically find the device pointer associated with the given host pointer. This is the standard behavior when using the mapping clauses. I don’t know if there’s an OpenMP standard solution, but currently you can allocate CUDA host memory using an LLVM extension allocator. There’s an example in the tests https://github.com/llvm/llvm-project/blob/main/openmp/libomptarget/test/api/omp_device_managed_memory_alloc.c. Specifically we have

  int *shared_ptr =
      omp_alloc(N * sizeof(int), llvm_omp_target_shared_mem_alloc);

#pragma omp target is_device_ptr(shared_ptr)
  { ... }

  omp_free(shared_ptr, llvm_omp_target_shared_mem_alloc);

The is_device_ptr clause is important here as it instruct the runtime to forgo the normal mapping table and instead copy the associated pointer directly. Is this what you were looking for?

jhuber6 · August 30, 2023, 2:35am

Also w.r.t. mixing CUDA and OpenMP offloading: you can partially do it with some caveats. Namely they can’t be in the same file and global state won’t be shared between the two. See the FAQ for an example Support, Getting Involved, and FAQ — LLVM/OpenMP 18.0.0git documentation.

addy419 · August 30, 2023, 2:51am

No, I am looking for something along the lines of llvm_omp_target_alloc_host, i.e. I don’t want the memory to be migratable, just page locked.

Since I was able to find the extension, is there a flag that can check if llvm is used? I am also testing this on nvhpc, so a compiler condition directive will be needed

jhuber6 · August 30, 2023, 2:58am

Yes, the host variant maps to cuMemAllocHost while the shared variant maps to cuMemAllocManaged, glad you found the one you were looking for.

The easiest way is to just check for the __clang__ macro most likely. I think you might be able to use an variant for some overloading type implementations, but I don’t think nvhpc even supports those. That would look something like this.

#pragma omp declare variant(foo) match(implementation = {vendor(llvm)})

addy419 · August 30, 2023, 4:23pm

@jhuber6 I am getting the error:

src/ompoffload/ops_ompoffload_common.cpp:71:10: error: use of undeclared identifier 'llvm_omp_target_alloc_host'
   71 |   *ptr = llvm_omp_target_alloc_host(bytes, device);

I am using clang version 17.0.0 (https://github.com/llvm/llvm-project.git c6e065ef22c29a341dcc764f8f6ed9ab5ec1c57a) so llvm_omp_target_alloc_host should be available. Do I need to import something else?

jhuber6 · August 30, 2023, 4:26pm

I believe that’s the internal version. You need to forward declare it because it’s not really intended to be accessed by users. The expected way is to use omp_alloc and omp_free, e.g.

omp_alloc(size, llvm_omp_target_host_mem_alloc);
 omp_free(ptr, llvm_omp_target_host_mem_alloc);

addy419 · August 30, 2023, 4:33pm

@jhuber6 Thanks, that worked. I am waiting for the GPU allocation and will test it soon. Will update you if that helps.

addy419 · August 30, 2023, 5:33pm

@jhuber6 llvm_omp_target_host_mem_alloc does not solve the issue. Zero copy still does not work, nvvp shows the same behaviour nonetheless.

To copy the data, I am using

  int host = omp_get_initial_device();
  int device = omp_get_default_device();

  void* device_ptr = omp_get_mapped_ptr(*to, device);
  omp_target_memcpy(device_ptr, *from, size, 0, 0, device, host);

Is that the issue?

jdoerfert · August 30, 2023, 5:37pm

I’m confused. You want zero copy, right? So direct access from the device to the host memory? And you said you want the host memory to be pinned and stay on the host, right?
If that is so, do not copy memory to the device. If you call omp_target_memcpy, we will copy memory, so no zero copy.

int *ptr = (int*)omp_alloc(4, llvm_omp_target_host_mem_alloc);
#pragma omp target is_device_ptr(ptr)
*ptr = 42
printf("val %i\n", *ptr);

addy419 · August 30, 2023, 5:39pm

I do need to copy it to *to from *from, in that case, do I just do a memcpy? or a target memcpy but both to and from are device?

*to is a device pointer associated with a host ptr.

  void* device_ptr = omp_target_alloc(bytes, device);
  omp_target_associate_ptr(*to, device_ptr, bytes, 0, device);

jhuber6 · August 30, 2023, 5:47pm

I’m a little confused as well. The CUDA documentation describes that the pointers allocated here should be unified. There should not need to be any copies, see CUDA Driver API :: CUDA Toolkit Documentation .

All host memory allocated in all contexts using cuMemAllocHost() and cuMemHostAlloc() is always directly accessible from all contexts on all devices that support unified addressing. This is the case regardless of whether or not the flags CU_MEMHOSTALLOC_PORTABLE and CU_MEMHOSTALLOC_DEVICEMAP are specified.

You shouldn’t need to do any association here. I believe since OpenMP 5.1 we treat implicitly captured arguments as firstprivate if they are not already present in the mapping table. That means you should just be able to do the following and have it “just work”

#include <omp.h>

int main() {
  int *ptr = (int*)omp_alloc(4, llvm_omp_target_host_mem_alloc);
#pragma omp target
  { *ptr = 1; }
  int v = *ptr
  omp_free(ptr, llvm_omp_target_host_mem_alloc);
  return v;
}

addy419 · August 30, 2023, 5:51pm

*to and *from are both dynamically allocated arrays. *from is allocated with the llvm_omp_target_host_mem_alloc where as *to is a device array. The original datatypes of to and from is **void so it might be confusing when I wrote *to, *to is an array not a value.

complete mallochost. *ptr should be *from

void ops_device_mallochost(OPS_instance *instance, void** ptr, size_t bytes) {
#if defined(__clang__)
  *ptr = omp_alloc(bytes, llvm_omp_target_host_mem_alloc);
#else
  *ptr = ops_malloc(bytes);
#endif
}

*to should be *ptr:

void ops_device_malloc(OPS_instance *instance, void** ptr, size_t bytes) {
  *ptr = ops_malloc(bytes);
  int device = omp_get_default_device();

  void* device_ptr = omp_target_alloc(bytes, device);
  omp_target_associate_ptr(*ptr, device_ptr, bytes, 0, device);
}

and this is h2d:

void ops_device_memcpy_h2d(OPS_instance *instance, void** to, void **from, size_t size) {
  int host = omp_get_initial_device();
  int device = omp_get_default_device();

  void* device_ptr = omp_get_mapped_ptr(*to, device);
  omp_target_memcpy(device_ptr, *from, size, 0, 0, device, host);
}

The only option I can think of is

#pragma omp target is_device_ptr(*from)
memcpy(*to, *from, size);

will that work?

jhuber6 · August 30, 2023, 6:02pm

That would definitely “work”, but it would probably be very slow. Also you’d need to use __builtin_memcpy unless you’re an early adopter of my libc for GPUs — The LLVM C Library project.

You should be able to copy this stuff with omp_target_memcpy as you’ve done, but then it’s not zero copy anymore. What I thought you had in mind when you said “zero copy” was to use this memory directly instead of device memory. The overhead of launching a kernel to do the copying is almost certainly going to be on the order of the memcpy itself. If the amount of memory in the host pointer is smaller it’d probably be best to roll writing back the result into the main kernel, e.g.

void *device = omp_alloc(size, llvm_omp_target_device_mem_alloc)
void *host = omp_alloc(size, llvm_omp_target_host_mem_alloc)
#pragma omp target
{
  work(device);
  update(device, host);
}

addy419 · August 30, 2023, 6:08pm

With how the application is designed, it is not possible to change that. I’ll need to search and change a fair bit of code in order to that. What I want is to copy call the values from the pointer allocated with mallochost to a device pointer. Is there no way the target memcpy will work in that case?
In the cuda variant, these operations work just fine:

void ops_device_malloc(OPS_instance *instance, void** ptr, size_t bytes) {
  cutilSafeCall(instance->ostream(), cudaMalloc(ptr, bytes));
}
void ops_device_mallochost(OPS_instance *instance, void** ptr, size_t bytes) {
  cutilSafeCall(instance->ostream(), cudaMallocHost(ptr, bytes));
}
void ops_device_memcpy_h2d(OPS_instance *instance, void** to, void **from, size_t size) {
    cutilSafeCall(instance->ostream(), cudaMemcpy(*to, *from, size, cudaMemcpyHostToDevice));
}

jdoerfert · August 30, 2023, 6:37pm

cudaMalloc → omp_target_alloc
cudaMallocHost → omp_alloc(size, llvm_omp_target_host_mem_alloc)
cudaMemcpy → omp_target_memcpy

EDIT:
I now see you want to return the host pointer from ops_device_mallochost, ok. What did not work with that approach? Zero Copy openmp offload - #13 by addy419 ?

So far, so good.

Now you say you want 0 copy, but in fact you want explicit copies, correct?
You can get either, but not both, I mean 0 copies contradicts explicit copies.
Let’s assume for a second you want to emulate your ops_device_ functions:

void ops_device_malloc(OPS_instance *instance, void** ptr, size_t bytes) {
  *ptr = omp_target_alloc(bytes, omp_get_default_device());
}
void ops_device_mallochost(OPS_instance *instance, void** ptr, size_t bytes) {
  *ptr = omp_alloc(size, llvm_omp_target_host_mem_alloc);
}
void ops_device_memcpy_h2d(OPS_instance *instance, void** to, void **from, size_t size) {
    omp_target_memcpy(*to, *from, size, 0, 0, /* Dst Device */ omp_get_default_device(), /* Src Device */ omp_get_initial_device());
}

addy419 · August 30, 2023, 6:52pm

The invocation for cudaMemcpy are far more in openmp offload than in cuda. For example in openmp offload, application makes 5561 calls to [CUDA memcpy HtoD]. For cuda, same application makes 192 calls. It takes a total of 229.15us for HtoD in cuda and 6.5745ms for ompoffload. I assumed this is because the memory is not pinned, but same thing happened after pinning. Due to this, the average runtime which is 27 sec for cuda changes to 92 sec for ompoffload when running more iterations (the profiling was done on 2 iterations, I need 2955 iterations in total for the whole application).

jdoerfert · August 30, 2023, 7:18pm

OK, now we are talking.

You can check why and what OpenMP is actually transferring. Our documentation has more info (LLVM/OpenMP Runtimes — LLVM/OpenMP 18.0.0git documentation), but basically try:

env LIBOMPTARGET_INFO=$((0x20)) ./your-application

If you don’t have line numbers and variable names in the output, add -gline-tables-only to your compilation.
You can also use
env LIBOMPTARGET_PROFILE=prof.json ./your-application
to get a chrome profile file, which might shed light on some things, see also LLVM/OpenMP Runtimes — LLVM/OpenMP 18.0.0git documentation

My office hour was earlier today, but if you are stuck, let me know.
We also have a slack where you can get quick help:
https://join.slack.com/t/openmp-opt/shared_invite/zt-1my2xbwds-VqpEQc2SIOkQ5~U7vZzZXg

addy419 · August 31, 2023, 2:48pm

Hi @jdoerfert, thank you so much for the help. I figured out that the loop variables (start[0], end[0], start[1], end[1] etc) were being continuously moved to the target which was causing this issue. changing them to normal integers reduced the runtime from 81 sec to 53 sec. With just that one change, HtoD calls reduced from 5561 to 1767. I need to move other variables as well, but I’m thankful for the hint.

Kind Regards,
Aditya

addy419 · August 31, 2023, 9:59pm

@jdoerfert The variables for reductions are transferred every time as well. Is that a normal thing? Libomptarget device 0 info: Copying data from host to device, HstPtr=0x00007ffcab90e6c8, TgtPtr=0x000014b24d806000, Size=8, Name=p_a7_0
Libomptarget device 0 info: Copying data from host to device, HstPtr=0x00007ffcab90e6c0, TgtPtr=0x000014b24d806200, Size=8, Name=p_a8_0
Libomptarget device 0 info: Copying data from host to device, HstPtr=0x00007ffcab90e6b8, TgtPtr=0x000014b24d805a00, Size=8, Name=p_a9_0
Libomptarget device 0 info: Copying data from host to device, HstPtr=0x00007ffcab90e6b0, TgtPtr=0x000014b24d805c00, Size=8, Name=p_a10_0
Libomptarget device 0 info: Copying data from host to device, HstPtr=0x00007ffcab90e7b8, TgtPtr=0x000014b24d805e00, Size=8, Name=p_a11_0
Libomptarget device 0 info: Copying data from device to host, TgtPtr=0x000014b24d805e00, HstPtr=0x00007ffcab90e7b8, Size=8, Name=p_a11_0
Libomptarget device 0 info: Copying data from device to host, TgtPtr=0x000014b24d805c00, HstPtr=0x00007ffcab90e6b0, Size=8, Name=p_a10_0
Libomptarget device 0 info: Copying data from device to host, TgtPtr=0x000014b24d805a00, HstPtr=0x00007ffcab90e6b8, Size=8, Name=p_a9_0
Libomptarget device 0 info: Copying data from device to host, TgtPtr=0x000014b24d806200, HstPtr=0x00007ffcab90e6c0, Size=8, Name=p_a8_0
Libomptarget device 0 info: Copying data from device to host, TgtPtr=0x000014b24d806000, HstPtr=0x00007ffcab90e6c8, Size=8, Name=p_a7_0

Topic		Replies	Views
Showcasing LLVM/Offload Runtimes cuda , offload	0	613	December 15, 2023
Implementing "Partial Shared Virtual Memory" for OpenMP Offload Target OpenMP	2	234	May 31, 2023
Using OpenMP target offloading in llvm-8.0.0 OpenMP	5	97	June 20, 2019
Test with clang offloading to GPU OpenMP	3	85	October 21, 2019
Executing OpenMP 4.0 code on Nvidia's GPU LLVM Dev List Archives	4	77	January 21, 2016

Zero Copy openmp offload

Related Topics