Zero Copy openmp offload

@jhuber6 Thanks, that worked. I am waiting for the GPU allocation and will test it soon. Will update you if that helps.

@jhuber6 llvm_omp_target_host_mem_alloc does not solve the issue. Zero copy still does not work, nvvp shows the same behaviour nonetheless.

To copy the data, I am using

  int host = omp_get_initial_device();
  int device = omp_get_default_device();

  void* device_ptr = omp_get_mapped_ptr(*to, device);
  omp_target_memcpy(device_ptr, *from, size, 0, 0, device, host);

Is that the issue?

I’m confused. You want zero copy, right? So direct access from the device to the host memory? And you said you want the host memory to be pinned and stay on the host, right?
If that is so, do not copy memory to the device. If you call omp_target_memcpy, we will copy memory, so no zero copy.

int *ptr = (int*)omp_alloc(4, llvm_omp_target_host_mem_alloc);
#pragma omp target is_device_ptr(ptr)
*ptr = 42
printf("val %i\n", *ptr);

I do need to copy it to *to from *from, in that case, do I just do a memcpy? or a target memcpy but both to and from are device?

*to is a device pointer associated with a host ptr.

  void* device_ptr = omp_target_alloc(bytes, device);
  omp_target_associate_ptr(*to, device_ptr, bytes, 0, device);

I’m a little confused as well. The CUDA documentation describes that the pointers allocated here should be unified. There should not need to be any copies, see CUDA Driver API :: CUDA Toolkit Documentation .

All host memory allocated in all contexts using cuMemAllocHost() and cuMemHostAlloc() is always directly accessible from all contexts on all devices that support unified addressing. This is the case regardless of whether or not the flags CU_MEMHOSTALLOC_PORTABLE and CU_MEMHOSTALLOC_DEVICEMAP are specified.

You shouldn’t need to do any association here. I believe since OpenMP 5.1 we treat implicitly captured arguments as firstprivate if they are not already present in the mapping table. That means you should just be able to do the following and have it “just work”

#include <omp.h>

int main() {
  int *ptr = (int*)omp_alloc(4, llvm_omp_target_host_mem_alloc);
#pragma omp target
  { *ptr = 1; }
  int v = *ptr
  omp_free(ptr, llvm_omp_target_host_mem_alloc);
  return v;
}

*to and *from are both dynamically allocated arrays. *from is allocated with the llvm_omp_target_host_mem_alloc where as *to is a device array. The original datatypes of to and from is **void so it might be confusing when I wrote *to, *to is an array not a value.

complete mallochost. *ptr should be *from

void ops_device_mallochost(OPS_instance *instance, void** ptr, size_t bytes) {
#if defined(__clang__)
  *ptr = omp_alloc(bytes, llvm_omp_target_host_mem_alloc);
#else
  *ptr = ops_malloc(bytes);
#endif
}

*to should be *ptr:

void ops_device_malloc(OPS_instance *instance, void** ptr, size_t bytes) {
  *ptr = ops_malloc(bytes);
  int device = omp_get_default_device();

  void* device_ptr = omp_target_alloc(bytes, device);
  omp_target_associate_ptr(*ptr, device_ptr, bytes, 0, device);
}

and this is h2d:

void ops_device_memcpy_h2d(OPS_instance *instance, void** to, void **from, size_t size) {
  int host = omp_get_initial_device();
  int device = omp_get_default_device();

  void* device_ptr = omp_get_mapped_ptr(*to, device);
  omp_target_memcpy(device_ptr, *from, size, 0, 0, device, host);
}

The only option I can think of is

#pragma omp target is_device_ptr(*from)
memcpy(*to, *from, size);

will that work?

That would definitely “work”, but it would probably be very slow. Also you’d need to use __builtin_memcpy unless you’re an early adopter of my libc for GPUs — The LLVM C Library project.

You should be able to copy this stuff with omp_target_memcpy as you’ve done, but then it’s not zero copy anymore. What I thought you had in mind when you said “zero copy” was to use this memory directly instead of device memory. The overhead of launching a kernel to do the copying is almost certainly going to be on the order of the memcpy itself. If the amount of memory in the host pointer is smaller it’d probably be best to roll writing back the result into the main kernel, e.g.

void *device = omp_alloc(size, llvm_omp_target_device_mem_alloc)
void *host = omp_alloc(size, llvm_omp_target_host_mem_alloc)
#pragma omp target
{
  work(device);
  update(device, host);
}

With how the application is designed, it is not possible to change that. I’ll need to search and change a fair bit of code in order to that. What I want is to copy call the values from the pointer allocated with mallochost to a device pointer. Is there no way the target memcpy will work in that case?
In the cuda variant, these operations work just fine:

void ops_device_malloc(OPS_instance *instance, void** ptr, size_t bytes) {
  cutilSafeCall(instance->ostream(), cudaMalloc(ptr, bytes));
}
void ops_device_mallochost(OPS_instance *instance, void** ptr, size_t bytes) {
  cutilSafeCall(instance->ostream(), cudaMallocHost(ptr, bytes));
}
void ops_device_memcpy_h2d(OPS_instance *instance, void** to, void **from, size_t size) {
    cutilSafeCall(instance->ostream(), cudaMemcpy(*to, *from, size, cudaMemcpyHostToDevice));
}

cudaMalloc → omp_target_alloc
cudaMallocHost → omp_alloc(size, llvm_omp_target_host_mem_alloc)
cudaMemcpy → omp_target_memcpy

EDIT:
I now see you want to return the host pointer from ops_device_mallochost, ok. What did not work with that approach? Zero Copy openmp offload - #13 by addy419 ?


So far, so good.

Now you say you want 0 copy, but in fact you want explicit copies, correct?
You can get either, but not both, I mean 0 copies contradicts explicit copies.
Let’s assume for a second you want to emulate your ops_device_ functions:

void ops_device_malloc(OPS_instance *instance, void** ptr, size_t bytes) {
  *ptr = omp_target_alloc(bytes, omp_get_default_device());
}
void ops_device_mallochost(OPS_instance *instance, void** ptr, size_t bytes) {
  *ptr = omp_alloc(size, llvm_omp_target_host_mem_alloc);
}
void ops_device_memcpy_h2d(OPS_instance *instance, void** to, void **from, size_t size) {
    omp_target_memcpy(*to, *from, size, 0, 0, /* Dst Device */ omp_get_default_device(), /* Src Device */ omp_get_initial_device());
}

The invocation for cudaMemcpy are far more in openmp offload than in cuda. For example in openmp offload, application makes 5561 calls to [CUDA memcpy HtoD]. For cuda, same application makes 192 calls. It takes a total of 229.15us for HtoD in cuda and 6.5745ms for ompoffload. I assumed this is because the memory is not pinned, but same thing happened after pinning. Due to this, the average runtime which is 27 sec for cuda changes to 92 sec for ompoffload when running more iterations (the profiling was done on 2 iterations, I need 2955 iterations in total for the whole application).

OK, now we are talking.

You can check why and what OpenMP is actually transferring. Our documentation has more info (LLVM/OpenMP Runtimes — LLVM/OpenMP 18.0.0git documentation), but basically try:

env LIBOMPTARGET_INFO=$((0x20)) ./your-application

If you don’t have line numbers and variable names in the output, add -gline-tables-only to your compilation.
You can also use
env LIBOMPTARGET_PROFILE=prof.json ./your-application
to get a chrome profile file, which might shed light on some things, see also LLVM/OpenMP Runtimes — LLVM/OpenMP 18.0.0git documentation

My office hour was earlier today, but if you are stuck, let me know.
We also have a slack where you can get quick help:
https://join.slack.com/t/openmp-opt/shared_invite/zt-1my2xbwds-VqpEQc2SIOkQ5~U7vZzZXg

Hi @jdoerfert, thank you so much for the help. I figured out that the loop variables (start[0], end[0], start[1], end[1] etc) were being continuously moved to the target which was causing this issue. changing them to normal integers reduced the runtime from 81 sec to 53 sec. With just that one change, HtoD calls reduced from 5561 to 1767. I need to move other variables as well, but I’m thankful for the hint.

Kind Regards,
Aditya

@jdoerfert The variables for reductions are transferred every time as well. Is that a normal thing? Libomptarget device 0 info: Copying data from host to device, HstPtr=0x00007ffcab90e6c8, TgtPtr=0x000014b24d806000, Size=8, Name=p_a7_0
Libomptarget device 0 info: Copying data from host to device, HstPtr=0x00007ffcab90e6c0, TgtPtr=0x000014b24d806200, Size=8, Name=p_a8_0
Libomptarget device 0 info: Copying data from host to device, HstPtr=0x00007ffcab90e6b8, TgtPtr=0x000014b24d805a00, Size=8, Name=p_a9_0
Libomptarget device 0 info: Copying data from host to device, HstPtr=0x00007ffcab90e6b0, TgtPtr=0x000014b24d805c00, Size=8, Name=p_a10_0
Libomptarget device 0 info: Copying data from host to device, HstPtr=0x00007ffcab90e7b8, TgtPtr=0x000014b24d805e00, Size=8, Name=p_a11_0
Libomptarget device 0 info: Copying data from device to host, TgtPtr=0x000014b24d805e00, HstPtr=0x00007ffcab90e7b8, Size=8, Name=p_a11_0
Libomptarget device 0 info: Copying data from device to host, TgtPtr=0x000014b24d805c00, HstPtr=0x00007ffcab90e6b0, Size=8, Name=p_a10_0
Libomptarget device 0 info: Copying data from device to host, TgtPtr=0x000014b24d805a00, HstPtr=0x00007ffcab90e6b8, Size=8, Name=p_a9_0
Libomptarget device 0 info: Copying data from device to host, TgtPtr=0x000014b24d806200, HstPtr=0x00007ffcab90e6c0, Size=8, Name=p_a8_0
Libomptarget device 0 info: Copying data from device to host, TgtPtr=0x000014b24d806000, HstPtr=0x00007ffcab90e6c8, Size=8, Name=p_a7_0

It depends, if you have a

#pragma omp target teams distribute parallel for reduction(+:var) <clauses>

we need to copy the value at least from the device so you can use it on the host, right?
That said, we probably could avoid the h2d transfer.
If you want the data to stay on the device, before and after, you need to write it differently, e.g.

#pragma omp target teams
{
double var = ...;
#pragma omp distribute parallel for reduction(+:var)
...
}

That said, you probably want to enable remarks (-Rpass=openmp-opt -Rpass-analysis=openmp-opt -Rpass-missed=openmp-opt, see OpenMP Optimization Remarks — LLVM/OpenMP 18.0.0git documentation) to ensure the code is properly optimized.
If any remark is a “bad” one, OMP112 or OMP130, you should investigate.
Also, LIBOMPTARGET_INFO=16 will tell you how the code is executed, which should hopefully be (Generic-)SPMD mode.

Hi @jdoerfert I am wasting quite a bit of time in this

67.98%  264.83ms         3  88.277ms     569ns  157.89ms  cuDevicePrimaryCtxRetain

Can you tell me what this is?

Btw, these are the remarks, found one 112, but I don’t know where it is, since it is a library function.

remark: <unknown>:0:0: Found thread data sharing on the GPU. Expect degraded performance due to data globalization. [OMP112]
remark: ./OpenMP_offload/initialise_chunk_kernel_xx_ompoffload_kernel.cpp:101:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/initialise_chunk_kernel_xx_ompoffload_kernel.cpp:101:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/initialise_chunk_kernel_yy_ompoffload_kernel.cpp:101:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/initialise_chunk_kernel_yy_ompoffload_kernel.cpp:101:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/initialise_chunk_kernel_zz_ompoffload_kernel.cpp:101:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/initialise_chunk_kernel_zz_ompoffload_kernel.cpp:101:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/initialise_chunk_kernel_x_ompoffload_kernel.cpp:97:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/initialise_chunk_kernel_x_ompoffload_kernel.cpp:97:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/initialise_chunk_kernel_y_ompoffload_kernel.cpp:97:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/initialise_chunk_kernel_y_ompoffload_kernel.cpp:97:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/initialise_chunk_kernel_z_ompoffload_kernel.cpp:97:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/initialise_chunk_kernel_z_ompoffload_kernel.cpp:97:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/initialise_chunk_kernel_cellx_ompoffload_kernel.cpp:97:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/initialise_chunk_kernel_cellx_ompoffload_kernel.cpp:97:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/initialise_chunk_kernel_celly_ompoffload_kernel.cpp:97:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/initialise_chunk_kernel_celly_ompoffload_kernel.cpp:97:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/initialise_chunk_kernel_cellz_ompoffload_kernel.cpp:97:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/initialise_chunk_kernel_cellz_ompoffload_kernel.cpp:97:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/initialise_chunk_kernel_volume_ompoffload_kernel.cpp:122:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/initialise_chunk_kernel_volume_ompoffload_kernel.cpp:122:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/generate_chunk_kernel_ompoffload_kernel.cpp:147:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/generate_chunk_kernel_ompoffload_kernel.cpp:147:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/ideal_gas_kernel_ompoffload_kernel.cpp:103:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/ideal_gas_kernel_ompoffload_kernel.cpp:103:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel1_b2_ompoffload_kernel.cpp:143:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel1_b2_ompoffload_kernel.cpp:143:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel1_b1_ompoffload_kernel.cpp:143:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel1_b1_ompoffload_kernel.cpp:143:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel1_t2_ompoffload_kernel.cpp:143:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel1_t2_ompoffload_kernel.cpp:143:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel1_t1_ompoffload_kernel.cpp:143:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel1_t1_ompoffload_kernel.cpp:143:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel1_l2_ompoffload_kernel.cpp:143:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel1_l2_ompoffload_kernel.cpp:143:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel1_l1_ompoffload_kernel.cpp:143:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel1_l1_ompoffload_kernel.cpp:143:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel1_r2_ompoffload_kernel.cpp:143:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel1_r2_ompoffload_kernel.cpp:143:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel1_r1_ompoffload_kernel.cpp:143:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel1_r1_ompoffload_kernel.cpp:143:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel1_ba2_ompoffload_kernel.cpp:143:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel1_ba2_ompoffload_kernel.cpp:143:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel1_ba1_ompoffload_kernel.cpp:143:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel1_ba1_ompoffload_kernel.cpp:143:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel1_fr2_ompoffload_kernel.cpp:143:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel1_fr2_ompoffload_kernel.cpp:143:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel1_fr1_ompoffload_kernel.cpp:143:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel1_fr1_ompoffload_kernel.cpp:143:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_xvel_plus_4_bot_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_xvel_plus_4_bot_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_xvel_plus_2_bot_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_xvel_plus_2_bot_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_xvel_plus_4_top_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_xvel_plus_4_top_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_xvel_plus_2_top_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_xvel_plus_2_top_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_xvel_minus_4_left_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_xvel_minus_4_left_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_xvel_minus_2_left_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_xvel_minus_2_left_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_xvel_minus_4_right_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_xvel_minus_4_right_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_xvel_minus_2_right_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_xvel_minus_2_right_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_xvel_plus_4_back_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_xvel_plus_4_back_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_xvel_plus_2_back_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_xvel_plus_2_back_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_xvel_plus_4_front_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_xvel_plus_4_front_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_xvel_plus_2_front_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_xvel_plus_2_front_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_yvel_minus_4_bot_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_yvel_minus_4_bot_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_yvel_minus_2_bot_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_yvel_minus_2_bot_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_yvel_minus_4_top_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_yvel_minus_4_top_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_yvel_minus_2_top_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_yvel_minus_2_top_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_yvel_plus_4_left_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_yvel_plus_4_left_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_yvel_plus_2_left_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_yvel_plus_2_left_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_yvel_plus_4_right_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_yvel_plus_4_right_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_yvel_plus_2_right_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_yvel_plus_2_right_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_yvel_plus_4_back_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_yvel_plus_4_back_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_yvel_plus_2_back_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_yvel_plus_2_back_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_yvel_plus_4_front_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_yvel_plus_4_front_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_yvel_plus_2_front_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_yvel_plus_2_front_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_zvel_plus_4_bot_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_zvel_plus_4_bot_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_zvel_plus_2_bot_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_zvel_plus_2_bot_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_zvel_plus_4_top_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_zvel_plus_4_top_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_zvel_plus_2_top_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_zvel_plus_2_top_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_zvel_plus_4_left_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_zvel_plus_4_left_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_zvel_plus_2_left_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_zvel_plus_2_left_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_zvel_plus_4_right_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_zvel_plus_4_right_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_zvel_plus_2_right_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_zvel_plus_2_right_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_zvel_minus_4_back_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_zvel_minus_4_back_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_zvel_minus_2_back_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_zvel_minus_2_back_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_zvel_minus_4_front_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_zvel_minus_4_front_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_zvel_minus_2_front_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel2_zvel_minus_2_front_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel3_plus_4_a_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel3_plus_4_a_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel3_plus_2_a_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel3_plus_2_a_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel3_plus_4_b_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel3_plus_4_b_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel3_plus_2_b_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel3_plus_2_b_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel3_minus_4_a_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel3_minus_4_a_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel3_minus_2_a_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel3_minus_2_a_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel3_minus_4_b_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel3_minus_4_b_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel3_minus_2_b_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel3_minus_2_b_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel3_plus_4_back_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel3_plus_4_back_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel3_plus_2_back_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel3_plus_2_back_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel3_plus_4_front_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel3_plus_4_front_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel3_plus_2_front_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel3_plus_2_front_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel4_minus_4_a_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel4_minus_4_a_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel4_minus_2_a_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel4_minus_2_a_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel4_minus_4_b_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel4_minus_4_b_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel4_minus_2_b_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel4_minus_2_b_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel4_plus_4_a_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel4_plus_4_a_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel4_plus_2_a_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel4_plus_2_a_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel4_plus_4_b_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel4_plus_4_b_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel4_plus_2_b_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel4_plus_2_b_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel4_plus_4_back_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel4_plus_4_back_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel4_plus_2_back_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel4_plus_2_back_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel4_plus_4_front_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel4_plus_4_front_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel4_plus_2_front_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel4_plus_2_front_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel5_plus_4_a_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel5_plus_4_a_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel5_plus_2_a_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel5_plus_2_a_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel5_plus_4_b_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel5_plus_4_b_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel5_plus_2_b_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel5_plus_2_b_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel5_plus_4_left_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel5_plus_4_left_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel5_plus_2_left_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel5_plus_2_left_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel5_plus_4_right_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel5_plus_4_right_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel5_plus_2_right_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel5_plus_2_right_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel5_minus_4_back_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel5_minus_4_back_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel5_minus_2_back_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel5_minus_2_back_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel5_minus_4_front_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel5_minus_4_front_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel5_minus_2_front_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/update_halo_kernel5_minus_2_front_ompoffload_kernel.cpp:112:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/field_summary_kernel_ompoffload_kernel.cpp:176:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/field_summary_kernel_ompoffload_kernel.cpp:176:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/viscosity_kernel_ompoffload_kernel.cpp:153:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/viscosity_kernel_ompoffload_kernel.cpp:153:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/calc_dt_kernel_ompoffload_kernel.cpp:166:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/calc_dt_kernel_ompoffload_kernel.cpp:166:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/calc_dt_kernel_min_ompoffload_kernel.cpp:102:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/calc_dt_kernel_min_ompoffload_kernel.cpp:102:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/calc_dt_kernel_get_ompoffload_kernel.cpp:133:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/calc_dt_kernel_get_ompoffload_kernel.cpp:133:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/calc_dt_kernel_print_ompoffload_kernel.cpp:166:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/calc_dt_kernel_print_ompoffload_kernel.cpp:166:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/PdV_kernel_predict_ompoffload_kernel.cpp:166:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/PdV_kernel_predict_ompoffload_kernel.cpp:166:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/PdV_kernel_nopredict_ompoffload_kernel.cpp:185:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/PdV_kernel_nopredict_ompoffload_kernel.cpp:185:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/revert_kernel_ompoffload_kernel.cpp:103:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/revert_kernel_ompoffload_kernel.cpp:103:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/accelerate_kernel_ompoffload_kernel.cpp:166:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/accelerate_kernel_ompoffload_kernel.cpp:166:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/flux_calc_kernelx_ompoffload_kernel.cpp:103:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/flux_calc_kernelx_ompoffload_kernel.cpp:103:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/flux_calc_kernely_ompoffload_kernel.cpp:103:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/flux_calc_kernely_ompoffload_kernel.cpp:103:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/flux_calc_kernelz_ompoffload_kernel.cpp:103:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/flux_calc_kernelz_ompoffload_kernel.cpp:103:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_cell_kernel1_xdir_ompoffload_kernel.cpp:116:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_cell_kernel1_xdir_ompoffload_kernel.cpp:116:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_cell_kernel2_xdir_ompoffload_kernel.cpp:103:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_cell_kernel2_xdir_ompoffload_kernel.cpp:103:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_cell_kernel3_xdir_ompoffload_kernel.cpp:128:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_cell_kernel3_xdir_ompoffload_kernel.cpp:128:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_cell_kernel4_xdir_ompoffload_kernel.cpp:147:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_cell_kernel4_xdir_ompoffload_kernel.cpp:147:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_cell_kernel1_ydir_ompoffload_kernel.cpp:110:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_cell_kernel1_ydir_ompoffload_kernel.cpp:110:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_cell_kernel2_ydir_ompoffload_kernel.cpp:110:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_cell_kernel2_ydir_ompoffload_kernel.cpp:110:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_cell_kernel3_ydir_ompoffload_kernel.cpp:128:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_cell_kernel3_ydir_ompoffload_kernel.cpp:128:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_cell_kernel4_ydir_ompoffload_kernel.cpp:147:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_cell_kernel4_ydir_ompoffload_kernel.cpp:147:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_cell_kernel1_zdir_ompoffload_kernel.cpp:116:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_cell_kernel1_zdir_ompoffload_kernel.cpp:116:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_cell_kernel2_zdir_ompoffload_kernel.cpp:103:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_cell_kernel2_zdir_ompoffload_kernel.cpp:103:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_cell_kernel3_zdir_ompoffload_kernel.cpp:128:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_cell_kernel3_zdir_ompoffload_kernel.cpp:128:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_cell_kernel4_zdir_ompoffload_kernel.cpp:147:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_cell_kernel4_zdir_ompoffload_kernel.cpp:147:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel_x1_ompoffload_kernel.cpp:116:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel_x1_ompoffload_kernel.cpp:116:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel_z1_ompoffload_kernel.cpp:116:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel_z1_ompoffload_kernel.cpp:116:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel_x2_ompoffload_kernel.cpp:110:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel_x2_ompoffload_kernel.cpp:110:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel_y2_ompoffload_kernel.cpp:110:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel_y2_ompoffload_kernel.cpp:110:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel_x3_ompoffload_kernel.cpp:103:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel_x3_ompoffload_kernel.cpp:103:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel_z3_ompoffload_kernel.cpp:103:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel_z3_ompoffload_kernel.cpp:103:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel_mass_flux_x_ompoffload_kernel.cpp:91:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel_mass_flux_x_ompoffload_kernel.cpp:91:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel_post_pre_advec_x_ompoffload_kernel.cpp:110:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel_post_pre_advec_x_ompoffload_kernel.cpp:110:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel1_x_nonvector_ompoffload_kernel.cpp:110:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel1_x_nonvector_ompoffload_kernel.cpp:110:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel2_x_ompoffload_kernel.cpp:103:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel2_x_ompoffload_kernel.cpp:103:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel_mass_flux_y_ompoffload_kernel.cpp:91:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel_mass_flux_y_ompoffload_kernel.cpp:91:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel_post_pre_advec_y_ompoffload_kernel.cpp:110:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel_post_pre_advec_y_ompoffload_kernel.cpp:110:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel1_y_nonvector_ompoffload_kernel.cpp:110:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel1_y_nonvector_ompoffload_kernel.cpp:110:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel2_y_ompoffload_kernel.cpp:103:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel2_y_ompoffload_kernel.cpp:103:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel_mass_flux_z_ompoffload_kernel.cpp:91:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel_mass_flux_z_ompoffload_kernel.cpp:91:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel_post_pre_advec_z_ompoffload_kernel.cpp:110:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel_post_pre_advec_z_ompoffload_kernel.cpp:110:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel1_z_nonvector_ompoffload_kernel.cpp:110:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel1_z_nonvector_ompoffload_kernel.cpp:110:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel2_z_ompoffload_kernel.cpp:103:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/advec_mom_kernel2_z_ompoffload_kernel.cpp:103:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/reset_field_kernel1_ompoffload_kernel.cpp:103:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/reset_field_kernel1_ompoffload_kernel.cpp:103:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/reset_field_kernel2_ompoffload_kernel.cpp:116:3: Moving memory allocation from the heap to the stack.
remark: ./OpenMP_offload/reset_field_kernel2_ompoffload_kernel.cpp:116:3: Moving memory allocation from the heap to the stack.
remark: src/ompoffload/ops_ompoffload_common.cpp:125:3: Moving memory allocation from the heap to the stack.
remark: src/ompoffload/ops_ompoffload_common.cpp:125:3: Moving memory allocation from the heap to the stack.
remark: src/ompoffload/ops_ompoffload_rt_support_kernels.cpp:68:1: Moving memory allocation from the heap to the stack.
remark: src/ompoffload/ops_ompoffload_rt_support_kernels.cpp:68:1: Moving memory allocation from the heap to the stack.
remark: src/ompoffload/ops_ompoffload_rt_support_kernels.cpp:117:1: Moving memory allocation from the heap to the stack.
remark: src/ompoffload/ops_ompoffload_rt_support_kernels.cpp:117:1: Moving memory allocation from the heap to the stack.

Not much more than the NVIDIA docs can:

https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PRIMARY__CTX.html#group__CUDA__PRIMARY__CTX_1g9051f2d5c31501997a6cb0530290a300

That said, are you really “wasting time in this” IT says 68% but it also says 265ms. This function should not be called often, I am surprised to see it called 3 times, potentially once per device? Are you sure this is a bottleneck in your application?

That might be just fine. You can try to compile with assertions on, that often helps;
-fopenmp-target-assume-no-thread-state
-fopenmp-target-assume-no-nested-parallelism

1 Like

It was an issue with SPMD, looking more into it, I realised that clang does not support omp target loop yet (not perfectly at least). With NVHPC, a significant performance bump was observed.

So, omp target loop is not openmp at all. omp target teams loop can be used as a shorthand for the omp target teams distribute parallel for simd form.
That said, did LIBOMPTARGET_INFO=16 tell you the kernels were executed in Generic mode? or (Generic-)SPMD mode?