Libomptarget fatal error 1: '#pragma omp requires unified_shared_memory' not used consistently!

The below C++ code builds, but the executable fails at runtime.
(It is taken from the C code under the libomptarget subdir's test directory)

#include <omp.h>

#pragma omp requires unified_shared_memory
#define N 1024
extern "C" void __tgt_register_requires(int64_t);

int main() {

  int a[N] = {0};
  int b[N] = {0};
  int *device_data;
  __tgt_register_requires(1);
#pragma omp target map(tofrom : device_data)
  {
    device_data = &a[0];
    for (int i = 0; i < 1024; i++) {
      a[i] += 1;
    }
  }
}

I don't see this test, nor do I understand what you are trying to say.
Is the test failing? If so, which test is this?

~ Johannes

I'm trying to build a test C++ code that uses part of
unified_shared_memory/shared_update.c

Removed the internal function, but I get:

CUDA device 0 info: Device supports up to 65536 CUDA blocks and 1024
threads with a warp size of 32
CUDA device 0 info: Launching kernel
__omp_offloading_34_8009dd23_main_l12 with 1 blocks and 33 threads in
Generic mode
CUDA error: Error when synchronizing stream. stream =
0x0000000001d22ae0, async info ptr = 0x00007ffe73ea2728
CUDA error: an illegal memory access was encountered
Libomptarget error: Failed to synchronize device.
Libomptarget error: Call to targetDataEnd failed, abort target.
Libomptarget error: Failed to process data after launching the kernel.
Libomptarget error: Source location information not present. Compile
with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while
offloading is mandatory
/var/spool/parastation/jobs/8941317: line 23: 20812 Aborted
     (core dumped) ./a.out

You have an illegal memory access, some memory is not properly
mapped.

This is the code:

#include <iostream>
#include <omp.h>

#pragma omp requires unified_shared_memory
#define N 1024

int main() {
  int a[N] = {0};
  int *device_data = new int[N];
#pragma omp target map(tofrom : device_data[0:N])
  {
    device_data = &a[0];
    for (int i = 0; i < 1024; i++) {
      device_data[i] = 123;
    }
  }
  std::cout << a[0] << std::endl;
}

Alexey, all,

Looking at the header file, omp_target.h, I am not so certain __tgt_
functions are internal ones, as omp_get_num_devices et al
are grouped together.

Hi all,

even a more simple example segfaults, when the requires directive is there:

#include <iostream>
#include <omp.h>
#include <stdio.h>

#pragma omp requires unified_shared_memory
#define N 1024

int main() {
  int a[N];
  printf("a=%p\n", a);
#pragma omp target map(tofrom : a[0:N])
  {
    printf("a=%p\n", a);
    for (int i = 0; i < 1024; i++) {
      a[i] = 123;
    }
  }
  printf("a[0]=%i, a[%i]=%i\n", a[0], N/2, a[N/2]);
}

The code runs sucessfully when the requires directive is removed because
the mapping of `a` is explicitly specified.

For this code to run successfully, would it be necessary to allocate `a`
specially as cuda managed memory? I don't see any special treatment of
`a` in llvm ir. As I understand the OpenMP spec, the requires directive
should lead to a compile error if clang fails to generate such code.

The requires example from the OpenMP Examples also fails with the same
runtime error:

https://github.com/OpenMP/Examples/blob/main/sources/Example_requires.1.cpp

- Joachim

Are the Kernel/Hardware requirements llvm specific?

I can compile and execute the add_grid.cu example sucessfully:
https://developer.nvidia.com/blog/unified-memory-cuda-beginners/

So, I would expect that an OpenMP program should also run sucessfully.

- Joachim

I think that is it. I heard of problems with our USM before.
We need to use the managed allocators if USM is active, they are
about to be upstreamed (I hope).

Hi all,
In the mean time, what do I do?

I’m on JURECA and some nodes are attached to A100 GPUs.

Don't use required USM for now I would assume.

Ok, thanks. But I wonder given there are USM tests in C; no one noticed the errors so far?

Are you sure your machine/OS supports USM? Summit like nodes P9+V100 are the only architecture with NVIDIA GPU which supports USM that I’m aware of.

Ye

I’ll ask the JURECA’s admins at JSC.

No!

What you suggest (Itaru to check that the system supports "whatever?!")
would be necessary for a fictitious:

#pragma omp assume unified_shared_memory

but certainly not for

#pragma omp require unified_shared_memory

The OpenMP implementation must support the required property or must
state that it is not supported. Causing an error during the execution of
a target region is no compliant behavior.

As Johannes suggested, managed allocators might allow to provide
software USM, when hardware USM is not available. In any case, compiler
and runtime together need to make sure that the requirements for
supporting USM are met or abort otherwise.

BTW: will the managed allocators also change the behavior for stack
allocations? This would be necessary for a compliant USM support.

- Joachim

Are you sure your machine/OS supports USM? Summit like nodes P9+V100 are
the only architecture with NVIDIA GPU which supports USM that I'm aware of.

As you can read on the page I linked earlier, already Kepler supports
USM. Just not in hardware.

Explicit (software) USM requires explicitly invoking allocators like CUDA managed memory.
Implicit (OS backed) USM doesn’t have such restrictions and memory allocated by malloc on the host can be directly accessed on the device. P9+V100 is the only architecture I saw works in this category.

In the test case, “int a[N]” is stack memory but not explicitly allocated as CUDA managed memory. So invalid memory access is a legitimate CUDA error indicating this situation on a machine without implicit USM support.

#pragma omp require unified_shared_memory” doesn’t express which level of USM support is required. Then to me it means require USM up to the support on that machine and application developer needs to be aware of that. Beyond that is just undefined behavior.

If Itaru wants to write codes like the test case to work, he needs a machine with implicit USM support.
Otherwise, he has to avoid OS allocation but call CUDA-managed malloc explicitly or do not use USM at all. I don’t think stack allocation can be handled in explicit USM scenario.

There may be improvements that can be done in libomptarget to improve the situation by adding more protection. But it doesn’t help making the code work in this case.

Ye

Agreed. Such features can help a consistent way of using USM in OpenMP.
Ye

I'm not completely up-to-date with what the OpenMP allocator traits can
de/prescribe (can they have an effect to stack allocations without
explicit allocate directive?). But from my perspective, compiler and
runtime should support two modes:

- completely rely on hardware support for USM, but error out at runtime,
if the runtime determines that hardware support is not available during
execution. Would it be possible to encode this into the sm-specific
runtime library?

- rely on the use of an OpenMP allocator which uses managed memory. For
compilation units with require USM, the compiler must use __kmpc_alloc
for all stack variables, where it is not possible prove that they will
not need managed memory, e.g., if they can escape to a function
potentially spawing a target region.

- Joachim

FYI: I also posted a related question on this topic at the omp-lang
list, asking for clarification on the expected behavior, if USM cannot
be guaranteed.