Implementing "Partial Shared Virtual Memory" for OpenMP Offload Target

Dear all,

in our downstream LLVM, we want to provide some kind of “partial shared virtual memory” support for OpenMP offloading to our hardware accelerator, and I would appreciate some hints on how to integrate with the OpenMP API + runtime.

A special situation for us is that we offload from a 64 bit host to 32 bit accelerators, and a further complication is that normal load + store accesses made by our our 32 bit cores to host RAM are slow, while we can transfer working data from host RAM to device scratchpad memory via DMA, and then accesses to device memory are much faster.

Due to the fact that we only have 32 bit pointers on the accelerators, we cannot provide true shared virtual memory. Instead, what we can roughly do is map the 4 GB address space of an accelerator to a part of the host address space. Data in this part of host RAM is what we want to make accessible via “partial shared virtual memory” on the device.

At first glance, it seems that we could make use of the following functions in order to make some part of host RAM accessible on our accelerators from a user programming model perspective:

void *llvm_omp_target_alloc_device(size_t size, int device_num);
void *llvm_omp_target_alloc_host(size_t size, int device_num);
void *llvm_omp_target_alloc_shared(size_t size, int device_num);

Would llvm_omp_target_alloc_host() be reasonable here? This means, could we provide a special implementation of this, so that data allocated this way in host RAM would be mapped to the 32 bit address space of our device? As far as I understand, the host variant would be appropriate, since data allocated this way “cannot migrate to the device”, yielding to slow accesses as for our above-mentioned hardware properties. But I do not fully understand the difference between the host and shared variants of the meaning of “migratable”.

Furthermore, we would have to implement corresponding support in libomptarget and our target-dependent offload RTL. Has anyone provided a similar kind of shared virtual memory support for another target and could provide any hints on how this was roughly done?

Any hints would be greatly appreciated.

If you’re familiar with CUDA, we currently map llvm_omp_target_alloc_host and llvm_omp_target_alloc_shared to CUDA host pinned memory. I.e. they’re a unified address space accessible from both the host and device. I’m a little rusty on the details, but I think the difference is that llvm_omp_target_alloc_shared is expected to support asynchronous memory accesses, e.g. you can read values written to it from the host while the kernel is executing.

I’m not sure if one of these would work for your purposes since they assume a unified address space. Here’s an example of that function being used llvm-project/omp_device_managed_memory_alloc.c at main · llvm/llvm-project · GitHub. As you can see we do not use a mapping table and instead pass it directly via is_device_ptr so it’s firstprivate and bypasses the normal mapping done to pointers.

Right now we assume that the host and device have compatible address spaces at least, e.g. both the host and device are 64-bits. It’s theoretically possible to map a 64-bit address to a 32-bit address since it’s just a map however. Obvious caveat that you could have more host pointers than possible device pointers to map to, but that’s highly unlikely to happen in practice.

Thanks a lot for your explanations! I think we’ll go for trying to implement it using these functions then. I think we can make it work this way.