[RFC] OpenMP Offloading Backend for C++ Parallel Algorithms

Dear LLVM community. We should also support offloading C++ parallel algorithms to GPUs like NVIDIA does. AMD recently made a request for comments to incorporate the AMD implementation of stdpar in Clang. This is similar but addresses most of the shortcomings. Basically, with a surmountable effort, add an OpenMP offloading backend to LLVM’s libc++ project.

Modifications Needed

libc++ already has an OpenMP SIMD backend for C++ parallel algorithms. For many of the C++ parallel algorithms, extending the existing SIMD backend to target GPUs is straightforward. We can simply define the macro _PSTL_PRAGMA_SIMD to expand to #pragma omp target teams distribute parallel for simd rather than #pragma omp simd depending on, for instance, a compiler option (technically, we could always do it, but that would result in offloading as soon as you compile the code for devices, thus we might want an opt-out mechanism).

We must also add mapping clauses to transfer data to and from the device and allocate/deallocate corresponding device memory: #pragma omp target … map(to/from:...). If a user has already mapped the data to the device, the map clause will not transfer the data, so users can keep the data on the device and avoid transfers if they want to. The mapping clauses will not have any effect on unified shared memory systems (if OpenMP is informed about usm).

Functors and Function Pointers

Passing lambdas is correctly compiled to GPU targets without any modifications. However, users need to map additional variables accessed in the lambda to the device (for non usm).
The only major problem we have encountered is whether the user passes a host or device function pointer into the function template. So at the moment, we have to pass the device function pointers to the templates, for instance as:

#include <algorithm>
#include <execution>
#include <cassert>
#include <iostream>
#include "definitions.h"

#pragma omp declare target
void init(int& n) {n=-1; };
void increment(int& n) {n++; };
#pragma omp end declare target

int main()
{
  void (*initdev)(int& n);
  void (*incrementdev)(int& n);

  #pragma omp target map(always,from:initdev,incrementdev)
  {
    initdev = &init;
    incrementdev = &increment;
  }

  int * a =  new int[LEN];
  std::for_each(std::execution::par_unseq,a, a+LEN,initdev);
  for (int i=0; i < 100; i++) {
    std::for_each(std::execution::par_unseq,a, a+LEN,incrementdev);
    assert(a[0] == i);
    assert(a[LEN-1] == i);
  }
  delete[] a;
  return 0;
}

That said, we will soon automatically map function pointers to their device counterpart, eliminating the need for the initdev and incrementdev pointers above.

Patch

I have made a draft PR to show an example of how to offload std::for_each, std::fill, and std::transform to GPUs using OpenMP. It is simple and naive at this point, but it gets offloaded to the GPU as expected.

If anybody would like to take a look at the code generated when applying the patch, I have made another Github repository with temporary files from 12 small test programs.

3 Likes

I think it makes a lot of sense to add a backend with OpenMP offloading support, assuming that we can fulfill the standards requirements. This shouldn’t be an extension of the SIMD backend though. Instead, this should get it’s own backend with the corresponding algorithms and markups.

2 Likes

This is a really interesting investigation. I left some comments on the PR to avoid spreading the conversation too much. But overall, dispatching to the GPU has been one of our design goals for the backend API so we would clearly like a backend that actually uses these capabilities. I have some thoughts on your actual patch, but the direction makes sense to me.

2 Likes

It was a PoC, after all, to get the conversation going. I think the comments are encouraging and we are very much interested in feedback on how to do this right. One of the reasons we didn’t try to do to much for this stage.

@philnik That’s fine. I think the point of the PR was to show how easy it was to extend the SIMD backend. We can reasonably use (a copy of) it as a starting place to allows us more elaborate modifications.

@ldionne There is a GPU meetup before the dev-meeting. I will put this on the agenda, in case you can join, please do.

Do you mean the workshop on October 10th? Unfortunately I don’t think I’m going to make it, I’ll only get there in the afternoon :frowning:

Yep, that one. We’ll still discuss this topic though :wink:

For now, I think the next step is a “fresh” PR with a standalone GPU backend, or at least the first parts of it. Tests, etc. so we can have a proper review. While lambdas and objects with call operator already work, in the sense that we auto-magically compile them for the GPU target together with their visible transitive dependences, we need to look at syntax/APIs for users that “dislike OpenMP”.

It should not be much of a problem, we simply hide the pragmas and such behind agnostic llvm offload APIs. Note that the below is only needed for complex cases and if you want to improve performance by reducing memory movement on “non-unified shared memory systems”.

To move an array to the GPU and keep it there to avoid moves when std::par algorithms are used in sequence:

#pragma omp target data enter map(tofrom:Array[0:N]) device(DevNo)

can be hidden in a function like

void llvm_offload_host_to_device(void * ptr, size_t bytes, int device_no); 

The runtime will notice the mapping, not move the array, and use the device address of it automatically in the GPU code.

Similarly, to compile a function for the GPU explicitly, e.g., if it is passed via a function pointer or in a different translation unit:

#pragma omp declare target(fn_foo) [indirect]

could be replaced by a attribute. We have __device__, we have [[omp::declare_target]] and we could have something for the llvm offload API, all do effectively the same thing.
(Please ignore syntax errors, the idea should be sound.)

Wrt. tests:
We have ~4 GPU buildbots right now, I’ll check with the owners if they build libc++ and if we can test the GPU backend. We are also in the process of setting up CI capabilities on a multi GPU system, but that might need some more time.
If the libcxx buildbots could be migrated to a GPU system, or if we want to test them via CPU offfload, that would be great too. All of the required code, runtimes, etc. are in upstream LLVM, so the outside dependence’s are only the GPU “drivers” (rocm for AMD and CUDA for NVIDIA).

1 Like

Thanks for the detailed feedback!

I will push a few modifications that I have locally to the draft PR. Then we can start working on adding a separate offloading backend in libcxx/include/__algorithm/pstl_backends.

I have tried to add a separate OpenMP offloading backend under libcx/include/__algorithm/pstl_backends/gpu_backends. What do you think of this draft PR?

For anyone interested, we’re planning on meeting at 09:00 AM PST (aka 12:00 EST) on Monday October 2nd to go over the patch and discuss the RFC. If you want to join, DM me your e-mail address on Discord and I’ll send you the calendar invitation!

2 Likes