Dear LLVM community. We should also support offloading C++ parallel algorithms to GPUs like NVIDIA does. AMD recently made a request for comments to incorporate the AMD implementation of stdpar
in Clang
. This is similar but addresses most of the shortcomings. Basically, with a surmountable effort, add an OpenMP offloading backend to LLVM’s libc++
project.
Modifications Needed
libc++
already has an OpenMP SIMD backend for C++ parallel algorithms. For many of the C++ parallel algorithms, extending the existing SIMD backend to target GPUs is straightforward. We can simply define the macro _PSTL_PRAGMA_SIMD
to expand to #pragma omp target teams distribute parallel for simd
rather than #pragma omp simd
depending on, for instance, a compiler option (technically, we could always do it, but that would result in offloading as soon as you compile the code for devices, thus we might want an opt-out mechanism).
We must also add mapping clauses to transfer data to and from the device and allocate/deallocate corresponding device memory: #pragma omp target … map(to/from:...)
. If a user has already mapped the data to the device, the map clause will not transfer the data, so users can keep the data on the device and avoid transfers if they want to. The mapping clauses will not have any effect on unified shared memory systems (if OpenMP is informed about usm).
Functors and Function Pointers
Passing lambdas is correctly compiled to GPU targets without any modifications. However, users need to map additional variables accessed in the lambda to the device (for non usm).
The only major problem we have encountered is whether the user passes a host or device function pointer into the function template. So at the moment, we have to pass the device function pointers to the templates, for instance as:
#include <algorithm>
#include <execution>
#include <cassert>
#include <iostream>
#include "definitions.h"
#pragma omp declare target
void init(int& n) {n=-1; };
void increment(int& n) {n++; };
#pragma omp end declare target
int main()
{
void (*initdev)(int& n);
void (*incrementdev)(int& n);
#pragma omp target map(always,from:initdev,incrementdev)
{
initdev = &init;
incrementdev = &increment;
}
int * a = new int[LEN];
std::for_each(std::execution::par_unseq,a, a+LEN,initdev);
for (int i=0; i < 100; i++) {
std::for_each(std::execution::par_unseq,a, a+LEN,incrementdev);
assert(a[0] == i);
assert(a[LEN-1] == i);
}
delete[] a;
return 0;
}
That said, we will soon automatically map function pointers to their device counterpart, eliminating the need for the initdev and incrementdev pointers above.
Patch
I have made a draft PR to show an example of how to offload std::for_each, std::fill, and std::transform to GPUs using OpenMP. It is simple and naive at this point, but it gets offloaded to the GPU as expected.
If anybody would like to take a look at the code generated when applying the patch, I have made another Github repository with temporary files from 12 small test programs.