AMD is interested in contributing support for seamlessly offloading the execution of C++ Algorithms to GPUs. We have been working on this for a while and feel it has reached a point where it can be beneficial to the wider community. Internally, we’ve used the mnemonic stdpar
for this functionality.
Why Do We Want To Do This?
GPU programming remains a highly specialist skill that requires one to become familiar with new idioms, and engage in rituals (e.g. explicit memory management) that might be less than ergonomic for the experienced C++ programmer. We (AMD) believe that there is great merit making GPUs extremely
easy to leverage from standard C++, without requiring the programmer to change their source code. The addition of overloads that give implementations the possibility to parallelise standard algorithms, coupled with the evolution of GPU hardware, make it possible to achieve this goal: with the stdpar
mechanism that is being proposed, algorithms can be seamlessly offloaded to GPU execution, without changes to the source, simply by adding a compiler flag. This allows one to not only re-use code that they might have maintained and thoroughly validated for some time, but also to directly access standard library components such as containers (e.g. std::array
), utilities (e.g. std::pair
, std::move
) or serial algorithms (e.g. std::sort
).
Although it would have been possible to develop the feature separately, or make it proprietary, we think it is more important to participate in the wider ecosystem. We expect that this will ultimately lead to a refined tool which can serve the programming community at large, without enforcing vendor lock-in of any sort.
What Is stdpar
?
stdpar
is merely a composition of components that are already available in Clang / LLVM, coupled with some sugar that takes advantage of the C++ language specification. Practically, it is a flag that one passes to clang
. Said flag establishes the following basic but potent invariant: any and all functions that are reachable from an algorithm invoked with a suitable execution policy (initially parallel_unsequenced
) can and should be offloaded to GPU execution. This is built atop pre-existing support for the HIP language, relaxing some of the latter’s restriction checks / codegen constraints, and adding the ability to compute a kernel’s reachability.
Note that similar, but not entirely equivalent, functionality has been offered by NVIDIA’s NVC++, with encouraging results.
How Is stdpar
Used?
The user sets the -stdpar
flag and then algorithms that were invoked with the parallel_unsequenced_policy` execution policy get offloaded to the GPU, assuming that one is present. In practice, it appears that it helps users avoid rewriting large, complex, algorithms, using new, alien, interfaces, allowing them to focus on their work.
What Targets Will Be Supported Initially?
We have validated HIP, as the implicit language, on AMDGPU targets, using rocThrust as the algorithm library, since these are under our direct maintenance and control. Whilst the mechanisms being proposed are generic and there is no inherent AMD specific lock-in, we were not in a position to validate more widely. We also introduce two levels of capability (both validated), one which relies on transparent on-demand paging (as offered via HMM), and another which implicitly relies on traditional explicit GPU memory management, but imposes some restrictions. Finally, bring-up and validation on Windows is something we will undertake at a later date, for the HIP + AMDGPU targets + rocThrust combination.
How Are We Going To Do This?
We have submitted the patches that enable this capability for review:
⚙ D155769 [Clang][docs][RFC] Add documentation for C++ Parallel Algorithm Offload. We are also making the runtime component (what we call the compatibility header) available here: GitHub - ROCmSoftwarePlatform/roc-stdpar. Within the same repository, we are providing both a single diff against trunk, for perusal by those interested in early local testing, and diffs against a series of codebases that use the functionality. In general, we would like for stdpar
to converge on being indistinguishable from any generic Clang/LLVM component. More specifically, its
evolution will be driven exclusively within the confines of the project, and there will be no separate, OOT development. We would also like to coordinate with the rest of the GPU LLVM ecosystem around broadening support past HIP on AMDGPU, and it might be opportune to set up a working group aimed at achieving this coordination.
Will AMD Continue Working On stdpar
?
Yes, and we look forward to cooperation with other groups / component owners around extending support / validation to, amongst others:
- SYCL + oneDPL (Alexey Bader)
- OpenCL + Boost.Compute (Anastasia Stulova)
- HIPSPV / chipStar + oneDPL (Pekka Jääskeläinen)
- CUDA + thrust (Justin Lebar, Artem Belevich)
Also, since this can be regarded as a language extension, of sorts, it is likely of interest to @zygoloid @rjmccall as well.