[RFC] Adding C++ Parallel Algorithm Offload Support To Clang & LLVM

AMD is interested in contributing support for seamlessly offloading the execution of C++ Algorithms to GPUs. We have been working on this for a while and feel it has reached a point where it can be beneficial to the wider community. Internally, we’ve used the mnemonic stdpar for this functionality.

Why Do We Want To Do This?

GPU programming remains a highly specialist skill that requires one to become familiar with new idioms, and engage in rituals (e.g. explicit memory management) that might be less than ergonomic for the experienced C++ programmer. We (AMD) believe that there is great merit making GPUs extremely
easy to leverage from standard C++, without requiring the programmer to change their source code. The addition of overloads that give implementations the possibility to parallelise standard algorithms, coupled with the evolution of GPU hardware, make it possible to achieve this goal: with the stdpar mechanism that is being proposed, algorithms can be seamlessly offloaded to GPU execution, without changes to the source, simply by adding a compiler flag. This allows one to not only re-use code that they might have maintained and thoroughly validated for some time, but also to directly access standard library components such as containers (e.g. std::array), utilities (e.g. std::pair, std::move) or serial algorithms (e.g. std::sort).

Although it would have been possible to develop the feature separately, or make it proprietary, we think it is more important to participate in the wider ecosystem. We expect that this will ultimately lead to a refined tool which can serve the programming community at large, without enforcing vendor lock-in of any sort.

What Is stdpar?

stdpar is merely a composition of components that are already available in Clang / LLVM, coupled with some sugar that takes advantage of the C++ language specification. Practically, it is a flag that one passes to clang. Said flag establishes the following basic but potent invariant: any and all functions that are reachable from an algorithm invoked with a suitable execution policy (initially parallel_unsequenced) can and should be offloaded to GPU execution. This is built atop pre-existing support for the HIP language, relaxing some of the latter’s restriction checks / codegen constraints, and adding the ability to compute a kernel’s reachability.

Note that similar, but not entirely equivalent, functionality has been offered by NVIDIA’s NVC++, with encouraging results.

How Is stdpar Used?

The user sets the -stdpar flag and then algorithms that were invoked with the parallel_unsequenced_policy` execution policy get offloaded to the GPU, assuming that one is present. In practice, it appears that it helps users avoid rewriting large, complex, algorithms, using new, alien, interfaces, allowing them to focus on their work.

What Targets Will Be Supported Initially?

We have validated HIP, as the implicit language, on AMDGPU targets, using rocThrust as the algorithm library, since these are under our direct maintenance and control. Whilst the mechanisms being proposed are generic and there is no inherent AMD specific lock-in, we were not in a position to validate more widely. We also introduce two levels of capability (both validated), one which relies on transparent on-demand paging (as offered via HMM), and another which implicitly relies on traditional explicit GPU memory management, but imposes some restrictions. Finally, bring-up and validation on Windows is something we will undertake at a later date, for the HIP + AMDGPU targets + rocThrust combination.

How Are We Going To Do This?

We have submitted the patches that enable this capability for review:
⚙ D155769 [Clang][docs][RFC] Add documentation for C++ Parallel Algorithm Offload. We are also making the runtime component (what we call the compatibility header) available here: GitHub - ROCmSoftwarePlatform/roc-stdpar. Within the same repository, we are providing both a single diff against trunk, for perusal by those interested in early local testing, and diffs against a series of codebases that use the functionality. In general, we would like for stdpar to converge on being indistinguishable from any generic Clang/LLVM component. More specifically, its
evolution will be driven exclusively within the confines of the project, and there will be no separate, OOT development. We would also like to coordinate with the rest of the GPU LLVM ecosystem around broadening support past HIP on AMDGPU, and it might be opportune to set up a working group aimed at achieving this coordination.

Will AMD Continue Working On stdpar?

Yes, and we look forward to cooperation with other groups / component owners around extending support / validation to, amongst others:

  • SYCL + oneDPL (Alexey Bader)
  • OpenCL + Boost.Compute (Anastasia Stulova)
  • HIPSPV / chipStar + oneDPL (Pekka Jääskeläinen)
  • CUDA + thrust (Justin Lebar, Artem Belevich)

Also, since this can be regarded as a language extension, of sorts, it is likely of interest to @zygoloid @rjmccall as well.

Given the restrictions on new users, I couldn’t directly reference the various component owners in the initial post, so I’m doing it here, apologies for the inconvenience: @bader, @AnastasiaStulova

@pjaaskel @Artem-B

We should probably get someone from NVIDIA’s libcu++ involved. I’ll check tomorrow whether they are interested.

First of all, thanks for working on this. I think this is very interesting and would be great to have in the LLVM project. However, I’d like to understand a bit more how exactly things are supposed to work. You say that adding -stdpar to the command line should enable offloading to the GPU, but you don’t say what this flag would do inside the compiler. Does this include new headers? Does it enable some builtins? Does it enable new optimization passes? Something completely different?

Also worth mentioning is that this should be coordinated with the effort to add the PSTL to libc++.

Thank you for the quick reply! I’ll try to be brief here, noting that the patch set will be more detailed per each component:

  • passing -stdpar has the consequence of enabling HIP (in a sense it’s as if you passed -x hip), and setting a specific language option we’re adding;
  • the HIP / ROCm drivers are augmented to also search for the path to the forwarding header available here: https://github.com/ROCmSoftwarePlatform/roc-stdpar/blob/main/include/stdpar_lib.hpp, and if it is found and we are compiling with stdpar, it gets implicitly included (HIP already does this for other headers such as the one introducing the device side math function overloads, so it’s not quite novel);
  • conversely, Sema is relaxed so as to allow calls from the device side to unannotated host functions;
  • CodeGen is similarly relaxed to allow emission for unannotated functions;
  • Subsequently, we run a pass over IR which computes the set of functions reachable from kernels; functions that are not reachable get removed.

Everything else proceeds as normal for a HIP compilation, and we do not modify other components (so it’s just Driver + FE + ME). The forwarding header merely introduces into std overloads that take parallel_unsequenced_policy as the explicit type of their first argument, which makes them better candidates than the master template. The body merely forwards to an accelerator specific algorithm library (for now, rocThrust).

In what regards coordination with the libc++ PSTL effort, I admit that I’m not sure what that entails, so if you could provide additional details they’d be very welcome. As far as I know, and I could be wrong, the TBB PSTL implementation was donated by Intel some time ago and both libc++ and libstdc++ incorporated it. Independently from that, please note that what we are proposing is intended to compose at the standard C++ interface level and wouldn’t interfere with what libc++ does: algorithms that are invoked with par_unseq just get redirected to the offload implementation without burdening the standard library.

I’ll point to the first patch in the series since that has the docs and they might be useful: ⚙ D155769 [Clang][docs][RFC] Add documentation for C++ Parallel Algorithm Offload.

That’s a great idea, thanks! Perhaps it’s worth polling Jared & the Thrust guys too? This is more up their alley and Thrust would be a natural implementation detail for a CUDA variant, I think.

Not really. In libc++ we have started implementing our own version, since the Intel PSTL doesn’t really meet our requirements for a number of reasons.

I think you are completely wrong here. This is interfering with libc++ simply by adding overloads the namespace std and thus restricting what the standard library itself can do without breaking you. While unlikely in this specific case, these kinds of shenanigans regularly break people, and I don’t think we want to start doing this on the toolchain level. This will also inevitably burden the standard library, since what you want to do is basically transparent to the user, resulting in bug reports against libc++ and not some third party library.
In libc++ we have customization points which are specifically for adding different backends, which you are basically proposing here (plus support from the compiler). I think it would be much healthier for the library and the toolchain as a whole to instead use these customization points and thus avoiding lots of problems.
Another problem is that the wrapper does a lot of things that break many things/configurations libc++ supports.
These are the problems that I found taking a quick glance:

  • introducing a bunch of non-reserved identifiers
  • using non-reserved namespaces
  • not annotating functions to make them hidden
  • not hiding try/catch behind macros
  • using ADL

So, TBH, I think the only way to make this seamless is to properly integrate support into libc++.

CC @crtrott

Thank you for the thorough feedback, this is exactly the sort of discussion that motivated the RFC and going for developing this in the open. I will try to cover all the issues you raised slightly later today, if you do not mind.

I’m curious what changes will be required to interface with libc++. Will these changes all be relegated to the host? Or are we planning on enabling certain functions on the device that aren’t constexpr like we require with CUDA. I’ve has passing thoughts about doing something similar to the libc++ library as I’m doing with the libc library to enabling calling it from the GPU, see libc for GPUs — The LLVM C Library.

Thanks for working on this. I agree with @philnik that would be great when this can be integrated in libc++. I think that would help to make it easier to use stdpar with their standard library.

At a glance, this seems like a reasonable thing for Clang to take, but it’s definitely an extension, and we’ll have to consider it according to the extensions policy; I’d appreciate it if you could make a post addressing each of those criteria.

I’m somewhat concerned about putting std in the name when there’s an accepted standards process for C++ and this isn’t actually part of it. It might be more appropriate to name your project something less loaded.

Thanks for bringing this up. Our efforts have been stalled for a while, maybe this helps us get the momentum we need.

A few general thoughts:

  1. Any approach based on direct inclusion of HIP/CUDA/SYCL headers will not work well for anything that is not also HIP/CUDA/SYCL. For one, these languages are “like C++” but not quite. The differences are key and will cause never ending friction. Further, you cannot, as of now, combine them with other offloading models, e.g., OpenCL and OpenMP. This is an actual problem since the latter is the only upstream offload runtime.
  2. Integration (only) in libc++ is suboptimal as not everyone can convince/force their users to change the stdlib. That said, I agree that libc++ needs to be involved to avoid conflicts, if libc++ is used by the application.
  3. I fail to see how this tackles the two main (or most interesting) problems with “magically offload this”:
    a) Where is my data? Which implies if offloading is worth, movement is required, etc. (I understand it works with unified shared memory, but still…)
    b) Can I execute the functor on the device/host or not?
    I looked at the tests in the GH repo, it does not seem to touch the second point at all.

That all said, what we’ve been working on is a wrapper layer to solve 1) and 2) by design. 3a) can be added to any approach, 3b) probably requires compiler support but can also be added to any approach.

The current status can be found here: GitHub - markdewing/libompx at add_catch
An example would look like this:

 ompx::device::sort_by_key(keys_begin, keys_begin + N, values);

and resolve to the vendor provided implementation, or a fallback.
FWIW, we are not married to the spelling and such.

Note that we do not include the vendor provided headers into the user application but instead create a library with them (which solves 1).
Wrt. 2), we could make this part of libc++ that can be used standalone, or use it in libc++ as a dependence (assuming this will live in llvm-project/openmp/ or llvm-project/offload)
For 3a), we have some prototype code to check where the data is, perform movement if necessary, etc. 3b) was not yet approached.

My thoughts regarding the points above:

3a) I don’t think it’s worth wanting to deal with “Where is my data?” without having executors to give a name to the device aspect of “where”. I may even go as far as there may not be a reason to refine automagical data dispatch if I can’t name the namespace the from/to aspect of “where”. Strictly speaking, you may not need this, all APIs implicitly define where data goes when dispatching from host, but given how C++ seemingly wanting to introduce GPGPU and heterogenity in an incremental fashion, the first step support automagic-only is reasonable.
3b) I don’t think any API beside C++AMP ever bothered to take on that burden as part of the language. AMP’s definition was fairly nice and could be borrowed, naturally without the restrict keyword which caused a few allergic reactions, even though it had merits. (This of course @AlexVlx is well aware of.)

It seems to work actually in a lot of cases already.
For OpenCL it works by design as this is a pure runtime problem and is just seen as any another C library. hipSYCL seems to work also with some CUDA and HIP code in the same application. oneAPI SYCL seems compatible with OpenMP in the same code.
For the RFC here, I guess they have their implementation working with HIP at the same time since this comes from the AMD HIP team.
That said, I agree that having some consolidation effort across various approach and reduce end-user friction is an important goal because in any big enough project relying on libraries for example, it will be necessary.

C++ doesn’t seem (to me) to move in the direction to add “places” for memory. Even if, it is far out. What we have now is the offloading models defining APIs for that. Let’s assume that is all we will have for years to come.
Those APIs already allow the user to allocate “shared” and “dedicated” memory, move stuff, etc. People will need to use those if they have a distributed memory system w/o unified shared memory (USM), or they don’t get what they want from USM. Let’s assume either of the two is true for a while to come. So, we have APIs for allocation and movement outside of C++ and we assume people will need to use them. The question now is, how does it play into C++ algorithms. My argument was, the algorithm can check data placement, and thereby implicitly get information if offloading was desired. Always offloading (w/ movement or USM) is not what everyone ones. Never offloading is neither. What people want is to offload when it makes sense, and given a fixed interface, we need to determine when that is. Data movement cost is often the biggest factor in this determination.

I do not understand this response. All offload languages have a way to declare code for the device, host, or both, but that is not the only problem here. You need a way to determine if a pointer/executor points to a device function, or, if it points to a host function with equivalent device code at a known address. Maybe I did not express that properly before.

I do not follow. OpenCL uses a distinct offload driver and runtime. You can write code with OpenCL and have host, or even offload OpenMP, in the same program, but you cannot communicate. There is no cross language device calls. Sharing global state is not a thing, sharing pointers to heap memory may or may not work, depending on the setup and target architecture (one or multiple “contexts”).
SYCL and OpenMP, even in the oneAPI implementation, share some of these problems (as far as I can tell). They use the same “context” which should make heap sharing possible, but I don’t think they use the same driver, so no inter language calls or shared globals. I am happy if someone can correct me on this one.

So far I only mentioned the driver and runtime issues. We have other things, like differences in overload rules, explicit vs. implicit device code generation, different (lambda) capture rules, different ways to make math/complex/… work, …

I think we agree on the facts, not on the expectations.
In my modest world, just having a program using different accelerators from different vendors, various offload models and back-ends to compile, not crashing when run, each one off-loading efficiently in its own isolated world, just “communicating” through host memory is already an achievement. :wink:
Second-order optimization for me is to rely on some explicit interoperability concepts.
Of course, I would also love what I see as a third-order optimization: having the big picture working in an optimal way, transparently, with high performance, low power consumption, even against vendor will, etc.
But I feel this is outside of the scope of this RFC or do I miss some points?

What I want to avoid is that we put ourselves into a corner. This RFC is based on HIP which is problematic for many reasons. What I would suggest is to build something that works with “all” models, is based on upstream code, and can be shared/endorsed by all vendors. Hence my suggestion to develop the shim layer over the vendor libraries rather than a header only HIP/CUDA/… implementation.

1 Like

“a pointer/executor points to a device function”: I may have missed a few iterations of executors, but don’t executors define the “where and how” of executing a function? It’s equivalient to a thread pool or a logical device.

What C++AMP did was introduce an extra property on functions, it’s restrictness which could be cpu, amp, and a proposed auto, for eg. int sq(int x) restrict(cpu,amp). The restrictness was part of the function type, much like exception specification in C++, and participated in the type system, it had deduction rules, etc. It was immediately visible (and deducible) which function had what restrictness. This was a nice take on the issue of “it’s C++, but not really”. It was clearly defined what worked and didn’t in amp-restricted contexts. It was also clearly visible that the STL was off limits, because the STL has blanket permission to implement any of its internals in a DLL, rendering it impossible to generate device code for. (STL functions were undecorated, and that defaulted to restrict(cpu). Planned future version would’ve introduced the default restrict(auto) as the default, which corresponds to lazy device codegen based on callsites. Related read is SYCL_EXTERNAL.)

As for pointer values to such functions, an implementation could have a fix offset separating host and various device target pointers. Let’s say (int)*(int) == 0x00001234 and the impl knows that target ID functions are separated by 0x0100000, so if the executor of a function is a GPU, its binary is gfx908, internally that corresponds to target id 2, so it will offload the function at 2 * 0x0100000 + 0x00001234. If everything passed front-end (type?) checks, the runtime can be sure that this function exists.

But given how there’s a working implementation running on HIP, I take it this a solved problem.

I don’t have the bandwidth to attend ISO meetings, and because we all know Twitter’s the best place for detailed and nuanced conversations (/s), the room temperature to ISO absorbing offloading APIs isn’t great. Without going full Futhark and an optimizer generating everything including the offload boundary, heterogenous programming will need explicit naming of memory namespaces. I don’t see that happening anytime soon, but let’s hope efforts like this take us one step closer to users banging on the committee’s door for that to happen.