[RFC] Adding C++ Parallel Algorithm Offload Support To Clang & LLVM

Apologies for the delay in getting to this, please see below for an attempt at addresing the extension criteria (happy to extend; I’ll answer the name bit more extensively in another post, but the short of it is that it was chosen based on prior art and it already being in use - it can definitely be changed).

stdpar as a Clang Extension

Evidence of a significant user community

There is a growing body of literature around the precursor / proprietary alternative to this extension, which has been available since 2019, e.g.:

Selecting some conclusions will help to outline the benefits of the proposed extension:

Our conclusion is that stdpar can be a good candidate to develop a performance portable and
productive code targeting the Exascale era platform, assuming this approach will be available
on AMD and/or Intel GPUs in the future.
Asahi et al. (2022)

A few hundred lines of code, without hardware-specific knowledge, achieve cluster-level
performance.
Latt (2021)

To continue writing scientific code efficiently with a large and not always professionally trained
user community to run on all hardware architectures, we think we need community solutions for
portability techniques that will allow the coding of an algorithm once, and the ability to execute it
on a variety of hardware products from many vendors.
Bhattacharya et al. (2022)

Beyond performance portability, this study has demonstrated how traditional HPC programming
techniques such index-based traversal are well-supported use cases in ISO C++17. In general,
none of the C++17 implementations impose unreasonable requirements on algorithm use:
captured pointers are allowed.
Based on the three ports, we conclude that only minimal code transformation is required coming
from either a vendor-supported programming model such as CUDA or a portability layer such as
Kokkos.
Lin, McIntosh-Smith, Deakin (2023)

Yet because it utilizes standard C++, it provides a very low entry bar for developers, offering a
simplified path for porting from serial CPU code to GPUs, without the need to explicitly manage
memory transfers. […] All APIs have a learning curve, though std::par is closest to normal C++,
providing an easy transition path to GPU programming for little effort.
Atif et al. (2023)

Currently only a small part of the overall OpenFOAM codebase runs on GPUs (the gradient
evaluation). In the near future, we plan to extend to other routines of immediate interest and
perform scalability tests on multiple nodes. The work caught the attention of OpenCFD, the
company maintaining OpenFOAM codebase, confirming the approach based on standard ISO
C++ parallelism has potential to become mainstream and widely adopted.
Malenza (2022)

In the same vein, some of the issues that affect the precursor alternative, and would be addressed by this extension are also made apparent:

Must compile the whole program with nvc++ when offloading for NVIDIA GPU
– To avoid One Definition Rule violations with e.g. std::vector
Andriotis et al. (2021)

Both of these compilers are still rather immature, exhibiting a number of compiler bugs and lack
of build system integration, so performance numbers should be taken lightly.
Bhattacharya et al. (2022)

While, in theory, nvc++ is link compatible with gcc libraries, there are certain limitations. Any
memory that will be offloaded to the device must be allocated in code compiled by nvc++, and
there are issues with some STL containers. The linking of the final executable must also be
performed by nvc++. The compiler is new enough that critical bugs are still being found, though
are often rapidly patched. Furthermore, it is not fully recognized by many build systems, requiring
significant user intervention for more complex installations, especially when some packages require
g++ and others nvc++.
The compilation time with nvc++ tends to be significantly longer than with the other portability
layers.
Atif et al. (2023)

This suggests widespread interest from the scientific community, at least. Generally, conclusions from the above tend to be convergent in appreciating the ease of use of the extension, and its role as a very smooth path to using GPUs to accelerate execution without having to forfeit accrued knowledge / embrace new & unfamiliar idioms. At the same time, there are some challenges with existing solutions that would be addressed via incorporation of the extension into a mainstream compiler, which is cooperatively developed, and composes with standard toolchains / libraries in an organic way.

As a proof of feasibility, we also present an implementation based on HIP. There are two primary reasons for this choice:

  1. HIP is already available in Clang / LLVM, and is by now mature and well integrated with all of its
    components;

  2. HIP is an interface to

    GPUs, and is actively used in production by complex projects, such as:

    which means that the reachability of the extension is maximised, without
    forfeiting robustness.

The implementation is “pressure-tested” by way of a series of apps of varying complexity, for which we make the changes available: https://github.com/ROCmSoftwarePlatform/roc-stdpar/tree/main/data/patches. Some of these apps, in spite of having a large footprint, can be executed, via the proposed extension and the sample implementation, with minimal changes to the
build infrastructure e.g.:

  • stlbm: 32834 lines of C++, 9 lines of CMake changes
  • miniBUDE: 142834 lines of C++, 22 lines of CMake changes

A specific need to reside within the Clang tree

Transparently implementing the feature being discussed, without requiring user intervention, requires adjusting the compilation process:

  • driver must handle new flags;
  • Sema must treat certain constructs differently;
  • IR transformations must be added depending on said flags etc.

Trying to handle the above out-of-tree would be infeasible without essentially creating and maintaining a fork. Furthermore, it is not possible to address the concerns expressed in the literature, and outlined above, from levels above the compiler. We posit that the functionality we are proposing is generic and
generally beneficial, and thus would rather contribute it to the community, rather than make it AMD specific.

A specification

The documentation describing this feature, its implementation and characteristics has been put up for review here: ⚙ D155769 [HIP][Clang][docs][RFC] Add documentation for C++ Parallel Algorithm Offload. Any gaps identified via the review process shall be filled. We also provide a fully functional implementation based on the existing ROCm toolchain. The runtime components used for this are fully open sourced and available here: GitHub - ROCmSoftwarePlatform/roc-stdpar. The latter are meant as an illustration of how a toolchain might compose with the compiler capability being proposed in order to support the feature. They are not binding and do not form part of what is being put forth via this RFC; no restrictions or requirements are imposed on other toolchains.

Representation within the appropriate governing organization

We do not anticipate that this feature will ever be submitted for C++ standardisation, and do not intend to push it for standardisation. It is meant to address a gap in the extant and near future iterations of the C++ Standard, without requiring modifications to said standard. Furthermore, this is not a language change, but rather a compiler one. A possible, conservative interpretation, is that this proposal describes an extension to the HIP language. Even if one assumes this conservative interpretation, we will note that there is no extension to the actual shared HIP interface as reflected via
GitHub - ROCm-Developer-Tools/HIP: HIP: C++ Heterogeneous-Compute Interface for Portability.

A long-term support plan

The extension being proposed is going to constitute a key component of AMD’s ROCm Stack, and AMD’s heterogeneous compute offerings. It will receive the same high level of support that the HIP language has received since being upstreamed. Obviously, AMD cannot ensure coverage for other toolchains that will choose to add support for the extension being proposed, therefore the above applies only to the common, generic compiler parts and the ROCm toolchain implementation of
the extension.

A high-quality implementation

We have put the patch set containing the proposed changes up for review here: ⚙ D155769 [HIP][Clang][docs][RFC] Add documentation for C++ Parallel Algorithm Offload. We have made efforts to ensure that the code is aligned with LLVM standards, and that the overall footprint is minimal i.e. things that could be done without mutating the compiler were kept out. As mentioned above, the runtime components are also open source.

A test suite

All of the components we are contributing are covered by unit tests that are also being added to the Clang / LLVM tree, and which can be consulted via the associated patch set. Furthermore, since this is building on mature, existing functionality within Clang / LLVM, it levers existing test coverage.

Additional notes

It is important to clarify and reiterate that we are not proposing an addition to the C++ standard, and we are not charting a course for the future of heterogeneous computing in C++. The extension we are proposing adds a feature that has existed in a constrained form for some time, and which is in current use. Furthermore, we are not proposing an unifying “One Model To Rule Them All” for heterogeneous computing / offload in Clang / LLVM. This is a feature that is meant to compose a generic compiler interface with generic FE and ME processing with target specific BE & run time handling. To remove any ambiguity, we will note that we envision that composition between targets / implementations will be handled by the user, in user space, and not automatically by the compiler & linker. Otherwise stated, today’s work-flow where one would compile for HIP, CUDA, OpenCL or SYCL separately remains, with stdpar being an extension supported by those toolchains. This is necessary at this point in time and for
the foreseeable future in order to allow optimum implementations to exist.

1 Like

Thank you all for the interest and lively conversation. A number of very interesting matters have been brought to the fore, and now that the discussion appears to have settled a bit I will try to provide some additional details. Hopefully this will help provide clarity.

What This Proposal Is

We are proposing a non-standard compiler extension that enables offload to target accelerators without changes to C++ source code, via a (set of) compiler flag(s), in the presence of a toolchain that uses the functionality. This is aimed at bringing offload to code authored in iterations of the C++ standard that lack any mechanism in this regard (17, 20, 23 etc.). It is done without adding keywords, public interface functions, or any other language level alteration. It is strictly opt-in i.e. the user must set the flag(s). We submit that the compiler mechanisms being proposed will generalise to other languages such as FORTRAN.

What This Proposal Is Not

We are not proposing a C++ language feature or an addition to the C++ Standard. This extension is non-standard, and using it can lead to behaviours that are outside of the purview of said standard. We are also not proposing a replacement for existing offload mechanisms, but rather a sibling that will coexist, and which addresses different needs and abides by different constraints (e.g. no source level changes, no accelerator aware memory allocation in the source language). Finally, at this time, we are not proposing a novel compositional mechanism that supersedes current Clang/LLVM practices when it comes to offload: as is the case today, -stdpar offloading via different toolchains is not expected to coexist within the same compilation. Otherwise stated: today it is not possible to have -x hip and -x cuda passed to the same compilation, for example, and this proposal does not change this behaviour. We posit it is important to maximise toolchains’ freedom of implementation and that the landscape is not yet unified enough to permit pursuing a singular unified implementation layer.

Choice of Naming

The stdpar name(space) was chosen to match pre-existing similar functionality. It has gained some mnemonic qualities in public conversations / and the literature, and thus it appeared preferable to adopt it in order to minimise user confusion / friction. The concerns about its “loaded” significance are noted, and if it is necessary we can always change it, at the cost of placing some extra cognitive load on users / build infrastructures. For example, it is likely to lead to a proliferation of control variables in CMake or Make.

The Forwarding Header

We are not submitting the header for upstreaming into Clang/LLVM, and it is not part of this proposal per se. It is provided as illustration, and developed in the open for two reasons:

  1. Demonstrate a plausible, fully featured implementation of the runtime components for one of the
    toolchains that are already in upstream (ROCm);
  2. Facilitate community participation and cooperative development, as opposed to AMD merely
    producing an opaque turnkey solution.

The compiler components which are being proposed here do not depend on the forwarding header in general, although the ROCm Toolchain chooses to use this mechanism. Other toolchains can choose completely different approaches, which may or may not rely on a header. The header is authored with a focus on readability, and, as a consequence, the code is neither optimised nor hardened. The choice of using overloads in the std namespace had two motivations:

  1. There is historical precedent, with all offload toolchains (CUDA, HIP, OMP) in upstream already
    doing it to e.g. add device overloads for math functions; this is / has been acceptable as selecting
    offload via either of those mechanisms implies opting into a non-standard extension to the C++
    language, as is the case with what we are proposing;
  2. It allowed us to reduce complexity in the front-end, whilst enabling rapid iteration on the
    implementation / easier debugging.

libc++ Interactions

The extension we are proposing is dependent on tidy interactions and composition with the C++ standard library. As such, it is necessary to ensure that stdpar works with any and all standard library implementations, including but not restricted to:

The only reliably available customisation point appears to be the standard defined and mandated interface, with standard library implementers differing in their choice of implementation. For the ROCm toolchain we have chosen to rely only on the latter, and as described above have preferred to avoid injecting complexity in the compiler front-end at the cost of a non-standard runtime component implementation. This is a choice made for AMD’s demonstrative implementation, and other toolchains / implementers are not constrained in their choices - it is perfectly valid for another implementer to choose to either pursue upstreaming into libc++ of their runtime implementation or to pursue tight coupling with libc++'s customisation points.

Having said that, the feedback and concerns expressed in what regards the forwarding header, such as its lack of uglification or reliance on ADL, are absolutely valid, and would have to be addressed before the header is incorporated into the ROCm stack itself. To this end, @philnik’s insight as an experienced standard library developer would be invaluable, and we’d be extremely appreciative if it were possible to create issues on the roc-stdpar repository, so that we can improve.

libompx

This looks like an amazing project, and one that will benefit from the extension that is being proposed. There is no reason for which it cannot be the run time component of e.g. a future -stdpar implementation that is target agnostic. However, as stated above, at this point in time we are not proposing any particular implementation strategy and we are not mandating a choice of toolchain / offload mechanism. At the same time, this proposal neither specifies nor assumes any particular attitude towards offload: a toolchain’s implementation can choose to always offload or algorithmically / oportunistically offload. Similarly, it can choose to provide various types of hints such as prefetching, it can choose whatever accelerator aware allocation function it deems fit for interpose mode etc. All of these are implementation details that go beyond the scope of this proposal, which limits itself to a fairly simple AS-IF view: expected side-effects of an algorithm’s application manifest as if offload hadn’t occurred.

The Two Main (Or Most Interesting) problems

Quoting @jdoerfert:

I fail to see how this tackles the two main (or most interesting) problems with “magically offload this”:
a) Where is my data? Which implies if offloading is worth, movement is required, etc. (I understand it
works with unified shared memory, but still…)
b) Can I execute the functor on the device/host or not?

These are indeed the main, most interesting problems for traditional offload / GPU compute models, and they all try to address them in fairly similar ways (__device__, restrict(amp) #pragma omp declare target, hipMalloc, #pragma omp target data use_device_ptr etc.). However, we should note that these are questions that the C and C++ (and, indeed, most mainstream) languages do not ask. What we are proposing is intended to work with a mainstream language (C++) in its existing form, without any language level or machine model user-visible additions. Therefore, we humbly submit the following:

  1. Data placement is not a concept that is expressible in standard C or C++, as both have a flat uniform
    view of memory. Thus, as long as one forms a valid address (pointer), one can legally use said
    pointer in any context where it is usable, until it is invalidated. How this is implemented is up to the
    toolchain, run time components it uses, and not something that we propose should be handled by
    the compiler. We do document two modes of execution, one which is based on HMM i.e. transparent
    on-demand paging and direct use of any and all allocations, and a fallback mode called
    Interposition, which allows the runtime to interpose all heap/free-store allocations to make them
    accelerator accessible via some implementation defined mechanism. Our demonstrative
    implementation illustrates both, but other toolchains might make different choices. Interposition
    mode is, as stated, a fallback, and does not fully model the C++ view of the world;
  2. C++ is not a linguistically segregated / partitioned language at this time, i.e. all valid C++ code
    should be compilable and executable on any and all processors that can be targeted by a C++
    toolchain, for which a suitable run time environment exists. Thus, from the user’s perspective, any
    and all callables can and should execute, and given that the extension being proposed does not
    modify the linguistic level, it is necessary to adopt that perspective. In practice, this translates into
    optimistically emitting all code. There are limitations to this optimistic approach, and they are
    documented, but they appear to be acceptable in practice given the precursor feature’s uptake (as
    mentioned in the extension specification shared in a prior post above).

Those reasons, and many more, are true for OpenMP too.

While your HIP is portable argument is nice, the patch states:

At the moment, C++ Standard Parallelism Offload is only available for AMD GPUs,

which is far less portable.

I agree that users want C++ parallelism. But users also want performance, portability, and interoperability with what they have. I fear this approach cuts too many corners to provide what users really want and will end up useable for a select few only.

Take performance: Users eventually need control over data placement. If they would not, everyone would use USM right now. People are not (for other reasons as well).

Take portability: Users will require an implementation for more than just AMD GPUs, they also need this to interoperate nicely with other compilers. Finally, the requirements of this approach are simply too strict for practical applications, USM, kernel version, all of this precludes most HPC users, which is what this is supposed to target, for at least another 1-5 years. So they would need to write an alternative impl and carry both around if they really wanted to use this.

Interoperability issues are discussed below.


Now, some of the actual problems:

This is very problematic. The former breaks code (always, eventually), the latter is known not to work well in practice (O0, non default pipelines, intermediate IR states, …).

This means, we have switches for things that do nothing in upstream, correct?

This is not only an apple vs oranges comparison but also doesn’t make much sense. Nobody opts into non-standard extensions of C++ because they call a overloaded math function in the offloaded toolchain. On the contrary, they want standard C++ available and we provide a subset through these overloads. That said, these overloads are device only. Nothing we have right now implicitly crosses the host-device boundary, so “historical precedent” is not available. Finally, overload headers are a means to an end. We are, driven by AMD, developing the right solution now since the overlay headers are problematic (see GPU libc). The lesson here, if any, is that we should do the right thing for C++ parallel algorithms from the beginning and provide a library solution.

Correct. However, you can link together device files targeting the same arch that come from HIP/CUDA/OpenMP and run LTO on them. The proposed model can break this because of double definition errors, among others. Given that AMD effectively requires this linking step, the approach proposed here can effectively prevent the use of stdpar and any offloading mechanism in the same application.

This reads like the plan is to “compile everything”, if so, please elaborate on the problems that are implied by compiling for the device, including: exceptions, assembly, differences in macros, redefinition of functions, missing global symbols (global variables, external functions, …), thread local, unsupported types/operations, … compile time.

1 Like

This proposal is a new language. It’s a C++ dialect, possibly a subdialect of HIP as presently implemented. We can call it an extension instead, but it changes the meaning of tokens in source files to do something different to C++, so that’s a different language. Calling it stdpar seems suspect in that light, as does using -stdpar to opt into pieces of the hip driver.

Could we go with clang -x hip -hipstdpar as the interface? It leaves the door open to drop the hip names later as alternative implementations come online, where cuda for testing against the nvcc implementation seems likely to be useful.

If it doesn’t enable a load of hip stuff (kernels, math overloads etc) then it could be -hipstdpar or -stdpar=hip or similar, but if it does switch on all the hip front end stuff it’s far more explicit to require the -x hip control.

It’s not obvious to me that openmp would like magic offloading of things in algorithm but maybe it would. I’m not aware of anything like the thrust library for openmp so that seems like lot of extra work. Something to do later perhaps.

1 Like

From https://discourse.llvm.org/t/rfc-adding-c-parallel-algorithm-offload-support-to-clang-llvm/72159/13:

The current status can be found here: GitHub - markdewing/libompx at add_catch

Hi all,

I am lead developer of oneDPL.

Thanks for that proposal.

At Intel we think this is a good motion to create community available and driven solution for standard parallelism on GPU/accelerators.

What we can see in different vendor solutions is that the idea of C++ Standard algorithms offloading is the same but the implementation strategies/techniques are different as well as trade-offs

Eventually, at Intel we think there are several important design goals we would like to achieve for the eventual solution. Most of them are already touched one way or the other in this thread.

  1. It should work with any platform C++ standard library (GNU, LLVM, Microsoft). It can be done in different ways but the point is it allows to provide interoperability with existing software.
    • From what we see AMD implementation already can do that. Yes, adding the overloads to the std:: namespace is hacky and fragile but it might be good enough as a short-term solution.
    • As @philnik proposes it can be done via customization point mechanisms well-defined by each standard library implementation (LLVM/GNU/Microsoft). If we want to go this way we need
      to work with the vendors to make it possible.
  2. It should be able to compile SYCL/CUDA/HIP/etc. code based on the user choice.
  3. It should allow redirecting C++ parallel algorithms to different heterogeneous libraries (Thrust, oneDPL, rocThrust, Kokkos, etc.). It is important because different algorithm implementations might be more/less suitable for specific hardware enabling. For example, oneDPL and Thrust have different “algorithm graphs” when one algorithm is expressed through the other. To show the difference, oneDPL copy_if is expressed via scan-based algorithms while Nvidia has copy_if as a separate “primitive”. Sure thing, we should support rocThrust for HIP, oneDPL for SYCL, Thrust for CUDA, etc. But to me it should not be 1 to 1 mapping. It might be oneDPL for CUDA devices, if users want that.

I think Parallel STL upstream that lives in LLVM might be a good place for implementing the prototype when we agree on the design.
It would allow to prove feasibility of the desired design without affecting standard library implementations on the prototyping stage.
We could use Parallel STL tag-based customization points that we previously agreed on in discussions with GNU, LLVM, and Nvidia. Further, this solution might be adopted by at least GNU and oneDPL that are Parallel STL based and as far as I remember LLVM libc++ developers also had a plan to sync with Parallel STL from time to time.

2 Likes

Thank you everyone for the replies and the insight, it’s highly appreciated. Putting together what @rjmccall, @JonChesterfield, @jdoerfert and @rarutyun have said, it seems that the following would constitute a preferable course of action:

  1. This proposal, as embodied in this RFC, should be a HIP extension exclusively:
    • -stdpar and friends become -hipstdpar, freeing up the general name for future solutions;
    • documentation is reworded to clearly and unambiguously signal it’s an extension only available for HIP;
    • this unambiguously places the support burden on HIP and avoids any mishaps, such as the ones @philnik was worried about.
  2. Separately (concurrently), an effort, possibly under the [ParallelSTL] tag, should take place, pursuing a solution that is more generic, as suggested / requested by @rarutyun & @jdoerfert:
    • this should be built around the customisation points that various standard library implementations expose;
    • should allow for flexible selection of an algorithm implementation orthogonally to the accelerator (use oneDPL on NVidia, for example);
    • might be library only, which libompx suggests might be possible, and @jdoerfert indicates as preferable.

(1) would allow HIP users to start experiencing the feature right away, without causing collateral damage. (2) would lead to a better overall solution, but is a longer-term effort. Once (2) materialises, users wouldn’t need to alter their source to transition, just their compilation flow. Only when (2) materialises should prime naming real estate prefixed with -std become available. We can consider uplifting components from the HIP extension’s implementation in the future, once the parSTL one comes along. I do hope I’ve not mis-represented anybody’s position here, please correct me if I did. Thank you!

Can you speak to what would be required to bring up an implementation of this for non-AMD GPUs? Where it is on the spectrum between “implement a few straightforward intrinsics in a backend” and “reimplement essentially all of the HIP infrastructure below the frontend”?

I most definitely can, thank you for the question! Let me prefix things with the fact that this is not AMD’s answer / stance, or any sort of official statement, but rather my view on the topic. AMD cannot commit to implementing support for toolchains other than HIP, at this point in time, and it would be up to the owners / maintainers of said toolchains to act on this proposal.

At a glance, generalisation to other toolchains would require a few steps:

  1. A prerequisite is the existence of an algorithm library that has implementations for the std algorithms:
  2. The currently proposed implementation is set up to forward to HIP, the straightforward way to generalise it is to change -stdpar into an equality flag so that the user can say -stdpar=hip or -stdpar=cuda or -stdpar=sycl etc.; where now we use the existence of the flag to set the input to HIP, we’d use the value to pick the type, but we’d still be enabling LangOpts.StdPar etc.; different offload toolchains would be mutually exclusive, as is the case in upstream today
  3. The toolchains would have to deal with -stdpar in whatever way fits them best:
    • the CUDA & SYCL drivers can do exactly what the HIP / ROCm driver does i.e. implicitly include the algorithm library and the forwarding header
    • the CUDA FE can directly re-use the relaxation we do for HIP (in practice this is already shared), whilst SYCL wouldn’t need to do anything special as a consequence of -stdpar being enabled since it’s not explicitly partitioned between host & device
    • the CUDA ME should directly re-use the accelerator code selection pass (should be run preferably post-LTO / BC linking, as we do on AMDGPU), SYCL can, unless there’s already similar capability in upstream (I am assuming that for SYCL one has to deal with parallel_for entry-points similarly, but I couldn’t find a pass doing that)

Offload backends that are already in upstream wouldn’t have to change. By the point code reaches the BE it’s exactly what you’d expect to see from a standard HIP/CUDA/SYCL offload codebase. Toolchains would need to implement their own forwarding header (a fairly mechanical process), and decide on how/if they want to offer interposition; if they choose to they’d need to provide definitions for the accelerator aware replacements the pass expects - this is all in the runtime layer, not in the compiler.

I’m not mentioning OMP because @JonChesterfield or @jhuber6 would be better sources for what OMP might do and because there’s no algorithms library for OMP at the moment (libompx will change that when it lands, it appears).

Please let me know if this is more or less what you were wondering about, or if I can provide any additional clarification.

P.S.: HIP can target more than AMD GPUs (NV, Intel), but AMD cannot bring those targets up, which is why support is constrained to AMD GPUs as that is something that AMD can validate and verify.

1 Like

The point of libompx, as presented above, is not “something with/about OpenMP offload”. That is why I explicitly mentioned the name can be changed. What the project shows is how we should make all of this available to the user. Not with driver flags and mandatory middle end passes that change the default C++ behavior in ways nobody will grasp. But instead, provide a (mostly) library solution that lives in LLVM, likely as part of libc++, and which directs requests conformant with the standard. It will target the GPU vendor libraries when possible and appropriate, otherwise the used host C++ stdlib.

As an extension to my answer, as well as further elaboration on point 1 “Evidence of a significant user community” from “Contributing Extensions to Clang”, please consider the following:

Which leads to the conclusion that all the primary GPU ecosystems / toolchains (CUDA, HIP, SYCL) are adding (we will add it to the HIP AMD toolchain) or already have this capability. I will also observe that we have all converged on a pretty similar solution. Thus, to @jdoerfert’s point that we should absolutely not do this but rather go with a library solution, I will humbly disagree. A library solution cannot provide the experience / interface that is being sought here, which is already being offered by other implementations, even though it has other benefits and advantages. I submit that we should have both some form of the -stdpar extension (be it the generic one, or the HIP-only as a safe conservative start) and a library solution (possibly / probably nested under libc++); these aren’t perfect substitutes, but rather different points in the offload continuum, and they present the user with different questions / tradeoffs.

Could you actually list anything specific?
The interface looks the same for both, so what is the difference?

Similar, maybe, but your solution is arguably worse / broken, compared to NVIDIA’s.
The idea of “compile all for the GPU” is one example. NVIDIA says:

Function pointers can’t be passed to Parallel Algorithms to be run on the GPU, and functions may not be called through a function pointer within GPU code.

Which is sound, albeit not satisfactory.

Could you explain how the interface is the same? Upstream, you say that:

An example would look like this:
ompx::device::sort_by_key(keys_begin, keys_begin + N, values);

Which looks like you expect the user to explicitly opt into calling something from the library. Perhaps I misunderstood your example / hypothesis, case in which I apologise.

Ah, it appears that you have identified a gap in documentation, which should be addressed, thank you ever so much! To clarify, exactly the same restrictions apply around pointers to function and access to the full call graph, so it’s encouraging that you assess them as being “sound”. I’m not sure what you mean by “compile all for the GPU”, possibly another misunderstanding due to unfortunate wording on my part - I do apologise again!

Okay, so let me try to understand the library controversy here. C++ officially merged the Parallelism TS in 2016, which means that there are now algorithms in the standard library that are parameterized by execution policy, some of which permit parallel execution and even offloading of the algorithms. libc++ apparently provides customization points which they say should be sufficient to enable this kind of offloading. However, AMD is concerned about relying on those customization points because they need to support other STL implementations, so instead AMD would like the compiler to inject headers that overload these algorithms. libc++ is then concerned about the likely maintenance burden of having these overloads in their namespace.

Does AMD have concerns about the customization points offered by libc++ beyond their need to support other STLs? Is it reasonably feasible to use those customization points when using libc++ (or other STLs that offer similar features) and only rely on overload injection on STLs that don’t cooperate?

There was also a claim up-thread that this proposal introduces a not-entirely-compatible dialect of C++. Is that just because the source has to be processed as HIP in order to compile for offloading, or is there something more to the language changes required? I don’t remember what all changes HIP introduces, but if this is isolated to the offloading compile, that doesn’t seem particularly problematic (since it is, after all, something the user opted into). Or do the injected headers do something ABI-affecting?

As far as Clang policy goes, my initial reaction is that the compiler has an obligation to not break our entente with the standard library and other system headers. Adding overloads that are explicitly meant to be preferred over the main STL implementation for conformant calls into the STL is a very invasive step. If this project is proposing to do that, I think at a minimum you need to explicitly take responsibility for any ongoing maintenance burdens: your project documentation needs to clearly and prominently state that you rely on invasive changes to the STL, that you have only tested with specific releases of specific STLs, and that any and all compatibility problems with other STL releases are always your responsibility and should be filed with you rather than with the STL vendor. But it would be much better if you can avoid that as much as possible.

Thank you for the very good question! As a prefix, I will suggest that execution policies, and the standard as is, do not actually have any provisions for offload. What we are doing (all in the -stdpar) boat, is adding a non-standard mechanism that exploits the contract that execution policies establish in terms of execution sequencing. As currently written, the standard cannot really accommodate offload as it assumes that the machine is “uniform”, for lack of a better word. Future versions of the C++ standard will address this via the std::execution proposal, and, specifically for the case of algorithms, via compositional mechanisms that allow placing an execution on a scheduler. This is forward-looking and will serve users that are on C++26+, and would require source level changes and toolchain upgrades, but will be the “standard” way of solving this conundrum, going forward. The functionality being proposed is working within the constraints of the present, and is aimed at users that would strongly prefer to not change their source code / purely “standard” C++ surface level experience.

In what regards AMD concerns, I assume that customisation mechanisms are of two nature here (if I am omitting something, I apologise in advance): either custom execution policies, or internal libc++ implementation details which allow to target algorithm execution to different backends. The former has the problem that it would require the user to change their source to type in a different execution policy, which makes the user experience slightly inferior to “I set a compiler flag, and then enjoy myself”. The latter is a possible challenge because it forces the user to change the standard library provider, which is generally expensive and disruptive for the user. It would also be limited to future versions of libc++, so it’d not work with extant versions of it that already expose the parallel algorithms interface. However, this is not to say that a library solution on the libc++ level is not desirable, it absolutely is, but it would be complementary, and would provide users that are willing to rewrite their source and update their standard library with greater control.

In what regards the not-entirely-compatible dialect aspect, compiling as HIP is indeed the main problem / source of incompatibility, since HIP itself as a language is an incompatible dialect of C++ (at best). The header itself should not mess with the ABI, and depending on how you read the standard:

  • pre C++20 It is allowed to add template specializations for any standard library function template to the namespace std only if the declaration depends on at least one program-defined type and the specialization satisfies all requirements for the original template, except where such specializations are prohibited.;
  • post C++20 It is undefined behavior to declare a full specialization of any standard library function template.

and consider we’re not adding full specialisations, you could possibly (tenuously) argue that the overloading is legal. However, compiling as HIP involves some other unambiguously non-standard changes such as adding __device__ overloads into std for e.g. math functions, which is a double whammy both because the standard not give any consideration to heterogeneity (as it stands), and because it explicitly bans it: It is undefined behavior to add declarations or definitions to namespace std or to any namespace nested within std. Hence, I would say that this is unambiguously a non-standard extension, and opting implies opting into non-standard C++, even though the user visible surface area remains unchanged.

This is more than fair, thank you for making the point. I should probably have clarified that this is the intention / AMD’s stance - we are not looking to offload the maintenance burden for -stdpar uses on AMD HW via HIP to other actors / components. I will update the documentation to unambiguously clarify the points you identified, thank you. At the same time, this seems like a pretty compelling reason to go with -hipstdpar rather than -stdpar, as that would make things even less ambiguous.

As for avoiding unpleasantness around redirecting calls, I agree that this is hardly ideal. Unfortunately, it’s not clear how cater to the user’s desire to just type e.g. std::for_each(std::execution::par_unseq, f, l, []() { /* big lambda */ }), and get offload from that without any other change except a (set of) compiler flag(s), without subverting their expectations in one way or another, in extant C++ contexts. We could, of course, hoist the forwarding into the front-end, but that is just making things opaque whilst retaining the invasiveness. Other alternatives would require some amount of user buy-in (use a different execution policy, use a different standard library). Which is why I submitted above that it is important to have both classes of capability, to serve the needs of all users. In general, assuming that at some point in the future some other toolchain would decide to adopt this, they do not need to use overloading / interposition. They’d still need to redirect conformant calls into the STL to conformant calls into some other library, at least for some implementations of the standard library.

I followed this up with:

So, why is ompx::device::sort_by_key(keys_begin, keys_end, values) conceptually different than std::sort_by_key(std::unseq, keys_begin, keys_end, values)? I don’t see how it would be different. I mean, we all provide an API, nothing I said anywhere speaks against making it look like the std par API, right?

Do you have anything that is conceptually worse in the library solution, compared to the header solution (which is the conceptual difference between what you propose and what I propose).

What I mean, is that the following statement of yours sounds like you want to translate all CPU functions to device IR and then prune what you “don’t need”. Can you elaborate on what “optimistically emitting all code” means in this context?

With a bit more reflection on the above, and the fact that there’s no interest from other toolchains, there appear to be no compelling reasons to pursue a generic, disruptive, solution at this point. Since we (AMD) know that this is an important feature for HIP users, we will add it to HIP and only to HIP, only for AMDGPU targets, which means that any and all maintenance burdens rests with us. This ensures that no fallout can touch standard library providers. I shall re-tag the reviews to reflect their HIP only nature, re-organise the code, where applicable, to nest it under the AMDGPU target / the ROCm toolchain, and rename the flags, warnings & variables. As for this RFC, we could repurpose it to discuss library solutions and their applicability, or leave it as is for future reference if we want to revisit the topic in the future (for example, if another toolchain becomes interested).

Overall, this has been extremely useful in pointing out challenges and issues associated with generalisation. Juxtaposing these with hindsight makes it clear that it was quite ambitious to pursue a generic solution from the get-go, so I would like to apologise for that, and thank everyone for the comments and insight.