[RFC] Adding C++ Parallel Algorithm Offload Support To Clang & LLVM

Apologies for the delay in getting to this, please see below for an attempt at addresing the extension criteria (happy to extend; I’ll answer the name bit more extensively in another post, but the short of it is that it was chosen based on prior art and it already being in use - it can definitely be changed).

stdpar as a Clang Extension

Evidence of a significant user community

There is a growing body of literature around the precursor / proprietary alternative to this extension, which has been available since 2019, e.g.:

Selecting some conclusions will help to outline the benefits of the proposed extension:

Our conclusion is that stdpar can be a good candidate to develop a performance portable and
productive code targeting the Exascale era platform, assuming this approach will be available
on AMD and/or Intel GPUs in the future.
Asahi et al. (2022)

A few hundred lines of code, without hardware-specific knowledge, achieve cluster-level
performance.
Latt (2021)

To continue writing scientific code efficiently with a large and not always professionally trained
user community to run on all hardware architectures, we think we need community solutions for
portability techniques that will allow the coding of an algorithm once, and the ability to execute it
on a variety of hardware products from many vendors.
Bhattacharya et al. (2022)

Beyond performance portability, this study has demonstrated how traditional HPC programming
techniques such index-based traversal are well-supported use cases in ISO C++17. In general,
none of the C++17 implementations impose unreasonable requirements on algorithm use:
captured pointers are allowed.
Based on the three ports, we conclude that only minimal code transformation is required coming
from either a vendor-supported programming model such as CUDA or a portability layer such as
Kokkos.
Lin, McIntosh-Smith, Deakin (2023)

Yet because it utilizes standard C++, it provides a very low entry bar for developers, offering a
simplified path for porting from serial CPU code to GPUs, without the need to explicitly manage
memory transfers. […] All APIs have a learning curve, though std::par is closest to normal C++,
providing an easy transition path to GPU programming for little effort.
Atif et al. (2023)

Currently only a small part of the overall OpenFOAM codebase runs on GPUs (the gradient
evaluation). In the near future, we plan to extend to other routines of immediate interest and
perform scalability tests on multiple nodes. The work caught the attention of OpenCFD, the
company maintaining OpenFOAM codebase, confirming the approach based on standard ISO
C++ parallelism has potential to become mainstream and widely adopted.
Malenza (2022)

In the same vein, some of the issues that affect the precursor alternative, and would be addressed by this extension are also made apparent:

Must compile the whole program with nvc++ when offloading for NVIDIA GPU
– To avoid One Definition Rule violations with e.g. std::vector
Andriotis et al. (2021)

Both of these compilers are still rather immature, exhibiting a number of compiler bugs and lack
of build system integration, so performance numbers should be taken lightly.
Bhattacharya et al. (2022)

While, in theory, nvc++ is link compatible with gcc libraries, there are certain limitations. Any
memory that will be offloaded to the device must be allocated in code compiled by nvc++, and
there are issues with some STL containers. The linking of the final executable must also be
performed by nvc++. The compiler is new enough that critical bugs are still being found, though
are often rapidly patched. Furthermore, it is not fully recognized by many build systems, requiring
significant user intervention for more complex installations, especially when some packages require
g++ and others nvc++.
The compilation time with nvc++ tends to be significantly longer than with the other portability
layers.
Atif et al. (2023)

This suggests widespread interest from the scientific community, at least. Generally, conclusions from the above tend to be convergent in appreciating the ease of use of the extension, and its role as a very smooth path to using GPUs to accelerate execution without having to forfeit accrued knowledge / embrace new & unfamiliar idioms. At the same time, there are some challenges with existing solutions that would be addressed via incorporation of the extension into a mainstream compiler, which is cooperatively developed, and composes with standard toolchains / libraries in an organic way.

As a proof of feasibility, we also present an implementation based on HIP. There are two primary reasons for this choice:

  1. HIP is already available in Clang / LLVM, and is by now mature and well integrated with all of its
    components;

  2. HIP is an interface to

    GPUs, and is actively used in production by complex projects, such as:

    which means that the reachability of the extension is maximised, without
    forfeiting robustness.

The implementation is “pressure-tested” by way of a series of apps of varying complexity, for which we make the changes available: https://github.com/ROCmSoftwarePlatform/roc-stdpar/tree/main/data/patches. Some of these apps, in spite of having a large footprint, can be executed, via the proposed extension and the sample implementation, with minimal changes to the
build infrastructure e.g.:

  • stlbm: 32834 lines of C++, 9 lines of CMake changes
  • miniBUDE: 142834 lines of C++, 22 lines of CMake changes

A specific need to reside within the Clang tree

Transparently implementing the feature being discussed, without requiring user intervention, requires adjusting the compilation process:

  • driver must handle new flags;
  • Sema must treat certain constructs differently;
  • IR transformations must be added depending on said flags etc.

Trying to handle the above out-of-tree would be infeasible without essentially creating and maintaining a fork. Furthermore, it is not possible to address the concerns expressed in the literature, and outlined above, from levels above the compiler. We posit that the functionality we are proposing is generic and
generally beneficial, and thus would rather contribute it to the community, rather than make it AMD specific.

A specification

The documentation describing this feature, its implementation and characteristics has been put up for review here: ⚙ D155769 [HIP][Clang][docs][RFC] Add documentation for C++ Parallel Algorithm Offload. Any gaps identified via the review process shall be filled. We also provide a fully functional implementation based on the existing ROCm toolchain. The runtime components used for this are fully open sourced and available here: GitHub - ROCmSoftwarePlatform/roc-stdpar. The latter are meant as an illustration of how a toolchain might compose with the compiler capability being proposed in order to support the feature. They are not binding and do not form part of what is being put forth via this RFC; no restrictions or requirements are imposed on other toolchains.

Representation within the appropriate governing organization

We do not anticipate that this feature will ever be submitted for C++ standardisation, and do not intend to push it for standardisation. It is meant to address a gap in the extant and near future iterations of the C++ Standard, without requiring modifications to said standard. Furthermore, this is not a language change, but rather a compiler one. A possible, conservative interpretation, is that this proposal describes an extension to the HIP language. Even if one assumes this conservative interpretation, we will note that there is no extension to the actual shared HIP interface as reflected via
GitHub - ROCm-Developer-Tools/HIP: HIP: C++ Heterogeneous-Compute Interface for Portability.

A long-term support plan

The extension being proposed is going to constitute a key component of AMD’s ROCm Stack, and AMD’s heterogeneous compute offerings. It will receive the same high level of support that the HIP language has received since being upstreamed. Obviously, AMD cannot ensure coverage for other toolchains that will choose to add support for the extension being proposed, therefore the above applies only to the common, generic compiler parts and the ROCm toolchain implementation of
the extension.

A high-quality implementation

We have put the patch set containing the proposed changes up for review here: ⚙ D155769 [HIP][Clang][docs][RFC] Add documentation for C++ Parallel Algorithm Offload. We have made efforts to ensure that the code is aligned with LLVM standards, and that the overall footprint is minimal i.e. things that could be done without mutating the compiler were kept out. As mentioned above, the runtime components are also open source.

A test suite

All of the components we are contributing are covered by unit tests that are also being added to the Clang / LLVM tree, and which can be consulted via the associated patch set. Furthermore, since this is building on mature, existing functionality within Clang / LLVM, it levers existing test coverage.

Additional notes

It is important to clarify and reiterate that we are not proposing an addition to the C++ standard, and we are not charting a course for the future of heterogeneous computing in C++. The extension we are proposing adds a feature that has existed in a constrained form for some time, and which is in current use. Furthermore, we are not proposing an unifying “One Model To Rule Them All” for heterogeneous computing / offload in Clang / LLVM. This is a feature that is meant to compose a generic compiler interface with generic FE and ME processing with target specific BE & run time handling. To remove any ambiguity, we will note that we envision that composition between targets / implementations will be handled by the user, in user space, and not automatically by the compiler & linker. Otherwise stated, today’s work-flow where one would compile for HIP, CUDA, OpenCL or SYCL separately remains, with stdpar being an extension supported by those toolchains. This is necessary at this point in time and for
the foreseeable future in order to allow optimum implementations to exist.

1 Like