[RFC] Implementation of OpenACC 3.3 for Offload in Clang

What is OpenACC?

OpenACC is a standardized, directives-based parallel computing language extension which permits easier targeting of CPU and GPU devices. OpenACC has programmers provide instructions in the form of pragmas (in C and C++) to the compiler for how to optimize, parallelize, and offload code. OpenACC was designed from the ground-up for heterogeneous computing, with all of the individual directives targeted at the problems and solutions that best apply to accelerators. OpenACC takes a descriptive approach to parallelization, enabling compilers more freedom in how to parallelize for a given architecture than other approaches, while providing the ability for programmers to optimize through progressively more prescriptive clauses.

Other Implementations

There are currently multiple implementations widely available of OpenACC, including one that is currently in process. First, the NVIDIA HPC C++/Fortran compilers 1 are both extremely mature OpenACC implementations. GCC 2 has support for OpenACC, with experimental support available as of GCC5, and a mature implementation available as of GCC9. Additionally, the Flang 3 project has recently started their implementation of OpenACC. Finally, a DoE project known as CLACC 4, a Clang/LLVM based OpenACC->OpenMP translator, has implemented OpenACC, albeit with limited use. While the Cray Fortran compiler also supports OpenACC, support for OpenACC was dropped from their C/C++ compilers when the Clang front-end was adopted. However, based on comments during Supercomputing 2022 they would be interested in regaining OpenACC support in these compilers if support were added to Clang.

The proliferation of implementations available show an extensive interest in OpenACC, which I believe makes it a good idea for Clang to implement it as well.

Implementation Strategy

Much of the implementation has already been designed by the Flang community, so we intend to leverage as much of the infrastructure developed there as possible. We are confident that the two compilers can share runtime and code generation facilities, which will greatly simplify both implementations.

However, one consideration is the OpenACC optimization, loop analysis, and scheduling components. Flang is implementing this in MLIR via a special OpenACC MLIR dialect 5. It is our intention to use that as well, so NVIDIA is beginning to participate in the CIR/MLIR effort with a goal of accelerating the implementation and adoption of it, at least for the purposes of dialects. If this DOES NOT end up working by the time we’re ready for front-end code generation, we have identified and are prepared to implement a strategy where we can use LLVM-IR based metadata to enable the translation to the OpenACC MLIR dialect for the purposes of reusing those MLIR based passes.

Short term, we are going to start the Clang effort by implementing Parsing and Semantic Analysis of the directives, enabled by the -fopenacc flag, chosen for GCC compatibility. We will also implement a temporary flag to control the __OPENACC macro override (to be removed when we have a complete implementation), which will permit existing programs to be compiled with Clang to leverage our semantic analysis checking.

We believe this to be the best way to start for a few reasons. First, it permits us to make necessary progress while the Flang design and engineering of the shared components continues, so that we can help guide that effort, and start using the infrastructure and implementation when it is mature. We believe this will shorten the time required to develop OpenACC in Clang.

Second, it permits us to implement the OpenACC semantic analysis rules to the standard, such that Clang can be used to validate existing programs, thus becoming more immediately useful. This is possible because OpenACC directives are treated as advanced hints to the compiler, so ignoring them is a compliant implementation model, though we obviously intend to add the offload analysis as soon as it is available to do so.

We intend to leverage the existing tests from the NVHPC product, and development will follow the LLVM convention for lit tests, as well as use the lit tests written for CLACC. Additional testing/test suites are described below in the “A Test Suite” section.

While this is expected to be a multi-year project, we are excited to start implementing it.

Contribution Requirements: Clang - Get Involved

Evidence of a significant user community

The NVIDIA HPC compilers both have a sizable OpenACC community. Additionally, there is an annual OpenACC Computing Summit 6 that is well attended. CLACC has a number of users thanks to its support for Kokkos. While user numbers for GCC’s OpenACC implementation are unavailable, we believe that there is a significant user base, and that the presence of an upstreamed Clang implementation will only improve that situation.

Artificial Intelligence and other parallel computing tasks benefit from Offload; which OpenACC provides with a low barrier of entry, which we also anticipate will produce a growing user base.

Specific Need to reside within the Clang Tree

As OpenACC is a pragma based language that interacts closely with expressions and modifies code generation, it is necessary for it to be a part of Clang, and have its programming model be a part of LLVM. Additionally, we intend to use the OpenACC MLIR dialect5 (as currently being developed/used by the Flang community), which would require an LLVM based frontend to generate.

A Specification

OpenACC has been an active standard since the release of the 1.0 specification in 2011 7. Since then, there have been 8 additional specifications released, with the current version being the OpenACC 3.3 specification 8, released roughly a year ago. The committee continues to meet frequently, and intends to continue releasing specifications.

Representation within the appropriate Governing Organization

NVIDIA is active in the OpenACC Technical Committee, employing Jeff Larkin, the Technical Committee Chair, among other participants. Additionally, Duncan Poole and Jack Wells, also NVIDIA employees 9, are the chairman of the OpenACC board of directors and the president of OpenACC, respectively. NVIDIA has been active on the OpenACC Technical Committee since its inception, and intends to continue active participation for the foreseeable future.

Long Term Support Plan

NVIDIA as a company is dedicated to the success of OpenACC, as should be clear from our participation in the OpenACC standardization efforts and our commitment to implementing OpenACC in Flang, and is equally as committed to proliferating its use in multiple compilers. We intend to continue development and support of OpenACC in Flang and Clang in perpetuity via funding multiple compiler engineering developers. We are committed to this support.

High Quality Implementation

While we don’t yet have an implementation, we intend to do said implementation completely ‘upstream’ in Clang, where it is subject to extensive review and validation by the code owners and other contributors. Additionally, as the Attributes Code Owner (and primary reviewer as additional contributors start helping), I intend to ensure that every bit of code contributed meets or exceeds the LLVM and Clang coding standards and levels of quality.

A Test Suite

As we are developing in Clang directly, we intend to contribute extensive ‘lit’ tests. Additionally, the CLACC effort has resulted in a significant test suite that we intend to leverage along the way 5. For runtime-based testing, as the product matures, we also intend to develop and contribute runtime tests. There is also a current effort to add the UDel OpenACC V&V test suite 10 to the llvm test-suite 11. Finally, as OpenACC is an extensively used language, we will be leveraging existing open-source applications for runtime correctness and performance validation.

Patches

I’ve prepared 3 patches to start the implementation, however Github’s lack of a ‘patch stack’ has resulted in them all being in 1 pull request. At the conclusion of that pull-request, we’ll have -fopenacc, an _OPENACC macro defined to 1, plus a command line flag to override it for the purposes of validation, plus initial parsing support that diagnoses and recovers from all uses of #pragma acc. Please see the PR here: https://github.com/llvm/llvm-project/pull/70234.

11 Likes

Thanks for the RFC, I was expecting this to show up any time now :wink:

I find it interesting that you managed to write this while mentioning OpenMP only once, as part of the CLACC description. I think there are more things to consider wrt. to usefulness, interplay, sharing, etc. For now, let me be brief and ask the most obvious questions:

High-level: How is this effort different from CLACC and why shouldn’t we just “upstream CLACC”? Along those lines, and only as far as I remember, CLACCs problem was that there is little/none codes that use OpenACC in C/C++ (note that I do not talk about Fortran here!), has that fundamentally changed? Do we have a list of codes and/or entities that are interested in OpenACC on C/C++?
Impl-level: Flang OpenACC is setup on top of OpenMP offload (soon llvm-project/offload), is that the plan for Clang OpenACC as well? If so, at which stage would it converge? If not, what is the lowering plan?
Unrelated: Does that mean NVIDIA will help (more) to improve our NVPTX backend?

Thanks for the response!

Upstreaming CLACC would require keeping the ‘translate at the AST Level to OpenMP’ part, which has a few negatives. First, it leaves quite a bit of performance on the floor, an secondly, is not acceptable as a part of the Clang design, it is critical that our AST always/continue to represent the source code directly for the purposes of tools. That said, a VAST amount of CLACC will be re-used along the way.

As far as users: OpenACC is well used in C/C++ for the NVHPC compiler, and has quite a few users for the GCC implementation as far as we can tell. As this is an entirely upstream implementation, we don’t think we’ll face the ‘chicken or egg’ problem that CLACC faces (that is: Folks unwilling to use it because it isn’t upstream, and won’t be upstreamed because it doesn’t have user support).

There IS a list of users of OpenACC in C++ that, to my understanding, has been published at the OpenACC conference, but I don’t have access to that right now. Either way, we expect that with NVIDIA’s full backing, that more users, who are otherwise unable to use NVHPC products, will be attracted to the clang product.

That IS effectively the plan. For this RFC, we did a ‘plan to not plan’ for the lowering, with the intent to have our Frontend Code-Gen lower to as close to Flang’s implementation as possible. As you know, this will end up being set up on top of the OpenMP offload work that you’ve been working on.

That is not a question I’m able to answer as I don’t have authorization to discuss our long-term plans with the NVPTX backend.

2 Likes

That all makes sense to me. I’m hoping to get more testing and other improvements out of this, and as long as we all share the same lowering code path, that should be the case. We’ll also get interoperability and similar fun properties.
(If someone has that list of C++ OpenACC users/entities, I’d be interested in general.)

1 Like

Can you explain in more detail about the performance loss?

Clacc’s AST is designed so it does represent the OpenACC source code. Can you explain in more detail?

Which parts? From your description above, I thought you were going to reimplement everything starting with parsing and sema.

If NVIDIA can convince the community that users want OpenACC in upstream Clang, then that part of Clacc’s problem is solved.

1 Like

Hi Joel- I appreciate your responses! I was hoping you’d be able to come comment/participate.

The conversion to OpenMP AST causes some of the language specific nuance to be lost, and it means our MLIR optimizations from Flang will not be usable. Our internal analysis shows that there is opportunities in this method that otherwise aren’t available.

The conversion to OpenMP loses source-fidelity as far as I can tell, which is not acceptable, and at minimum, results in an extra layer of translation that isn’t particularly acceptable to us.

As stated above, we intend to take much of CLACC’s lit-testing, and I’m very much basing the Parsing/semantic analysis on CLACC’s implementation. As I go further, many of the decisions made in CLACC’s implementation of those components is likely to be reflected here.

NVIDIA’s interest here is to have a Clang implementation that reflects what we’re doing in Flang/using the above RFC, as this is important for our use cases. We’re not likely to invest in the user community for an alternative.

1 Like

Just so I get this right: The plan is that you will hook up MLIR as a production lowering strategy for upstream Clang (which may or may not contain OpenACC)?

+1, OpenACC nodes should have its own representation in AST.

3 Likes

It is our intent to ‘end up’ there some day (in the OpenACC dialect), with that then lowering to the existing offload mechanisms, yes. We are also investing in the CIR effort with a hope to accelerate that to be ready when we need it.

If that doesn’t come to fruition in time, we have alternative code generation mechanisms that will continue to leverage the offload project (including a polygeist like solution, or just lowering as OMP does for now).

5 Likes

It doesn’t. Clacc builds an OpenACC AST just as it would if it were not translating to OpenMP. The OpenMP translation is implemented in a separate module and is attached to the OpenACC AST.

Keep in mind that much of Clacc’s tests are designed to check the translation of OpenACC to OpenMP.

Yep! And much of that would not be reusable as-is, and would likely need translation to our implementation.

+1 Thanks Erich for this RFC.

Having the possibility to lower to the OpenACC dialect will be very nice. It will be a good test for our generic design.

3 Likes

This is cool Erich, makes total sense to have this clang upstream.

Sounds like a good incremental approach!

We’re excited with the ClangIR bits of the collaboration, and happy to help with integration efforts, discussions, PRs, etc.

6 Likes

It sounds like an exciting project!

I agree; this is a strong part of OpenACC. It would be valuable if these analyses, particularly loop dependence analysis, is implemented in somewhere common in MLIR not just for OpenACC dialect. I believe that OpenMP could also benefits from this analysis, even though it is not be mandated by the spec.

1 Like

Thank you for this RFC!

I did have one question related to this bit about testing:

What is the license for the existing tests from the NVHPC product and is it compatible with our licensing? I think the CLACC license is the same as LLVM’s license and so we’d be fine to pull in their tests, but it would be good if @jdenny-ornl could verify that.

The NVHPC tests will be run internally (as they are not clear license wise), but any issues found by it will be reduced/anonymized internally and then released as lit tests.

And yes, CLACC is under the LLVM License (https://github.com/llvm-doe-org/llvm-project/blob/clacc/main/LICENSE.TXT).

That would be lovely as well! At the moment, the Flang team is running with the design of the passes, so in this RFC, I’m just proposing to use whatever Flang ends up designing. They also have OpenMP code to optimize, so I suspect it’ll end up having some sort of common optimizations/pipeline/etc.

Ah, I took your meaning as these tests would be contributed rather than stay downstream. So long as there is sufficient upstream testing, that’s fine, but we shouldn’t rely heavily on downstream testing for things implemented upstream. (Basically, I expect you’ll have sufficient test coverage for what’s being contributed but if you need help on questions of licensing for existing tests to help provide that coverage, the community has a path to get some help with that.)

1 Like

I absolutely intend to have extensive coverage in lit tests as I contribute patches. The NVHPC tests are just ‘also being run’, and are of benefit because they are reduced from ‘real world’ code.

The UDel compliance suite WILL be upstreamed as well, so I intend there to be multiple layers of test coverage.

1 Like

Perfect, thank you!

1 Like