[RFC] Add Full Support for the SYCL Programming Model

Add full support for the SYCL programming model

We (Intel) propose adding full support for the SYCL programming model to the LLVM compiler infrastructure. This will facilitate collaboration on C++ single-source heterogeneous programming for accelerators (e.g., GPU, FPGA, DSP) from different hardware and software vendors. The SYCL 2020 Specification is available at the Khronos site: SYCL™ 2020 Specification (revision 8).

Our prior RFCs toward this effort can be seen here:

In the last four years, here are some of the major developments in our SYCL compilation efforts:

  • both our design and SYCL have evolved (e.g. the previous RFC supported SYCL 1.2.1)
  • we have multiple years of SYCL implementation experience
  • we have received feedback from the community for both the original RFC as well as patches that we have upstreamed
  • we have real-world user feedback from talking with our customers
  • we have worked with Codeplay in supporting non-Intel hardware

As a result, we now have a more mature set of design and implementation that we can upstream that is more in line with community standards and expectations.

SYCL is an open specification from the Khronos Group and Intel supports a community implementation that fully meets this specification. This would allow for open tooling and be in line with the goals of the Unified Acceleration (UXL) Foundation: Announcing the Unified Acceleration (UXL) Foundation - oneAPI.io

We have a long term commitment to supporting SYCL and intend to use the upstreamed implementation in our compiler products. Alexey Bader is the SYCL code owner and has a long history of work within the LLVM community and the development of the SYCL specification. We have worked with Codeplay to enable non-Intel hardware and hope that a community implementation will further enable SYCL support on a broader range of devices.

Purpose of this RFC

The purpose of this RFC is twofold. Much of the RFC describes how we propose to support SYCL in clang in the near term using our current work. However, we would also like to describe the limitations we have found and outline a direction for how these limitations could be lifted. We are particularly interested in the community’s thoughts on these directions. We are, of course, committed to improving the design as we maintain the SYCL support.

Support for third party host compilers

Like many offload projects in clang, the design for SYCL takes a multi-pass compiler approach. The application’s source file is compiled once with a “device compiler”, which mostly ignores the host code and generates code only for the offload kernels. The source file is then compiled a second time with a “host compiler”, which generates code only for the host parts of the application. Normally, the host and device compilers are both clang, but run with different options. Some users, however, need to use their own host compiler, which might not be clang. Our design currently supports this through a command line option named “-fsycl-host-compiler”.

Our preference is to upstream this support also because we think it is generally useful for people using clang to compile SYCL programs. Other clang developers may also fork clang as a starting point for their own SYCL implementations, and they could also benefit from the common support for a third-party host compiler. Does the community see this as sufficient reason to upstream support that interoperates with a non-clang compiler?

Mechanism for integrating the host and device code

One central problem that must be solved with a multi-pass compiler design is to logically connect the host and device code, which are separately compiled. Our current strategy is to compile the device code first, and have the device compiler generate two header files which are automatically included into the host code. One header (the integration header) is automatically included at the start of the host compilation phase and the other header (the integration footer) is automatically included at the end of the host compilation phase. This design is described in detail in TODO.

This design is well suited for supporting a third-party host compiler because the connection between the device and host code is standard C++ header code, which can be consumed by any conformant C++ host compiler. However, we have observed that relying on standard C++ imposes certain limitations on the input SYCL source code.

One example is the use of the C++ type that is used to “name” a kernel. With the header file approach, these types must be forward declared in the integration header. However, this places a limitation on these types to those that can be legally forward declared at namespace scope. For example, it is not possible to forward declare a type that is defined at block scope or to forward declare a type that is in the “std” namespace. Thus, these types may not be legally used to name kernels in the user’s SYCL source code.

We would like to lift these limitations, at least in the common case where the user uses clang for both the host and device compilation phases. Our thought was to implement a different integration approach for this common case, which might assume that the host compiler understands clang attributes like __builtin_sycl_unique_stable_name and decorates the LLVM IR with some new attributes. Our thinking is very early at this point.

Even if we develop this new approach, though, we think it makes sense to keep the approach based on the integration header because this will continue to be used for a third-party host compiler.

Location for the logic for generating the integration support

Another open question is the location for the logic that generates the integration header (or the new integration information alluded to above). This logic is currently located in the Sema phase of the CFE, and the community has expressed concern about this. We are considering two alternatives. One option is to move the logic to the CodeGen phase of the CFE. Another is to move the logic out of the CFE entirely and create a new LLVM IR pass instead. If we do this, the new IR pass would likely run near the start of the pipeline, before other IR passes can transform the IR that is emitted by the CFE. This option may be attractive especially if we end up supporting two integration methods because we could isolate each method in its own IR pass, and then enable one or the other pass depending on whether the user requests a third-party host compiler.

Dependency on the Khronos LLVM-SPIRV translator

With our current approach, device code corresponding to the offload region in SYCL source file is compiled and linked into LLVM IR. Various backend compilers (e.g., Intel Graphics Compiler), require the device code to be present in SPIR-V IR. Hence, we need a compiler flow to generate SPIR-V IR from LLVM IR during the post-link stage.

This final translation to SPIR-V currently relies on the Khronos SPIRV-LLVM translator, which is a project external to the LLVM compiler infrastructure. As a result, end users must ensure that the translator is available in their environment, so that the clang driver is able to find it during post-link stage of compilation for SYCL. We would like to eliminate this dependency in order to make it easier for people to use clang for SYCL.

A SPIR-V backend is currently under development inside the LLVM compiler infrastructure. Once deployed, this backend can replace the current translator in the post-link stage of SYCL compilation flow.

Topics of interest

In this RFC, we would like to request your feedback for the following items:

12 Likes

Thank you for this RFC and splitting it up into constituent components! While this particular part of the RFC did not garner any comments, there was healthy discussion on the individual component RFCs. That discussion showed that the community is interested in the technology and supports upstreaming efforts in general. I’ll leave comments on the individual RFCs for things to keep in mind when going through those efforts.

There is one part of the omnibus RFC (Run-time Library: SYCL runtime upstreaming) that was just posted today; we should get feedback from the community on that RFC before starting work on that aspect of the project.

What I do wonder about is what is the minimum set of functionality that clang/LLVM would need in order to be able to implement SYCL efficiently as a library? I suspect that a lot of the required functionality for SYCL support could be exposed as generic, programming model-independent primitives.

That would have the advantage that the same functionality could be shared across different models already present in clang, shared with other SYCL implementations, and be otherwise useful to end users.

As developer of AdaptiveCpp, the other LLVM-based SYCL implementation, my concern is that if the upstreamed SYCL support is too integrated within itself and too close to DPC++, it will make it very difficult for us to potentially upstream some of our own features, because interfaces and assumptions will be too different (e.g. we don’t use -fsycl and we don’t need a SYCL-specific clang driver). The more upstream SYCL functionality is expressed in terms of independent primitives, the less of an issue this will be.

As an example, implementing accessors could be done without SYCL-specific compiler support if there is a builtin to introspect the content of a struct and its data layout (which can be useful independently of SYCL). The same holds for inspection of attributes attached to a function, which would be sufficient for most of the SYCL attribute functionality. Builtins to get mangled function names could, at least in the long term, be unified with e.g. __builtin_get_device_side_mangled_name etc.

1 Like

I’m curious about that as well and I agree with your perspective on generic primitives.

I agree. I’m not familiar with much of the inner workings for the oneAPI implementation, but my limited understanding is that it is driven by attributes such as [[clang::sycl_kernel]] that are used in the library implementation. The use of attributes avoids more intrusive integration.

I suspect the best approach for identifying opportunities for generic primitives will come from a collaborative review process. I’ll make sure that you are tagged on any related code review submissions.

If needed, yes. The consensus reached for the “SYCL Host Compiler Integration Header and Footer” RFC involves pursuit of an approach that doesn’t rely on mangled names.

If you have suggestions for other primitive functionality in mind, please share!

Making progress on all the ISO C++ SG7 proposals would make all this easier because some parts of the heavy lifting could be done at the library level just by using some standard or experimental introspection and code generation.
Perhaps there is some possible consolidation with the implementation of Reflection for C++26 and others? For example, while P2996 does not seem to allow introspection of lambda capture with nsdm_description because the layout is not specified by the C++ standard, the real Clang implementation would do it anyway and this can be reused by a SYCL implementation.

That seems possible, yes. I have not been following the reflection work, so I don’t have a good sense of its limitations or what all it could enable.