[RFC] SYCL runtime upstreaming

The main RFC for SYCL implementation upstreaming can be found here:

SYCL runtime provides an implementation of the SYCL 2020 API specification. It’s
responsible for managing resources for enqueuing tasks to the offload device,
tracking dependencies between them, and data movement between the host and
devices. The SYCL runtime is device-agnostic and uses Unified Runtime
(GitHub - oneapi-src/unified-runtime) as an external dependency. This
Unified Runtime serves as an interface layer between the SYCL runtime and
device-specific backends. Unified Runtime has several adapters that bind to
various backends in a similar fashion to libomptarget, and we will discuss the
best way to support those backends in the recent offload RFC.

The SYCL runtime uses semantic versioning (libsycl.so.major.minor.patch, see
https://semver.org/). Somewhat unusually, the major version of the library is
also appended to its name on Windows (e.g. sycl7.dll) to avoid unexpected usage
of an old library at runtime. The Windows library with the correct version is
chosen by the driver when compiling with “-fsycl”.

This RFC contains a brief overview of the core functionality of the SYCL runtime
and its major components.

SYCLHeadersAndLibraries

High-level overview of SYCL headers and libraries

SYCL API overview

SYCL API provides a collection of classes that manage backend resources:
sycl::platform, sycl::device, sycl::context, sycl::queue. For the most
part, all those classes except for queue are simple wrappers over their backend
counterparts, with methods that allow querying their capabilities or other
information. When the first of these objects is created, all available platforms
are queried for their devices, and the appropriate device is chosen by either
the default selector or one that’s specified by the user. The default selector
uses an implementation defined heuristic to choose a device from all available
ones. In our implementation, it considers whether there are device binaries
compatible with the device, the type of the device (GPU/CPU/accelerator) and its
backend. An environment variable that restricts devices and platforms visible to
a SYCL application is also supported, which allows overriding selector behavior.

For memory management, SYCL provides sycl::buffer, sycl::unsampled_image,
and sycl::image. These classes represent memory that can be accessed on any
device or the host using sycl::accessor or sycl::host_accessor. SYCL also
provides a pointer-based Unified Shared Memory API as an alternative to these
memory objects, which requires the user to specify dependencies and data movement
explicitly instead of relying on accessors.

Here’s an example of a small SYCL application that allocates memory on the device,
executes a kernel there, then reads the results on the host.

#include <sycl/sycl.hpp>

#include <cassert>
#include <numeric>
#include <vector>

int main() {

  sycl::range<3> range{8, 8, 8};
  std::vector<int> vec(range.size());
  std::iota(vec.begin(), vec.end(), 0);
  sycl::buffer<int, 3> b{vec.data(), range};

  int val = 12;
  sycl::queue q;
  sycl::event e = q.submit([&](sycl::handler &cgh) {
    sycl::accessor acc{b, cgh};
    cgh.parallel_for(range, [=](sycl::id<3> idx) {
      acc[idx] += val;
    });
  });
  sycl::host_accessor hostAcc{b, sycl::read_only};
  int i = 0;
  for (const int &x : hostAcc)
     assert(x == val + i++);
}

The lambda passed to sycl::queue::submit represents a command group: a task to
be submitted to the queue and its dependencies. sycl::handler is an object
constructed internally by the SYCL runtime to be passed to the command group
function object. The command group handler is used to construct accessors,
registering the corresponding memory object as a dependency, and invoke a kernel
or submit another command, e.g. a memory operation or an asynchronous host task.

After the kernel is submitted, the application requests access to the buffer on
the host, waiting for the kernel to complete because of its dependency on the
same buffer. Other host-device synchronization points in a SYCL application
include buffer destruction and explicit calls to sycl::queue::wait() or
sycl::event::wait().

Host and device code integration

The lambda passed to the sycl::handler::parallel_for function represents a
device kernel. When a SYCL application is started or a dynamic SYCL library is
loaded, all device images contained in the multi-targeted binary are registered
in the program manager, which is one of the larger components of the SYCL
runtime, by calling the __sycl_register_lib runtime function, a counterpart to
__tgt_register_lib
(Offloading Design & Internals — Clang 18.0.0git documentation).
The device image wrappers generated by SYCL compilation tools also include
information like device kernel symbols and build options. The integration
header, which is generated during device compilation and included during host
compilation (see
[RFC] SYCL Host Compiler Integration Header and Footer),
helps the runtime map SYCL kernel invocations to their corresponding
symbols in the device code. The program manager then uses information from
device image wrappers to map the kernel name to a set of device images, chooses
an image compatible with the requested device, passes it to the backend to be
just-in-time compiled if needed, and caches the result for reuse.

The integration header also provides information about kernel arguments. It
assumes that the lambda layout is the same for both device and host compilation,
and the runtime uses the information about argument size and offsets to extract
their values from the lambda object passed to sycl::handler::parallel_for and
sets them when enqueuing the kernel. The integration header also includes
information about argument types, which is needed to handle special cases like
sycl::accessor objects.

For more details about device code compilation, see

and

Dependency management

Another large internal component of the SYCL runtime is the scheduler. It
maintains a directed acyclic graph representation of the submitted command
groups, managing dependencies and data movement required by the application code
with the use of accessors. Most of the actual synchronization between tasks is
handled by the underlying device backend, with the scheduler simply enqueuing
tasks as they are submitted and passing a list of dependencies. There are some
exceptions to this. A SYCL application may create a host accessor or submit an
asynchronous host task. Such cases are handled by the scheduler directly by
delaying the submission of device tasks until their host dependencies are
satisfied. Cross-context dependencies, which can’t be passed to a backend
directly, are handled in a similar fashion by using an internal host task as a
proxy dependency that waits for the real one.

Nodes of the dependency graph along with the resources allocated for them are
cleaned up during buffer destruction and after submitting a command group. This
occurs as soon as they are no longer needed, i.e. when the task has been
enqueued and the node no longer represents a potential implicit dependency for
subsequent command submissions.

For more information about SYCL’s command execution order, see
https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:command-groups-exec-order

Device headers and libraries

SYCL runtime headers provide declarations of various algorithms and built-in
functions to be used in device code. A large portion of such algorithms and
built-in functions are simple enough to be implemented in the headers (for
example, those directly lowered into SPIR-V built-ins), but some have more
complex implementations (for example, fallback implementations for functionality
that might not be supported natively). The more complex implementations are
available in the form of several pre-built device libraries that are linked with
the user device code either during ahead-of-time compilation or at runtime by
the program manager, using the information about device library dependencies
embedded in device image wrappers by SYCL compilation tools.
Some device libraries functionality overlaps with libc/libm implementation for
GPU. We are looking for ways to re-use LLVM code for SYCL and welcome any
feedback.

Could you please provide some details on this environment variable such as its name and value format?

Do you have a proposed location for where the source for the runtime library would be added to the llvm-project repo? Perhaps a new top-level libsycl directory to match the other run-time libraries?

Does the library require non-standard C++ extensions beyond the __builtin_sycl_unique_stable_name builtin? (That builtin is being discussed at RFC: SYCL support for unnamed SYCL kernel functions).

As long as you ditch the unified-runtime, put everything on top of offloading, put everything into a new top project (libsycl), and support the sycl runtime on none clang with CPUs, move as fast as possible.

1 Like

Sorry for such a delayed response.

The name of the variable implemented in intel/llvm is ONEAPI_DEVICE_SELECTOR (the name might need to be changed once upstreamed). It allows to select backends and device types with values like “opencl:gpu” and its syntax is described in detail here: https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md#oneapi_device_selector

Yes, making it a top-level project makes sense to me.

I’m not aware of anything else besides that and __builtin_sycl_unique_stable_id, which is quite similar but provides a name for a global variable, which is used for specialization constants.

1 Like

What is the current status of this RFC?

We’ve posted a follow up with some open questions on runtime upstreaming here: [RFC] SYCL runtime upstreaming questions
It was pointed out there that we should contact the LLVM board about the current external dependencies of SYCL runtime, which we did this Monday, waiting for a response on that.

1 Like