LLVM has a growing community around, and interest in, offloading. People target other CPUs on the system, GPUs, AI accelerators, FPGAs, remote threads or machines, and probably more things. The base principles are always the same, yet we have only loosely connected approaches to tackle this. Everyone has to write, maintain, and distribute their own flavor of an offload runtime and the connected components. This is certainly not the situation we want as a community.
The idea is to provide a space for all of us to work together on the offloading infrastructure. Overall, the goal is a unified user experience, less and better code, interoperability between offloading models, and portability for the models that are currently not as portable as we would like them to be. I believe a subproject is adequate for this, given the wide scope and the way it couples with other (sub)projects.
To bootstrap a generic LLVM offload, I suggest to rename and move
llvm-project/offload. This will give us an initial offloading backend that has been actively developed upstream for years, and is used by companies like AMD and Intel. It is already only loosely coupled to the OpenMP host runtime, and it has never required users to start with
It contains three “kinds” of runtimes, tests, and some utilities. The runtimes are:
libomptarget.so, a host runtime to orchestrate target independent offloading tasks, like reading the images, registration and launch of kernels (via plugins below), and reference counting for “mapped” memory (optional), … this is the only “legacy” code we still have. As part of a soon to happen rewrite we will make the names in the library and the user facing API more generic, and design it in a way that non-OpenMP users can even more easily circumvent OpenMP features (like reference counting). The subclass approach (see below) could also be used to specialize generic offloading for each “model”, e.g., OpenMP, CUDA, HIP, … while sharing most code.
- Plugins, e.g.,
libomptarget.rtl.amdgpu.so, contain code to talk to “the hardware”, for some notion of it. This can be a GPU, via ROCM or CUDA, a remote system, via gRPC or MPI, or the CPU. The (nextgen-)plugins share most of their code, including 95% of the JIT logic and a device memory manager, and only target dependent parts are implemented via subclasses. This ensures a consistent user experience and helps to maintain feature parity across targets.
- Device runtimes, e.g.,
libomptarget.devicertl.a, contains the device code for all supported targets, e.g., GPUs. The linker will pull in the parts required by the device parts of an application which allows us to write our portability and performance abstraction layer in C++.
libomptarget is capable of offloading code to CPUs, AMD and NVIDIA GPUs, and it has many extensions for things like remote machines or a “virtual” GPU running on the host. The API is generic enough to lower CUDA programs onto it to make them portable. The library provides a ready to go JIT for the offloaded regions and comes with various controls, e.g., to enable basic profiling and debugging capabilities. You can record and replay offloaded kernels or use the portable wrappers around low-level target-dependent instructions to write portable kernel code. The libc-gpu runtime is automatically supported (incl. RPCs), and std::parallel offloading in libcxx targets the runtime as well.
I can talk more about features, design, and future plans, but for now I want to get a feeling if there is any opposition to such a move/rename. Please feel free to ask questions.
I also want to thank the people that provided early feedback, but then I don’t want to name names in case this doesn’t work out. Anyways, you know who you are