[RFC] Introducing `llvm-project/offload`

LLVM has a growing community around, and interest in, offloading. People target other CPUs on the system, GPUs, AI accelerators, FPGAs, remote threads or machines, and probably more things. The base principles are always the same, yet we have only loosely connected approaches to tackle this. Everyone has to write, maintain, and distribute their own flavor of an offload runtime and the connected components. This is certainly not the situation we want as a community.

Introducing llvm-project/offload

The idea is to provide a space for all of us to work together on the offloading infrastructure. Overall, the goal is a unified user experience, less and better code, interoperability between offloading models, and portability for the models that are currently not as portable as we would like them to be. I believe a subproject is adequate for this, given the wide scope and the way it couples with other (sub)projects.

OpenMP offload

To bootstrap a generic LLVM offload, I suggest to rename and move llvm-project/openmp/libomptarget into llvm-project/offload. This will give us an initial offloading backend that has been actively developed upstream for years, and is used by companies like AMD and Intel. It is already only loosely coupled to the OpenMP host runtime, and it has never required users to start with #pragma omp.

The llvm-project/openmp/libomptarget folder:
It contains three ā€œkindsā€ of runtimes, tests, and some utilities. The runtimes are:

  • libomptarget.so, a host runtime to orchestrate target independent offloading tasks, like reading the images, registration and launch of kernels (via plugins below), and reference counting for ā€œmappedā€ memory (optional), ā€¦ this is the only ā€œlegacyā€ code we still have. As part of a soon to happen rewrite we will make the names in the library and the user facing API more generic, and design it in a way that non-OpenMP users can even more easily circumvent OpenMP features (like reference counting). The subclass approach (see below) could also be used to specialize generic offloading for each ā€œmodelā€, e.g., OpenMP, CUDA, HIP, ā€¦ while sharing most code.
  • Plugins, e.g., libomptarget.rtl.amdgpu.so, contain code to talk to ā€œthe hardwareā€, for some notion of it. This can be a GPU, via ROCM or CUDA, a remote system, via gRPC or MPI, or the CPU. The (nextgen-)plugins share most of their code, including 95% of the JIT logic and a device memory manager, and only target dependent parts are implemented via subclasses. This ensures a consistent user experience and helps to maintain feature parity across targets.
  • Device runtimes, e.g., libomptarget.devicertl.a, contains the device code for all supported targets, e.g., GPUs. The linker will pull in the parts required by the device parts of an application which allows us to write our portability and performance abstraction layer in C++.

libomptarget is capable of offloading code to CPUs, AMD and NVIDIA GPUs, and it has many extensions for things like remote machines or a ā€œvirtualā€ GPU running on the host. The API is generic enough to lower CUDA programs onto it to make them portable. The library provides a ready to go JIT for the offloaded regions and comes with various controls, e.g., to enable basic profiling and debugging capabilities. You can record and replay offloaded kernels or use the portable wrappers around low-level target-dependent instructions to write portable kernel code. The libc-gpu runtime is automatically supported (incl. RPCs), and std::parallel offloading in libcxx targets the runtime as well.

I can talk more about features, design, and future plans, but for now I want to get a feeling if there is any opposition to such a move/rename. Please feel free to ask questions.


I also want to thank the people that provided early feedback, but then I donā€™t want to name names in case this doesnā€™t work out. Anyways, you know who you are :wink:

17 Likes

+1

Most of the parts for offloading, such as kernel launch and device management, are pretty common. For now, any framework, programming model, that would like to offload to GPU, has to write their own device launcher, no matter it is machine learning framework, such as PyTorch, or modern programming models, such as Triton. Itā€™s great that LLVM can provide a generic base implementation and all the others can extend it based on their actual need.

We can move common stuff to llvm/offload, probably make it a LLVM component, and rewrite libomptarget by combining llvm/offload and OpenMP specific parts.

2 Likes

+1

Projects like MLIR could benefit a lot from this.

I think it would be great also to consolidate & move tooling to generate & embed GPU binaries, kernel registration, etc, to llvm/offload. But that might a separate discussion.

1 Like

llvm-project/llvm/offload or llvm-project/offload ? The latter gives it more visibility and we are not in svn thus there is room for more top-level projects.

3 Likes

+1. Itā€™d be great to be able to figure out how to extend this to be a good target for SYCL upstreaming.

1 Like

I donā€™t have any appreciable experience with the existing work, so I canā€™t comment on the details, but the objective makes sense to me and the direction sounds good!

2 Likes

There are build hazards if we go with the former. This will have code in it that needs to be compiled with clang, probably in the ENABLE_RUNTIMES fashion, so I think weā€™ll have an easier time for llvm-project/offload.

I have a longstanding interest in a C++ wrapper over the C HSA interface used on amdgpu. That would be a hsa.hpp included by the openmp host plugin, by the libc GPU loader and by the amdgpu-arch tool found under clang. Currently that is done by copy&paste&cry because there is no good place in tree to put it. So llvm-project/offload/any-string/hsa.hpp would make me happy. Even if we just had some header only library code with no cmake that would be a step forward.

In general I think there are some really useful generic things under the openmp project. For example there is some library code under DeviceRTL which approximates a compiler-rt, where clang emits library calls into that which expand to device specific builtins. I like that because it means the IR for nvptx and for amdgpu is more similar. If nvptx used addrspace weā€™d probably be able to get a single lit test to check both of them.

The HSA and Cuda libraries expose roughly similar capability for asking GPUs to do things behind very different interfaces. The ā€˜plugins-nextgenā€™ is a mix of openmp specific stuff and an abstraction over that divergence.

Thus what I think would be wonderful in the short and long term are:

  • A compiler-rt style GPU lib that abstracts over architecture differences

  • A host library that abstracts over cuda / hsa / other GPU kernel control languages

Itā€™s not as clear to me that the bulk of libomptarget is inherently reusable - a lot of it seems to be plumbing to the dynamically linked plugins that would largely disappear if we stopped dynamically linking them - but others hopefully see more potential there.

As a variation on this proposal, I think we should create an offloading top level project and use that as a library for implementing parts of openmp, libc, maybe hip or cuda support. We move code from openmp into it incrementally, renaming & maybe even testing as it goes, leaving the openmp runtime libraries ultimately much smaller and dedicated to the openmp specific behaviour.

Specifically, Iā€™d suggest llvm/offload builds a smallish number of static libraries that are statically linked into other projects, and we treat the API defined by offload as toolchain internal and free to change as we see fit, avoiding getting tagged with backwards compatibility ideas from other languages.

This also means all the code landing in offload can be reviewed by non-openmp people to help establish credibility that this is an engineering play.

I also endorse this idea. Just to add to the list of projects that already use the OpenMP offloading runtime, the OpenACC compiler Clacc is also leveraging it as far as I know.

https://www.openacc.org/events/open-accelerated-computing-summit-2023

1 Like

The latter, I used llvm instead of llvm-project everywhere, corrected now. Thanks.

Regardless of this RFC, thatā€™s on our TODO list for a while.

1 Like

The direction of this sounds great, as I think we should have a place for offloading that is not only OpenMP, but provides the building blocks for offloading in a more general sense.

Iā€™m not completely sold on the idea to just refactor/move everything into llvm/offload or llvm/offload/openmp what is currently in openmp/libomptarget and wonder if that transition should include some more brain-cycles on what should live there to build that common place more incrementally.
In case we decide that just refactor/move is indeed the right call to make, I think though that we should plan and arrange as OpenMP being a usecase of llvm/offload not the root of it.

On a more general note, I agree that this move would give us a nice place to include some nicer-to-work-with abstraction/wrapper for, e.g., HSA, to help with things like compatibility across HSA versions or similar. I know too little about the libc-on-GPU work and to what extent parts from there would also live under an llvm/offload project.

libc(-on-GPU) stays in llvm-project/libc. The logic that the offloading thread servers RPC requests if the kernel needs RPC, stays in llvm-project/offload.

Iā€™m afraid this is going to derail the general direction discussion. Long story short, llvm-project/offload is not for OpenMP, but OpenMP is a user (at least as time moves on). Weā€™ve shown CUDA can be a user next to OpenMP just fine. All users should have their user-specific code and APIs in /offload, there is little reason to split that again into 5 places. Similarly, all offload specific stuff should be in there, wrappers, launchers, image handling, JIT, device code, ā€¦
Doing it incrementally sounds great and I am all for it if: (1) we actually have proper technical reasons for it, like, real reasons not just that we ā€œwant to start fresh because itā€™s always betterā€, and (2) we find 3-5 volunteers that spend +80% of their time doing it and also avoid duplication w/ libomptarget by using the new functionality there right away. That said, I doubt itā€™s necessary or helpful to do it incrementally, on the contrary, I believe it will make the effort a non-starter or cause enormous pain.

Thatā€™s right. Claccā€™s current design of building OpenACC support on OpenMP already works pretty well without the generalization proposed in this RFC. However, one issue is that runtime diagnostics often read like OpenMP not OpenACC. I assume that generalizing/customizing the diagnostics to make more sense for other programming models is one possible improvement that falls under this RFC.

2 Likes

Iā€™m afraid this is going to derail the general direction discussion. Long story short, llvm-project/offload is not for OpenMP, but OpenMP is a user (at least as time moves on).

Iā€™m not sure I understand your comment. As I mentioned that I like the general direction.

My understanding from the RFC was that we will be bulk moving all of libomptarget to llvm/offload as is and then continue from there. The ā€œas isā€ part is what I was referring to and wondered if it makes more sense to move pieces somewhat incrementally. Eventually all offloading will be in llvm/offload. The incremental move allows us to have another look at parts, strip the OpenMP-name parts and while we do that separate specific parts from more general things.

As an example, we have a variety of kmpc and omp types, functions, etc sprinkled around the DeviceRTL. This, and similar OpenMP ā€œspecificsā€ or artifacts made me wonder if we should transition the common parts from the libomptarget to llvm/offload first. Making it more visible from the start to not be OpenMP-centric.
Iā€™m likely missing a ton of technical intricacies on why we would not want to do that, so Iā€™m happy to learn more about these reasons.

I do understand the appeal to just move/rename all of libomptarget (even if itā€™s just for time-constraints, practicality reasons), but I am unsure if this is the way we should do it. And I would think that learning about the pros and cons is what this RFC is all about.

This RFC was about creating such a sub-project with one suggestion on how to do it to get a head start. Weā€™ll have opportunity for technical discussion as part of the PR I hopefully put out soon.

Hi All,

I wanted to express support for turning libomptarget into a more general solution. Weā€™ve attempted to use it as part of Chapel (https://chapel-lang.org/) runtime in the past, but we needed more features from it that the solution included wrapping some CUDA Driver API functions independently. Specifics of that arenā€™t in my memory, but I can dig some stuff up. Currently, we have our own runtime implementation that interfaces CUDA/HIP directly.

In any case, weā€™d be happy to participate in discussions and very interested in giving the ā€œoffloadā€ project another try in Chapel runtime.

Engin

2 Likes

There are some components that Iā€™ve wanted for long enough that Iā€™ll write them myself given some indication that theyā€™ll be reviewed. Well, move&adapt from where they already exist. Added some links to the existing things which get merged/cleaned and proposed by me shortly.

HSA.hpp.
Exists in various degrees of completeness in openmp, libc and my GitHub. Put a C++ API over hsa.h so one doesnā€™t have to pass void* to callbacks and similar.

Dynamic cuda/hsa
exist in openmp as a way to deal with compiling on machines that donā€™t have the vendor libs installed. Theyā€™re a copy of a subset of the header and a cpp which exports the same symbols and under the hood does a dlopen and dlsym on init.

gpu-builtins
Small C library that defines things like ballot and broadcast behind target agnostic names. Probably uses uint64_t for wave type. Compiled to IR and statically linked into libc, devicertl, whoever else wants it. Very similar idea to compiler-rt builtins.

DeviceRTL has the same sort of thing, but written via variant and scattered around.

Those are the ones Iā€™ve written repeatedly in different projects and am volunteering to write an Nth time under llvm/offload, including patching libc and openmp to use said libraries.

It seems I read the RFC wrong then.

I appreciate and support an llvm/offload where a generic offload infrastructure lives and we can leave technical discussions to PRs. Sounds good.

This is a good direction. We need to make sure it is generic enough for SYCL, OpenMP (C, C++ and Fortran), Triton, Mojo, ā€¦, etc. and try not to break backward compatibility.

Can you be more precise about backwards compatibility?

I was imagining this to be an LLVM internal thing, in the sense that clang and llvm know the names of symbols in the library, and the whole system co-evolves.

Once that has established itself as useful and the initial API mistakes have been identified and fixed, maybe it could market itself as usable independent from LLVM, but even then it would be more sensible to have that as a separate entity with a fixed API to localise the backwards compatible cruft accumulation in one place.

For example, avoiding a need to enshrine the stable API of libomptarget in llvm/offload is one of the reasons I think the two should be separate - users other than openmp really donā€™t care about name stability of libomptarget, and openmp really does care about it, so localising the versioning and associated games/overhead at the libomptarget boundary is the right place to do it.

E.g. if you have generated code using the current libomptarget ABI dynamically (w/ the current entry and data layout), when we switch to link the new liboffload dynamically, the binrary should still work.