[RFC] `llvm-project/offload` roadmap

Follow up to [RFC] Introducing `llvm-project/offload`

There was basically unanimous support for the goals stated in the LLVM/Offload RFC but there was some discussion about the way forward. Since then, I have had offline conversations with interested parties and spend some time cleaning up more of the existing code base. I hope I can convince people that this is a feasible way forward and many concerns can be addressed, e.g., untangling of OpenMP from core features, providing proper error checking, encapsulating concurrently modified data structures, etc.

Project board:
I started a “development planning board” on GH. The hope is that we can write down what has to be done, or at least discussed. This way we can address concerns and missing features in a structured way. I pre-populated the board with some tasks for myself, and tasks I took from the past discussions. People should feel free to add, modify, and assign tasks as we do elsewhere.

Meetings:
To facilitate communication as we have hopefully many contributors (and for sure a lot to do), I plan to resurrect the (probably bi-weekly) GPU/Offloading meeting which died down after we lost our star organizer (@kuhar).

First meeting (repeats ever 2 weeks): 2024-01-12T17:00:00Z→2024-01-12T18:00:00Z
ICS file: LLVM_Offload --- Design and Impl. Discussions.ics - Google Drive

MS Teams
Click here to join the meeting
Meeting ID: 236 392 634 256
Passcode: hRkG3z
Agenda and meeting minutes:
Meeting notes - Google Docs

Working groups:
A couple issues, like API design, are big. We might want to form working groups that come up with solutions and alternatives to present to the rest in the meeting.

Roadmap:
I still believe the most feasible way to do this is to move/rename libomptarget. Here is a PR that works for my setup. Once accepted, we will have a starting point and modify things in place, e.g., rename the libraries. This way, we can easily add “missing features” in parallel. As examples, CUDA API support, CI, testing, and API (re-)design can be worked on and tested independently.

I’ll update this post as things evolve but I also hope we have consensus and can move discussion + development onto GH.

~ J


Tag: @JonChesterfield, @xtian-github, @jplehr, @alycm, @jhuber6, @shiltian, @antonrydahl, @grypp, @josemonsalve2, @e-kayrakli, @tschuett, @tahonermann, @Artem-B, @Anyee, @jbrodman, @jdenny-ornl, @fabianmc, @mjklemm

6 Likes

What is your intent for the API and ABI of libomptarget in the future?

I note the plan to leave a symlink behind after the rename.

To what extent will programs built against old versions of libomptarget continue to work with the new project?

We have at least three downstream toolchains building programs that use the libomptarget interface. HPE, ROCm, Intel. Changing clang to emit code that only works with the new offload library would cause problems in those. Is that the expectation here?

Similar questions were asked by @xtian-github in the original RFC thread. As I mentioned there, the plan is to keep backwards compatibility in the same way we did so far. We keep all APIs around and we keep them working for a reasonable amount of time. After that, downstream can own the wrappers. From our experience, we never removed functionality, just “repackaged it” when we extended things. This was always done to allow backwards compatible wrappers. I expect that to be true in the future.

Thank you for clarifying. This means adding new interfaces, which clang emits code calling into, and downstream toolchains will have to add those interfaces to their libomptarget implementations (or diverge from clang) to keep things working.

We could introduce the new function in llvm/offload implemented in terms of the existing API when changing clang to emit calls to it. Then in a subsequent patch rewire offload to do the cleverer thing. That would let downstream toolchains use that otherwise transient stub implementation as a specification for what they’re meant to do and/or as their intermediate implementation.

@jdoerfert Thanks for the updated RFC. This looks a promising direction. We are looking forward to working with you and community on design and part of implementation. As we talked, the goal is to make “new offload libraries” to support multiple languages such as OpenMP, SYCL, and Triton using common “offload lib core” support while allowing language specific extensions.

1 Like

Thanks for the roadmap, Johannes. We remain interested in using llvm/offload in Chapel runtime in the future. I’d be happy to participate in the meetings and/or working groups and giving it a try when some of the interface mismatches are addressed.

1 Like

Thank you for this follow on RFC. We’re also interested in supporting high-level offload compute languages in LLVM, and are interested in participating in the technical discussions about how this API should look. We, either myself or some of my colleagues, will join the calls and working groups that you have described.

We have experience with putting SYCL, OpenMP, and OpenCL on top of portable low-level interfaces, and are committed to bringing technical proposals and engineering effort on how this API could look. We’ll start commenting on, or creating, issues in the roadmap board that you created.

1 Like

Thanks for the roadmap as well as the organization of work, Johannes. We are interested in leveraging the offloading infrastructure internally and look forward to participating in the community discussions and the meetings.

1 Like

As Alastair said above we (Codeplay) are keen to be involved. I don’t have permission to add items or leave comments on your Github planning board right now but would be happy to start leaving some initial feedback on those tickets when possible.

1 Like

Cool!

People that are in the LLVM organization (which you might want to do) should have write access. For everyone else, I can invite them explicitly once I have the GH handle. Please PM or email me your handle or ask to be added to the llvm organization (which you’ll need for upstreaming code anyway).

Given the positive responses by >=5 institutions that want to invest in this, plus lots of positive comments on the original RFC, I hope this clears the bar set by our developers policy:

  • Must be generally aligned with the mission of the LLVM project to advance compilers, languages, tools, runtimes, etc. :white_check_mark: (I don’t think anyone would object to this.)
  • Must conform to all of the policies laid out in this developer policy document, including license, patent, coding standards, and code of conduct. :white_check_mark: (we already do and we plan to continue to do so.)
  • Must have an active community that maintains the code, including established code owners. :white_check_mark: (I am happy to continue as code owner, but we will revisit this in our early meetings. The responses above show an active community is willing to form.)
  • Should have reasonable documentation about how it works, including a high quality README file. :white_check_mark: (we got a non-trivial amount of information on the openmp.llvm.org webpage and we will spend effort on better documentation as one of the initial goals.)
  • Should have CI to catch breakage within the project itself or due to underlying LLVM dependencies. :white_check_mark: (more CI will be added but we have one staging and one production buildbot)
  • Should have code free of issues the community finds contentious, or be on a clear path to resolving them. :white_check_mark: (the code is upstream and we established the development board, we’ll add the Offload meeting, and the thematic working groups asap.)
  • Must be proposed through the LLVM RFC process, and have its addition approved by the LLVM community - this ultimately mediates the resolution of the “should” concerns above. :white_check_mark: (I think the responses here show broad agreement, including on the way we can resolve existing issues.)

If nobody objects, I would like to start with some of the administrative tasks such that we can hit the ground running in 2024.
I will send out invites for the meeting later this year, I’m still waiting for more people to put their availability into the calendar above.

For now, I want to create the subproject folder, README and docs. Then webpage and GH teams/hooks. I’ll also make sure we test the initial commit on more systems to verify it does not break existing functionality of OpenMP offload.

I don’t recall how the webpage redirect is done but I can start writing content in llvm-project/offload/docs which we would show under offload.llvm.org.

~ J

Tag: @tonic @akorobeynikov @tstellar


(Side note: It hasn’t even been 8 years since something like this was first proposed: RFC: Proposing an LLVM subproject for parallelism runtime and support libraries, LLVM moves fast after all.)

I still object.

As stated previously, I believe we have enthusiastic support for a LLVM GPU offloading library that is used by multiple different languages, in and out of tree. We are all tired of targeting multiple different vendor libraries and would like a common abstraction. We have an opening to do something really useful here.

Your plan of record is mv libomptarget LLVMGPU then comprehensively rewrite. I do not see support for that from any group other than openmp developers, and even those developers have mixed feelings about the rewrite part. People want their pain point addressed, it does not follow that they want you to do this.

Instead of your plan, I maintain that we should create a LLVMGPU project, write an API for it which has features like “find a kernel” and “launch a kernel” which aren’t interwoven with openmp semantics. We then use that library as part of the implementation of openmp and other users. That follows the collection of reusable libraries strategy of LLVM which I believe does and should have widespread support.

I have concerns about your plan to move libomptarget and then immediately rewrite it. Given that’s the intent, you should write your new library instead and optionally use it from libomptarget. That drastically reduces the risk of breaking users of libomptarget and has zero effect on your stated development plan. It also leaves you free to write a more useful language agnostic interface, free of the existing API. That is the technically superior strategy to achieve your stated goals and I am totally bewildered that you are unwilling to consider it.

I also have concerns that you have been unwilling to engage with AMD engineers about this strategy, given that the ROCm compiler is the only downstream compiler that uses libomptarget (based on attendance in the multicompany openmp meeting). Intel and HPE both use their own library and are thus unaffected by this change, except for where you subsequently break the API as planned above. I am arguing internally that AMD should likewise abandon your library in favour of a hard fork of the runtime libraries until such time as the new GPU offloading library becomes equivalently usable to the one currently in tree.

You consider my suggestion to take too long relative to hacking things together. That fundamentally misunderstands that the purpose of a common library. The goal should be to reduce the implementation cost of GPU languages, not minimise the implementation cost of the common library. A library called “LLVM GPU” which contains all the bugs and quirks of OpenMP is not going to decrease the implementation cost for anyone.

Hi @JonChesterfield

From Codeplay’s perspective we’re happy to go with whatever the consensus is on the best way forward. That said, we do share your concerns about starting with libomptarget wholesale and then trying to remove the all the OpenMP quirks. Introducing a new API and reimplementing libomptarget (and other languages) on top of it would be less disruptive and probably lead to more productive conversations about required features and design decisions. The combination of multiple components in libomptarget may also confuse things (host runtime, plugins-nextgen, DeviceRTL). If the goal is simply to create a library that abstracts away specific GPU APIs then the new library would only have to replace the plugins and the remaining libomptarget components could remain OpenMP specific.

We mentioned in the previous RFC that we’re willing to contribute Unified Runtime in its entirety, which may be a better starting off point if development goes in that direction but we want to avoid a completely blank slate. Unified Runtime handles the host runtime aspects of offloading in a backend-agnostic way, but unlike libomptarget aims to be language independent and does not implement device library code, so has a smaller footprint. In terms of design it’s very much like OpenCL, which of course that means we do inherit quirks from SYCL/OpenCL - but we would want to deal with these if any problems come up. We have working backends (called adapters) for CUDA, Level Zero, HIP, and OpenCL.

Hello everyone.

I’ll restate some of my earlier concerns from the other RFC as one of the contributors to libomptarget. Ultimately I won’t complain if we end up going with the currently proposed method, but I don’t think it’s the best course of action.

As mentioned by others, changing libomptarget as it exists is disruptive, and I personally don’t believe that the OpenMP interface is inherently valuable for non-OpenMP users. The push for this most likely comes from the work Johannes posted above that implements CUDA / HIP in terms of OpenMP runtime calls as it would represent less work to get a functioning product. However, this does not necessarily mean we will end up with an optimal library.

Realistically, the valuable portions of libomptarget are the “plugins”. That is, the implementations that abstract over the details of the CUDA driver / HSA / OpenCL / Whatever. My proposal is that we write an API that abstracts over the “plugins” and exports that as a useable, stable interface. Then on top of that we can write runtime entrypoints made to be emitted by the compiler. This is how CUDA works at least. It provides the CUDA driver which is a supported interface, and it provides things like __cudaRegisterFunction which eventually calls into the CUDA driver and maintain a few bits of internal state.

My original suggestion was that we start (mostly) from scratch for just the interface by copying over the plugins as into the new llvm-project/offload directory and then discuss with the other interested groups as how to best provide an expressive layer over them. This would minimize collisions with libomptarget and minimize exposure to OpenMP specific constructs. I don’t think this would take an unreasonable amount of time so long as we are ignoring backwards compatibility with OpenMP (for now). While this is still my preferred method, I am perfectly fine with just building a new API next to libomptarget, so long as we can have them as separate users.

I spent a few minutes typing out a really, really rough draft of some such API, llvm-project/offload API scratch · GitHub. The desire is to discuss this further to refine what people need.

One thing I like about the libomptarget plugins is that they are already written in an LLVM style (mostly). I’m not in favor of up streaming the unified runtime verbatim for this reason, though we can look at it for commonality and inspiration. They seem to solve similar problems after all.

My two-cents, cheers.

Noted. As I see it, you want us to go a different way and it seems non-resolvable.
I tried to alleviate some of your early concerns with the encapsulation of features, but it seems that was not enough.
At this point, short of moving to your solution, I am unsure there is anything we can do, is there?
Which means we are either deadlocked or we commit to something and develop from there.
Talking to others again, I believe this RFC has a broad majority backing it.
Rewriting APIs and other things you want to do are obviously not out of the question, we can, and want to still do that as we develop this.

Good.

This is odd. Above I see people working on 3+ different programming models targeting 3+ different accelerators supporting this RFC. If we include the original RFC I made, people working on 5+ established and a few researchy programming models have commented positively on the goal and for the most part not objected to the path.

I don’t see why your goal is different from mine or somehow excluded by the proposed path. There is no need to start from scratch to achieve what you are describing.
We can have that exact API and we can use it without OpenMP semantics. The POC shows this nicely.

We talked about this in details for hours, on at least 3 different occasions. I would not describe it as “me not considering it”.

Wrt. downstream users (esp. AMD): Why would you not just keep openmp/libomptarget downstream and the end result is the same for you? No breakage for you downstream even if we change the API and such.

Again, I spend hours talking to multiple AMD people 1-on-1 in addition to the many hours we spend on this in the OpenMP meetings. I walked you personally though the rational multiple times. I also discussed it with others that are now working on documents to discuss in our new meeting (see @jhuber6 post above).

AMD is not the only downstream compiler that uses libomptarget. I am unsure why you would think that. Intel, for example, uses upstream libomptarget with extensions and they regularly apply patches. Their fork is open, you can look. I know of multiple other compilers with (experimental) lowering to libomptarget.
All that said, if your concern is that LLVM/Offload is going to be unstable for your downstream compiler, it should be easy to not opt-in at the beginning. As mentioned above, you can keep libomptarget around and simply ignore LLVM/Offload until you have confidence to switch over.

I don’t see why the proposed path makes it impossible or harder to spend time isolating OpenMP “quirks” after we moved. We would always have a working solution that we can rewrite as we do it right now. Patch after patch it would make OpenMP quirks “opt-in” (as not all non-OpenMP languages would necessarily always want to opt-out of all the features).

I don’t think I understand how one could not have a layer to select the right plugin/adaptor. I mean, you have 4 devices, where is device 3 mapped to the X86 plugin and device 4 to NVIDIA? That said, I can see how you could move other libomptarget capabilities into the plugins if we want to. Linking the plugins into libomptarget statically is already on the TODO list; in case the dynamic loading was your issue with the two libraries.

First meeting (repeats ever 2 weeks): 2024-01-12T17:00:00Z→2024-01-12T18:00:00Z
ICS file: LLVM_Offload — Design and Impl. Discussions.ics - Google Drive

MS Teams
Click here to join the meeting
Meeting ID: 236 392 634 256
Passcode: hRkG3z
Agenda and meeting minutes:
Meeting notes - Google Docs

We are not going to reach consensus on the right approach. Evidently our incentives are misaligned. Let’s accept that as a constraint and seek to move forward.

The thing I want - an abstraction over different GPU architectures - is useful in its own right. It’s roughly what you’d get if the existing libomptarget plugins were split into openmp-specific and vendor-specific pieces. Or if the current refactoring of plugins towards sharing code was continued to the fix point, where all the openmp semantics are in the common core. It’s also roughly what libc uses for testing. And what I think essentially every project that doesn’t benefit from a ready-made general purpose offloading library, such as the one you’re proposing here, has already written for themselves.

I’ve posted that as a separate RFC, [RFC] GPU builtins runtime. I can hopefully instantiate that independent of the llvm-project/offload if necessary. Please consider making use of it.

Good luck with the offload project either way. I still believe in the end point you want to reach, despite total gridlock on discussing the path to it.

My general point is that, from the discussion I’ve seen here and elsewhere, people are mostly interested in what we call the “plugins”. That is, the layer of abstraction over the vendor specific runtimes like HSA and CUDA. The implementation of libomptarget itself is a language specific runtime that abstracts the plugin interface. I don’t think there’s anything valuable about the OpenMP runtime API itself, as it was primarily designed for clang code generation and the OpenMP language. The existing art is the CUDAOMP implementation, which rewrites CUDA calls into OpenMP calls, but all those calls are fairly standard and could be rewritten to calls into the plugin API once made stable.

My suggestion is to simply copy the plugins to the proposed llvm-project/offload and design a unified API around them. This would be immediately useful and relatively easy to set up. Later we could port libomptarget to use this interface once it’s well tested. We’ve already had such temporary duplication in the past (i.e. the new driver, new plugins, and new DeviceRTL) so I don’t think this is a breach of existing protocol.

I believe that this would be the smoothest path forward, as changing the existing libomptarget structure has clearly proven somewhat contentious. Because libomptarget it itself a language specific abstraction over the plugins, I believe the plugins are what we should export. The draft API I posted earlier is pretty much an attempt at that. Would love to hear other’s perspectives on the best path forward, cheers.

“Only having plugins”, does not describe what happens to all the non-OpenMP code in libomptarget. Is it moved into the plugins? Do we loose the capability? Do we create a new layer that is just not based on libomptarget but that duplicates the non-OpenMP functionality? E.g.,

  • Who loads the right plugin(s)?
  • Who assigns numbers and does the mapping?
  • Who provides the API functions that are more than simple plugin entry points e.g., 3D-rect memcpy?

That said, there is no reason to keep the current split. We have various changes already on the TODO list, e.g., static linking of plugins and configurability for language dependent extensions and APIs. Long term, the non-plugin part will likely grow, hence calling for the removal now does not make much sense to me.

There is a vast design space out there and the problems we already solved with the code base, plus the ones we want to solve, cannot be described in a few paragraphs. The point of starting with a working solution is that we have one. People can and have started offloading libraries from scratch, but it will take time, lots of time, to settle and adapt anything if we have a multitude of interests. Incremental changes can be made faster, and new problems as well as design considerations will follow. We can also use data to justify changes, rather than investing lots of time to realize something won’t scale or won’t combine well with other features.

Wrt. changing libomptarget structure: The contention comes from downstream, as far as I can tell. They have a way out, simply by keeping the original libomptarget. For them, it’s like we do a new project but we have a head start.

Wrt. draft API: I don’t see how that tackles error reporting in a reasonable way; copying CUDA is not really appealing (to me). Similarly, there is no mention of source location and it seems extensibility was not considered. Finally, we might not want to expose queues, and if we do, we should be very careful as the interfaces with FIFO queues is what CUDA is moving away from and HSA never had. Most of what I describe here was, or is, lacking in libomptarget (incl. plugins), let’s try to fix them, potentially one by one.

Thanks for the response, I appreciate going through the pros and cons in the implementation, since we’re all in agreement for the goals of this project at least.

That was my intention. We would have a new intermediate layer that exposes the plugins more directly. Some code will not live in the plugins, for example the asynchronous structs and logic for iterating the devices, but overall it’s a common enough set that it would still be usable by libomptarget or whatever other language runtime wants to have a generic interface to vendor specific libraries.

Handling loading and detection of devices is fairly straightforward. APIs like Vulkan or HSA just give you a way to iterate all the known devices and inspect their capabilities (E.g. if it’s a GPU or CPU, integrated or dedicated). Additional logic like ID mapping would presumably be handled by the language runtime, i.e. libomptarget. The rough draft I made also proposes some way to just get all devices that can run a given image so you can avoid runtime initialization overhead.

I’ve been looking into this, it’s a large change however. My initial idea was that we would do this in my version of llvm-project/offload and then move libomptarget to that later. The main reason I want to get rid of the dlopen interface the plugins use is so we aren’t forced into a C api to communicate with the plugins and can have shared state. I realistically just want an API like Vulkan or OpenCL that is hosted in LLVM.

Personally, I think mostly re-using the plugins is a sufficiently large head start. I just don’t see what is special about OpenMP’s preferred interface that makes it a compelling target rather than the a small abstraction around the plugins. Though I suppose I have a selfish reason to not create such a split given that it would increase my workload.

Thanks for looking over it. The interface is more similar to existing APIs like CUDA, HSA, Vulkan, OpenCL, and Intel’s UnifiedRuntime. My plan was to make a good C interface the abstracts over the common bits, then provide a C++ wrapper that we can use internally. I think that this is a simpler and more useful approach than the OpenMP runtime interface.

I don’t think that this layer needs to be concerned with source location information. The current plugin API that libomptarget uses doesn’t use any source location information, I think that’s a consideration the language runtime wants, not the interface. For extensibility, that’s definitely something we want to figure out through successive discussions, that’s why I’ve written something out as a starting point. For error reporting, it’s the same as every other API, I’m unsure what more we need besides a call like CUDA’s cudaGetLastError, but that shouldn’t be explicitly necessary.

For queues, granted I’m not the most knowledgeable guy on their underlying implementations, but from my experience the FIFO queue is the common way to expose this and it’s what we already have implemented in the plugins, so I figured it was a good start. We can refine that later, but it would be a similar issue in either implementation.

Thanks.