[RFC] OpenMP Offload the device from the accelerator (i.e nested target directive)

Hello.

There are many accelerators and they also can offload to the other accelerators.
For example, GPUs can offload their operations to in-memory-devices (as like Aquabolt-XL HBM2-PIM, LPDDR5-PIM With In-Memory Processing, and AXDIMM With Acceleration Buffer | IEEE Journals & Magazine | IEEE Xplore).
I want to implement programming model for offloading accelerator to accelerator and compile this program using LLVM, and also I want that this programming model is suitable design for OpenMP offload (or LLVM offload) system.

So I would like to listen to the opinions about below questions for design.

  1. Is it reasonable in OMP design and concept to use nested target directive to offload accelerators from the accelerators?
  2. If it is reasonable, is there any plan to support nested target directive?
  3. If it is not, how to support these kinds of offloading (host(cpu)->accelerator A → accelerator B) in OMP?

Any discussion and opinion will be helpful.

Thanks
Sincerely
YoungjooKo

1 Like

Hi YoungjooKo,

ref 1) Yes, as an OpenMP extension. The OpenMP standard does not yet allow this, but it is clear (to me) it has to in the future. No offloading language supports this by default (at least not the 5 main ones we implement via Clang). I’m happy to help where I can to extend our implementation as a proof of concept for the method.

ref 2) We briefly looked into it as part of our “Remote OpenMP Offloading” work (preprint: Remote_Offloading.pdf - Google Drive). However, we never implemented much. @kai_plociennik has a prototype, IIRC, but it’s unclear if we want to “start fresh”, with the lessons learned. For the standard, we should probably do some prototype first (and, for example, an IWOMP paper).

ref 3) As mentioned in 2, upstream does not support it and neither does the standard. That said, one can add support, and basic support should not be too hard.

We can continue the discussion here, or in a meeting.

~ J

1 Like

Hi, @jdoerfert
Thanks for sharing your suggestion and sorry for late response. We had internal discussion about this topic.
We’ve checked @kai_plociennik’s approach that you said. Starting from his approach could be an option.
Also, we also want to continue this discussion in here or in a meeting. It would be very helpful for us :slightly_smiling_face:

The main purpose of our suggestion is designing general interface for nested offloading, not tightly coupled with specific architecture.
It means that our idea should support general interface that works for various types of ‘host’ (CPU, GPU, or other possible host) and ‘target’ (same as host, CPU, GPU, …).
To make this possible, we propose two discussion points, programming model and (runtime) offload interface.

For general programming model, our first option is using ‘target’ pragma from OpenMP in nested form (maybe as Kai~ said).

#pragma target parellel for{
        #pragma target pareller for {
                  for (int i =0; i < 10; i++) a[i] = 10
         }
}

The second option is using new directive or clauses in the target directive to mark nested offloading target.

#pragma target parellel for{
        #pragma ntarget pareller for { //[new directive]nested target
                  for (int i =0; i < 10; i++) a[i] = 10
         }
}

or

#pragma target parellel for{
        #pragma target pareller for nested { //[new clause]nested target
                  for (int i =0; i < 10; i++) a[i] = 10
         }
}

We think the first option provides more general programming interface to programmers for nested offloading.
In that case, we also can support various types of host and offloading targets without any special hints in the codes.
The matching between each code section and host/offload-target could be decided by compile-option or compiler’s decision.
WDYT?

Second discussion point is to design the unified API for nested offload.
As LLVM-offload project has been going on, we also want to provide common API regardless of the types of host/target when doing nested offload.

Extension from libomptarget is one option for compatibility.
However, the difference in host device functionality seems to be a problem in using the existing libomptarget.
Operations in libomptarget are executed on the CPU, but that operations may not work on the accelerators.

It seems difficult to provide all the functions of the existing libomptarget (due to hardware functional issues), so it would be good to provide a common API suitable for the accelerator.
For example, we only provide functions essential for offload such as kernel_launch/data_move as API.

We think that providing such a common API seems reasonable regardless of the type of accelerator, but as the LLVM offload project progresses, I wonder if there are any difficulties in providing a common API due to the characteristics of specific accelerators or other practical issues.

Thanks
Sincerely
Youngjoo Ko

  1. Syntax: I don’t think there is a need for a new pragma or clause just yet. However, depending on the answer to 2), this might change.
  2. Matching: This is the tricky part. We have at least two options that are somewhat reasonable. Let’s assume we have a host and target X from which we want to offload to target Y:
    A) We could compile all target regions for X and Y, make 2 images next to each other in the host object, move the Y image to X when we offload to X.
    B) We could let the user specify Y is the nested target to only compile the nested regions for Y. Without special syntax we’d likely also compile all orphaned regions for Y. We could then embed the Y image into the X image.
    I’m personally in favor of A). People can use ifdef or [begin/end] declare variant to create code for a single target only.
  3. API: The API we have should technically be sufficient. All you’d need to do is setup libomptarget and the plugins on X. If X can’t handle such a setup, you still could use a new library with this API. All that said, we will redesign the API soon. If you have specific points that you think are important, please participate in the meetings and the API working group.

Now, for something like the above to work you’d need to:

  • Allow nested target pragmas in clang and generate the same API calls for them (easy-ish).
  • Generate the two images and embed them properly (A or B above) (medium).
  • Setup a libomptarget on X and handle image movement (unclear).

The next meeting is tomorrow 7am pacific. The ics file is in the agenda and should be on the llvm calendar.