Is nested offloading from host to accelerator A to accelerator B possible?

Dear all,

sorry for the repeated question but since this is really important for our project, I want to ask again whether it is possible to have some kind of nested offloading from host to accelerator A to accelerator B? For example in the style

// start on host
#pragma omp target
#pragma omp target
// code to be executed on accelerator B, reached via
// first offload to accelerator A, then to accelerator B

or using “nested declare target”? Any hints if this is possible would be greatly appreciated.

Best regards,

Kai Plociennik

On the spec side, I don’t think nested target would be allowed. On the implementation side, we also don’t support to launch target region from a target device.

OK thanks, that helps!

If accelerator A is a CPU one can probably make this work. If not, more pieces are missing.

Thanks @jdoerfert yes, the intermediate accelerator A is a RISC-V CPU.

Meanwhile, I received an interesting response from the OpenMP architecture review board: openmp - Question to ARB on target construct limitations - Stack Overflow

To summarize their answer, nested offload is not prohibited and they are even working on this concept to support it in the future. Also, they encourage me to try to implement this, which might be helpful for them as far as I understand. Hence, it would be interesting for us to try to add support for this in our downsteam Clang/LLVM. So any hints on my following naive approach to do this would be very helpful.

As far as I see, we could implement nested offload by doing the following:

  • Currently, Clang aborts compilation with an error when encountering plain nested #pragma omp target. Hence, we have to make such source code be processed further.
  • I would assume that OpenMP code generation in Clang would then in principle work and create the correct outlined functions and calls to OpenMP Runtime. If not, we have to fix this.
  • Currently, the form in which -fopenmp-targets=... is handled, this allows only for specifying the target for the first level of offload. We have to extend this so that we can do something like -fopenmp-targets-first-level=... and -fopenmp-targets-second-level=....
  • Currently, -fopenmp-is-device is added to Clang calls for the device. This prevents Clang from performing all the instrumentation it does for creating the fat host binary recursively on the device. If we enable this, roughly the same compilation and linking steps with intermediate results as for the combination “host, accelerator A” have to be done for the combination “accelerator A, accelerator B”. We have to remove -fopenmp-is-device and maybe adapt the Clang Toolchain for accelerator A (RISC-V) to do the proper steps.
  • Currently, building a statically linked “executable” (object file) for our bare metal RISC-V accelerator A would be enough for our use case. This means, I would try to build upon something like libc and libstdc++ and statically link libomp and libomptarget for RISC-V against these, together with our user device code. If I understand correctly, if we can make sure that basic loading/initialization of the resulting binary on the RISC-V accelerator happens correctly, then we might succeed.

Do you think the above plan could work? Or do you see any problems or necessary modifications?

You can already specify multiple targets: -fopenmp-targets=a,b,c and that should be sufficient. Given the lack of control flow and context sensitive analysis in the frontend, there is no “level” when we encounter orphaned target constructs anyway.

I somehow doubt we want to remove it. Though this might depend on what the final setup would look like. As an example, does the host or the offload target 1 register the kernels for target 2? I’m not sure if this particular question will make a difference but I think it would be good to first layout the overall design before we go into detailed questions like the is-device flag. Might need to keep it in a modified form, let’s see.

The new driver for openmp (-fopenmp-new-driver) created by @jhuber6 is linking the same libraries into a offload CPU as you specify for the host. That means, if you have static libraries for everything you need, and those contain openmp target offload code for the accelerator CPU architecture, things should be linked together just fine.

IIRC, Atmn Patel looked at hierarchical offloading as part of our remote offloading work. I will ask him and start an email thread as he’s not registered here.

OK thanks a lot for your hints. So my “naive approach” would indeed be naive and it seems I have to understand the overall architecture a lot better, and that some overall architecture for this has to be developed.