Is nested offloading from host to accelerator A to accelerator B possible?

kai_plociennik · February 18, 2022, 10:10am

Dear all,

sorry for the repeated question but since this is really important for our project, I want to ask again whether it is possible to have some kind of nested offloading from host to accelerator A to accelerator B? For example in the style

// start on host
#pragma omp target
#pragma omp target
// code to be executed on accelerator B, reached via
// first offload to accelerator A, then to accelerator B

or using “nested declare target”? Any hints if this is possible would be greatly appreciated.

Best regards,

Kai Plociennik

shiltian · February 18, 2022, 3:21pm

On the spec side, I don’t think nested target would be allowed. On the implementation side, we also don’t support to launch target region from a target device.

kai_plociennik · February 21, 2022, 7:47am

OK thanks, that helps!

jdoerfert · February 21, 2022, 4:53pm

If accelerator A is a CPU one can probably make this work. If not, more pieces are missing.

kai_plociennik · February 22, 2022, 11:23am

Thanks @jdoerfert yes, the intermediate accelerator A is a RISC-V CPU.

Meanwhile, I received an interesting response from the OpenMP architecture review board: openmp - Question to ARB on target construct limitations - Stack Overflow

To summarize their answer, nested offload is not prohibited and they are even working on this concept to support it in the future. Also, they encourage me to try to implement this, which might be helpful for them as far as I understand. Hence, it would be interesting for us to try to add support for this in our downsteam Clang/LLVM. So any hints on my following naive approach to do this would be very helpful.

As far as I see, we could implement nested offload by doing the following:

Currently, Clang aborts compilation with an error when encountering plain nested #pragma omp target. Hence, we have to make such source code be processed further.
I would assume that OpenMP code generation in Clang would then in principle work and create the correct outlined functions and calls to OpenMP Runtime. If not, we have to fix this.
Currently, the form in which -fopenmp-targets=... is handled, this allows only for specifying the target for the first level of offload. We have to extend this so that we can do something like -fopenmp-targets-first-level=... and -fopenmp-targets-second-level=....
Currently, -fopenmp-is-device is added to Clang calls for the device. This prevents Clang from performing all the instrumentation it does for creating the fat host binary recursively on the device. If we enable this, roughly the same compilation and linking steps with intermediate results as for the combination “host, accelerator A” have to be done for the combination “accelerator A, accelerator B”. We have to remove -fopenmp-is-device and maybe adapt the Clang Toolchain for accelerator A (RISC-V) to do the proper steps.
Currently, building a statically linked “executable” (object file) for our bare metal RISC-V accelerator A would be enough for our use case. This means, I would try to build upon something like libc and libstdc++ and statically link libomp and libomptarget for RISC-V against these, together with our user device code. If I understand correctly, if we can make sure that basic loading/initialization of the resulting binary on the RISC-V accelerator happens correctly, then we might succeed.

Do you think the above plan could work? Or do you see any problems or necessary modifications?

jdoerfert · February 22, 2022, 3:11pm

You can already specify multiple targets: -fopenmp-targets=a,b,c and that should be sufficient. Given the lack of control flow and context sensitive analysis in the frontend, there is no “level” when we encounter orphaned target constructs anyway.

I somehow doubt we want to remove it. Though this might depend on what the final setup would look like. As an example, does the host or the offload target 1 register the kernels for target 2? I’m not sure if this particular question will make a difference but I think it would be good to first layout the overall design before we go into detailed questions like the is-device flag. Might need to keep it in a modified form, let’s see.

The new driver for openmp (-fopenmp-new-driver) created by @jhuber6 is linking the same libraries into a offload CPU as you specify for the host. That means, if you have static libraries for everything you need, and those contain openmp target offload code for the accelerator CPU architecture, things should be linked together just fine.

IIRC, Atmn Patel looked at hierarchical offloading as part of our remote offloading work. I will ask him and start an email thread as he’s not registered here.

kai_plociennik · February 23, 2022, 8:42am

OK thanks a lot for your hints. So my “naive approach” would indeed be naive and it seems I have to understand the overall architecture a lot better, and that some overall architecture for this has to be developed.

Topic		Replies	Views
Hierarchical/nested offloading to target devices possible? OpenMP	0	203	February 2, 2022
[RFC] OpenMP Offload the device from the accelerator (i.e nested target directive) Clang Frontend offload , openmp	3	154	March 19, 2024
Test with clang offloading to GPU OpenMP	3	108	October 21, 2019
[RFC][OpenMP][CUDA] Unified Offloading Support in Clang Driver Clang Frontend	50	152	March 21, 2016
[RFC][OpenMP] Usability improvement, allow dropping offload targets Clang Frontend	11	108	September 5, 2018

Is nested offloading from host to accelerator A to accelerator B possible?

Related Topics