[RFC] Device runtime library (re)design

Tl;dr
  We should extract device specific code out of the OpenMP deviceRTL such that we can reuse the common logic (>90%) for all devices.
  We also need to improve the documentation and we should think about bringing the code into the LLVM coding style.

Requested changes:
I would like is to change the OpenMP device runtime library design (openmp/libomptarget/deviceRTLs) towards the following goals:
1) Allow reuse of common logic between different devices in a clean and extensible way.
2) Improve the documentation, e.g., doxygen comments and code comments, for the code.
3) Follow the same coding style as LLVM core.

Disclaimer:
First, I do not want to say it currently is impossible the reuse the code for other devices or the code is not documented at all. What I
think is that we can improve both substantially if we choose to do so. Also, a change in coding style is easier now than later, so if we
decide to do refactoring, that can be included without adding to much churn.

Motivation:
Now we can discuss if we should do any of the proposed changes but I guess most of them have clear benefits. I am also not the first
to suggest them. Point 1 was mentioned with the initial drop of the device runtime [0], but it was rejected for time reasons. Point 2
was recently discussed as a pressing issue in multiple reviews. Point 3 is a general observation as writing and reviewing code for the
openmp sub project is unnecessarily hard for LLVM developers due to the different coding style.

Proposed structure:
In order to ease the reuse by new devices we should have a common core with device independent logic, e.g., in
  openmp/ibomptarget/deviceRTLs/common
including an interface that declares all device specific methods needed by the core logic. The interface is then the
only thing implemented in the device subfolders, e.g.,
   openmp/ibomptarget/deviceRTLs/nvptx, openmp/ibomptarget/deviceRTLs/amdgpu, ...
To get to this goal, all device specific code has to be extracted from the core logic. The prototypes below show that this
is fairly easy to do.

Feasibility and prototypes:
To showcase the direction I would like is to move to I "redesigned" three files (out of ~20) with the above goals in mind. The patches
can be found here:
  ⚙ D64217 [OpenMP][NFCI] Cleanup the target state queue implementation
  https://reviews.llvm.org/D64218
  ⚙ D64219 [OpenMP][NFCI] Cleanup the target worksharing implementation
Note that there is a vast design space even if we agree to the above three goals. As a consequence, I'd like us to use the patches to
discuss general design decisions not specific ones until we agreed on a path forward.

Please let me know what you think,
  Johannes

[0] ⚙ D14254 [OpenMP] Initial implementation of OpenMP offloading library - libomptarget device RTLs.

This proposal does not try to have a pure c++ device runtime library. I want a core implemented in C++, or just interpretable as C++, and device specific parts implemented in a different clang supported language.

When I tried to do the OpenMP rewrite a while ago I also realized we need 5.0 or even 5.1 features to do so. Once we do, there will still be device specific parts that require other languages or language extensions, e.g. intrinsics. However, clang already supports cuda intrinsics in offloaded regions when we are targeting nvptx.

You are proposing to use C++ in the OpenMP RTL? That will make it dependent on the C++ compiler runtime, no?

Jeff

Not necessarily. First, I do not really "want/need" to use C++. Second, even if we did we could do so without the runtime part.

The main observation is the following: ~95% of the "CUDA" code in device runtime library right now is not "CUDA" code but generic C/C++ code interpreted as CUDA.

At the end of the day it doesn't matter if the file ending is .cu, .cpp, or .hip, what matters is how we interpret the code. Once we separated the generic code from the
device (language) specific one we can interpret the generic code as CUDA when we target nvptx, as HIP when we target AMDGPU, as something else when we target another device.

One possible design that would limit the "interpretation decision" to a single file would be as follows:

deviceRTL.X // entry point for the device runtime library which includes all header below. Compiled with "clang -x LANGUAGE ..."

common/abc.h // implementation of functionality abc in a generic way
common/...
common/xyz.h

nvptx/target_impl.h // defines macros specific to the device language and provide implementation for device specific code, e.g., what is the spelling of an atomic_add.
amdgpu/target_impl.h // defines macros specific ...
...

If you look at the design and use case of the device runtime right now, this is basically what we do anyway. Some codes are put into "CUDA" files but they are linked on IR level anyway for performance.

I think there are already some discussions on the base language for this effort, and it is also one of my concern.

There can be a situation where C++ is not the best language to use for a certain device RTL development, so it may not be a good idea to write the "common" code in C++.

Thanks,
Hansang

Hi Hansang,

I have the feeling the file ending and modeline (file comment) changes I made overshadow the important parts, namely the separation of device agnostic runtime logic and device specific implementation.

To exaggerate a bit:
   If people really want the modeline to say "CUDA" and the file ending to be ".cu", that is at the end of the day OK with me.
   After this rewrite, with the separation of common logic and device impl., I can simply interpret the common logic "CUDA" files if I need to, e.g., as HIP for AMDGPU.

I mentioned already in the other discussion [0], we can easily have a design in which none of the files has a ".cpp" ending.
Even if they have, that does not mean we actually interpret/compile them as C++ files ("clang -x cuda a.cpp" works just fine).
Calling them CUDA is simply misleading if they are not CUDA specific.
Calling them C++ might be misleading as well, I agree, but for now the logic part is basically C++ code.
During the rewrite we can make the code C, though I would prefer not to.

Thanks,
  Johannes

[0] http://lists.llvm.org/pipermail/openmp-dev/2019-July/002589.html

Johannes,
   Suppose we define today some common logic based on existing deviceRTLs. Are you requiring that all future deviceRTLs support this. What if they cannot, are you going to change the common logic or claim that the new deviceRTLs is not compliant.
Thanks
Ravi

Good point.

I would say we change it where it makes sense, e.g., we allow to override some parts with a different logic. Though, at the end of the day, it is the interface that needs to be implemented, how is up to the deviceRTLs. I want is to avoid duplication as much as possible, so not to force some logic, e.g., the current gpu-focused one, onto an implementation but to offer it. Maybe we will end up with multiple “common logics”, but any reuse we can get out of it is worth separating the device specific parts, especially since there is no runtime cost to it.

Does that make sense and answer your question?

Do you have some proposal based on what is in the opensource. We can take a look at it and provide feedback based on what we have implemented for our arch.

Thanks

Ravi

Yes, the RFC I send to openmp-dev has three example patches that clean up the code, add documentation, and extract target specific stuff. After we did this we can move the files.

Short summary of the phone call we had Wednesday for people that
couldn't make it, are not aware of the phone calls, or simply forgot:
  - We do split the requested changes into three different activities,
    thus separate patch sets:
    1) splitting common logic and target specific code
    2) improving documentation
    3) unifying coding style
  - We need to improve the testing situation, both CI testing and test
    coverage. Task include:
    * Setup buildbots that run OpenMP tests and report the results,
      detect failure, maybe later measure performance, etc.
    * Add OpenMP execution tests, both host and device, to the OpenMP
      subproject (see also below) and the LLVM Test Suite.
  - We need to agree on a directory structure and build model for the
    device runtime library or better libomptarget (potentially later
    also for other parts). As discussed, I outline what I think is a
    good solution below.

Please raise concerns or objections if you have any, comments and
encouragement are obviously also welcome!

If no objections are raised, we will go ahead with D65836 and hopefully
make faster progress from there.

Hi Johannes,

Thanks for working on this. A brief question inline…

Short summary of the phone call we had Wednesday for people that
couldn’t make it, are not aware of the phone calls, or simply forgot:

  • We do split the requested changes into three different activities,
    thus separate patch sets:
  1. splitting common logic and target specific code
  2. improving documentation
  3. unifying coding style
  • We need to improve the testing situation, both CI testing and test
    coverage. Task include:
  • Setup buildbots that run OpenMP tests and report the results,
    detect failure, maybe later measure performance, etc.
  • Add OpenMP execution tests, both host and device, to the OpenMP
    subproject (see also below) and the LLVM Test Suite.
  • We need to agree on a directory structure and build model for the
    device runtime library or better libomptarget (potentially later
    also for other parts). As discussed, I outline what I think is a
    good solution below.

libomptarget* directory structure

  • everything currently contained in “openmp/libomptarget/”

Folder and file names (including file extensions) are subject to the
usual bikeshedding. Let’s agree on the idea first and the color of the
shed later.

Sources

For the device RTL it currently looks like this:
openmp/libomptarget/deviceRTLs/nvptx/{docs,src,test,CMakeLists.txt}
The test structure is discussed later. docs is pretty much empty and it
should probably live with the common logic. src is discussed now:

I now assume we will have a common core (probably header only), and
device specific code that may or may not use common core parts. The
structure I imagine looks like this:
openmp/libomptarget/deviceRTLs/common/include/{queue.h, target_impl.h,…}
openmp/libomptarget/deviceRTLs/nvptx/{libomptarget.cu,CMakeLists.txt,…,include/target_impl.h,…}
openmp/libomptarget/deviceRTLs/amdgpu/{libomptarget.hip,CMakeLists.txt,…,include/target_impl.h,…}
openmp/libomptarget/deviceRTLs/CPU/{libomptarget.XYZ,CMakeLists.txt,…,include/target_impl.h,…}
openmp/libomptarget/deviceRTLs/…/{libomptarget.XYZ,CMakeLists.txt,…,include/target_impl.h,…}
I added a CPU version above to ease testing (see below) and to provide a
target_impl.h declaration for better IDE support.

  • The common files could be header only (debatable and not strictly necessary).
  • All source files that need to be compiled and linked into a library
    for this device (as IR or object files) are listed in the device
    CMakeLists.txt file. This can include common source files if the
    common files are not header only. There is no need to include any
    common files if the device implementation is self-contained.
  • To make target definitions available to common files we would:

Why do target definitions need to be made available to common files? Is it only for the case that common has source not just headers, and that source must be compiled per device?

Thanks.

Joel

Hi Johannes,

Thanks for working on this. A brief question inline....

> Sources
> -------
> For the device RTL it currently looks like this:
> openmp/libomptarget/deviceRTLs/nvptx/{docs,src,test,CMakeLists.txt}
> The test structure is discussed later. docs is pretty much empty and it
> should probably live with the common logic. src is discussed now:
>
> I now assume we will have a common core (probably header only), and
> device specific code that may or may not use common core parts. The
> structure I imagine looks like this:
> openmp/libomptarget/deviceRTLs/common/include/{queue.h,
> target_impl.h,...}
> openmp/libomptarget/deviceRTLs/nvptx/{libomptarget.cu
> ,CMakeLists.txt,...,include/target_impl.h,...}
>
> openmp/libomptarget/deviceRTLs/amdgpu/{libomptarget.hip,CMakeLists.txt,...,include/target_impl.h,...}
>
> openmp/libomptarget/deviceRTLs/CPU/{libomptarget.XYZ,CMakeLists.txt,...,include/target_impl.h,...}
>
> openmp/libomptarget/deviceRTLs/.../{libomptarget.XYZ,CMakeLists.txt,...,include/target_impl.h,...}
> I added a CPU version above to ease testing (see below) and to provide a
> target_impl.h declaration for better IDE support.
>
> - The common files could be header only (debatable and not strictly
> necessary).
> - All source files that need to be compiled and linked into a library
> for this device (as IR or object files) are listed in the device
> CMakeLists.txt file. This can include common source files if the
> common files are not header only. There is no need to include any
> common files if the device implementation is self-contained.
> - To make target definitions available to common files we would:
>

Why do target definitions need to be made available to common files? Is it
only for the case that common has source not just headers, and that source
must be compiled per device?

Good question. Yes, and especially when the common source (that is
compiled alone) contains references to template functions/classes which
we can only declare in the common code but need to define in the target
code. The atomic add example below was my motivation. We want a generic
atomic add for various types so templates seemed like a good option but
the definition is target specific. As we discussed before, we might not
want to use templates after all but specify the versions we would
generate explicitly. We then wouldn't need to make the definitions
available. Neither would we need that for uses in the "header only"
part of the common code.

Does that make sense?

[Adding back the list, which I dropped accidentally.]