[RFC] Device runtime library (re)design

Short summary of the phone call we had Wednesday for people that
couldn’t make it, are not aware of the phone calls, or simply forgot:

  • We do split the requested changes into three different activities,
    thus separate patch sets:
  1. splitting common logic and target specific code
  2. improving documentation
  3. unifying coding style
  • We need to improve the testing situation…
  • We need to agree on a directory structure and build model…

Great summary, thanks. Sounds good to me.

libomptarget* directory structure

  • everything currently contained in “openmp/libomptarget/”

Folder and file names (including file extensions) are subject to the
usual bikeshedding. Let’s agree on the idea first and the color of the
shed later.

Yep. I’ve spent a couple of days staring at the differences between amdgcn and nvptx and have come up with a few use cases for the dir structure and build process.

  • Totally device independent code. Can compile exactly the same thing for various targets. Basically not a problem, put in headers or src/common/foo.x. Hopefully as time goes on we’ll have more of this.

  • Necessarily device dependent. Access to architectural constants, performance counters, thread IDs or similar. To the extent they address common concerns, I’d like these to have a common interface. Or at least, common to hardware that has said concern. Functions like unsigned GetWarpId() perhaps. Declared in target_impl.h or equivalent.

  • Optionally device dependent. The runtime is performance sensitive and inevitably the general purpose implementation of functions (in terms of the target_impl API) will be considered for specialisation. In the extreme case a vendor may want to implement everything from scratch and use nothing from common, but I suspect wanting to override particular parts will be the usual case.

  • Vendor extensions. Extra functions, targeted from clang, that encapsulate some performance sensitive behaviour. Some will end up migrating into common, but a staging ground is likely to be helpful. Unlikely to be a problem in this context - it’s just code that happens to only exist under one arch subdir.

For the device RTL…
I now assume we will have a common core (probably header only), and
device specific code that may or may not use common core parts. The
structure I imagine looks like this:
openmp/libomptarget/deviceRTLs/common/include/{queue.h, target_impl.h,…}
openmp/libomptarget/deviceRTLs/nvptx/{libomptarget.cu,CMakeLists.txt,…,include/target_impl.h,…}
openmp/libomptarget/deviceRTLs/amdgpu/{libomptarget.hip,CMakeLists.txt,…,include/target_impl.h,…}
openmp/libomptarget/deviceRTLs/CPU/{libomptarget.XYZ,CMakeLists.txt,…,include/target_impl.h,…}
openmp/libomptarget/deviceRTLs/…/{libomptarget.XYZ,CMakeLists.txt,…,include/target_impl.h,…}
I added a CPU version above to ease testing (see below) and to provide a
target_impl.h declaration for better IDE support.

Seems reasonable. CPU offload target is an interesting exercise that risks derailing this discussion a bit.

  • The common files could be header only (debatable and not strictly necessary).

I’m less confident that the common code benefits from being header only. Going with declarations in header + source files that get linked in wins a degree of encapsulation that I suspect will be beneficial here. Reversible decision either way.

  • All source files…
  • To make…
  • put implementations in {nvptx,amdgpu,…}/include/target_impl.h.

Likewise, doesn’t seem to matter if the implementations are in the header or elsewhere under the architecture subdir.

  • have a #include_next "target_impl.h" in common/include/target_impl.h.
  • have the common/include and one device include folder in the
    include path (in this order) when we compile any file.

#include_next doesn’t appear to be necessary. One can either give the implementation header a different name or let target specific implementations be in arch/target_impl.x.

I think your proposal (modulo the header-only discussion) covers the case of mostly common code that calls architecture specific leaf functions well. Notably that’s sufficient for today’s amdgcn target. I’d like to keep the option for architecture specific functions open.

On argument in favour of declarations in one shared header, #included by a source file that implements them for each architecture, is that it’s about as simple a model as one can get while keeping the arch dependent code separate.I’d personally prefer we start there.

Thanks all,

Jon

The reason I mentioned target_impl.h should be header only is that we can then use templates to construct similar target dependent things.
(See the barrier_impl in https://reviews.llvm.org/D64218) Though, that can obviously also be done differently, e.g., functions with constant arguments.

Let's forget about the header stuff, extract the code in a "natural way" and see what happens.

Next steps:
  There are three pending patches for which we could:
   - go ahead with the review,
   - abandon and try to separate the "extraction" part, or
   - abandon and start fresh.
Regardless, there are many more files that need extraction.
@Jon: Would you start posting patches for them? I would initially not change the directory structure though but simply create the target_impl.X stuff in the same folder.

Cheers,
  Johannes

One diff currently up uses inline free functions to implement target specific stuff. That’s not obviously a good idea so my reasoning is below. Please let me know what you think.

No overhead means implementations need to be in headers or the nvptx cmake needs to apply cross translation unit optimisation. I’m wary of doing the latter without access to heavy test coverage.

In general, I like static member functions with CRTP for interfaces. No overhead, can wire in compile time constraints. Quite a lot of boilerplate. Probably appropriate for things like shuffle. Doesn’t handle partial implementations especially well, e.g. if only part of it is used by a particular target.

For compatibility shims around popcount or atomicMax we probably want overloaded free functions or templates though.

So I don’t yet know what the right interface will be. It should be safe, obvious how to implement, flexible enough to adapt to new targets. Allow shared implementations. Classically difficult problem.

As an interim, I’d like to use free functions. They’re not very safe. The interface is implicit. I wrote the amdgcn version of target_impl by copy then changing the bodies. Bad times. But they’ll be very easy to move behind whatever set of interfaces look like the right choice when the fog clears. Definitely not my preferred final design.

Cheers!

Jon