[RFC] Device Runtime: Directions and Build Requirements

Hi,

I wanted to inform the list of our preliminary plans for the device
runtime and plugin. Please provide feedback and concerns.

The overall goal is still to make the device runtime and plugin as
independent from the target as possible. While Jon already made
excellent progress, we still have "CUDA" files, and we compile certain
code actually as "CUDA", e.g., we use __shared__, __device__, ... I
don't think we need to do any of that. OpenMP should be sufficient to
create a device runtime on its own, a efficient one when we also use
clang extensions where necessary (D74361).

A rough idea of how this would look like can be found here:
⚙ D77472 [OpenMP][PoC][WIP] An OpenMP-based OpenMP device runtime [NOT FOR COMMIT]

In this proof-of-concept we replaced __device__ with
`omp declare target to(NAME)`
and __shared__ with
`omp allocate(NAME) allocator(omp_pteam_mem_alloc)`.

We should have all the critical parts in-place to compile the runtime
only with the CUDA intrinsic known to Clang and the generic atomics for
NVIDIA. We wrap the CUDA intrinsics with `omp begin/end declare variant`
and can even have all intrinsics (NVIDIA, AMDGPU, ...) in the same file.

I'm not sure if the OpenMP driver can emit a pure LLVM-IR device module
already but that seems to be doable addition. That IR is linked into the
device library bc file as we do it right now which connects everything
to the existing infrastructure.

Again, please provide feedback and concerns!

Cheers,
Johannes