[RFC] Port RTL to openmp

From: “Doerfert, Johannes via Openmp-dev” <openmp-dev@lists.llvm.org>

I’m proposing defining the shared variables with pragma omp target declare
and rewriting the uses of volatile to follow c++ semantics. Expecting some
clang work around address spaces and constructors.

We need to give the users a way to declare shared (global and dynamic)
memory anyway so figuring this out is worth it for sure.

That is well observed. Accessing device memory from OpenMP is likely to be necessary on performance grounds, independent of the runtime implementation tradeoffs.

I can’t find a description in the openmp spec for this. There is a proposal from 2017 at https://www.openmp.org/wp-content/uploads/openmp-TR5-final.pdf which describes a considerable superset of such functionality. I don’t know the background there but can imagine the scope of the paper is challenging at committee stage.

I’d suggest following the syntax from that paper, and once the implementation is sound, propose it for inclusion in the standard. Would you be open to reviewing such? I’m not sure know how it would fit in with the current rewrite so rough pointers are welcome there. I’m also happy to write it for nvptx first.

I’m in favor with a caveat. An OpenMP implementation of the runtime was
on my agenda for the (far ahead) future anyway. However, we have a lot
of ongoing projects and I would postpone this iff we have a workaround
for the AMDGPU backend for now. If not, this is very reasonable.

There are paths forward for the AMDGPU backend on HIP but timescales and details are unclear. The pragmatic win would be for a company that doesn’t have a cuda or hip implementation and doesn’t want to ship a bunch of vendor extensions to opencl.

Thanks,

Jon

> > I'm proposing defining the shared variables with pragma omp target
> declare
> > and rewriting the uses of volatile to follow c++ semantics. Expecting
> some
> > clang work around address spaces and constructors.
>
> We need to give the users a way to declare shared (global and dynamic)
> memory anyway so figuring this out is worth it for sure.
>

That is well observed. Accessing device memory from OpenMP is likely to be
necessary on performance grounds, independent of the runtime implementation
tradeoffs.

OK. Let's do this now then.

I can't find a description in the openmp spec for this. There is a proposal
from 2017 at https://www.openmp.org/wp-content/uploads/openmp-TR5-final.pdf
which
describes a considerable superset of such functionality. I don't know the
background there but can imagine the scope of the paper is challenging at
committee stage.

It seems to be an official technical report (TR) by the committee. See
page 159ff in the current "standard" TR8.

I'd suggest following the syntax from that paper, and once the
implementation is sound, propose it for inclusion in the standard. Would
you be open to reviewing such? I'm not sure know how it would fit in with
the current rewrite so rough pointers are welcome there. I'm also happy to
write it for nvptx first.

I looked around a little and I'm wondering what happens if we just replace
  ...
  DEVICE SHARED uint32_t usedMemIdx;
  DEVICE SHARED uint32_t usedSlotIdx;
  ...

from common/src/omp_data.cu, with

  #pragma omp begin declare target
    ...
    /* shared */ uint32_t usedMemIdx;
    /* shared */ uint32_t usedSlotIdx;
    ...

  #pragma omp allocate(usedMemIdx, usedSlotIdx) allocator(omp_pteam_mem_alloc)
  #pragma omp end declare target

in a cpp version of the file.

@Ravi The above is a perfect use case for attributes if we don't have
      them already! /* shared */ would be replaced by the attribute and
      we directly see what's happening.

I'm in favor with a caveat. An OpenMP implementation of the runtime was
> on my agenda for the (far ahead) future anyway. However, we have a lot
> of ongoing projects and I would postpone this *iff* we have a workaround
> for the AMDGPU backend for now. If not, this is very reasonable.
>

There are paths forward for the AMDGPU backend on HIP but timescales and
details are unclear. The pragmatic win would be for a company that doesn't
have a cuda or hip implementation and doesn't want to ship a bunch of
vendor extensions to opencl.

You have me convinced.