Multi arch deviceRTL status

Hello OpenMP dev and AOMP team,

It’s been a little while since sending an update on deviceRTL changes and we’ve just passed having 50% of the code available under common. Therefore, here is the state of play as I see it.

Design premise:

  • Provide a single interface.h file declaring everything in the deviceRTL library

  • Write a thin abstraction layer over synchronisation, atomics, architecture specific functions. This exists as target_impl.h, implemented for nvptx and amdgcn

  • Provide source under common/, written in terms of said abstraction, which can optionally be used by targets

  • Functions that are not drawn from common/ are implemented under target/

  • Provide a test suite written against interface.h

Status:

  • Interface implemented, looks OK. May want to reconsider how constants are shared with the compiler

  • Abstraction layer sufficient for most of the existing code. Needs an atomic wrapper, some refactoring

  • About half the sloc are under common. All used by nvptx, all will be used by amdgcn once atomics are wrapped

  • Some functions still missing from amdgcn in tree, all available in AOMP. WIP

  • Test suite is vapourware. I have undocumented plans

Next steps:

  • Rename files that don’t contain any cuda to use .cpp suffix

  • Fill in last gaps in target_impl

  • Build the testing infra

End goal:

  • Demonstrably correct (unit tested!) openmp device runtime library

  • Running on various nvptx and amdgcn gpus with minimal compiler complexity. Now is a great time to join in as a third accelerator vendor

  • Support for combining generic implementations under common with target specialised versions

Thanks all,

Jon

I said it before but I say it again, thanks for your work on this!

Without this rewrite we could not (reasonably) develop, maintain, and
test our runtime library for more than 1 target. With it, I'm hopeful :wink:

The situation looks good already but I was hoping we get AMD support up
and running before we fork of clang 10. So this part, and the TRegion
part, need to get done this year (almost). Do you think that is feasible
(for the AMD runtime and plugin)? (I'll ressurect TRegions, rebase them
and move them into the OMPIRBuilder, so it will be mostly a question of
fast reviews on that part.)

Hey,

The amdgcn deviceRTL needs a shim around atomics and a copy of libcall.cu to be broadly functional. That seems minor.

There’s some refactoring work going on in the aomp branch to reduce the libraries it depends on. The nvptx/cuda openmp needs an entire second toolchain installed. I don’t want that to be true for amdgcn as well.

The hsa plugin is about 1200 lines total, already working, with a few outstanding todos and stylistic improvements available. Ron is looking at the todos at present. I’d be equally happy to iterate on that in tree - it’s not really code that can be used for other architectures so making it beautiful isn’t strictly necessary. It may also get reimplemented in terms of a different underlying API at some point next year.

Aside from that… it’s down to the clang/llvm support, and how much customisation it takes to target nvptx & amdgcn from the same code path. Hopefully the differences largely lie in the runtime. I need to pull down a copy of your patches and see what needs to be tweaked to get a second gpu target working.

Getting support in prior to the clang fork would make me happy. Up for working pretty long days to hit that. After the Christmas party tomorrow at least :slight_smile:

One hazard - the runtime makes use of function pointers, which the llvm amdgcn back end (i.e. llc) doesn’t support. We inline very aggressively so that mostly works out anyway, but there are a couple of places that route the function pointer through memory (reduction iirc), and the aomp work around for that is not pretty. I’m looking for better options.

Thanks!

Jon

The amdgcn deviceRTL needs a shim around atomics and a copy of libcall.cu
to be broadly functional. That seems minor.

Minor, agreed. With the "copy" part only if copy means you split it into
common and target code and call target code through target_impl.h from
the common code :wink:

There's some refactoring work going on in the aomp branch to reduce the
libraries it depends on. The nvptx/cuda openmp needs an entire second
toolchain installed. I don't want that to be true for amdgcn as well.

Sounds good, though that is a secondary goal (IMHO). If we get support
up and running people will install another toolchain :wink:

The hsa plugin is about 1200 lines total, already working, with a few
outstanding todos and stylistic improvements available. Ron is looking at
the todos at present. I'd be equally happy to iterate on that in tree -
it's not really code that can be used for other architectures so making it
beautiful isn't strictly necessary. It may also get reimplemented in terms
of a different underlying API at some point next year.

That sounds fair. Reusing plugin code was never a top priority.

Aside from that... it's down to the clang/llvm support, and how much
customisation it takes to target nvptx & amdgcn from the same code path.
Hopefully the differences largely lie in the runtime. I need to pull down a
copy of your patches and see what needs to be tweaked to get a second gpu
target working.

I will try to put a rebase of TRegion IRBuilder stuff on phab tomorrow.
We can start with unit tests to trigger it. The runtime patches should
apply but the interface is not "up to date".

Getting support in prior to the clang fork would make me happy. Up for
working pretty long days to hit that. After the Christmas party tomorrow at
least :slight_smile:

Same here, though no Christmas party :wink:

One hazard - the runtime makes use of function pointers, which the llvm
amdgcn back end (i.e. llc) doesn't support. We inline very aggressively so
that mostly works out anyway, but there are a couple of places that route
the function pointer through memory (reduction iirc), and the aomp work
around for that is not pretty. I'm looking for better options.

The better option are TRegion reductions that never made it to Phab.
Function pointers will stay for some things but we won't need to store
them making constant propagation and inlining easy. Let's talk about
that later though.