The deviceRTL dependency graph is not great. TLDR: I want to use a unity build to work around nvcc lacking LTO so that we don’t have to write everything in circularly referenced headers.
omptarget-nvptx uses symbols from support and supporti uses symbols from omptarget-nvptx. This currently works as supporti is included at the end of omptarget-nvptx and everything includes omptarget-nvptx, but it means that support.h doesn’t work if you include it anywhere else.
This is sad because support.h declares an API that wraps builtin cuda variables which would be really useful, except that I can’t use it in debug.h because all headers use debug, including omptarget-nvptx.
The root problem is that everything includes omptarget-nvptx and most stuff is implemented inline in headers, quite a lot of which is then used by omptarget-nvptx. Related hazards are some missing include guards and headers that don’t work unless they’re included in a specific order relative to each other.
I can cut through this tangle if we agree that moving functions out of headers is fair game. amdgcn inlines everything anyway so we lose no optimisation. The clang build of nvptx deviceRTL can do likewise. It’ll also make target_impl substantially more readable as I can put the function declarations under a common directory and link in implementations.
The internet suggests that nvcc is not capable of cross translation unit optimisations. Newer versions may be, it’s difficult to search for. I suggest we #include all the source into a single TU and build that when using nvcc, game dev style.