There are fundamental differences between the GPU targets and there are incidental ones. To the extent we can abstract over the differences, we get to reuse optimisations and testing across the targets.
GPUs are vector machines modelled as SIMT in IR. This spawns a bunch of architecture specific intrinsics for things like thread id in warp or warp level shuffles. At least one for amdgpu and one for nvptx.
Openmp has largely dealt with this by emitting calls into a runtime library with implementations that dispatch to the architecture specific intrinsic, thus the openmp optimisations act in part on those runtime functions.
I believe HIP uses header files that dispatch to the amdgpu intrinsic. I’m unclear what the story for running HIP on nvptx is there. Fortran won’t be using C++ header files and probably has it’s own dispatch layer, hopefully in MLIR somewhere.
Libc has it’s own header abstracting over these with link time selection of the target architecture.
I wish to collapse this divergence into the following and am seeking to uncover support or opposition:
1/ add a llvm.gpu.name intrinsic for each of these things
2/ add a codegen IR pass that lowers those intrinsics to the target specific ones, doing some limited impedance matching as required
3/ call that pass (early) from amdgpu and nvptx codegen
4/ add trivial clang builtins that expand to the llvm intrinsics
5/ patches to fold the existing divergence onto these as one goes along
This is architecture independent until the back end and provides a common substrate for the GPU programming languages to rely on. The work is mechanical, most of the mental energy probably goes on choosing names for the intrinsics that annoy all parties equally.