Hello all. Seeking support and/or warning of opposition before I make a start on the following. Tagging mlir because they might like to work with a single GPU target that turns into amdgcn/nvptx somewhere downstream.
There are lots of differences in the details between different GPUs. Despite that, we’ve had reasonable success emitting LLVM bitcode that ignores some of those differences, leaving it until the backend to sort out the details for a given architecture. Libc and openmp currently build one bitcode file for nvptx and a second one for amdgpu. That’s achieved roughly by refusing to burn in assumptions like number of compute units and wave size in the front end and letting the backend sort out the details later.
I think we could similarly paint over the differences between amdgpu, nvptx, intel and so forth with reasonable success. Compile application code that doesn’t make architecture-specific assumptions or call vendor specific intrinsics to a single file and then sort out the details later, after specifying what the hardware is actually going to be. That sounds a lot like spirv–.
Shall we make that the reality?
Sketch:
- Lift gpuintrin.h to llvm.gpu.* intrinsics and give them clang builtins
- A backend-specific IR pass translates those intrinsics to the target ones
- Compile code to spirv64-- or maybe spirv64-llvm- with references to llvm.gpu
- Translate that gpu-agnostic IR to gpu-specific IR sometime later when the target is known
We’d compile application code, libraries or whatever to the spirv-- triple, possibly to spirv64-foo- (I’m not sure what the favourite string encoding is). Applications that want to know wavesize or similar at compile time don’t build but that’s fine, they’ve still got amdgcn etc available.
This interacts positively with the SPV / llvm-spirv translator setup. Provided the translator will write spirv64-- out and convert it back (and preserve the name of the llvm.gpu intrinsics, which might mean an llvm extension, which seems fine) then we get all that goodness without much bother.
Crucially for me it would resolve some hassle in the openmp-on-spirv prototype I’m working on. The llvm.gpu intrinsic set derived from gpuintrin.h is probably useful regardless (e.g. getting it out of a C header is nice for fortran and the IR doesn’t pick up the target flavour quite so early).
The backends could either learn to accept spirv64-- directly and emit code/ptx from them or we could have an IR-module-to-IR-module translator that emits code with the right triple and address spaces fixed up etc. I think I want that module translator anyway to help debug spirv vs direct-to-amdgcn differences but it’s maybe too ugly to use in the production pipeline.