RFC: SPIRV IR as a vendor agnostic GPU representation

Hello all. Seeking support and/or warning of opposition before I make a start on the following. Tagging mlir because they might like to work with a single GPU target that turns into amdgcn/nvptx somewhere downstream.

There are lots of differences in the details between different GPUs. Despite that, we’ve had reasonable success emitting LLVM bitcode that ignores some of those differences, leaving it until the backend to sort out the details for a given architecture. Libc and openmp currently build one bitcode file for nvptx and a second one for amdgpu. That’s achieved roughly by refusing to burn in assumptions like number of compute units and wave size in the front end and letting the backend sort out the details later.

I think we could similarly paint over the differences between amdgpu, nvptx, intel and so forth with reasonable success. Compile application code that doesn’t make architecture-specific assumptions or call vendor specific intrinsics to a single file and then sort out the details later, after specifying what the hardware is actually going to be. That sounds a lot like spirv–.

Shall we make that the reality?

Sketch:

  • Lift gpuintrin.h to llvm.gpu.* intrinsics and give them clang builtins
  • A backend-specific IR pass translates those intrinsics to the target ones
  • Compile code to spirv64-- or maybe spirv64-llvm- with references to llvm.gpu
  • Translate that gpu-agnostic IR to gpu-specific IR sometime later when the target is known

We’d compile application code, libraries or whatever to the spirv-- triple, possibly to spirv64-foo- (I’m not sure what the favourite string encoding is). Applications that want to know wavesize or similar at compile time don’t build but that’s fine, they’ve still got amdgcn etc available.

This interacts positively with the SPV / llvm-spirv translator setup. Provided the translator will write spirv64-- out and convert it back (and preserve the name of the llvm.gpu intrinsics, which might mean an llvm extension, which seems fine) then we get all that goodness without much bother.

Crucially for me it would resolve some hassle in the openmp-on-spirv prototype I’m working on. The llvm.gpu intrinsic set derived from gpuintrin.h is probably useful regardless (e.g. getting it out of a C header is nice for fortran and the IR doesn’t pick up the target flavour quite so early).

The backends could either learn to accept spirv64-- directly and emit code/ptx from them or we could have an IR-module-to-IR-module translator that emits code with the right triple and address spaces fixed up etc. I think I want that module translator anyway to help debug spirv vs direct-to-amdgcn differences but it’s maybe too ugly to use in the production pipeline.

1 Like

When I last looking into making vendor agnostic intrinsics I ran into the problem that it’s not easy to simply alias an intrinsic to another one. There’s also the issue that we would then need to decide what the ‘common’ behavior is. Standards like OpenCL might help, but there’s a few edge cases.

Overall, it would be nice if SPIR-V behaved more like a true generic GPU target than a simple serialization of LLVM-IR. If we had generic LLVM intrinsics we could use those from other targets as well. That’s what the gpuintrin.h header does right now, but it’s easier to modify as a header than some intrinsics.

It’s worth clarifying that aliasing intrinsics is not the intent here. Let there be a whole new intrinsic called llvm.gpu.whatever which is lowered in an IR pass scheduled by the backend to whatever that target wants it to mean.

I agree that the header is easier to edit than an IR pass, and we might want the header to remain for some of the functions that are implemented in terms of other functions anyway (and/or to paper over some differences, e.g. if a 128 bit warp size shows up). However sticking target specific intrinsics in the IR makes translating to a different target a mess, and keeping target agnostic intrinsics in the IR means we need them to be intrinsics. It’s definitely a hassle (and I acknowledge I’ve wanted this for years…) but it’s clearly implementable and not that much bother overall.

Interesting side point, if we somewhat canonicalised on the llvm.gpu intrinsics even when targeting amdgcn explicitly, we’d have an obvious point for guaranteed lowering to constants / branch elision etc.

That’s an interesting approach. You have an additional step though, AFAIK the translator only supports SPIR-V to LLVM using the SPIR convention. Not a deal breaker of course, but something to add to your list.

FYI, DPC++ (GitHub - intel/llvm: Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects., from which we are upstreaming SYCL support) uses a somehow related approach for the support of NVPTX and AMDGPU: the SYCL headers uses SPIR-V builtins to access specific functionalities, to support NVPTX / AMDGPU from there we provide an implementation of these via libclc. We never emit actual SPIR-V module though, but I could see a reuse of what you are proposing here.

cc @frasercrmck @asudarsa @MrSidims

That’s a similar idea. Collection of names of functions with an instantiation somewhere downstream. @AlexVlx suggested “spv lib functions” which might be the same libclc. It’s some sort of compiler-rt anyway. I’m carrying some scars from weaving various compiler-runtime code into toolchains which makes me biased towards a dedicated IR pass that instantiates intrinsics on the fly, where Intel could easily add instantiations in the same pass in their fork.

Being able to express the GPU SIMT stuff without gpuintrin.h/compiler-rt/libclc/libc/devicertl and so forth seems pretty high value to me - we have a rough consensus that the basis set looks pretty much like it did in opencl, have resolved the warp vs wavefront name challenge with “wave” to annoy both sides and so forth.

I think the intrinsics are worth having just to pull them out of gpuintrin.h and offloading’s devicertl, and because the compiler knows stuff about them and can occasionally optimise based on that, but as a side effect it would also give spirv sufficient target-agnostic capability to target the various GPUs easily without any further work.

The question is whose problem are you solving, and where? The goals and problems of an implementor and end user are different. At the bottom of the stack, trying to be portable can be actively harmful. I would put libc in the category of the bottom of the stack of compiler support, on top of which you would implement a spirv implementation.

I’m not sure what the value add is for the middle end or backend. It expands the number of cases we need to worry about (for both correctness and optimization) in a variety of places.

In terms of “frontend user” convenience, I don’t see what upgrading the wrapper functions gpuintrin has into intrinsics buys over a utility library

2 Likes

The problem, broadly, is that stuff is more complicated than it ought to be and that makes things more awkward for users and the implementation. Perhaps it’s a bad idea to have two things in one RFC - I didn’t anticipate the llvm.gpu. intrinsics to be contentious. I think we can generalise spirv slightly and have a bunch of special case stuff drop out of GPU toolchains all over the place from the same intrinsics. The middle end could check for the llvm intrinsics instead of both amdgpu and nvptx ones, seems a win. Attributor likewise.

Being able to build with fopenmp --target=any or similar to get a binary that will run on amdgpu or nvptx or intel systems is a good feature however the details work out. Today we can do that by building the application lots of times and bundling it up in an archive to dig out later, which is functional but not particularly delightful. Anyone that wants to ship a library gets to ship it as N copies of pretty similar bitcode, optionally bundled up in an archive. We have quite a lot of infrastructure to hide that from the user at present.

Any program that wants to build as nvptx or amdgcn ends up with a header abstracting over things like ballot to deal with the difference in names. Libc currently uses the gpuintrin.h in clang, which is inline static functions to handle the dispatch. Openmp currently emits calls into the devicertl which contains implementations that call into the intrinsics (or maybe gpuintrin by now), as switching on triple to pick different names makes the test cases spuriously messier. So openmp codegen and devicertl would both be simplified by llvm.gpu.* intrinsics, and libc would be roughly a wash.

I don’t know what fortran/mlir are doing for this. I expect there’s a similar layer somewhere, sometimes they do ugly things with compiling C header files as if it was source code under macro definitions. Spirv above ^ in the Intel fork calls into libclc. We could move code under compiler-rt instead if we wanted.

Generally GPU toolchains have a pile of weird quirks / annoyances. For spirv specifically, being able to have llvm.gpu intrinsics bubble through and then turn into the right thing, means I wouldn’t need to pass around an extra library file containing implementations of them.

End users would notice the clang builtins that turn into llvm intrinsics and possibly use that instead of their existing transform shim. That’s probably value add too, though maybe contentious.

The core issue is that the way the toolchains were thrown together does not make sense. It only appears from the perspective of the frontend implementor that pushing the problem compiler middle/backend deal with it is the easiest solution, since the pieces that work OK in the current scheme are the simplest generic llvm intrinsics and anything more complicated runs into issues.

This is approximately my feeling on this from a moral perspective. If you want portable code, you need to use abstractions and accept the limitations that come with that. The implementation of the abstractions needs to not be injected directly into the source of the program, as is what most of the systems are doing now (rocm-device-libs/libclc, device-rtl, mlir, and gpuintrin are all approximately doing the same).

But that does not mean the set of llvm intrinsics is the only possible way to provide that abstraction. We don’t have much in the way of transforms using these type of functions now, but a well-known system call name system could work as it does for ordinary libc and libm functions (or compiler-rt, although slightly different). We would just need to define a system for managing the function availability as a target triple property.

The intrinsics just side steps that we have to rework how the toolchain links the components together, and agree on some common interface. But I think we need to solve that problem anyway.

The intrinsics have the issue that they would be only be implemented in terms of other intrinsics. So we would have to make sure they are intrinsics that behave identically to an unknown call that could call other leaf intrinsics which at which point it’s not much of an intrinsic and is just a name.

I’ve suggested a compiler-rt like solution in the past. The way that OpenCL makes things ‘portable’ is by just introducing functions with defined behavior, it’s then the job of of the OpenCL runtime to convert those functions into something meaningful for the target. It wouldn’t be overly difficult to make a GPU version of the compiler-rt builtins that more or less just add a linkable versions of the ones we already have in gpuintrin.h.

However, one issue with that is that I don’t think SPIR-V correctly supports linking yet, at least not within LLVM. For a standard approach to work, we’d need standard behavior for linkers.

Got my warning of opposition sufficiently clearly through side channels. I won’t be pursing this enhancement to spirv.