RFC: llvm.gpu builtins for target agnostic code representation

Motivation

I’d like to compile code to spirv64–, pass the resulting SPV format file around, then JIT it to whichever GPU architecture (or interpreter?) one is using locally. There are multiple moving parts for that to work. This RFC is to discuss the first one I’d like to land. It’s straightforward and hopefully useful to the wider community beyond that aspiration.

Implementation at pull/131190, previous discussion at here.

Background

We have a clang library header, gpuintrin.h, which provides a minimal abstraction over the amdgpu and nvptx GPU targets. This is primarily used by llvm libc, with patches under review to move openmp runtime to it. These are static inline functions serving as a minimal compiler runtime.

As of PR 131164 the gpuintrin.h layer will also try to handle the spirv64 target. That is, once that header is filled out, we should have libc compiling as spirv64-- (modulo the loader/crt infra). Similarly once the openmp runtime is moved over to gpuintrin.h (and some extra work around address space conversions lands) we’ll be able to build the openmp runtime to spirv64-- as well.

That is, we should be able to use a single SPV file to represent the openmp or libc runtimes, such that it is lowered to whichever vendor is running the program at JIT time.

Method

How I propose we do this is add some llvm intrinsics, suggesting llvm.gpu.* as the naming convention, and have the spirv translator pass them through unchanged. They’ll then eventually reach a specific machine backend which will know how to lower them to whatever is native. That’s pull/131190. The lowering is an IR pass which will be run unconditionally by the target backend. Mostly a lookup table from intrinsic to what to do with it, adding a new GPU target means adding a column to the table. Possibly also run early in the pipeline so the generic builtins can be used from code that knows from the start that it’s being built for a specific architecture as cheaply as native intrinsics could be.

Getting abstractions right is difficult. I expect the set of functions to change slightly as a third vendor comes onboard, as we discover missing features and so forth. The ones proposed here are exactly the architecture dependent ones from llvm libc where there has already been some evolution. The premise is to be able to write C/openmp/etc targeting a SIMT GPU, without really minding which one, and specialise it to the exact hardware in question later. That means extending IR sufficiently to describe SIMT operations, which currently we can only do by committing to a specific architecture up front.

There are some follow on implications here. Some of the intrinsics added are functionally identical to existing vendor specific ones - we could use the generic one through the amdgpu backend if we like. Some lowering done by CGBuiltin can be moved into the IR pass, e.g. to create code that doesn’t need to know what code object version it will run against. Some of the llvm.gpu builtins may be useful as target-independent optimisation constructs and/or lowered directly by the different backends, without needing to go through a lowering pass.

FAQ:

Why are these intrinsics and not in compiler-rt / libc / ockl / devicertl / libclc / otherlib?

Because the operations map directly to SIMT programming in a GPU, exactly like the GPU target specific intrinsics do, because it’s essentially the same idea. Also because this lets us pass IR around until it reaches the backend where it magically turns into the right machine code for the target without also needing to package another library around or assume some dependency is present.

Why don’t we just say what the target is at clang time, instead of delaying until JIT?

That’s what we’ve already got. I’m not proposing changing that, though these intrinsics would work just fine there as well, maybe letting us deprecate some of the target specific ones.

Don’t these intrinsics already exist in my favourite language

Yes, probably. As otherwise you wouldn’t be able to program GPUs with it. The point is that every language has written essentially the same intrinsics with the same lowering but slightly different names, and so has every target, so there’s a ridiculously long list of ways of spelling thread_id_x all with the same semantic meaning. We should bravely fold the really common ones into a common IR intrinsic so that all the projects get a chance to partially deduplicate themselves.