Goal and assumptions
The main goal is to set-up conversion patterns and transformations to make possible the multi-threaded SPIR-V dialect conversion to LLVM dialect.
The focus will be on the simple kernel, with no function calls and no notion of control flow (yet). This is primarily for the reasons that control flow operations may require complicated transformations for satisfactory performance, mostly due to control flow divergence. While this can be tackled by vector predication (essentially mapping GPU control flow masks), there are other approaches which can be more beneficial. These can be considered as future extensions to this project. It is also easier to build these if the core infrastructure has been already set up.
Parallelism
Just to reiterate, the main idea is to map a single workgroup to a single CPU thread. In ideal case (if only) this would transform subgroups to SIMD vectors and invocations to vector elements.
Short summary
I propose the following project outline (for the nearest future at least).
-
Have a hello world kernel conversion working. For the basic kernel we can consider a kernel that uses ids of work/sub-groups, invocations in computations but does not have any synchronisation (and no control flow per assumption).
-
Model SPIR-V barrier synchronisation: cases here are spv.ControlBarrier
and spv.MemoryBarrier
This will help to tackle GPU to CPU mapping challenges one by one, and will help to support the main features quickly. More details about step 1 can be found below. I also think that a separate thread for step 2 to discuss barriers would me more convenient (Plus step 1 can be prototyped at that point).
1. Hello world kernel
An example of the simple kernel that can be supported is the following:
spv.module @hello_world Logical GLSL450 {
spv.globalVariable @nw built_in("NumWorkgroups") : !spv.ptr<vector<3xi32>, Input>
spv.globalVariable @iid built_in("LocalInvocationId") : !spv.ptr<vector<3xi32>, Input>
spv.globalVariable @wod built_in("WorkgroupId") : !spv.ptr<vector<3xi32>, Input>
spv.func @ hello_world_kernel(/* some buffer arguments */) {
// SPIR-V code that represents something like:
// buffer[wid][iid] = ...
}
spv.EntryPoint "GLCompute" @load_store_kernel, @wid, @lid, @nw
spv.ExecutionMode @ hello_world_kernel "LocalSize", 16, 1, 1
}
1.1 “Unrolling” the kernel
The idea is to make the code “single-threaded” for every workgroup id in dimensions x, y z. I propose to introduce a kernel unrolling pass (Arbitrary naming for now) that will essentially wrap the kernel into for loops (this approach nicely follows OpenCL to CPU compilation: see @mehdi_amini’s links above). The structure (with pseudocode for loops) will the be:
spv.module @hello_world Logical GLSL450 {
spv.globalVariable @nw built_in("NumWorkgroups") : !spv.ptr<vector<3xi32>, Input>
spv.globalVariable @iid built_in("LocalInvocationId") : !spv.ptr<vector<3xi32>, Input>
spv.globalVariable @wod built_in("WorkgroupId") : !spv.ptr<vector<3xi32>, Input>
spv.func @ hello_world_kernel(/* some buffer arguments */) {
for every WorkgroupId from (0, 0, 0) to numWorkgroups
for every subgroup from 0 to subgroup size // Optional
for every localId from (0, 0, 0) to LocalSize
// Old kernel code here
}
spv.EntryPoint "GLCompute" @load_store_kernel, @wid, @lid, @nw
spv.ExecutionMode @ hello_world_kernel "LocalSize", 16, 1, 1
}
The number of workgroups and the number of subgroups (if exists) are passed as the global values. The number of local invocations per workgroup is extracted from spv.ExecutionMode
.
Having done this, it is now remains to parallelise workgroup for loop to dispatch numWokgroups
number of threads, and handle the scalar kernel separately.
1.2 Parallelising the workgroup loop
To create 1D workgroup for loop I propose to reuse the lowering from OpenMP dialect to LLVM dialect. This can be modelled by omp.parallel
to specify the number of threads, and omp.WSLoop
to actually create the parallel loop.
To create 2D or 3D workgroups we can proceed in the same way, but also setting collapse
value in omp.WSLoop
so that the nested loops are handled together. I haven’t properly looked at this part yet, so I think that starting with 1D is reasonable.
The actual code inside the workgroup loops can be packed into a function, that additionally takes all id
s as parameters.
1.3 Vectorizing the kernel
I propose to use a second pass after “unrolling” the kernel to pack the instructions into vectors. There are number of points here
- Vector width should be parametrised and passed as an argument to the pass. Otherwise, analysis is required to get the optimal SIMD width (like in LLVM loop vectoriser) which is again, an extra feature that can be considered separately from the initial goal.
-
LocalSize % width == 0
to avoid non-trivial cases in the beginning. The starting point will be 1D LocalSize
- All local ids should be used directly in the kernel to avoid gather/scatter ops.