I am working on SPIR-V to LLVM conversion at the moment and together with @antiagainst, @MaheshRavishankar we were thinking to have a “mlir-spirv-runner” tool. This runner will in some way resemble Cuda/Vulkan runners and aim at JITing SPIR-V via SPIR-V to LLVM conversion. One of the results I am particularly interested in is executing GPU/SPIR-V modules on CPU.
Encoding descriptor sets
Kernel arguments in SPIR-V are represented as global variables with set and binding numbers specified, e.g.
spv.globalVariable @__var bind(0, 1) : ...
I think this can be encoded in symbolic reference of the variable:
spv.globalVariable @__var_set0_binding1 : ...
so that we can lower
llvm.mlir.global via existing conversion pass.
I have 2 options how to structure the pipeline:
The input is a module with a gpu module containing the kernel, main function and function declarations for helpers (LHS of the diagram). The outline of the passes is the following:
- Convert GPU dialect to SPIR-V dialect
- Lower ABI attributes and update VCE triple
spv.moduleso that it can be lowered to LLVM (called
SPIRVEncodeDescriptorSetsin the diagram for now and described in more detail above)
- Convert SPIR-V to LLVM (and drop entry points for now assuming there is no “internal” functions)
- Convert standard to LLVM
- Handle GPU launch op (
ConvertGPULaunchToLLVM). For that, we can get the source pointer to the buffer data and the destination pointer of the kernel’s global variable. We would naturally want to transfer the buffer from the host to device to execute the kernel. But we are running on CPU so instead we “emulate” this memory transfer by copying the data to some destination pointer (global variable in our case), and then executing the kernel which now has its global variables with data all set up.
The problem with this approach is that the result of running the passes is a nested module that cannot be translated to proper LLVM. To take care of it, we can “embed” the kernel’s module into the main one, and resolve possible conflicts in symbolic references.
This approach separates the host code and the device code into 2 modules in 2 files. The pipeline is similar to the one above, but in the end we compile the nested module and the main module separately into two separate object files, and then link them.
This approach has a number of drawbacks however:
- We would need to specify the variable/kernel declarations in the main module to tell the compiler that those exist in some other module.
- More importantly, there is no crossing of the boundary in modules in MLIR at the moment. Handling this is a separate case and it has to be discussed separately.
I see the second approach more natural, but given the current state of separate modules handling I think that embedding may be preferable.
It would be great to hear any other comments on this!