I try to understand what is the current status of the GPU code generation.
I can see there are several dialects: GPU, NVVM, SPIR-V, ROCDL.
As far as I understand :
NVVM is Nvidia’s language for compute kernels (e.g. CUDA)
SPIR-V allows the specification of kernels that can be used with Vulkan, OpenGL or OpenCL
GPU allows the abstract representation of the host-side, but there are lowering passes towards NVVM, ROCDL, SPV.
ROCDL - there is no documentation explaining what it is used for.
Are there some tutorials on how to take an application, optimize it, split it into host-side and kernels, and then generate code and run it? If there are many ways of doing this, the keywords that would interest me most are Nvidia and OpenCL.
These are great questions! There are different existing APIs and software stacks for driving GPUs in reality, so you see different dialects in MLIR matching them. Your categorization in the above looks good to me: ROCDL (for AMD), NVVM (for NVIDIA), and SPIR-V (for various Khronos APIs) are dialects for GPU device kernels. The GPU dialect is mostly modelling the host-device ABI and launch contract akin to CUDA/OpenCL at the moment. MLIR allows one to mix dialects so these dialects, while maintaining a clear separation and focus, can work together for progressive lowering from some high-level abstractions towards targeting GPU using the technically you’d like.
MLIR itself just provides reusable dialects/patterns/passes to let one compose an end-to-end compiler; one often needs to also find other components in other project repos to piece together a fully end-to-end flow (which I guess is what you are asking with “take an application, optimize it, split it into host-side and kernels, and then generate code and run it”), because the end-to-end flow typically involves a framework and its own format for representing models, etc. In MLIR you can find various runners that execute IR snippets on GPUs (written as FileCheck tests for these runners). The input IR snippets are already starting from GPU level where we have separated the host and device though. There are a few passes you can channel together to go from, say, Linalg dialect (see the discussion here for the list of passes for SPIR-V). To look at the flow at an even larger scope, you might need to see how these components are used in projects that use MLIR, like TensorFlow, IREE, mlir-npcomp, and others. For CUDA path, @herhut might chime in to point to an example flow and other resources. For the SPIR-V path, what we have in MLIR is mainly for Vulkan compute at the moment. IREE is relying on it extensively for GPU CodeGen. You can find an example going from MHLO ops all the way down to Vulkan API IR + SPIR-V blob in an executable format. The core CodeGen flow is also documented here. Regarding current status, it is able to fully convert a few vision models. We are also seeing momentum on bringing up OpenCL support via SPIR-V in MLIR; there are quite a few commits landed towards that direction. They started to be used in PlaidML AFAICT.
Thanks for this very detailed reply. Let me ask just a few practical questions to make sure I understand it:
There are several paths to go from linalg or tensorflow to execution on NVIDIA using MLIR. They may involve external tools, but they can be used in practice today.
One of these paths is to use the CUDA API, the other is to use the Vulkan API (OpenCL is not yet there).
In both cases, I should:
use NVVM for the lowering of the kernel to executable code, but stick to the correct API.
use GPU to lower towards the correct host-side API, sticking to the correct API.
Is this correct?
Furthermore: for the OpenCL path, the problem is one related to the implementation of the API, or something more fundamental, related to the kinds of memory organization or optimization one uses? In other terms, what would be the investment in time to make it run?
For 1), depending on what you mean by “can be used in practice”. If it’s using as an off-the-shelf solution to support production workload, I don’t think we are there yet. Many things still remain to be done (and contributions of any sort is certainly very welcome ). But the flow should be there to show how everything fits together. For 2), that’s correct. For 3), NVVM is the kernel side for CUDA; if going through Vulkan compute path, you’ll need SPIR-V. NVVM is not accepted by Vulkan, which is a cross-vendor standard. For the host side, you can use the GPU dialect and lower its launches into Vulkan API calls like what we do in the mlir-vulkan-runner, which is similar to other runners. You can also look at how the host side is modelled in IREE, which is more akin to native Vulkan.
For OpenCL, I don’t think there are fundamental reasons that we cannot have it; it’s just not really implemented end-to-end yet. (Another thing to keep in mind is that MLIR is still young and evolving actively; even if something is not immediately working, it probably just means RFCs and improvements are needed, instead of the features designed not to be supported. ) Most of the stuff in MLIR land is needs based; so if you have the need then please certainly feel free to push it forward. Regarding investment, it depends on how far you’d like to push it. The SPIR-V dialect is pretty much built out at the moment, including the target environment support and ways to automate pulling in missing ops. Lowerings from other dialects to SPIR-V is reasonably good as we can support converting non-trivial ML models. To bring up OpenCL support, these are all that we can share and reuse; one just needs to pull in more OpenCL related SPIR-V ops, define proper OpenCL conversion target, iron out problems in the lowering procedure, and add more OpenCL oriented patterns/passes. Oh, also to have a way to run, like an OpenCL runner or something. I won’t think it’s a huge amount of work if to just have a minimal example working end-to-end, especially with many components that can be reused and lots of examples to look at. I might be oversimplifying things, but if somebody can dedicate time to it, roughly I’m expecting within weeks.
I don’t think you need to target NVVM directly: this is modeling the LLVM-level intrinsics for Nvidia GPU (unless you need very specific low-level features). A combination of GPU+Standard ops should be enough in the common case I think, we should have in the GPU dialect enough to produce kernels.