This RFC is the beginning of an effort to clean the GPU dialect and strengthen its semantics.
Currently the GPU dialect is filled with operations that don’t really belong to the dialect. This RFC proposes removing them from gpu via one of three alternatives:
Creating new dialects to host them.
Moving them to an existing dialect where they make more sense.
Removing them from the monorepo if no one shows interest in maintaining them upstream.
NOTE: This is only a proposal, and one possible outcome is to keep some or all the ops in the dialect.
Why?
The MLIR charter RFC established a project best effort mandate to: check dialects status, improve documentation, and productivize dialects. Currently, the first sentence in the GPU dialect docs is:
Note: this dialect is more likely to change than others in the near future; use with caution.
IMO, we have abused of this experimental note, and as a result the dialect has been expanded beyond its original description to include operations that are not semantically aligned with the dialect.
This dialect provides middle-level abstractions for launching GPU kernels following a programming model similar to that of CUDA or OpenCL…
Its goal is to abstract away device- and driver-specific manipulations to launch a GPU kernel and provide a simple path towards GPU execution from MLIR…
This dialect also abstracts away primitives commonly available in GPU code, such as with gpu.thread_id…
The criteria I used for selecting which operations to remove is:
The operations don’t conform to the dialect description.
The operations seem to be used by just one particular target, or legacy code.
Removal proposal
Each of the following groups should be considered a unit to be removed by one of the above alternatives.
My initial proposal for this set is to create a sparse or sparse.gpu dialect. However, if no one is interested in maintining the dialect then removal is the next best choice.
As far as I could tell, these ops are on some legacy paths or used by SPIR-V. If SPIR-V folks are interested in them, we could find them a place in a different dialect.
My initial reaction is a -1 to, at least, the removal part of the proposal, mainly because the RFC does not very clearly motivate why the changes are needed (other than the somewhat vague “an effort to clean the GPU dialect and strengthen its semantics”). And if changes are needed indeed, I would prefer to keep this functionality alive through a new sparse gpu dialect.
But, of course, the community should chime in as well. I am, quite often, told, offline, that people are building on top of the sparse dialect and how much they appreciate all the work that has been done by the MLIR Sparsifier Team in the past. However, in situations like this, it would be nice if people speak out more publicly in favor of keeping sparse functionality by stating active use cases.
What’s the issue with Subgroup MMA operations? They are portable across GPUs, they have intrinsics in LLVM level. They provide little to no value on NVIDIA architectures. That said, they do exist in the CUDA ecosystem.
The dialect currently states…
This dialect provides middle-level abstractions for launching GPU kernels following a programming model similar to that of CUDA or OpenCL. It provides abstractions for kernel invocations (and may eventually provide those for device management) that are not present at the lower level (e.g., as LLVM IR intrinsics for GPUs)
Thanks @fabianmc, this is a good initiative, I support making the GPU dialect more focused on its goals. It seems that one addition at a time, it scoped creep into a mix of things that is hard to disentangle right now.
That is not saying anything about the value of everything there is: all of these have been added for good reasons and likely have use-cases, however it stretches the dialect itself when dedicate ones seem better (from a code organization point of view, from how it fits in the lowering flow, and from a mental model / documentation point of view).
Of course one of the contributing factors to scope creep is the difficulty there is each time when conducting a discussion on adding a new dialect: the bar is much lower for folks to add to an existing dialect instead.
The problem is that they don’t fit the quote you mention. This is about “kernel invocation” (and in general host management of the device), and possibly device management when it’s coupled to the kernel structure IMO (like gpu.block_id and such), but not about random computation. We can of course instead change the description you quote if we prefer to go this route.
The quoto/doc calls for middle-level abstractions. gpu.subgroup*** fits that bar kind of.
I don’t see why we’re treating gpu.subgroup*** differently from gpu.shuffle or gpu.rotate or gpu.all_reduce: from a codegen perspective they’re equivalent, both lowers to GPU-specific LLVM intrinsics.
We also have convenience ops that don’t necessarily map to GPU intrinsics but improve ergonomics, where will they go?
gpu.dynamic_shared_memory
gpu.warp_execute_on_lane_0
gpu.printf
Maybe we should split the GPU dialect into two dialects:
I just want to be clear, my first option was creating a new dialect for them, because I also know people using the ops. Removal would happen if no one showed interest, but somebody already showed interest, so a new dialect is the only option.
Now, the reason for removal from gpu, is that there’s nothing about those ops that is inherent to GPUs. You could move them to a sparse dialect, add a sparse-to-gpu-runtime, and nobody would see a difference in behavior. Further, if moved to their own dialect someone could make explore non gpu lowering paths from these ops. Therefore, IMO gpu is not the best place to host them and a dedicated dialect would be better.
Even if we view them as target independent ops, I’d argue vector.contract and other vector ops are better suited for the role here.
I agree that we could expand the definition of what is accepted in the dialect. But, in that case I’d also argue for removal of these ops with the above arguments.
Now, the idea of splitting the dialect in 2, it has been suggested to me before, and I’m starting to like it. However, I’d argue, let’s clean the dialect first, and then we think about splitting the dialect.
Back at the time, @herhut@thomasjoerg and I were just starting to bring up GPU support in MLIR so everything was kind of in flux. We had maybe 5 dialects. Even then, I think we discussed the idea of separating the “kernel” dialect, which would contain things like thread id and subgroup operations, from the “host” dialect, which would contain things like launch and library call abstractions. We didn’t do that because we only needed a handful of operations.
I’m rather in favor of moving/removing ops that are no longer needed, but let’s clearly state what is the problem this is trying to solve. What is it?
Thanks for the proposal, I do agree that the gpu dialect could use some cleanup, but IMO the biggest issue we have right now is that the dialect covers both the runtime aspect (how to launch kernels), the host/device separation, and the SIMT primitives used in device code. From the dialect docs, IIRC, it’s not clear if the device primitives can or cannot be used outside of gpu modules.
My suggestion would be to think about two issues here before listing ops that we want removed:
Revisit the dialect docs/charter and decide if the current scope works for us or not
Agree on op inclusion criteria, especially for device primitives
I think it would make sense to iterate on the docs/charter first and make this logical separation clearer, and potentially move the device primitives to their own dialect (e.g., simt, or have a new dialect that covers launching of kernels and host/device code separation). I think this is inline with what @grypp and @ftynse mentioned higher up the thread.
I think that the key aspect of device primitives is that they lower to both llvm intrinsics and to spirv ops, but are higher-level which makes it easier to connect them other dialects. For example, gpu.subgroup_reduce allows for arbitrary scalar and vector types, and provides patterns to decompose these to reductions that are natively supported by the hw (usually i32/f32 only).
RE op removal, I don’t see the motivation for removing sparse and opaque subgroup mma ops. Correct me if I wrong, but I think there’s little maintenance burden caused by keeping them around, even if they are not widely used? If we were to remove some ops, I think the first step would be to agree on more precise criteria for op inclusion in general. For example, we could say that an op has have a reasonable lowering path for both llvm and spirv and be supported by at least two GPU vendors/implementations, otherwise op inclusion would have to have some other strong motivation.
That needs to be done as well. As I said, this is the beginning of clean up process of the dialect. Regardless of that charter, I’d still argue for removal from the GPU dialect of the above ops, on the basis they are not intrinsically related to GPUs, they are legacy ops, or they don’t seem to be used by many targets.
The end goal is to remove the experimental tag from the dialect. For me the first step is, making sure the dialect has a more cohesive meaning, not just a bag of ops. In terms of implementing the first step, I decided to begin with removing ops that clearly shouldn’t be there.
Making sure the ops are being used: If there’s dead or unmaintained code, it should be removed.
Code organization: Some of the ops are likely better hosted in a different dialect, as their semantics are just loosely related to gpus. For example, all the sparse ops can make a dialect.
If an op is used primarily by just one target, it’s likely better suited to be placed in a different dialect.
That needs to be done. But, it’s out of scope of this particular RFC. For this RFC, we are using the current definition of the dialect that include utilities to execute code, and device primitives.
Further, at the risk of repeating myself, I think regardless of what the new charter looks like, I’d still argue for the removal of the above ops from GPU related dialects.
Dead or unutilized code is always a maintenance burden, either now or in the long run.
For the sparse ops, I genuinely believe they are better placed in a different dialect, as there’s nothing about them that is intrinsic to GPUs.
Now, if consensus is, we need to clarify semantics first and declare this RFC moot until then, that’s okay. However, I think these can be separate conversations.
It would help if this RFC defined clear, objective criteria for deprecating ops; without shared criteria, the removal list of OPs reads seems arbitrary.
In parallel, we can explore splitting the dialect as an orthogonal RFC, assuming there’s consensus?
Such an alternative lowering of the gpu specific part would make less sense, since the MLIR sparsifier already provide many ways to implement alternative lowering mechanisms elsewhere. The gpu part was specifically introduced to bridge the gap with a high-performance implementation such as cuSPARSE.
But, since removal is now off the table, I am perfectly okay moving this to a more sparse-gpu specific place.
As a XeGPU rep, I just wanted to share a few quick thoughts on the GPU dialect cleanup:
I think the cleanup is needed and really appreciate the initiative. Making the dialect more cohesive will definitely help maintainability and clarity.
AFAIK, the subgroup_mma_* ops were mainly used to construct target-agnostic code sequence for GEMM. They are useful but not critical for high-performance compiler development. The issues I see is that the current lowering from vector dialect is not SIMT-flavor, but the lowering for subgroup_mma_extract/insert must be implemented in SIMT-flavor (maybe this is why they are missing). This seems to me indicate that the operation semantics requires revisiting.
Also, this seems like a good chance to revisit naming conventions. Right now there’s a mix of subgroup_, warp_, and no prefixes, which is a bit inconsistent.
The “PTX→cubin“ compilation aspect of this “MLIR-to-execution” model keeps popping up as an issue during practical use of MLIR. There’s more than one way to do it, and MLIR’s idea of how to compile PTX→cubin and then launch it does not always go along with how the application that uses MLIR wants it done.
Specifically:
ptxas may not be available at runtime, at all.
ptxas found by MLIR may not be the right ptxas to use (e.g. we want to use ptxas from the CUDA version the app was built with, not the one installed in the default location on the system). Even if this ptxas works, it makes things non-hermetic.
It’s arguably a problem with how MLIR is used rather than with the MLIR itself, but it keeps coming back, because the users want their stuff to just work, and MLIR provides a convenient way that mostly works, until it does not.
I may be useful to change the “we’ll do everything, at once“ promise into something more granular that clearly separates the MLIR functionality into generic compilation to PTX, PTX→cubin, and execution phases. The compilation to PTX part must be buildable all by itself without pulling in any runtime dependencies. Providing the complete MLIR-to-launch pipeline is fine, but using it should be a conscious choice on the user’s end, rather than the default, IMO. Structuring MLIR-to-PTX part as an intentionally independent (from CUDA SDK bits) reusable component would hopefully prevent unintentionally resurrecting extra runtime dependencies for the applications that do not want them.
I guess the TL;DR; of my ask is “please consider deployment logistics“.
Just to check my two cents in, when it comes to the device-level abstractions (so, uncontroversially, something like gpu.thread_id) I’d recommend the dialect have a two-target rule: that is, ops in (the device-side part of) gpu should have lowerings to justify their existence as a multi-vendor abstraction that needs this sort of general wrapper.
(And yes, SPIR-V + direct lowering to whoever proposed the relevant SPIR-V extension counts, because there you’re at least expecting other people to pick up said extension)
So under this framing, some of the subgroup MMA ops could exist just fine (in that they’d be wrappers around cooperative matrix + the Nvidia primitives) but I’m less sure about the sparse ones.
I haven’t tested the NVIDIA bits in a while, and there’s always room for cleanup. However, with gpu-module-to-binary{toolkit="/path/to/toolkit"} you can pass any toolkit that you want at runtime, it doesn’t have to be the default.
There’s also the option to stop compilation at PTX, and let the driver compile PTX->cubin at runtime -this is how the buildbots do it.
You can also use -DMLIR_ENABLE_NVPTXCOMPILER=ON and ship a compiler that uses the static NVPTX compiler library, in which case there’s never a runtime dependency. I think @mehdi_amini also added something to embed libdevice in the compiler, so no dependencies at any point during runtime.
Having said that. Docs can always be improved, and it has been in my mind to have something like stop compilation at PTX, resume compilation from PTX and stop at CUBIN, to make it more fine-grained as you are proposing.
I’ll also say, that the whole gpu runtime needs to be reimagined. Which would well with splitting the gpu dialect into primitives and runtime. But all of that it’s out of scope of this RFC. However, I’ll add it to the list of TODOS.
I generally agree with this, however, let’s keep that conversation for a future RFC on the evolution of the dialect.
My main issue with the MMA ops is, I don’t know who is using them. AFAIK, neither NVIDIA nor AMD use them, and they were added many years ago, and we have come up with better abstractions since then. So I would prefer to remove, and if needed in the future to reintroduce.