[RFC] Add NV-GPU dialect (HW specific extension of GPU dialect for Nvidia GPUs)

@herhut Since this thread seems to have split into a few directions (esp. since I’m not entirely sure what your question was), here’s my attempt to summarize the situation our MLIR-based code generator is in

  • In our final output (LLVM IR), we want to use certain AMD-specific intrinsics. The two that have come up here are the mfma instructions (which are SIMT matrix multiplication instructions) and the buffer load/store/atomic intrinsics. (The newer forms of the buffer intrinsics are one of the changes in ⚙ D122765 [MLIR][AMDGPU] Add AMDGPU dialect, wrappers around raw buffer intrinsics )
  • These intrinsics do not generally have signatures that fit will into the MLIR model. For example, the buffer intrinsics take a four word “resource descriptor” to specify the buffer to operate on, its size, and so on, while MLIR uses the memref type for this purpose. The mfma intrinsics, while the mfma intrinsics have one function per supported size (where, in MLIR, we would have the choice of operation specified as an attribute on a generic mfma op)
  • We have historically put such wrappers into the gpu dialect within our copy of MLIR, going off the apparent precedent set by ops such as gpu.subgroup_mma_compute
  • We are seeking to minimize the number of patches we maintain on top of MLIR and to contribute parts of our codebase that aren’t specific to our code generator upstream if they’d be a good fit there.
  • While we’ve been putting our intrinsic wrappers in gpu, which dialect they end up in isn’t particularly critical
  • However, since these wrappers will have general utility for others looking to generate code for AMD and since they’re closely tied to the *ToROCDL conversions already present upstream, we would like to upstream them

Since the GPU dialect is, as you said, meant to contain things everyone could implement, I personally think it makes sense to have an nvgpu and amdgpu and so on to hold vendor-specific wrappers on the basis that, if an op goes into gpu, there should be more than one vendor/target that could realistically support it. For example, even though gpu.printf only had lowerings for AMD platforms when it landed, other targets like SPIR-V, had facilities for implementing printf that could be used to write a plausible lowering.

(also @whchung since we’re the two folks most likely to be dealing with the results of this discussion on our side)

I was just trying to understand the motivation for carving out a separate dialect now and the discriminant between the gpu dialect and the nvgpu dialect. I see the patches have already landed. SGTM - I’ll be happy to contribute to reviewing here.