[RFC] Add NV-GPU dialect (HW specific extension of GPU dialect for Nvidia GPUs)

Note that the GPU dialect includes both host side as well as device operations. This only consider device side operations.

Goal of the dialect

The GPU dialect is used to target GPUs while being agnostic to APIs and target. Therefore it is designed to abstract hardware specific details.

To target cuda/nvidia hardware we go through GPU and Vector dialects and lower to NVVM dialect. This means hardware specific information can only be exposed at the NVVM dialect. In some cases there is a large abstraction gap, making the lowering non trivial and preventing us from doing higher level transformation based on hardware details.

This proposal is for adding a new Nvidia/PTX Dialect that would directly model NVVM intrinsics using memref and N-d Vector types before lowering to NVVM/LLVM dialect/types. This is analogue to the existing AMX dialect for Nvidia GPUs.

First milestone/use case

The first use case for this dialect would be to add support for ld.matrix as well as mma.sync intrinsics that are used to respectively load a warp distributed matrix and compute a matrix multiplication accumulate.
Those operations are similar to the exiting SubgroupMmaComputeOp and SubgroupMmaLoadMatrixOp however they would be explicitly define how the matrices are mapping onto the warp lanes.
This will allow to not use an opaque type for matrices and this will allow generating more efficient memory access patterns.

Example

The following patch provides an example of what the newly added ops would look like.
The new ld.matrix op would represent a warp level operation using memref and n-d vector types to represent a warp distributed access that can then be lowered to nvvm.ldmatrix

def GPU_MmaLdMatrixOp : GPU_Op<"mma.ldmatrix", 
                                [MemoryEffects<[MemRead]>]> {
  let description = [{
  The `gpu.mma.ldmatrix` op represents loading a matrix fragment from
  memory. The load source and result type must be compatible with lowering 
  to the `nvvm.ldmatrix` instruction. This op is meant to represent 
  the distributed version of a `vector.transfer_read` as an intermediate 
  step between lowering from `vector.transfer_read` to `nvvm.ldmatrix`.
  
  Example:

  ``mlir
  gpu.mma.ldmatrix %shm_buffer[%c0, %c0] : memref<16x16xf16, 3> -> vector<4x2xf16>
  ``
  }];                                

  let arguments = (ins Arg<AnyMemRef, "", [MemRead]>:$srcMemref,
                      Variadic<Index>:$indices, BoolAttr:$transpose,
                      I32Attr:$numTiles);
  let results = (outs AnyVector:$res);
  let assemblyFormat = [{
    $srcMemref`[` $indices `]` attr-dict `:` type($srcMemref) `->` type($res)
  }];                      
}

FYI @nicolasvasilache @MaheshRavishankar @herhut @mehdi_amini @bondhugula

Thanks for sharing @ThomasRaoux.

The vector abstractions in MLIR specifically carve out room for this type of hardware-specific dialects and extensions, it is great to see a push in this direction for GPUs!

The memory / n-d vector interface is the perfect place to encode abstractions related to distribution, load/store and packing/unpacking with reshape.

One question that comes to mind is whether you see some of these distribution abstractions potentially also percolating up to the vector dialect?

Big +1 form me.

I’m a bit confused here: are you really adding a new dialect or just carving out a set of operations in the gpu dialect that you’ll prefix with mma. to somehow conceptually “namespace” them?

I think so, this is still under progress but I’m trying to build some distribution abstraction, the distribution part is really at the border between GPU and Vector dialect since Wapr distribution only makes sense for GPUs.

The proposal is to add a new dialect. A new namespace where target specific ops can be added would be enough if you think this is better but I haven’t seen any precedent for it. What part is unclear?

Could you explain a bit more what the specific differences would be? The ops you mention have a single lowering currently and I wonder why they cannot be evolved to fit your needs.

The difference is that the mapping of which warp lane holds which part of the matrix is explicitly defined. This allows generating either standard vector.load/vector.store for the data or use some special ld.matrix operation if possible.
Exposing this early has several advantages:

  • It allows potentially doing more complex memory access patterns and not necessarily load a 2D block like SubgroupMmaLoadMatrixOp does. This is critical to be able to avoid shared memory bank conflicts.
  • The logic to pick the right set of operations to load/store can become pretty complex this let us separate that from the lowering to llvm. Right now the current path relies on wmma that will pick a simple but suboptimal lowering of the memory accesses. (it doesn’t take advantage of vector loads, or ld.matrix)
  • One extra side-effects is that it makes fusing the mma ops with other operations easier since we don’t need to deal with opaque types anymore.

Doing more in the conversion to nvvm is possible but the logic there would become pretty complex and this doesn’t give us a chance to run transformations to try to avoid bank conflicts before lowering to nvvm dialect.

OK! The part that wasn’t clear to me was that your example showed gpu.mma.ldmatrix (dialects can’t have . in them IIRC).

Ah yes my bad, I pasted the version of the draft that is currently in the GPU dialect but it would be a different dialect prefix indeed.

Since there doesn’t seem to be any objections so far, I sent a patch introducing the new dialect for review:
https://reviews.llvm.org/D123266

This part (“preventing us from doing higher-level transformation”) isn’t entirely clear to me. An example here would help. In the past, we’ve added wmma-level ops to the gpu dialect (although these were specific to NVIDIA GPUs) for the lack of an nvgpu dialect – these ops as you know use memrefs and GPU dialect-specific types, and it has been so far considered okay to add certain hardware-specific ops the gpu dialect itself: the key is that these ops still worked on neutral (MLIR builtin) types although their “actions” were GPU-specific (nvidia or AMD). Examples for general reference:

%C = gpu.subgroup_mma_load_matrix %22[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf32> -> !gpu.mma_matrix<16x16xf32, "COp">
...
%R = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf32, "COp">

PTX in the name would appear to be out of place for a dialect like this. It looks like you want to have a specialized GPU dialect for NVIDIA GPUs: nvgpu instead?

It’ll be good to have more discussion here before we create it. I don’t think nvptx is the right name here (being the name of the final LLVM backend) – a big jump in abstraction through GPU → nvvm → LLVM → nvPTX.

One example is supporting a way to access swizzled shared memory like cutlass does (slide 37-48):

As mentioned this also allows decoupling the logic of how to bring the data like potentially using ldmatrix op from the lowering to llvm.

My understanding was that we were okay with the mma ops because they didn’t require exposing details on the implementation. Therefore we can imagine that they could be use to abstract AMD matrix core operations or other vendors. The representation is also compatible with SPIR-V cooperative matrix.

Here it feels like crossing a line as we need to expose how the hardware is expected to map the data on the warp lanes so it is unlikely that any other case.

The types are still going to MLIR generic types, this doesn’t change indeed. Are you saying those ops should go directly in the GPU dialect? If so why do you think it is better? I have got push back in the past when some ops were too specific to a single target.

sure, nvgpu sounds fine to me.

I can change the name. Do you have any fundamental concerns? What else do you think needs to be discussed? I believe this is consistent with how other targets are handled on the CPU side and the only alternative would be to add those target specific Ops with all the HW specific details into the GPU dialect.

Can you explain this aspect a little more? As it is proposed, I see that we only have an operation to load a matrix but no way to compute on these. The new formulation is incompatible with the gpu dialect ops, as it uses a different type.

So how would you model the computation itself? By exposing the result of the load as a vector, the IR creates the impression that you can actually access it like any regular vector. A similar approach is taken in the AMX dialect with its tiles, so maybe we should not treat this as related to the gpu dialect with its aim to abstract over hardware and instead make it part of the vector dialect family.

I do not see how this is exposed in the IR. How is this conceptually different to optimizing memory layout for your cache hierarchy in tiling, even though the ops do not expose the specifics of the hardware you are targeting.
Is the goal to use a specific operation so that it is visible which target the IR is being compiled for by choice of operation?

I added only one op as a placeholder to keep the patch small. The full flow of how it would work is in the patch I had mentioned in the original post. There would be another compute op taking vectors and return vectors.

There is more details in the patch but basically things would look like:

%a = gpu.ldmatrix ... : vector<4x2xf16>
%b = gpu.ldmatrix ... : vector<2x2xf16>
%c = vector.load ... 
%d = gpu.mma.sync(%a, %b, %c) {mmaShape = [8, 8, 4]} : vector<4x2xf16>, vector<2x2xf16>, vector<2x2xf16> -> vector<2x2xf16>
vector.store %d 

Correct, the result can be used as a regular vector. I agree this dialect is much closer to a Vector kind of dialect, I tried to make it explicit in the comments mentioning both GPU and Vector dialect. How would you explicitly make it part of the vector dialect family? Note that the reason it is still also a GPU kind of dialect is that the vectors are “simt-like” meaning they are distributed on the warp lanes.

The goal is to be able to load a matrix by assembling data from different part of the memory. This isn’t possible with wmma.load which requires loading a 2D shape with a constant stride.

ldmatrix just exposes a special way to distribute the loaded data onto warp lanes that allows generating less memory accesses. This is why it is being used. Note that the same thing can be done with multiple vector.load and in some cases that’s what would be done.

Since this feels relevant, I currently have ⚙ D122765 [MLIR][AMDGPU] Add AMDGPU dialect, wrappers around raw buffer intrinsics, which add some rather AMD-specific ops to the GPU dialect pending review. And (once it’s been cleaned up) we’re fixing to send the op we’ve added for AMD’s mfma instructions (which is an MLIR wrapper around LLVM intrinsic) upstream as well.

I think there’s a reasonable argument for going gpu.nvvm and gpu.rocm and so on when dealing with ops that are, in practice, vendor-specific, so that it’s clear that’s what they are. The LLVM dialect already has a bunch of vendor-specific extension dialects, for example.

Interesting, thanks for sharing. What kind of op lowering would generate the mfma operations?

Yes, it can be in the same dialect with a namespace or different dialect. I think different dialect is more aligned with what has been done on the CPU with AMX, X86Vector, etc… but I don’t think it makes a big difference.

@ThomasRaoux These mfma operations are instructions for matrix multiplication built into certain AMD GPUs. We emit them in our kernel generator when we’re performing GEMM on hardware that supports the instructions.

So you are planning to have a mfma in the GPU dialect as well as the rocdl dialect?

@herhut, to clarify this part more, the ops in this dialect would use a SIMT representation. That means the vector type would represent per thread type even though the op would be warp level operations. This is a similar modeling than the existing gpu.shuffle op.

For example below from the patch loads a vector<4x2xf16>, so each lane will hold 8 values and all the lanes collectively load a <16x16xf16> matrix that can be used by a mma.sync op.

nvgpu.ldmatrix %sm[%c0, %c0] {numTiles = 4 : i32, transpose = false} :
    memref<?x?xf16, 3> -> vector<4x2xf16>

I don’t read any objections with having target specific dialect so far. If the part above is clear and there is no other objections I’ll move forward once the patch is approved.

Thanks for the explanation. I missed that these are still SIMT.

My thinking wrt. hardware specific operations is that they make sense if they make it easier to model a certain property and if there is a generic operation at some higher level IR that makes it easy to target this operation.

If I understand this right, this operation would be generated from a SIMD-level vector abstraction in the vector dialect and, if we wanted to target e.g. AMD hardware, we would implement the corresponding lowering at that level.

The general gpu dialect instruction is still useful (and we should keep it) for code that does not come from the vector abstraction but wants to remain platform independent.

@krzysz00 do you plan to add the corresponding AMD instructions and potentially vector lowering?

Yes exactly.