I have some question about mma operation

Here is my code.

func.func @main() {
  %cst0 = arith.constant 0 : index 
  %cst1 = arith.constant 1 : index 
  %cst2 = arith.constant 2 : index
  %cst3 = arith.constant 3 : index
  %cst10 = arith.constant 10 : index  
  %cst32 = arith.constant 32 : index 
  %cst16 = arith.constant 16 : index 
  %f0 = arith.constant 0.0 : f16
  %f1 = arith.constant 1.0 : f16
  %f2 = arith.constant 2.0 : f16
  %f2f32 = arith.constant 2.0 : f32
  %input0 = memref.alloc() : memref<3x3xf16>
  %input1 = memref.alloc() : memref<3x3xf16>
  %output0 = memref.alloc() : memref<3x3xf32>
  %input_cast0 = memref.cast %input0 : memref<3x3xf16> to memref<*xf16>
  %input_cast1 = memref.cast %input1 : memref<3x3xf16> to memref<*xf16>
  %output_cast0 = memref.cast %output0 : memref<3x3xf32> to memref<*xf32>
  scf.for %i = %cst0 to %cst3 step %cst1 {
    scf.for %j = %cst0 to %cst3 step %cst1 {
      memref.store %f2, %input0[%i, %j] : memref<3x3xf16>
    }
  }   
  scf.for %i = %cst0 to %cst3 step %cst1 {
    scf.for %j = %cst0 to %cst3 step %cst1 {
      memref.store %f2, %input1[%i, %j] : memref<3x3xf16>
    }
  }   

  scf.for %i = %cst0 to %cst3 step %cst1 {
    scf.for %j = %cst0 to %cst3 step %cst1 {
      memref.store %f2f32, %output0[%i, %j] : memref<3x3xf32>
    }
  }
  call @printMemrefF32(%output_cast0) : (memref<*xf32>) -> ()
  gpu.host_register %input_cast0 : memref<*xf16>  
  gpu.host_register %input_cast1 : memref<*xf16>
  gpu.host_register %output_cast0 : memref<*xf32>
  gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %cst1, %grid_y = %cst1, %grid_z = %cst1)
             threads(%tx, %ty, %tz) in (%block_x = %cst32, %block_y = %cst1, %block_z = %cst1) {
    %A = gpu.subgroup_mma_load_matrix %input0[%cst0, %cst0] {leadDimension = 3 : index} : memref<3x3xf16> -> !gpu.mma_matrix<16x16xf16, "AOp">
    %B = gpu.subgroup_mma_load_matrix %input1[%cst0, %cst0] {leadDimension = 3 : index} : memref<3x3xf16> -> !gpu.mma_matrix<16x16xf16, "BOp">
    %C = gpu.subgroup_mma_load_matrix %output0[%cst0, %cst0] {leadDimension = 3 : index} : memref<3x3xf32> -> !gpu.mma_matrix<16x16xf32, "COp">
    %D = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mma_matrix<16x16xf16,"AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf32, "COp">
    gpu.subgroup_mma_store_matrix %D, %output0[%cst0, %cst0] {leadDimension = 3 : index} : !gpu.mma_matrix<16x16xf32, "COp">, memref<3x3xf32>
    gpu.terminator
  }
  call @printMemrefF32(%output_cast0) : (memref<*xf32>) -> ()
  memref.dealloc %input0 : memref<3x3xf16>
  memref.dealloc %input1 : memref<3x3xf16>
  memref.dealloc %output0 : memref<3x3xf32>
  return
}

func.func private @printMemrefF32(%ptr : memref<*xf32>)

When I run the code use mlir-cpu-runner,show that

Unranked Memref base@ = 0x5627f120d310 rank = 2 offset = 0 sizes = [3, 3] strides = [3, 1] data = 
[[2,   2,   2], 
 [2,   2,   2], 
 [2,   2,   2]]
Unranked Memref base@ = 0x5627f120d310 rank = 2 offset = 0 sizes = [3, 3] strides = [3, 1] data = 
[[22.0599,   -20670.1,   214.967], 
 [-20674,   211.054,   -41366.1], 
 [403.841,   -41258,   364.34]]
free(): invalid next size (fast)
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.      Program arguments: ../../llvm/build/bin/mlir-cpu-runner -entry-point-result=void -shared-libs=../../llvm/build/lib/libmlir_runner_utils.so -shared-libs=../../llvm/build/lib/libmlir_cuda_runtime.so -shared-libs=../../llvm/build/lib/libmlir_async_runtime.so
 #0 0x00005627ed567154 PrintStackTraceSignalHandler(void*) Signals.cpp:0:0
 #1 0x00005627ed56433b SignalHandler(int) Signals.cpp:0:0
 #2 0x00007f6f54769a00 (/usr/lib/libc.so.6+0x38a00)
 #3 0x00007f6f547b949c (/usr/lib/libc.so.6+0x8849c)
 #4 0x00007f6f54769958 raise (/usr/lib/libc.so.6+0x38958)
 #5 0x00007f6f5475353d abort (/usr/lib/libc.so.6+0x2253d)
 #6 0x00007f6f547ad63e (/usr/lib/libc.so.6+0x7c63e)
 #7 0x00007f6f547c322c (/usr/lib/libc.so.6+0x9222c)
 #8 0x00007f6f547c515a (/usr/lib/libc.so.6+0x9415a)
 #9 0x00007f6f547c79f3 cfree (/usr/lib/libc.so.6+0x969f3)
#10 0x00007f6f5629743e 
#11 0x00007f6f5629745d 
#12 0x00005627eda6fa5c compileAndExecute((anonymous namespace)::Options&, mlir::ModuleOp, llvm::StringRef, (anonymous namespace)::CompileAndExecuteConfig, void**) JitRunner.cpp:0:0
#13 0x00005627eda6ffd1 compileAndExecuteVoidFunction((anonymous namespace)::Options&, mlir::ModuleOp, llvm::StringRef, (anonymous namespace)::CompileAndExecuteConfig) JitRunner.cpp:0:0
#14 0x00005627eda7401b mlir::JitRunnerMain(int, char**, mlir::DialectRegistry const&, mlir::JitRunnerConfig) (../../llvm/build/bin/mlir-cpu-runner+0x90901b)
#15 0x00005627ed4d40fc main (../../llvm/build/bin/mlir-cpu-runner+0x3690fc)
#16 0x00007f6f54754290 (/usr/lib/libc.so.6+0x23290)
#17 0x00007f6f5475434a __libc_start_main (/usr/lib/libc.so.6+0x2334a)
#18 0x00005627ed5504f5 _start /build/glibc/src/glibc/csu/../sysdeps/x86_64/start.S:117:0
make: *** [makefile:62:gpu-mma-run] 错误 134

I don’t know what the problem is.I have some question, i hope someone can help me,thanks!

  • gpu.mma_matrix
    I guess the dimension of gpu.mma_matrix can only be 16x16,because I tried all the others failed, but I’m not so sure.
  • the attribute leadDimension
    Although I have read the official mlir documentation, I still have trouble understanding the meaning of leadDimension.
  • the mma operation should be nested in gpu.launch?
    llvm-project/mlir/test/Integration/GPU/CUDA/TensorCore at main Ā· llvm/llvm-project Ā· GitHub, i test using mma operation operation but don’t use gpu.launch, but i can’t lower mma operation,but I still want to ask this question.
    I really want to know how to use mma operations properly,thanks!

If you’re using convert-gpu-to-nvvm in your compilation pipeline, then these operations model CUDA WMMA intrinsics. It should fail to compile without using a size that matches an available CUDA WMMA size that is available.

If you use this operation for anything other than a trivial example, it is going to be very verbose. There is a VectorToGPU (convert-vector-to-gpu) pass that allows you to lower from vector dialect operations to gpu and nvgpu MMA intrinsics. See for example the tests in test/Conversion/VectorToGPU.

You don’t have to use the GPU dialect abstractions. For example, you could use a plain func.func to represent your device code. However, you would need to use a different mlir-opt command line to compile versus the one given in the test you linked to. That’s probably why you can’t lower. I’m guessing you are using (from the test you linked)
mlir-opt [filename] -gpu-kernel-outlining -pass-pipeline='gpu.module(strip-debuginfo,convert-gpu-to-nvvm,gpu-to-cubin{chip=sm_70})'. try to play with it to see what each pass is doing. The gpu-kernel-outlining takes everything in gpu.launch and outlines it to a gpu function nested in a gpu module. Then the convert-gpu-to-nvvm which is doing the MMA lowering only applies to gpu.module, per the pass pipeline specification ā€œā€“pass-pipeline=ā€˜gpu.module(…)ā€™ā€

1 Like

thanks!

Sorry, i meet some new problem.Hope you can give some guidance.

#map0 = affine_map<(i, j, k) -> (i, j)>
#map1 = affine_map<(i, j, k) -> (j, k)>
#map2 = affine_map<(i, j, k) -> (i, k)> 
func.func @main() {
  %c0 = arith.constant 0 : index
  %c1 = arith.constant 1 : index 
  %c4 = arith.constant 4 : index
  %f0 = arith.constant 0.0 : f16 
  %f2 = arith.constant 2.0 : f16
  %input0 = memref.alloc() : memref<4x4xf16>
  %input1 = memref.alloc() : memref<4x4xf16>
  %output0 = memref.alloc() : memref<4x4xf16>
  scf.for %i = %c0 to %c4 step %c1 {
    scf.for %j = %c0 to %c4 step %c1 {
      memref.store %f0, %output0[%i, %j] : memref<4x4xf16>
      memref.store %f2, %input0[%i, %j] : memref<4x4xf16>
      memref.store %f2, %input1[%i, %j] : memref<4x4xf16>
    }
  } 
  %input_v0 = vector.transfer_read %input0[%c0, %c0], %f0 : memref<4x4xf16>, vector<4x4xf16>
  %input_v1 = vector.transfer_read %input1[%c0, %c0], %f0 : memref<4x4xf16>, vector<4x4xf16>
  %output_v0 = vector.transfer_read %output0[%c0, %c0], %f0 : memref<4x4xf16>, vector<4x4xf16>
  %output_v1 = vector.contract {indexing_maps = [#map0, #map1, #map2], iterator_types = ["parallel", "reduction", "parallel"], kind = #vector.kind<add>} %input_v0, %input_v1, %output_v0 : vector<4x4xf16>, vector<4x4xf16> into vector<4x4xf16>
  vector.transfer_write %output_v1, %output0[%c0, %c0] : vector<4x4xf16>, memref<4x4xf16>
  return 
}

when i use
mlir-opt [filename] -pass-pipeline=ā€œfunc.func(convert-vector-to-gpu)ā€.It show that

#map0 = affine_map<(d0, d1, d2) -> (d0, d1)>
#map1 = affine_map<(d0, d1, d2) -> (d1, d2)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d2)>
module {
  func.func @main() {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %c4 = arith.constant 4 : index
    %cst = arith.constant 0.000000e+00 : f16
    %cst_0 = arith.constant 2.000000e+00 : f16
    %0 = memref.alloc() : memref<4x4xf16>
    %1 = memref.alloc() : memref<4x4xf16>
    %2 = memref.alloc() : memref<4x4xf16>
    scf.for %arg0 = %c0 to %c4 step %c1 {
      scf.for %arg1 = %c0 to %c4 step %c1 {
        memref.store %cst, %2[%arg0, %arg1] : memref<4x4xf16>
        memref.store %cst_0, %0[%arg0, %arg1] : memref<4x4xf16>
        memref.store %cst_0, %1[%arg0, %arg1] : memref<4x4xf16>
      }
    }
    %3 = vector.transfer_read %0[%c0, %c0], %cst {in_bounds = [true, true]} : memref<4x4xf16>, vector<4x4xf16>
    %4 = vector.transfer_read %1[%c0, %c0], %cst {in_bounds = [true, true]} : memref<4x4xf16>, vector<4x4xf16>
    %5 = vector.transfer_read %2[%c0, %c0], %cst {in_bounds = [true, true]} : memref<4x4xf16>, vector<4x4xf16>
    %6 = vector.contract {indexing_maps = [#map0, #map1, #map2], iterator_types = ["parallel", "reduction", "parallel"], kind = #vector.kind<add>} %3, %4, %5 : vector<4x4xf16>, vector<4x4xf16> into vector<4x4xf16>
    vector.transfer_write %6, %2[%c0, %c0] {in_bounds = [true, true]} : vector<4x4xf16>, memref<4x4xf16>
    return
  }
}

vector dialect don’t lower to gpu dialect, i also tred some metheds, i change the name of function,make vectors and memrefs have dimension 16x16, I also refer to the following link llvm-project/vector-to-mma-ops.mlir at 3512721d52b3380ea4d3f5b2419d0b7b072e7797 Ā· llvm/llvm-project Ā· GitHub
but it doesn’t work, it looks like i don’t find the trick how to lower vector,thanks!

I have found the reason for the failure of lower by viewing the code VectorToGPU.cpp,then i changed the order of iterator_types, then it lower successfully.

Edit: looks like you found the reason as I was typing. Even if 4x4 converts, I think you will need to use 16x16 to further lower

Looks like you’re making progress.

The issue here is that your vector operations are not following the proper convention expected by the pass. The conversion is ā€œall or nothingā€, meaning that if the entire chain of vector operations can’t all convert to gpu operations, then none of them will. We could definitely improve the error reporting/ feedback here because even running with -debug isn’t going to tell you why the slice of vector ops isn’t valid (sorry).

You shouldn’t feel sorry for this, i think you help me a lot,thanks! I learned a lot by learning the gpu dialect, and in the future, when I encounter such problem, I can view the source code to see how it is implemented,and the problem gave me a chance to view the source code.I am interested in mlir.