I have some question about mma operation

Here is my code.

func.func @main() {
  %cst0 = arith.constant 0 : index 
  %cst1 = arith.constant 1 : index 
  %cst2 = arith.constant 2 : index
  %cst3 = arith.constant 3 : index
  %cst10 = arith.constant 10 : index  
  %cst32 = arith.constant 32 : index 
  %cst16 = arith.constant 16 : index 
  %f0 = arith.constant 0.0 : f16
  %f1 = arith.constant 1.0 : f16
  %f2 = arith.constant 2.0 : f16
  %f2f32 = arith.constant 2.0 : f32
  %input0 = memref.alloc() : memref<3x3xf16>
  %input1 = memref.alloc() : memref<3x3xf16>
  %output0 = memref.alloc() : memref<3x3xf32>
  %input_cast0 = memref.cast %input0 : memref<3x3xf16> to memref<*xf16>
  %input_cast1 = memref.cast %input1 : memref<3x3xf16> to memref<*xf16>
  %output_cast0 = memref.cast %output0 : memref<3x3xf32> to memref<*xf32>
  scf.for %i = %cst0 to %cst3 step %cst1 {
    scf.for %j = %cst0 to %cst3 step %cst1 {
      memref.store %f2, %input0[%i, %j] : memref<3x3xf16>
    }
  }   
  scf.for %i = %cst0 to %cst3 step %cst1 {
    scf.for %j = %cst0 to %cst3 step %cst1 {
      memref.store %f2, %input1[%i, %j] : memref<3x3xf16>
    }
  }   

  scf.for %i = %cst0 to %cst3 step %cst1 {
    scf.for %j = %cst0 to %cst3 step %cst1 {
      memref.store %f2f32, %output0[%i, %j] : memref<3x3xf32>
    }
  }
  call @printMemrefF32(%output_cast0) : (memref<*xf32>) -> ()
  gpu.host_register %input_cast0 : memref<*xf16>  
  gpu.host_register %input_cast1 : memref<*xf16>
  gpu.host_register %output_cast0 : memref<*xf32>
  gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %cst1, %grid_y = %cst1, %grid_z = %cst1)
             threads(%tx, %ty, %tz) in (%block_x = %cst32, %block_y = %cst1, %block_z = %cst1) {
    %A = gpu.subgroup_mma_load_matrix %input0[%cst0, %cst0] {leadDimension = 3 : index} : memref<3x3xf16> -> !gpu.mma_matrix<16x16xf16, "AOp">
    %B = gpu.subgroup_mma_load_matrix %input1[%cst0, %cst0] {leadDimension = 3 : index} : memref<3x3xf16> -> !gpu.mma_matrix<16x16xf16, "BOp">
    %C = gpu.subgroup_mma_load_matrix %output0[%cst0, %cst0] {leadDimension = 3 : index} : memref<3x3xf32> -> !gpu.mma_matrix<16x16xf32, "COp">
    %D = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mma_matrix<16x16xf16,"AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf32, "COp">
    gpu.subgroup_mma_store_matrix %D, %output0[%cst0, %cst0] {leadDimension = 3 : index} : !gpu.mma_matrix<16x16xf32, "COp">, memref<3x3xf32>
    gpu.terminator
  }
  call @printMemrefF32(%output_cast0) : (memref<*xf32>) -> ()
  memref.dealloc %input0 : memref<3x3xf16>
  memref.dealloc %input1 : memref<3x3xf16>
  memref.dealloc %output0 : memref<3x3xf32>
  return
}

func.func private @printMemrefF32(%ptr : memref<*xf32>)

When I run the code use mlir-cpu-runner,show that

Unranked Memref base@ = 0x5627f120d310 rank = 2 offset = 0 sizes = [3, 3] strides = [3, 1] data = 
[[2,   2,   2], 
 [2,   2,   2], 
 [2,   2,   2]]
Unranked Memref base@ = 0x5627f120d310 rank = 2 offset = 0 sizes = [3, 3] strides = [3, 1] data = 
[[22.0599,   -20670.1,   214.967], 
 [-20674,   211.054,   -41366.1], 
 [403.841,   -41258,   364.34]]
free(): invalid next size (fast)
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.      Program arguments: ../../llvm/build/bin/mlir-cpu-runner -entry-point-result=void -shared-libs=../../llvm/build/lib/libmlir_runner_utils.so -shared-libs=../../llvm/build/lib/libmlir_cuda_runtime.so -shared-libs=../../llvm/build/lib/libmlir_async_runtime.so
 #0 0x00005627ed567154 PrintStackTraceSignalHandler(void*) Signals.cpp:0:0
 #1 0x00005627ed56433b SignalHandler(int) Signals.cpp:0:0
 #2 0x00007f6f54769a00 (/usr/lib/libc.so.6+0x38a00)
 #3 0x00007f6f547b949c (/usr/lib/libc.so.6+0x8849c)
 #4 0x00007f6f54769958 raise (/usr/lib/libc.so.6+0x38958)
 #5 0x00007f6f5475353d abort (/usr/lib/libc.so.6+0x2253d)
 #6 0x00007f6f547ad63e (/usr/lib/libc.so.6+0x7c63e)
 #7 0x00007f6f547c322c (/usr/lib/libc.so.6+0x9222c)
 #8 0x00007f6f547c515a (/usr/lib/libc.so.6+0x9415a)
 #9 0x00007f6f547c79f3 cfree (/usr/lib/libc.so.6+0x969f3)
#10 0x00007f6f5629743e 
#11 0x00007f6f5629745d 
#12 0x00005627eda6fa5c compileAndExecute((anonymous namespace)::Options&, mlir::ModuleOp, llvm::StringRef, (anonymous namespace)::CompileAndExecuteConfig, void**) JitRunner.cpp:0:0
#13 0x00005627eda6ffd1 compileAndExecuteVoidFunction((anonymous namespace)::Options&, mlir::ModuleOp, llvm::StringRef, (anonymous namespace)::CompileAndExecuteConfig) JitRunner.cpp:0:0
#14 0x00005627eda7401b mlir::JitRunnerMain(int, char**, mlir::DialectRegistry const&, mlir::JitRunnerConfig) (../../llvm/build/bin/mlir-cpu-runner+0x90901b)
#15 0x00005627ed4d40fc main (../../llvm/build/bin/mlir-cpu-runner+0x3690fc)
#16 0x00007f6f54754290 (/usr/lib/libc.so.6+0x23290)
#17 0x00007f6f5475434a __libc_start_main (/usr/lib/libc.so.6+0x2334a)
#18 0x00005627ed5504f5 _start /build/glibc/src/glibc/csu/../sysdeps/x86_64/start.S:117:0
make: *** [makefile:62ļ¼šgpu-mma-run] 错čÆÆ 134

I donā€™t know what the problem is.I have some question, i hope someone can help me,thanks!

  • gpu.mma_matrix
    I guess the dimension of gpu.mma_matrix can only be 16x16,because I tried all the others failed, but Iā€™m not so sure.
  • the attribute leadDimension
    Although I have read the official mlir documentation, I still have trouble understanding the meaning of leadDimension.
  • the mma operation should be nested in gpu.launch?
    llvm-project/mlir/test/Integration/GPU/CUDA/TensorCore at main Ā· llvm/llvm-project Ā· GitHub, i test using mma operation operation but donā€™t use gpu.launch, but i canā€™t lower mma operation,but I still want to ask this question.
    I really want to know how to use mma operations properly,thanks!

If youā€™re using convert-gpu-to-nvvm in your compilation pipeline, then these operations model CUDA WMMA intrinsics. It should fail to compile without using a size that matches an available CUDA WMMA size that is available.

If you use this operation for anything other than a trivial example, it is going to be very verbose. There is a VectorToGPU (convert-vector-to-gpu) pass that allows you to lower from vector dialect operations to gpu and nvgpu MMA intrinsics. See for example the tests in test/Conversion/VectorToGPU.

You donā€™t have to use the GPU dialect abstractions. For example, you could use a plain func.func to represent your device code. However, you would need to use a different mlir-opt command line to compile versus the one given in the test you linked to. Thatā€™s probably why you canā€™t lower. Iā€™m guessing you are using (from the test you linked)
mlir-opt [filename] -gpu-kernel-outlining -pass-pipeline='gpu.module(strip-debuginfo,convert-gpu-to-nvvm,gpu-to-cubin{chip=sm_70})'. try to play with it to see what each pass is doing. The gpu-kernel-outlining takes everything in gpu.launch and outlines it to a gpu function nested in a gpu module. Then the convert-gpu-to-nvvm which is doing the MMA lowering only applies to gpu.module, per the pass pipeline specification ā€œā€“pass-pipeline=ā€˜gpu.module(ā€¦)ā€™ā€

1 Like

thanks!

Sorry, i meet some new problem.Hope you can give some guidance.

#map0 = affine_map<(i, j, k) -> (i, j)>
#map1 = affine_map<(i, j, k) -> (j, k)>
#map2 = affine_map<(i, j, k) -> (i, k)> 
func.func @main() {
  %c0 = arith.constant 0 : index
  %c1 = arith.constant 1 : index 
  %c4 = arith.constant 4 : index
  %f0 = arith.constant 0.0 : f16 
  %f2 = arith.constant 2.0 : f16
  %input0 = memref.alloc() : memref<4x4xf16>
  %input1 = memref.alloc() : memref<4x4xf16>
  %output0 = memref.alloc() : memref<4x4xf16>
  scf.for %i = %c0 to %c4 step %c1 {
    scf.for %j = %c0 to %c4 step %c1 {
      memref.store %f0, %output0[%i, %j] : memref<4x4xf16>
      memref.store %f2, %input0[%i, %j] : memref<4x4xf16>
      memref.store %f2, %input1[%i, %j] : memref<4x4xf16>
    }
  } 
  %input_v0 = vector.transfer_read %input0[%c0, %c0], %f0 : memref<4x4xf16>, vector<4x4xf16>
  %input_v1 = vector.transfer_read %input1[%c0, %c0], %f0 : memref<4x4xf16>, vector<4x4xf16>
  %output_v0 = vector.transfer_read %output0[%c0, %c0], %f0 : memref<4x4xf16>, vector<4x4xf16>
  %output_v1 = vector.contract {indexing_maps = [#map0, #map1, #map2], iterator_types = ["parallel", "reduction", "parallel"], kind = #vector.kind<add>} %input_v0, %input_v1, %output_v0 : vector<4x4xf16>, vector<4x4xf16> into vector<4x4xf16>
  vector.transfer_write %output_v1, %output0[%c0, %c0] : vector<4x4xf16>, memref<4x4xf16>
  return 
}

when i use
mlir-opt [filename] -pass-pipeline=ā€œfunc.func(convert-vector-to-gpu)ā€.It show that

#map0 = affine_map<(d0, d1, d2) -> (d0, d1)>
#map1 = affine_map<(d0, d1, d2) -> (d1, d2)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d2)>
module {
  func.func @main() {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %c4 = arith.constant 4 : index
    %cst = arith.constant 0.000000e+00 : f16
    %cst_0 = arith.constant 2.000000e+00 : f16
    %0 = memref.alloc() : memref<4x4xf16>
    %1 = memref.alloc() : memref<4x4xf16>
    %2 = memref.alloc() : memref<4x4xf16>
    scf.for %arg0 = %c0 to %c4 step %c1 {
      scf.for %arg1 = %c0 to %c4 step %c1 {
        memref.store %cst, %2[%arg0, %arg1] : memref<4x4xf16>
        memref.store %cst_0, %0[%arg0, %arg1] : memref<4x4xf16>
        memref.store %cst_0, %1[%arg0, %arg1] : memref<4x4xf16>
      }
    }
    %3 = vector.transfer_read %0[%c0, %c0], %cst {in_bounds = [true, true]} : memref<4x4xf16>, vector<4x4xf16>
    %4 = vector.transfer_read %1[%c0, %c0], %cst {in_bounds = [true, true]} : memref<4x4xf16>, vector<4x4xf16>
    %5 = vector.transfer_read %2[%c0, %c0], %cst {in_bounds = [true, true]} : memref<4x4xf16>, vector<4x4xf16>
    %6 = vector.contract {indexing_maps = [#map0, #map1, #map2], iterator_types = ["parallel", "reduction", "parallel"], kind = #vector.kind<add>} %3, %4, %5 : vector<4x4xf16>, vector<4x4xf16> into vector<4x4xf16>
    vector.transfer_write %6, %2[%c0, %c0] {in_bounds = [true, true]} : vector<4x4xf16>, memref<4x4xf16>
    return
  }
}

vector dialect donā€™t lower to gpu dialect, i also tred some metheds, i change the name of function,make vectors and memrefs have dimension 16x16, I also refer to the following link llvm-project/vector-to-mma-ops.mlir at 3512721d52b3380ea4d3f5b2419d0b7b072e7797 Ā· llvm/llvm-project Ā· GitHub
but it doesnā€™t work, it looks like i donā€™t find the trick how to lower vector,thanks!

I have found the reason for the failure of lower by viewing the code VectorToGPU.cpp,then i changed the order of iterator_types, then it lower successfully.

Edit: looks like you found the reason as I was typing. Even if 4x4 converts, I think you will need to use 16x16 to further lower

Looks like youā€™re making progress.

The issue here is that your vector operations are not following the proper convention expected by the pass. The conversion is ā€œall or nothingā€, meaning that if the entire chain of vector operations canā€™t all convert to gpu operations, then none of them will. We could definitely improve the error reporting/ feedback here because even running with -debug isnā€™t going to tell you why the slice of vector ops isnā€™t valid (sorry).

You shouldnā€™t feel sorry for this, i think you help me a lotļ¼Œthanks! I learned a lot by learning the gpu dialect, and in the future, when I encounter such problem, I can view the source code to see how it is implementedļ¼Œand the problem gave me a chance to view the source code.I am interested in mlir.