Run linalg.matmul on gpu

Hi! Does somebody know how to run linalg.matmul on gpu device. I’m not familiar to it, and I don’t get any document about it. Here is my code.

module {
  func.func @forward(%arg0: tensor<4x2xf32>, %arg1: tensor<2x3xf32>) -> tensor<4x3xf32> {
    %cst = arith.constant dense<0.000000e+00> : tensor<4x3xf32>
    %0 = linalg.matmul {cast = #linalg.type_fn<cast_signed>} ins(%arg0, %arg1 : tensor<4x2xf32>, tensor<2x3xf32>) outs(%cst : tensor<4x3xf32>) -> tensor<4x3xf32>
    return %0 : tensor<4x3xf32>
  }
}

And, if it possible, I also want to know how to test it’s runtime.

Currently, there is no end-to-end lowering that can handle any input. However, many necessary pieces are available and, with some tweaks, what you’re asking can be done.

There are essentially three steps required to get it running:

  • map and lower the workload to a GPU kernel
  • lower the GPU kernel to a target device (serialize to a binary)
  • feed the created binary to a runtime

For the first step, an example pipeline that can get you to a GPU kernel:

mlir-opt -one-shot-bufferize="bufferize-function-boundaries=1 function-boundary-type-conversion=identity-layout-map" -convert-linalg-to-parallel-loops -canonicalize -gpu-map-parallel-loops -convert-parallel-loops-to-gpu -gpu-kernel-outlining -canonicalize -cse

At the heart of this lowering, the linalg operation gets converted into a parallel loop. This allows to create and outline a GPU kernel. Of course, it is a naive lowering so performance might be lacking :wink:

Depending on your runtime, you might need to add some data movement to make inputs accessible on GPU, for example: gpu.host_register or gpu.alloc + gpu.memcpy.

The next two steps depend on your target. Have a look at some of the existing tests:

There are a few more integration tests for other runners and more GPU examples. Hopefully, you can find one that works for your setup.

1 Like

Thanks!