Use MLIR/IREE for GPU CodeGen

Hey,

I am working on GPU CodeGen for PowerVR platform. I started with IREE because IREE has implemented many interesting conversions for LinalgToSPIRV. I have a question with regard to IREE (5a51d2a57 Jul 18), it looks like LinalgTilingOnBuffers skips ‘gpu.func’, for example:

  gpu.module @kernels {
    gpu.func @matmul(%arg0: memref<16x16xf32>, %arg1: memref<16x16xf32>, %ret0: memref<16x16xf32>) {
      linalg.matmul %arg0, %arg1, %ret0 :
        (memref<16x16xf32>, memref<16x16xf32>, memref<16x16xf32>)
      gpu.return
    }
  }

I run the command

iree-opt -split-input-file \
         -iree-codegen-linalg-tile-and-fuse \
         -iree-codegen-convert-to-gpu -canonicalize -cse \
         test.mlir

But the output is unchanged. So I change my work flow to this: I first use iree-opt to compile the non gpu MLIR function, then use mlir-opt to lower affine IR to standard IR, for example

iree-opt -split-input-file \
         -iree-codegen-linalg-tile-and-fuse \
         -iree-codegen-convert-to-gpu -canonicalize -cse \
         test.mlir | \
   mlir-opt -lower-affine 
// test.mlir
module attributes {
  spv.target_env =
    #spv.target_env<#spv.vce<v1.3, [Shader], [SPV_KHR_storage_buffer_storage_class]>,
    {max_compute_workgroup_invocations = 128 : i32,
     max_compute_workgroup_size = dense<[128, 128, 64]> : vector<3xi32>}>} {
  func @matmul(%arg0: memref<16x16xf32>, %arg1: memref<16x16xf32>, %ret0: memref<16x16xf32>) {
    linalg.matmul %arg0, %arg1, %ret0 :
      (memref<16x16xf32>, memref<16x16xf32>, memref<16x16xf32>)
    return
  }
}

Then I Manually copy the result to a stub code, for example:

#map0 = affine_map<(d0, d1)[s0] -> (d0 * 16 + s0 + d1)>
module attributes {
  gpu.container_module, spv.target_env = #spv.target_env<#spv.vce<v1.3, [Shader],
  [SPV_KHR_storage_buffer_storage_class]>,
  {max_compute_workgroup_invocations = 128 : i32,
   max_compute_workgroup_size = dense<[128, 128, 64]> : vector<3xi32>}>} {
  gpu.module @kernels {
    gpu.func @matmul(..) kernel .. { .. }

      .. (manual copy paste) ..

    gpu.return
  }

  func @main() {
    ...
    "gpu.launch_func"(%cst4, %cst4, %cst1, %cst1, %cst1, %cst1,
                      %arg0, %arg1, %arg2)
      { kernel = @kernels::@matmul }
        : (index, index, index, index, index, index,
           memref<16x16xf32>, memref<16x16xf32>, memref<16x16xf32>) -> ()
    return
  }
}

And execute the stub code with vulkan runner

mlir-vulkan-runner \
  --shared-libs=libvulkan-runtime-wrappers.so,libmlir_runner_utils.so \
  --entry-point-result=void -print-ir-after-all \
  stub.mlir

I wonder if there is a better way to prevent this manyal copy? Or is there any option in MLIR or IREE to mark certain functions as gpu.func

Thanks :slight_smile:

1 Like

If your question is specific to IREE, you like should have a look at the dedicated communication channels they have: GitHub - iree-org/iree: 👻

Hi @cycheng. As @mehdi_amini mentioned, this is IREE specific question better addressed on the IREE forums. Reach out to us on the IREE Discord server (Google IREE) and we can walk through what you want.

Specifically here, IREE codegen doesn not work with gpu.module or gpu.func.

If you are interested in looking into how IREEs codegen would work, you can look at description( here).
This shows what the IREE codegen pipeline expects as input and all the transformations applied to generate the SPIR-V dialect code and is for the linalg.matmul example itself.

If you want to see the entire IREE pipeline, you should be able to try this

func @matmul(%arg0: tensor<16x16xf32>, %arg1 : tensor<16x16xf32>) -> tensor<16x16xf32> {
  %0 = "mhlo.dot"(%arg0, %arg1) : (tensor<16x16xf32>, tensor<16x16xf32>) -> tensor(16x16xf32)
  return %0 : tensor<16x16xf32>
}

Then

iree-translate -iree-mlir-to-vm-bytecode-module -iree-hal-target-backends=vulkan-spirv -print-ir-after-all -mlir-disable-threading test.mlir

Hey @cycheng, thanks for your interest in IREE and GPU CodeGen! It’s interesting to see that you are trying to mix mlir-vulkan-runner together with IREE. One thing to point out is that IREE has its own way of handling runtime-kernel ABI, which is different from the GPU dialect and various runners in MLIR core. The passes in IREE are assuming that and that’s why they are staying in IREE’s codebase (some pattern might be upstreamable and we are generally gradually upstreaming them); so they might not be directly applicable to the contract expected by the GPU dialect. mlir-vulkan-runner follows the convention of the GPU dialect. So that’s why you need the “manual” copying to make it work. @MaheshRavishankar just landed a very nice write up on the code generation flow in IREE today that you might be interested to give a read.

If you’d like to leverage the existing functionality in IREE I’d recommend you start with IREE. At the moment it’s typical to start with HLO level when entering IREE’s flow; you can find various examples by following the Get Started pages. It’s not easy to enter IREE’s system at a random layer at the moment but we have plans to improve that and hopefully later it would also be easy to just drop in some SPIR-V code and leverage IREE’s runtime to run it and see how it performs.

OTOH, if you are interested to start with GPU level and use mlir-vulkan-runner because its simplicity, etc., you are also certainly very welcome to play with it and extend it in the way you’d like. :slight_smile:

The flow in the GPU dialect is different. We do not start out with a gpu.func at the LinAlg level. What the above code essentially would say is to run a linalg.matmul on every thread of the gpu. What you likely want is to run a linalg.matmul using all threads of the gpu.

To get closer to what you want, you would use your test input file and then transform it using mlir-opt to gpu code. I used the passes -convert-linalg-to-parallel-loops --test-gpu-greedy-parallel-loop-mapping --convert-parallel-loops-to-gpu --gpu-kernel-outlining which yields

#map0 = affine_map<()[s0, s1, s2] -> ((s0 - s1) ceildiv s2)>
#map1 = affine_map<(d0)[s0, s1] -> (d0 * s0 + s1)>


module attributes {gpu.container_module, spv.target_env = #spv.target_env<#spv.vce<v1.3, [Shader], [SPV_KHR_storage_buffer_storage_class]>, {max_compute_workgroup_invocations = 128 : i32, max_compute_workgroup_size = dense<[128, 128, 64]> : vector<3xi32>}>} {
  func @matmul(%arg0: memref<16x16xf32>, %arg1: memref<16x16xf32>, %arg2: memref<16x16xf32>) {
    %c16 = constant 16 : index
    %c0 = constant 0 : index
    %c1 = constant 1 : index
    %c1_0 = constant 1 : index
    %0 = affine.apply #map0()[%c16, %c0, %c1]
    %1 = affine.apply #map0()[%c16, %c0, %c1]
    "gpu.launch_func"(%0, %1, %c1_0, %c1_0, %c1_0, %c1_0, %arg0, %arg1, %arg2) {kernel = @matmul_kernel::@matmul_kernel} : (index, index, index, index, index, index, memref<16x16xf32>, memref<16x16xf32>, memref<16x16xf32>) -> ()
    return
  }
  gpu.module @matmul_kernel {
    gpu.func @matmul_kernel(%arg0: memref<16x16xf32>, %arg1: memref<16x16xf32>, %arg2: memref<16x16xf32>) kernel {
      %0 = "gpu.block_id"() {dimension = "x"} : () -> index
      %1 = "gpu.block_id"() {dimension = "y"} : () -> index
      %2 = "gpu.block_id"() {dimension = "z"} : () -> index
      %3 = "gpu.thread_id"() {dimension = "x"} : () -> index
      %4 = "gpu.thread_id"() {dimension = "y"} : () -> index
      %5 = "gpu.thread_id"() {dimension = "z"} : () -> index
      %6 = "gpu.grid_dim"() {dimension = "x"} : () -> index
      %7 = "gpu.grid_dim"() {dimension = "y"} : () -> index
      %8 = "gpu.grid_dim"() {dimension = "z"} : () -> index
      %9 = "gpu.block_dim"() {dimension = "x"} : () -> index
      %10 = "gpu.block_dim"() {dimension = "y"} : () -> index
      %11 = "gpu.block_dim"() {dimension = "z"} : () -> index
      br ^bb1
    ^bb1:  // pred: ^bb0
      %c1 = constant 1 : index
      %c0 = constant 0 : index
      %c16 = constant 16 : index
      %12 = affine.apply #map1(%0)[%c1, %c0]
      %13 = affine.apply #map1(%1)[%c1, %c0]
      scf.for %arg3 = %c0 to %c16 step %c1 {
        %14 = load %arg0[%12, %arg3] : memref<16x16xf32>
        %15 = load %arg1[%arg3, %13] : memref<16x16xf32>
        %16 = load %arg2[%12, %13] : memref<16x16xf32>
        %17 = mulf %14, %15 : f32
        %18 = addf %16, %17 : f32
        store %18, %arg2[%12, %13] : memref<16x16xf32>
      }
      gpu.return
    }
  }
}

Next step would be to do the mapping to spirv. However, when I used the --convert-gpu-to-spirv pass, I got an error about a missing abi attribute. I don’t know whether we have a pass to insert these. @antiagainst do you know?

That would be the final step using the above pipeline, too.

Hey @mehdi_amini, @MaheshRavishankar, thanks for reminding, I will move to the IREE community for IREE specific questions!

@MaheshRavishankar, @antiagainst, thanks for pointing out the references:

  1. IREE CPU/GPU Code Generation Pipeline
    (or) https://google.github.io/iree/design-docs/codegen-passes
  2. Progressive lowering of Matmul operation from HLO to SPIR-V
  3. Improve and demonstrate hackability at different levels of IR

It’s very useful to help me understand the IREE codegen pipeline :grin:

@antiagainst, thanks for the explanation! There are two reasons leading me to use mlir-vulkan-runner:

  1. Yes! For simplicity :wink:
  2. I managed to cross compile aarch64 iree on x86 pc but still can’t make it work. (I can cross compile MLIR for aarch64 in two stages, first build x86, then build aarch64 with the help of x86 llvm/mlir tablegen),

Our device is running on aarch64 linux, so I need an aarch64 based runtime, but I also want to leverage IREE’s transformation passes : P
I will follow your advice starting from IREE, and try to figure out how to cross build for IREE in IREE community.

Many thanks for your help :slight_smile:

Oh… right, I understand, then it is not the right level to perform tiling!

The idea is to first tile the linalg operation and then map it to GPU. If you wanted to perform some tiling, the invocation would be mlir-opt -linalg-tile-to-parallel-loops="linalg-tile-sizes=0,2" -convert-linalg-to-parallel-loops --test-gpu-greedy-parallel-loop-mapping --convert-parallel-loops-to-gpu --gpu-kernel-outlining.

A summary of the passes would be

  1. tile the linalg operation using parallel loops
  2. transform the tiled linalg operations to loops, as well
  3. distribute to GPU blocks/threads (this is done using annotations on the loops)
  4. convert the annotated loop nest to a GPU function
  5. outline a kernel

You can run it with your example step by step to see the tiling happening.

1 Like

Right now not. We had discussions in a previous thread and agreed on adding an attribute to gpu.func for workgroup size, which will address the issue. It’s just that nobody has gotten to implement it yet. :wink:

Actually supporting mobile and edge devices is one of IREE’s primary goals. So we are certainly aligned here and IREE cares about cross compilation. :slight_smile: At the moment we have cross compilation flow for Android working and you can find the steps in Get Started on Android with CMake page. If it’s general AArch64 Linux, I wouldn’t expect it to be super hard to get it working there too by selecting proper toolchains and probably tweaking some configurations. We have some extra documentation on cross compilation here. Please certainly feel free to ask on IREE’s channel and tweak things to make your case work!

2 Likes

@cycheng to follow up on comment from @herhut earlier.

From here to generate SPIR-V code you need to

  1. There is one manual step where the gpu.func generated needs to have a spv.entry_point_abi attribute added to the gpu.func that needs to be added to specify the workgroup size to use in the lowering to SPIR-V. So you would have to do something like
gpu.func (...) 
attributes {spv.entry_point_abi = {local_size = dense<[32, 4, 1]> : vector<3xi32>} {
  ...
}

(more info about the ABI attributes here).

  1. Then you can generate the spv.module by appending -lower-affine -canonicalize -legalize-std-for-spirv -convert-gpu-to-spirv to the flags above. Though while trying this out, realized there is a bug in the -lower-affine pass which is fixed after this patch.

Now that I tried this out, I think it is worth adding a pass to remove the manual step (1) above (i.e. allowing setting that through command line). Will take a stab at doing that and will update this post.

Thanks for bringing this up. This part of the codebase hasnt been in active use so things were not integrated as well as it should be. Will update this post as I try to stream-line this.

1 Like

With this patch (and its dependent patches) the following module

$ cat test.mlir
module attributes {
  spv.target_env =
    #spv.target_env<#spv.vce<v1.3, [Shader], [SPV_KHR_storage_buffer_storage_class]>,
    {max_compute_workgroup_invocations = 128 : i32,
     max_compute_workgroup_size = dense<[128, 128, 64]> : vector<3xi32>}>} {
  func @matmul(%arg0: memref<16x16xf32>, %arg1: memref<16x16xf32>, %ret0: memref<16x16xf32>) {
    linalg.matmul %arg0, %arg1, %ret0 : (memref<16x16xf32>, memref<16x16xf32>, memref<16x16xf32>) -> ()
    return
  }
}

can be lowered to spir-v with the following command line

$ mlir-opt -linalg-tile-to-parallel-loops="linalg-tile-sizes=8,4" -convert-linalg-to-parallel-loops
    -test-gpu-greedy-parallel-loop-mapping -convert-parallel-loops-to-gpu -gpu-kernel-outlining
    -lower-affine -canonicalize -legalize-std-for-spirv
    -test-spirv-entry-point-abi="workgroup-size=4,8" -convert-gpu-to-spirv test.mlir

The additional flags are for

  • Lowering affine.apply instructions inserted by the convert-parallel-loops-to-gpu pass.
  • Canonicalization to just get around some missing lowering for std.br` instruction
  • Some ops like subview cannot be lowered directly to SPIR-V, so they are folded into their load/store uses
  • Set workgroup size for the gpu.func kernel function (here [8, 4, 1])
  • Lower gpu.func and its body to spirv dialect.
1 Like

Thanks @MaheshRavishankar for adding this. Nice to see this work end-to-end again.

In this example where the operands to the launch are static, couldn’t we derive the ABI attributes from the call, as well? Not saying you should do this, just curious.

On a different note, if you only want to experiment with kernel side code generation you can use the ModelBuilder code which is in IREE. It allows using directly IREE lowering passes and run the generated code through Vulkan runner without having to do any extra steps. The following example for instance generated code for linalg.matmul: https://github.com/google/iree/blob/main/experimental/ModelBuilder/test/TestMatMulVulkan.cpp

You can build the example into a standalone app, you just need to add -DIREE_BUILD_EXPERIMENTAL=ON to CMake command line and follow the normal IREE build process:

cmake -G Ninja -B build/ -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DIREE_BUILD_EXPERIMENTAL=ON .


cmake --build build/

and run:

./build/experimental/ModelBuilder/test/test-matmul-vulkan -vulkan-wrapper=build/third_party/llvm-project/llvm/lib/libvulkan-runtime-wrappe
rs.so

On a slightly different topic if you have some ideas of a good strategy for matmul code generation that would work well for PowerVR, I would be interested to discuss it. I’m currently looking at different mobile GPU architectures to try to identify the different strategies needed to get efficient for different architectures.

@herhut and @MaheshRavishankar Thanks for sharing me another approach, and come out the patches so quickly, the MLIR community is awesome!!

I will try this approach, and compare against IREE’s transformation (Sorry I am still struggling on cross build IREE for aarch64).

@antiagainst I am very glad to be in the same direction with IREE/MLIR (unofficial, not represent powervr :stuck_out_tongue:) Thanks for your info, I am still working on cross build IREE for aarch64) Could you have a look and give me some hints, many thanks :slight_smile:

@ThomasRaoux Yes, I only want to experiment linalg dialect codegen, so I assume graph compiler has generated linalg dialect for me.
The cooperative matrix patch and MatmulCodegenStrategy are quite interesting!
My plan of the following weeks:
(1) Benchmarking single linalg.matmul on powervr.
(2) Benchmarking producer/consumer linalg.matmul on powervr.
(3) Tune the performance:
(3.1) Come out the optimization strategy for powervr, try to feedback (patches or discussions) to community if my bosses agree (Sorry I have to be conservative here, but I think powervr will become more open source friendly, and I am also working on this)
Unfortunately, MLIR/IREE has done many fantastic work, so I am not sure how much we can actually contribute :wink:

Aww, thanks! Do know that more work and focus is extremely helpful! As fairly generic solutions, there is no making up for heart and expertise on specific platforms, and that is the difference between a mediocre outcome and something great for each. So thank you for your perseverance, and don’t be too intimidated by all that appears to be going on. There is a lot still undone, especially for platform specific tuning and specialization.

1 Like

Yes the cooperative matrix benchmark test is probably the most interesting, it only works for GPU supporting cooperative matrix extension right now. I’m planning to add an alternative lowering for other GPUs using subgroup operations or falling back to naive code generation. I’ll hopefully get to it in the next few weeks.

Some interesting characteristics that influence the lowering so far are whether GPU has fast shared local memory, if it supports efficient subgroups shuffle instructions and if it has any dedicated hardware (tensorcore kind of thing).
If you are able to share, it would great to know if PowerVR can take advantage of any of those features.

THe patches have landed and the command line (here) should work. In due course will add a test for that pipeline. We probably can add a GPUCodegenPassPipeline somewhere for at least testing purposes.

I am not sure how we would do that. In general there is a non-trivial relationship between tile sizes and workgroup sizes. For the purposes of having some working flow, just allow setting this from command line for now.

There is definitely a lot of work to be done. @ThomasRaoux work on using cooperative matrix instructions is what we are focussing on. Maybe the same strategy (that has previously been demonstrated on CPU by @nicolasvasilache) but for powerVR would be really awesome!

Hey Folks,

Sorry for the late update. Thank you for all of your help, I can run and benchmark matmul on our platform using MLIR or IREE. I am moving to performance tuning.

For linalg.matmul, the optimization we are looking for will look like this (sorry for using C/C++ to describe the idea):

Tiling (LinalgTilingToParallelLoops)

// C += A x B
for (uint32_t i = 0; i < C.row_size(); i += TileRowSize)   // 2d block, which
  for (uint32_t j = 0; j < C.col_size(); j += TileColSize) // maps to scf.parallel
    for (uint32_t k = 0; k < B.row_size(); k += TileStepSize)
      for (uint32_t ti = i; ti < i + TileRowSize; ++ti)
        for (uint32_t tj = j; tj < j + TileColSize; ++tj)
          for (uint32_t tk = k; tk < k + TileStepSize; ++tk)
            C.at(ti, tj) += A.at(ti, tk) * B.at(tk, tj);

Map outer parallel loops to Grids (TestGpuGreedyParallelLoopMappingPass)
The top two loops are mapped to 2d grid, and the grid size is:

// GridSize  = [C.row_size() / TileRowSize, 
//              C.col_size() / TileColSize,
//              1] 
for (uint32_t i = 0; i < C.row_size(); i += TileRowSize)
  for (uint32_t j = 0; j < C.col_size(); j += TileColSize)

Map inner parallel loops to Thread Blocks/Workgroup (LinalgLowerToParallelLoops+TestGpuGreedyParallelLoopMappingPass)
This example use 1D workgroup, so each thread computes TileColSize results.

    // TileColSize, for example 16, which means one thread computes 16 results.
    // BlockSize = [TileRowSize, 1, 1]
    for (uint32_t k = 0; k < B.row_size(); k += TileStepSize)
 =>   for (uint32_t ti = i; ti < i + TileRowSize; ++ti)
        for (uint32_t tj = j; tj < j + TileColSize; ++tj)
          for (uint32_t tk = k; tk < k + TileStepSize; ++tk)
            C.at(ti, tj) += A.at(ti, tk) * B.at(tk, tj);

Map ti to Thread Blocks/Workgroup

    // BlockSize = [TileRowSize, 1, 1]
      for (uint32_t ti = i; ti < i + TileRowSize; ++ti)
    // TileColSize, for example 16, which means one thread computes 16 results.
    for (uint32_t k = 0; k < B.row_size(); k += TileStepSize)
      for (uint32_t tj = j; tj < j + TileColSize; ++tj)
        for (uint32_t tk = k; tk < k + TileStepSize; ++tk)
          C.at(ti, tj) += A.at(ti, tk) * B.at(tk, tj);

Local memory promotion for matrix B (LinalgPromotionOptions)

    for (uint32_t k = 0; k < B.row_size(); k += TileStepSize)
      // Load by all threads in a workgroup, each thread load "TileStepSize" values
      LocalB[TileStepSize][TileColSize] = 
        B[tj + 0 .. TileStepSize][tk + 0 .. TileColSize];

      for (uint32_t tj = j, ltj = 0; tj < j + TileColSize; ++tj, ++ltj)
        for (uint32_t tk = k, ltk = 0; tk < k + TileStepSize; ++tk, ++ltk)
          C.at(ti, tj) += A.at(ti, tk) * LocalB.at(ltk, ltj);

Local memory for tile Outputs (LinalgPromotionOptions)

    LocalC[TileRowSize][TileColSize] = { 0, .., 0 };

    for (uint32_t k = 0; k < B.row_size(); k += TileStepSize)
      LocalB[TileStepSize][TileColSize] = 
        B[tj + 0 .. TileStepSize][tk + 0 .. TileColSize];

      for (uint32_t tj = j, ltj = 0; tj < j + TileColSize; ++tj, ++ltj)
        for (uint32_t tk = k, ltk = 0; tk < k + TileStepSize; ++tk, ++ltk)
          LocalC.at(lti, ltj) += A.at(ti, tk) * LocalB.at(ltk, ltj);

    for (uint32_t tj = j, ltj = 0; tj < j + TileColSize; ++tj, ++ltj)
      C.at(ti, tj) += LocalC.at(lti, ltj)

My first question is, can I tile and map the grid/block in this way:

    TileSizes = [TileRowSize, TileColSize, TileStepSize]
    GridSize  = [C.row_size() / TileRowSize, C.col_size() / TileColSize, 1]
    BlockSize = [TileRowSize, 1, 1]

My second question is I try to run some examples in TestLinalgTransforms.cpp

mlir-opt -test-linalg-transform-patterns=test-matmul-to-vector-patterns-tile-2d \
  -convert-linalg-to-parallel-loops -test-gpu-greedy-parallel-loop-mapping \
  -convert-parallel-loops-to-gpu -gpu-kernel-outlining -lower-affine -canonicalize \
  -legalize-std-for-spirv  -split-input-file \
  ./test/Dialect/Linalg/transform-patterns-matmul-to-vector.mlir

But I got the message:

./test/Dialect/Linalg/transform-patterns-matmul-to-vector.mlir split at line #1:8:3: error: semi-affine expressions (division by non-const) are not supported

I try this because in our case we probably need to create our custom transformation flow, so I tried something like this:

    LinalgLoopDistributionOptions cyclicNprocsDefault;
    cyclicNprocsDefault.distributionMethod.resize(2,
                                                  DistributionMethod::Cyclic);
    cyclicNprocsDefault.procInfo =
        getGpuProcIds<gpu::BlockIdOp, gpu::GridDimOp>;
    patterns.insert<LinalgTilingPattern<MatmulOp>>(
        ctx,
        LinalgTilingOptions()
            .setTileSizes({32, 32, 4})
            .setLoopType(LinalgTilingLoopType::ParallelLoops)
            .setDistributionOptions(cyclicNprocsDefault)
        , LinalgMarker({},
                      Identifier::get("after_distribute3", ctx))
                     );

But the error blocks me :disappointed_relieved:

Please give me hints or directions, thanks :slight_smile:

CY

This is very timely @cycheng. Within IREE/Model builder we have been doing similar things.

The exact steps you are looking to do have been done IREE. This is an example of the IREE test that is doing the same thing : https://github.com/google/iree/blob/main/iree/compiler/Conversion/LinalgToSPIRV/test/workgroup_memory_promotion.mlir

(I am in the process of doing this for conv operation as well).

The mapping to parallel loops happens in a pass after this in IREE. The distribute options that you mentioned is part of the effort to move the distribution logic from IREE to MLIR.

@ThomasRaoux is looking at using co-operative matrix instructions and is in process of pushing this upstream.

All this to say that it would be great if

  1. This is useful for you
  2. We can collaborate on improving performance of matmul and other operations.

I think I hit a similar error when I was adding support for workgroup memory on convolutions. I’ll try to see whats happening.

1 Like

Hi CY,

The optimizations you are trying to do make a lot of sense. As @MaheshRavishankar mentioned I am working on similar flow although not all the pieces are quite functional yet. I have been taking a bit of a detour from the current IREE codegen that does linagl → SPIR-V as we have been working on an experimental flow linalg->vector->SPIR-V. If you look at this patch I’m using the MatmulCodegenStrategy which is shared between CPU and GPU to do very similar transformation. (BTW @nicolasvasilache is presenting this work on Thursday in the open design meeting, it will be mostly about vector but it should touch the GPU part too). I’m planning to send a PR for it very soon but I’m waiting on a dependency in MLIR.

If you look at this code, this describes similar strategy as what you are describing.

    strategy
        .tile<linalg::MatmulOp>(
            linalg::LinalgTilingOptions()
                .setLoopType(linalg::LinalgTilingLoopType::ParallelLoops)
                .setTileSizes({tileM, tileN, tileK})
                .setInterchange({1, 0, 2})
                .setDistributionOptions(WGDistribute))
        .setHoistInvariantCode(true)
        .promote<linalg::MatmulOp>(
              linalg::LinalgPromotionOptions()
                  .setAllocationDeallocationFns(allocateWorkgroupMemory,
                                                deallocateWorkgroupMemory)
                  .setCopyInOutFns(copyToFromWorkgroupMemory,
                                   copyToFromWorkgroupMemory)
                  .setOperandsToPromote({0, 1})
                  .setUseFullTileBuffers({false, false}))
          .tile<linalg::MatmulOp>(
              linalg::LinalgTilingOptions()
                  .setLoopType(linalg::LinalgTilingLoopType::ParallelLoops)
                  .setTileSizes(
                      {tileM / numSubgroupX, tileN / numSubgroupY, tileK})
                  .setDistributionOptions(SGDistribute))
        .vectorize<linalg::MatmulOp>().unrollVector<vector::ContractionOp>(
        {cooperativeMatrixM, cooperativeMatrixN, cooperativeMatrixK});

That should work, basically it sounds like the main difference with above is that you have are not trying to use subgroup operations. You could do the second tiling and map M dimension on threadID.x while N dimension would become a loop (ideally unrolled).

Why do you want to promote the output to local memory? If you unroll the loop along j you can keep C in register. Are you worried about pressure? Doing load and store to shared memory for every operation sounds expensive.

1 Like