Use MLIR/IREE for GPU CodeGen

Hey @mehdi_amini, @MaheshRavishankar, thanks for reminding, I will move to the IREE community for IREE specific questions!

@MaheshRavishankar, @antiagainst, thanks for pointing out the references:

  1. IREE CPU/GPU Code Generation Pipeline
    (or) https://google.github.io/iree/design-docs/codegen-passes
  2. Progressive lowering of Matmul operation from HLO to SPIR-V
  3. Improve and demonstrate hackability at different levels of IR

It’s very useful to help me understand the IREE codegen pipeline :grin:

@antiagainst, thanks for the explanation! There are two reasons leading me to use mlir-vulkan-runner:

  1. Yes! For simplicity :wink:
  2. I managed to cross compile aarch64 iree on x86 pc but still can’t make it work. (I can cross compile MLIR for aarch64 in two stages, first build x86, then build aarch64 with the help of x86 llvm/mlir tablegen),

Our device is running on aarch64 linux, so I need an aarch64 based runtime, but I also want to leverage IREE’s transformation passes : P
I will follow your advice starting from IREE, and try to figure out how to cross build for IREE in IREE community.

Many thanks for your help :slight_smile:

Oh… right, I understand, then it is not the right level to perform tiling!

The idea is to first tile the linalg operation and then map it to GPU. If you wanted to perform some tiling, the invocation would be mlir-opt -linalg-tile-to-parallel-loops="linalg-tile-sizes=0,2" -convert-linalg-to-parallel-loops --test-gpu-greedy-parallel-loop-mapping --convert-parallel-loops-to-gpu --gpu-kernel-outlining.

A summary of the passes would be

  1. tile the linalg operation using parallel loops
  2. transform the tiled linalg operations to loops, as well
  3. distribute to GPU blocks/threads (this is done using annotations on the loops)
  4. convert the annotated loop nest to a GPU function
  5. outline a kernel

You can run it with your example step by step to see the tiling happening.

1 Like

Right now not. We had discussions in a previous thread and agreed on adding an attribute to gpu.func for workgroup size, which will address the issue. It’s just that nobody has gotten to implement it yet. :wink:

Actually supporting mobile and edge devices is one of IREE’s primary goals. So we are certainly aligned here and IREE cares about cross compilation. :slight_smile: At the moment we have cross compilation flow for Android working and you can find the steps in Get Started on Android with CMake page. If it’s general AArch64 Linux, I wouldn’t expect it to be super hard to get it working there too by selecting proper toolchains and probably tweaking some configurations. We have some extra documentation on cross compilation here. Please certainly feel free to ask on IREE’s channel and tweak things to make your case work!

2 Likes

@cycheng to follow up on comment from @herhut earlier.

From here to generate SPIR-V code you need to

  1. There is one manual step where the gpu.func generated needs to have a spv.entry_point_abi attribute added to the gpu.func that needs to be added to specify the workgroup size to use in the lowering to SPIR-V. So you would have to do something like
gpu.func (...) 
attributes {spv.entry_point_abi = {local_size = dense<[32, 4, 1]> : vector<3xi32>} {
  ...
}

(more info about the ABI attributes here).

  1. Then you can generate the spv.module by appending -lower-affine -canonicalize -legalize-std-for-spirv -convert-gpu-to-spirv to the flags above. Though while trying this out, realized there is a bug in the -lower-affine pass which is fixed after this patch.

Now that I tried this out, I think it is worth adding a pass to remove the manual step (1) above (i.e. allowing setting that through command line). Will take a stab at doing that and will update this post.

Thanks for bringing this up. This part of the codebase hasnt been in active use so things were not integrated as well as it should be. Will update this post as I try to stream-line this.

1 Like

With this patch (and its dependent patches) the following module

$ cat test.mlir
module attributes {
  spv.target_env =
    #spv.target_env<#spv.vce<v1.3, [Shader], [SPV_KHR_storage_buffer_storage_class]>,
    {max_compute_workgroup_invocations = 128 : i32,
     max_compute_workgroup_size = dense<[128, 128, 64]> : vector<3xi32>}>} {
  func @matmul(%arg0: memref<16x16xf32>, %arg1: memref<16x16xf32>, %ret0: memref<16x16xf32>) {
    linalg.matmul %arg0, %arg1, %ret0 : (memref<16x16xf32>, memref<16x16xf32>, memref<16x16xf32>) -> ()
    return
  }
}

can be lowered to spir-v with the following command line

$ mlir-opt -linalg-tile-to-parallel-loops="linalg-tile-sizes=8,4" -convert-linalg-to-parallel-loops
    -test-gpu-greedy-parallel-loop-mapping -convert-parallel-loops-to-gpu -gpu-kernel-outlining
    -lower-affine -canonicalize -legalize-std-for-spirv
    -test-spirv-entry-point-abi="workgroup-size=4,8" -convert-gpu-to-spirv test.mlir

The additional flags are for

  • Lowering affine.apply instructions inserted by the convert-parallel-loops-to-gpu pass.
  • Canonicalization to just get around some missing lowering for std.br` instruction
  • Some ops like subview cannot be lowered directly to SPIR-V, so they are folded into their load/store uses
  • Set workgroup size for the gpu.func kernel function (here [8, 4, 1])
  • Lower gpu.func and its body to spirv dialect.
1 Like

Thanks @MaheshRavishankar for adding this. Nice to see this work end-to-end again.

In this example where the operands to the launch are static, couldn’t we derive the ABI attributes from the call, as well? Not saying you should do this, just curious.

On a different note, if you only want to experiment with kernel side code generation you can use the ModelBuilder code which is in IREE. It allows using directly IREE lowering passes and run the generated code through Vulkan runner without having to do any extra steps. The following example for instance generated code for linalg.matmul: https://github.com/google/iree/blob/main/experimental/ModelBuilder/test/TestMatMulVulkan.cpp

You can build the example into a standalone app, you just need to add -DIREE_BUILD_EXPERIMENTAL=ON to CMake command line and follow the normal IREE build process:

cmake -G Ninja -B build/ -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DIREE_BUILD_EXPERIMENTAL=ON .


cmake --build build/

and run:

./build/experimental/ModelBuilder/test/test-matmul-vulkan -vulkan-wrapper=build/third_party/llvm-project/llvm/lib/libvulkan-runtime-wrappe
rs.so

On a slightly different topic if you have some ideas of a good strategy for matmul code generation that would work well for PowerVR, I would be interested to discuss it. I’m currently looking at different mobile GPU architectures to try to identify the different strategies needed to get efficient for different architectures.

@herhut and @MaheshRavishankar Thanks for sharing me another approach, and come out the patches so quickly, the MLIR community is awesome!!

I will try this approach, and compare against IREE’s transformation (Sorry I am still struggling on cross build IREE for aarch64).

@antiagainst I am very glad to be in the same direction with IREE/MLIR (unofficial, not represent powervr :stuck_out_tongue:) Thanks for your info, I am still working on cross build IREE for aarch64) Could you have a look and give me some hints, many thanks :slight_smile:

@ThomasRaoux Yes, I only want to experiment linalg dialect codegen, so I assume graph compiler has generated linalg dialect for me.
The cooperative matrix patch and MatmulCodegenStrategy are quite interesting!
My plan of the following weeks:
(1) Benchmarking single linalg.matmul on powervr.
(2) Benchmarking producer/consumer linalg.matmul on powervr.
(3) Tune the performance:
(3.1) Come out the optimization strategy for powervr, try to feedback (patches or discussions) to community if my bosses agree (Sorry I have to be conservative here, but I think powervr will become more open source friendly, and I am also working on this)
Unfortunately, MLIR/IREE has done many fantastic work, so I am not sure how much we can actually contribute :wink:

Aww, thanks! Do know that more work and focus is extremely helpful! As fairly generic solutions, there is no making up for heart and expertise on specific platforms, and that is the difference between a mediocre outcome and something great for each. So thank you for your perseverance, and don’t be too intimidated by all that appears to be going on. There is a lot still undone, especially for platform specific tuning and specialization.

1 Like

Yes the cooperative matrix benchmark test is probably the most interesting, it only works for GPU supporting cooperative matrix extension right now. I’m planning to add an alternative lowering for other GPUs using subgroup operations or falling back to naive code generation. I’ll hopefully get to it in the next few weeks.

Some interesting characteristics that influence the lowering so far are whether GPU has fast shared local memory, if it supports efficient subgroups shuffle instructions and if it has any dedicated hardware (tensorcore kind of thing).
If you are able to share, it would great to know if PowerVR can take advantage of any of those features.

THe patches have landed and the command line (here) should work. In due course will add a test for that pipeline. We probably can add a GPUCodegenPassPipeline somewhere for at least testing purposes.

I am not sure how we would do that. In general there is a non-trivial relationship between tile sizes and workgroup sizes. For the purposes of having some working flow, just allow setting this from command line for now.

There is definitely a lot of work to be done. @ThomasRaoux work on using cooperative matrix instructions is what we are focussing on. Maybe the same strategy (that has previously been demonstrated on CPU by @nicolasvasilache) but for powerVR would be really awesome!

Hey Folks,

Sorry for the late update. Thank you for all of your help, I can run and benchmark matmul on our platform using MLIR or IREE. I am moving to performance tuning.

For linalg.matmul, the optimization we are looking for will look like this (sorry for using C/C++ to describe the idea):

Tiling (LinalgTilingToParallelLoops)

// C += A x B
for (uint32_t i = 0; i < C.row_size(); i += TileRowSize)   // 2d block, which
  for (uint32_t j = 0; j < C.col_size(); j += TileColSize) // maps to scf.parallel
    for (uint32_t k = 0; k < B.row_size(); k += TileStepSize)
      for (uint32_t ti = i; ti < i + TileRowSize; ++ti)
        for (uint32_t tj = j; tj < j + TileColSize; ++tj)
          for (uint32_t tk = k; tk < k + TileStepSize; ++tk)
            C.at(ti, tj) += A.at(ti, tk) * B.at(tk, tj);

Map outer parallel loops to Grids (TestGpuGreedyParallelLoopMappingPass)
The top two loops are mapped to 2d grid, and the grid size is:

// GridSize  = [C.row_size() / TileRowSize, 
//              C.col_size() / TileColSize,
//              1] 
for (uint32_t i = 0; i < C.row_size(); i += TileRowSize)
  for (uint32_t j = 0; j < C.col_size(); j += TileColSize)

Map inner parallel loops to Thread Blocks/Workgroup (LinalgLowerToParallelLoops+TestGpuGreedyParallelLoopMappingPass)
This example use 1D workgroup, so each thread computes TileColSize results.

    // TileColSize, for example 16, which means one thread computes 16 results.
    // BlockSize = [TileRowSize, 1, 1]
    for (uint32_t k = 0; k < B.row_size(); k += TileStepSize)
 =>   for (uint32_t ti = i; ti < i + TileRowSize; ++ti)
        for (uint32_t tj = j; tj < j + TileColSize; ++tj)
          for (uint32_t tk = k; tk < k + TileStepSize; ++tk)
            C.at(ti, tj) += A.at(ti, tk) * B.at(tk, tj);

Map ti to Thread Blocks/Workgroup

    // BlockSize = [TileRowSize, 1, 1]
      for (uint32_t ti = i; ti < i + TileRowSize; ++ti)
    // TileColSize, for example 16, which means one thread computes 16 results.
    for (uint32_t k = 0; k < B.row_size(); k += TileStepSize)
      for (uint32_t tj = j; tj < j + TileColSize; ++tj)
        for (uint32_t tk = k; tk < k + TileStepSize; ++tk)
          C.at(ti, tj) += A.at(ti, tk) * B.at(tk, tj);

Local memory promotion for matrix B (LinalgPromotionOptions)

    for (uint32_t k = 0; k < B.row_size(); k += TileStepSize)
      // Load by all threads in a workgroup, each thread load "TileStepSize" values
      LocalB[TileStepSize][TileColSize] = 
        B[tj + 0 .. TileStepSize][tk + 0 .. TileColSize];

      for (uint32_t tj = j, ltj = 0; tj < j + TileColSize; ++tj, ++ltj)
        for (uint32_t tk = k, ltk = 0; tk < k + TileStepSize; ++tk, ++ltk)
          C.at(ti, tj) += A.at(ti, tk) * LocalB.at(ltk, ltj);

Local memory for tile Outputs (LinalgPromotionOptions)

    LocalC[TileRowSize][TileColSize] = { 0, .., 0 };

    for (uint32_t k = 0; k < B.row_size(); k += TileStepSize)
      LocalB[TileStepSize][TileColSize] = 
        B[tj + 0 .. TileStepSize][tk + 0 .. TileColSize];

      for (uint32_t tj = j, ltj = 0; tj < j + TileColSize; ++tj, ++ltj)
        for (uint32_t tk = k, ltk = 0; tk < k + TileStepSize; ++tk, ++ltk)
          LocalC.at(lti, ltj) += A.at(ti, tk) * LocalB.at(ltk, ltj);

    for (uint32_t tj = j, ltj = 0; tj < j + TileColSize; ++tj, ++ltj)
      C.at(ti, tj) += LocalC.at(lti, ltj)

My first question is, can I tile and map the grid/block in this way:

    TileSizes = [TileRowSize, TileColSize, TileStepSize]
    GridSize  = [C.row_size() / TileRowSize, C.col_size() / TileColSize, 1]
    BlockSize = [TileRowSize, 1, 1]

My second question is I try to run some examples in TestLinalgTransforms.cpp

mlir-opt -test-linalg-transform-patterns=test-matmul-to-vector-patterns-tile-2d \
  -convert-linalg-to-parallel-loops -test-gpu-greedy-parallel-loop-mapping \
  -convert-parallel-loops-to-gpu -gpu-kernel-outlining -lower-affine -canonicalize \
  -legalize-std-for-spirv  -split-input-file \
  ./test/Dialect/Linalg/transform-patterns-matmul-to-vector.mlir

But I got the message:

./test/Dialect/Linalg/transform-patterns-matmul-to-vector.mlir split at line #1:8:3: error: semi-affine expressions (division by non-const) are not supported

I try this because in our case we probably need to create our custom transformation flow, so I tried something like this:

    LinalgLoopDistributionOptions cyclicNprocsDefault;
    cyclicNprocsDefault.distributionMethod.resize(2,
                                                  DistributionMethod::Cyclic);
    cyclicNprocsDefault.procInfo =
        getGpuProcIds<gpu::BlockIdOp, gpu::GridDimOp>;
    patterns.insert<LinalgTilingPattern<MatmulOp>>(
        ctx,
        LinalgTilingOptions()
            .setTileSizes({32, 32, 4})
            .setLoopType(LinalgTilingLoopType::ParallelLoops)
            .setDistributionOptions(cyclicNprocsDefault)
        , LinalgMarker({},
                      Identifier::get("after_distribute3", ctx))
                     );

But the error blocks me :disappointed_relieved:

Please give me hints or directions, thanks :slight_smile:

CY

This is very timely @cycheng. Within IREE/Model builder we have been doing similar things.

The exact steps you are looking to do have been done IREE. This is an example of the IREE test that is doing the same thing : https://github.com/google/iree/blob/main/iree/compiler/Conversion/LinalgToSPIRV/test/workgroup_memory_promotion.mlir

(I am in the process of doing this for conv operation as well).

The mapping to parallel loops happens in a pass after this in IREE. The distribute options that you mentioned is part of the effort to move the distribution logic from IREE to MLIR.

@ThomasRaoux is looking at using co-operative matrix instructions and is in process of pushing this upstream.

All this to say that it would be great if

  1. This is useful for you
  2. We can collaborate on improving performance of matmul and other operations.

I think I hit a similar error when I was adding support for workgroup memory on convolutions. I’ll try to see whats happening.

1 Like

Hi CY,

The optimizations you are trying to do make a lot of sense. As @MaheshRavishankar mentioned I am working on similar flow although not all the pieces are quite functional yet. I have been taking a bit of a detour from the current IREE codegen that does linagl -> SPIR-V as we have been working on an experimental flow linalg->vector->SPIR-V. If you look at this patch I’m using the MatmulCodegenStrategy which is shared between CPU and GPU to do very similar transformation. (BTW @nicolasvasilache is presenting this work on Thursday in the open design meeting, it will be mostly about vector but it should touch the GPU part too). I’m planning to send a PR for it very soon but I’m waiting on a dependency in MLIR.

If you look at this code, this describes similar strategy as what you are describing.

    strategy
        .tile<linalg::MatmulOp>(
            linalg::LinalgTilingOptions()
                .setLoopType(linalg::LinalgTilingLoopType::ParallelLoops)
                .setTileSizes({tileM, tileN, tileK})
                .setInterchange({1, 0, 2})
                .setDistributionOptions(WGDistribute))
        .setHoistInvariantCode(true)
        .promote<linalg::MatmulOp>(
              linalg::LinalgPromotionOptions()
                  .setAllocationDeallocationFns(allocateWorkgroupMemory,
                                                deallocateWorkgroupMemory)
                  .setCopyInOutFns(copyToFromWorkgroupMemory,
                                   copyToFromWorkgroupMemory)
                  .setOperandsToPromote({0, 1})
                  .setUseFullTileBuffers({false, false}))
          .tile<linalg::MatmulOp>(
              linalg::LinalgTilingOptions()
                  .setLoopType(linalg::LinalgTilingLoopType::ParallelLoops)
                  .setTileSizes(
                      {tileM / numSubgroupX, tileN / numSubgroupY, tileK})
                  .setDistributionOptions(SGDistribute))
        .vectorize<linalg::MatmulOp>().unrollVector<vector::ContractionOp>(
        {cooperativeMatrixM, cooperativeMatrixN, cooperativeMatrixK});

That should work, basically it sounds like the main difference with above is that you have are not trying to use subgroup operations. You could do the second tiling and map M dimension on threadID.x while N dimension would become a loop (ideally unrolled).

Why do you want to promote the output to local memory? If you unroll the loop along j you can keep C in register. Are you worried about pressure? Doing load and store to shared memory for every operation sounds expensive.

1 Like

Thanks @ThomasRaoux for the detailed explanation (was just about to post the same).

From my side I am very happy to work with you guys! The GPU computing + Polyhedral Codegen is what I really want to do but am not able to do it by myself. The whole work (MLIR+IREE) is also benefit us, because SPIRV+Vulkan is the standard backend, that means for any vendor who support Vulkan can directly leverage MLIR+IREE to generate high performance device code for BLAS applications (although it may need to tweak some parameters for different targets). Moreover, we can support TFLite or even TensorFlow directly if the MLIR ecosystem generate Vulkan based blob for us.

Thanks for open source the work and putting so much effort on it :wink:

THe best way to do this is to build this up in MLIR (or LLVM project) itself and not use IREE directly. Till now I think most of the MLIR work has been about building up the pieces needed to for the full stack, but there is no place within MLIR or LLVM eco-system where the full stack can be built. IREE codegen has the full stack, and some work is happening within the TF stack. All these put together the different pieces from core. Would be good if all of these could be consolidated in one place so that we can use it from there directly and everyone can pick it up from/contribute to there. There are questions of what are the goals of this full stack (what models/ops etc and what devices to target) that make it hard to see what this common infra would look like. Definitely the Matmul Strategy (and other similar things) help along that direction for single ops.

Thanks =) I think this matches my requirement.
I update my iree and use your patch. I saw the message:

error: 'std.mulf' op operand #0 must be floating-point-like, but got 'i8'
error: failed to legalize operation 'vector.transfer_read'
error: 'func' op unsupported module-level operation

In Vulkan, we do not support cooperative matrix load/store/mad : (
I would like to test your patch with f32 inputs/output, could you give me some guides?

Yes you are right! Our implementation for GX6250 is using registers for C.
We faced some tradeoff here, and are thinking to reduce register pressure by using local memory, but keep in register might be the best strategy for GX6250.

It’s great to know that the stack is helpful to you! It’s a fantastic collaboration among many talented contributors and all contributions are very welcome, no matter large or small; so please certainly feel free to propose changes to make your use cases better. :slight_smile:

Agreed. We decided to invest in Vulkan compute because it’s a standard based approach that gives us great reach across different platforms, hardware architectures, and form factors. The approach we have been taken thus far is to start with widely applicable generic paths to more vendor-specific specialized optimizations. So, first make it work and then make it performant.

Heterogeneity is baked into Vulkan/SPIR-V so the infrastructure is set up to handle that. We have target environment modelling pertain to Vulkan/SPIR-V and all conversion utilities are target environment aware. Patterns should not be worried about the particular target environment that much; the awesome dialect conversion framework together with the target environment should be able to handle that under the hood. It’s our hope to be able to auto scale to different configurations too so you just need to throw in your configuration and then patterns should adjust to support cases like with/without dedicated workgroup memory, different subgroup sizes, etc. As said by others, we’ve been prototyping the overall flow (runtime + kernel) in IREE. To support PowerVR, hopefully we just need a different vulkaninfo dump. :slight_smile:

Thomas is also working on enabling fast matmul using subgroup ops for the cases where one cannot have access to cooperative matrix.