[MLIR][GPU][Linalg] How to find a valid lowering pipeline for workgroup memory usage via `transform.structured.promote`?

Hi! We’re trying to figure out how to use workgroup memory when lowering (as an example) a linalg.matmul for execution on an NVidia GPU.

We boiled it down to this example, which is an extended version of a linalg test for memory promotion (which implies to us that there should be a valid lowering path):

func.func @main() {

    %A = memref.alloc() : memref<1024x1024xf32>
    %B = memref.alloc() : memref<1024x1024xf32>
    %C = memref.alloc() : memref<1024x1024xf32>

    %cf0 = arith.constant 0.00000e+00 : f32
    %cf1 = arith.constant 1.00000e+00 : f32

    linalg.fill ins(%cf1 : f32) outs(%A : memref<1024x1024xf32>)
    linalg.fill ins(%cf1 : f32) outs(%B : memref<1024x1024xf32>)
    linalg.fill ins(%cf0 : f32) outs(%C : memref<1024x1024xf32>)

    linalg.matmul ins(%A, %B: memref<1024x1024xf32>, memref<1024x1024xf32>)
               outs(%C: memref<1024x1024xf32>)
    return
}

module attributes {transform.with_named_sequence} {
  transform.named_sequence @__transform_main(%arg1: !transform.any_op {transform.readonly}) {

    %matmul = transform.structured.match ops{["linalg.matmul"]} in %arg1 : (!transform.any_op) -> !transform.any_op
    %func = transform.get_parent_op %matmul : (!transform.any_op) -> !transform.any_op

    %block_tiled_matmul, %for_blocks = transform.structured.tile_using_forall %matmul tile_sizes [4, 4, 0] (mapping = [ #gpu.block<x>, #gpu.block<y>] ) : (!transform.any_op) -> (!transform.any_op, !transform.any_op)

    %gpu_launch = transform.gpu.map_forall_to_blocks %func { generate_gpu_launch } : (!transform.any_op) -> !transform.any_op
    %matmul2 = transform.structured.match ops{["linalg.matmul"]} in %gpuLaunch : (!transform.any_op) -> !transform.any_op
    %0 = transform.structured.promote %matmul2 { operands_to_promote = [0, 1], mapping = [#gpu.memory_space<workgroup>] } : (!transform.any_op) -> !transform.any_op
    transform.structured.convert_to_loops %0 : !transform.any_op

    transform.yield
  }
}

Assuming transform.structured.convert_to_loops does delete its target (currently only on the main branch of LLVM), we can apply the schedule to get two memref.copy assigning to workgroup memory:

Lowered Code
#map = affine_map<(d0) -> (d0 * 4)>
module {
  func.func @main() {
    %c0 = arith.constant 0 : index
    %c1024 = arith.constant 1024 : index
    %c4 = arith.constant 4 : index
    %c256 = arith.constant 256 : index
    %c1 = arith.constant 1 : index
    %cst = arith.constant 1.000000e+00 : f32
    %cst_0 = arith.constant 0.000000e+00 : f32
    %alloc = memref.alloc() : memref<1024x4xf32, #gpu.address_space<workgroup>>
    %alloc_1 = memref.alloc() : memref<4x1024xf32, #gpu.address_space<workgroup>>
    %alloc_2 = memref.alloc() : memref<1024x1024xf32>
    %alloc_3 = memref.alloc() : memref<1024x1024xf32>
    %alloc_4 = memref.alloc() : memref<1024x1024xf32>
    linalg.fill ins(%cst : f32) outs(%alloc_2 : memref<1024x1024xf32>)
    linalg.fill ins(%cst : f32) outs(%alloc_3 : memref<1024x1024xf32>)
    linalg.fill ins(%cst_0 : f32) outs(%alloc_4 : memref<1024x1024xf32>)
    gpu.launch blocks(%arg0, %arg1, %arg2) in (%arg6 = %c256, %arg7 = %c256, %arg8 = %c1) threads(%arg3, %arg4, %arg5) in (%arg9 = %c1, %arg10 = %c1, %arg11 = %c1) {
      %0 = gpu.block_id  x
      %1 = gpu.block_id  y
      %2 = affine.apply #map(%0)
      %3 = affine.apply #map(%1)
      %subview = memref.subview %alloc_2[%2, 0] [4, 1024] [1, 1] : memref<1024x1024xf32> to memref<4x1024xf32, strided<[1024, 1], offset: ?>>
      %subview_5 = memref.subview %alloc_3[0, %3] [1024, 4] [1, 1] : memref<1024x1024xf32> to memref<1024x4xf32, strided<[1024, 1], offset: ?>>
      %subview_6 = memref.subview %alloc_4[%2, %3] [4, 4] [1, 1] : memref<1024x1024xf32> to memref<4x4xf32, strided<[1024, 1], offset: ?>>
      gpu.barrier
      memref.copy %subview, %alloc_1 : memref<4x1024xf32, strided<[1024, 1], offset: ?>> to memref<4x1024xf32, #gpu.address_space<workgroup>>
      gpu.barrier
      memref.copy %subview_5, %alloc : memref<1024x4xf32, strided<[1024, 1], offset: ?>> to memref<1024x4xf32, #gpu.address_space<workgroup>>
      gpu.barrier
      scf.for %arg12 = %c0 to %c4 step %c1 {
        scf.for %arg13 = %c0 to %c4 step %c1 {
          scf.for %arg14 = %c0 to %c1024 step %c1 {
            %4 = memref.load %alloc_1[%arg12, %arg14] : memref<4x1024xf32, #gpu.address_space<workgroup>>
            %5 = memref.load %alloc[%arg14, %arg13] : memref<1024x4xf32, #gpu.address_space<workgroup>>
            %6 = memref.load %subview_6[%arg12, %arg13] : memref<4x4xf32, strided<[1024, 1], offset: ?>>
            %7 = arith.mulf %4, %5 : f32
            %8 = arith.addf %6, %7 : f32
            memref.store %8, %subview_6[%arg12, %arg13] : memref<4x4xf32, strided<[1024, 1], offset: ?>>
          }
        }
      }
      gpu.terminator
    }
    return
  }
}

Now a further lowering to LLVM eludes us. We tried
mlir-opt test_promote_lowered.mlir --gpu-kernel-outlining --convert-linalg-to-loops --convert-scf-to-cf -expand-strided-metadata -lower-affine -convert-arith-to-llvm --finalize-memref-to-llvm -convert-func-to-llvm --canonicalize --gpu-lower-to-nvvm-pipeline="cubin-chip=sm_75 cubin-features=+ptx75 opt-level=3", which leaves us with the following messages:

error: conversion of memref memory space #gpu.address_space<workgroup> to integer address space failed. Consider adding memory space conversions.
error: conversion of memref memory space #gpu.address_space<workgroup> to integer address space failed. Consider adding memory space conversions.
error: 'llvm.call' op 'memrefCopy' does not reference a symbol in the current scope

Adding -convert-gpu-to-nvvm (which should be part of gpu-lower-to-nvvm-pipeline) for the mentioned memory space conversions anywhere in the pipeline and leaving gpu-lower-to-nvvm-pipeline out leads to this error:

error: 'llvm.call' op 'memrefCopy' does not reference a symbol in the current scope

When trying to use transform.structured.vectorize_children_and_apply_patterns on those two memref.copy (to try to go via a vector lowering path) yields us:

error: Unsupported Op, cannot vectorize

Trying to use transform.structured.vectorize_children_and_apply_patterns on the entire module just leaves those two memref.copy untouched:

Appended to schedule
    // Convert memref.copy to vector.transfer_read/write:
    %1 = transform.structured.match ops{["memref.copy"]} in %arg1 : (!transform.any_op) -> !transform.any_op
    %copy1, %copy2 = transform.split_handle %1 : (!transform.any_op) -> (!transform.any_op, !transform.any_op)
    %2 = transform.get_parent_op %copy1 {isolated_from_above} : (!transform.any_op) -> !transform.any_op
    %3 = transform.structured.vectorize_children_and_apply_patterns %2 : (!transform.any_op) -> !transform.any_op

It may be that we are simply missing a single pass, which would resolve our problem. So if anybody is able to give pointers, that would be much appreciated. :slight_smile:

I’m actually trying to something very similar on the same sm_75 architecture (but mine has no tensor cores) and am trying out mlir-python-extras. The author posted a Colab notebook in the GitHub issue discussion that might help: how can I run the cuda code on my gtx 1650? · Issue #78 · makslevental/mlir-python-extras · GitHub

Also the cuda notebook mentioned is here: mlir-python-extras/examples/cuda_e2e.ipynb at main · makslevental/mlir-python-extras · GitHub

If you figure it out, can you share?

Hi! I only found more indications that this is currently not possible if starting at the linalg level (with just standard MLIR, no other projects involved). Writing/generating code starting at the gpu-dialect seems a better approach, and that is currently good enough for our project. If anything changes, I will update this thread. :slight_smile: