Match patterns based on a MLIR input

TLDR: Is there a way to make the rewriter match a pattern based on mlir code, instead of using PDL or traversing the IR in a custom pass with an anchor-op? Can it be done without recompiling the compiler for each new pattern?

Hi,

I am working on generating accelerators and driver code for different (linalg) algorithms. As such linalg.generic is great as it can capture the class I am interested on.
However, the accelerators are not derived from MLIR and the compiler only knows which operations they can cover if I create a specific match/rewrite for each accelerator_type X dataflow combination.

With the goal to enable the user to write a simple config file that allows driving their “complex” accelerator, I started playing with the idea of exposing enough accelerator parameters (size, instructions, operation, etc) as pass options or as an input config file that controls the codegen. All this works alright as long as I know the operation ahead of time (anchor-op style).

What I would like is to request the user to write a single linalg.generic + trait (that includes the usual generic information + accelerator parameters) file as a config.mlir to control the transformations.

Questions

Is it possible to pattern match based on this external .mlir config file that has MLIR syntax?
Are there any examples in tree?

Considering that the file the user writes roundtrips ok with mlir-opt, would it make sense for me to compile it separately and try to extract the operations information from its symbol table and use them on the target code?

Are there any PDL examples used with linalg.generic patterns?

I appreciate any suggestions.
Thank you!

My example:

// Original MLIR
...
    linalg.generic {indexing_maps = [#map5, #map6, #map5], iterator_types = ["parallel"]} ins(%104, %2 : memref<16xf32>, memref<f32>) outs(%166 : memref<16xf32>) {
    ^bb0(%arg2: f32, %arg3: f32, %arg4: f32):  // no predecessors
      %513 = arith.addf %arg2, %arg3 : f32
      linalg.yield %513 : f32
    }
    %167 = memref.alloc() : memref<16xf32>
    linalg.generic {indexing_maps = [#map5, #map5], iterator_types = ["parallel"]} ins(%166 : memref<16xf32>) outs(%167 : memref<16xf32>) {
    ^bb0(%arg2: f32, %arg3: f32):  // no predecessors
      %513 = math.rsqrt %arg2 : f32
      linalg.yield %513 : f32
    }
    %168 = memref.expand_shape %167 [[0, 1]] : memref<16xf32> into memref<1x16xf32>
    %169 = memref.alloc() : memref<1x80x80x16xf32>
    linalg.generic {indexing_maps = [#map2, #map4, #map2], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%165, %168 : memref<1x80x80x16xf32>, memref<1x16xf32>) outs(%169 : memref<1x80x80x16xf32>) {
    ^bb0(%arg2: f32, %arg3: f32, %arg4: f32):  // no predecessors
      %513 = arith.mulf %arg2, %arg3 : f32
      linalg.yield %513 : f32
    }
...
// User config file - used to match the pattern below, in the code above
#map2=...
#map4=... 
#map2=...
#trait = {
  indexing_maps = [#map2, #map4, #map2], 
  iterator_types = ["parallel", "parallel", "parallel", "parallel"],
  opcodes = {(..,..),..},
  flow_pattern = ...,
  etc = ...,
} 

linalg.generic #trait ins(%A, %B : memref<?x?x?x?xf32>, memref<?x?xf32>) 
                       outs(%C : memref<?x?x?x?xf32>) {
^bb0(%0: f32, %1: f32, %2: f32): 
  %3 = arith.mulf %0, %1 : f32
  linalg.yield %3 : f32
}
// Transformed code
...
    linalg.generic {indexing_maps = [#map5, #map6, #map5], iterator_types = ["parallel"]} ins(%104, %2 : memref<16xf32>, memref<f32>) outs(%166 : memref<16xf32>) {
    ^bb0(%arg2: f32, %arg3: f32, %arg4: f32):  // no predecessors
      %513 = arith.addf %arg2, %arg3 : f32
      linalg.yield %513 : f32
    }
    %167 = memref.alloc() : memref<16xf32>
    linalg.generic {indexing_maps = [#map5, #map5], iterator_types = ["parallel"]} ins(%166 : memref<16xf32>) outs(%167 : memref<16xf32>) {
    ^bb0(%arg2: f32, %arg3: f32):  // no predecessors
      %513 = math.rsqrt %arg2 : f32
      linalg.yield %513 : f32
    }
    %168 = memref.expand_shape %167 [[0, 1]] : memref<16xf32> into memref<1x16xf32>
    %169 = memref.alloc() : memref<1x80x80x16xf32>
    
    // Replaced the last linalg.generic that matched the pattern with accelerator driver code.
    accel.send %165 : memref<1x80x80x16xf32>
    accel.send %168 : memref<1x16xf32>
    accel.recv %169 : memref<1x80x80x16xf32>
...

The selling point of PDL is exactly what you want: matching without recompiling the compiler for new patterns. It is arguably still work in progress and I would encourage you to consider contributing and extending it instead of rolling a custom solution. Unless it was added recently, PDL currently lacks support for matching regions and their internals, which is a big blocker for linalg.generic-style operations.

In the meantime, the “structured” subset of the transform dialect has a simple match operation - Transform Dialect - MLIR - capable of finding the operations with specified attributes. It can be arguably be extended as long as it remains within the structured realm, but likely won’t be as efficient as PDL on large inputs. Looking at the Transform dialect in general can be interesting for your case, some of its stated goals are driving codegen at finer granularity and updating heuristics without recompiling the compiler, which sounds close to what you want. It does not, however, aim for the same level of generality in IR matching and rewriting as PDL does.

It’s not very clear to me what do you mean here: the user writes some IR with holes/wildcards that gets matched against the “fully instantiated” IR? I have hard time imagining how those holes would be expressed without breaking all sorts of IR invariants: how does one express an unknown number of operations that feed each other? how does one express a subset of values an attribute of an operation should take? etc.

There shouldn’t be a material difference between compiling two unrelated modules located in separate files or in the same file as long as they don’t directly refer to each other through, e.g., symbols.

IREE has an example of using a separate file with a schedule expressed using the transform dialect - https://github.com/iree-org/iree/blob/main/tests/transform_dialect/cuda/reduction.mlir, https://github.com/iree-org/iree/blob/main/tests/transform_dialect/cuda/reduction_codegen_spec.mlir

Not upstream, everything seems to have switched to transform.structured.match.

Thank you for the reply!

I am new to the transform dialect. I will check it out.

Currently the only wild card we support is the size of inputs and outputs, expressed with ?x?x?. The accelerator implements a specific tile_size of said operation, thus after the linalg.generic (named op in current implementation) is matched, tiling takes place based on user input. SSA variable names dont matter as long as the flow is the same.
All other information has to be explicit, if the inner operations of the generic region change, the user has to write a new configuration because the instruction flow in the accelerator might change.

An example of a user defined accelerator configuration/trait for a f32 linalg.matmul accelerator would look like this:

// User config file:
#matmul_trait = {
  // Initialization
  dma_init_config = {
    dmaAddress = ...
    dmaInputAddress = ...
    ...
  },

  // Opcodes to send once, uses tokens defined in opcode_map
  init_opcodes = "(reset)",

  // Tiling
  accel_dim = map<(m, n, k) -> (4, 4, 4)>,

  // Permutation and who can be stationary
  permutation_map = affine_map<(m, n, k) -> (m, k, n)>,

  // Flow/Opcodes
  opcode_map = [
       map<(sA)      ->(0x22)[send(0)]>,
       map<(sB)      ->(0x23)[send(1)]>,
       map<(cC_rC)   ->(0x24)[recv(2)]>,
       map<(sB_cC_rC)->(0x25)[send(1),recv(2)]>,
       map<(reset)   ->(0xFF)[]>,
  ],

  // Flow to implement, uses tokens defined in opcode_map
  flow_pattern="(sA (sB_cC_rC)+)",
  // flow_pattern="(sA (sB   cC_rC)+)", // also valid, but with different opcodes

  acc_on_cpu=[2],

  // Includes the normal linalg.generic traits: 
  // Original generic information
  iterator_types = ["parallel", "parallel", "reduction"],
  indexing_maps = [
    affine_map<(m, n, k) -> (m, k)>, // A
    affine_map<(m, n, k) -> (k, n)>, // B
    affine_map<(m, n, k) -> (m, n)>  // C
  ]
}
...
  linalg.generic #matmul_trait
    ins (%A, %B : memref<?x?xf32>, memref<?x?xf32>)
    outs(%C : memref<?x?xf32>) {
    ^bb0(%a: f32, %b: f32, %c: f32):
      %0 = arith.mulf %a, %b : f32
      %1 = arith.addf %c, %0 : f32
      linalg.yield %1 : f32
    }
...

Not supported yet. Current support is planned for a single Op with nested ops in its region (single linalg.generic), not a sequence of ops in the outermost scope.

This is very helpful, thank you for sharing

I will look a little more on PDL and the transform.structured as suggested. If I extend them, I will be sure to create a pull request.

Thank you