I am new to MLIR, and looking for a way to port my DSL (namely T2S, a Halide-based FPGA programming language) to MLIR/CIRCT as a dialect. I happened to see this post, and this Transform dialect is exactly what I am looking for. As far as I can see, the Transform dialect would enable âembedded IRâ in the MLIR infrastructure â i.e. a part of IR (the payload) is embedded inside the whole IR, and another part of the IR (the transformer) would transform the payload into the final IR that really executes. With this functionality, all kinds of Halide-style embedded DSLs that separate concerns of algorithm and schedule could be enabled. It can be very powerful. It can also be very general: any optimization (like CSE or loop unrolling) can be given a payload IR to transform.
To me, a simplified transform idea could be enough. For example (Not mature idea, but roughly like):
#cmap = affine_map<(i, j, k) -> (i, j) >
#amap = affine_map<(i, j, k) -> (i, k) >
#bmap = affine_map<(i, j, k) -> (k, j) >
// A specification for the payload, e.g. matrix multiply C = A * B
%0 = t2s.func GEMM(%C : tensor<100x100xf32>, %A : tensor<100x200xf32>, %B : tensor<200x100xf32>) {
%C = t2s.multiply %A, %B {sink = #cmap, srcs = [#amap, #bmap]} : tensor<100x200xf32>, tensor<200x100xf32> -> tensor<100x100xf32>
return %C
}
// The transform part: from the specification, isolate the reference to %A into another function named ALoader
t2s.transform.isolate_producer(%0, %A, ALoader)
It expresses the following transformation:

The requirement here is simpler than the general Transform dialect because no pattern matching rule is needed. Instead, %0 (function GEMM) is matched entirely to manipulate.
The IR after transform might look like
#cmap = affine_map<(i, j, k) -> (i, j) >
#amap = affine_map<(i, j, k) -> (i, k) >
#bmap = affine_map<(i, j, k) -> (k, j) >
%AChannel = t2s.func ALoader(%A : tensor<100x200xf32>) {
%1 = t2s.load %A {srcs = [#amap]} : tensor<100x200xf32> -> tensor<100x200xf32>
t2s.write_channel(%AChannel, %1)
}
%2 = t2s.func GEMM(%C : tensor<100x100xf32>, %AChannel, %B : tensor<200x100xf32>) {
%3 = t2s.read_channel(%AChannel)
%C = t2s.multiply %3, %B {sink = #cmap, srcs = [#bmap]} : tensor<100x200xf32>, tensor<200x100xf32> -> tensor<100x100xf32>
}
After the isolation, the two functions can be further transformed (e.g. insert buffers, remove redundant loads, etc.), separately, but in a similar way.
In short, I really like this Transform idea, and believe it can be a very useful feature.