[RFC] Interfaces and dialects for precise IR transformation control

You can inject more operations at load-time by implementing a DialectExtension

class TransformDialectExtension : public DialectExtension<TransformDialectExtension, TransformDialect> {
public:
  void apply(MLIRContext *, TransformDialect *dialect) {
    for (auto &opInitializer : opInitializers) 
      opInitializer(dialect);
  }

  template <typename OpT>
  void registerOp() {
    opInitializers.push_back([](TransformDialect *dialect) { 
      RegisteredOperationName::insert<OpT>(*dialect); 
    });
  }
};

And register the extension ops with the dialect registry.

It might be interesting, actually now that you mention it, to experiment with making the transform dialect dynamically extensible (e.g. a dynamic op that calls into the python interpreter?) if you have a potential use case. I think you just set a flag in the ODS definition.

EDIT: now that I think about it, a dialect extension like that would work for any dialect…

Interesting. Yet some dialects are specifically designed to limit expressiveness to ops and types with stronger semantic guarantees. Would that not be counterproductive to add dynamic extensibility to these?

Yes, it would be strange depending on the dialect.

We discussed this in the ODM today. I’ll try to (maybe poorly) summarize some of the resolution points:

  • The “Transform” dialect and related utilities are treated as types of “infrastructure” and we want to have a similar bar of quality, usability, etc. that other parts of the MLIR infrastructure have.
  • The high-level user entry points are not fully scoped at this point in time, and to build these effectively we need actual use cases to validate and improve aspects of the design.
  • Given the previous point, the “Transform” dialect (+ surrounding infra) is not ready for general unfettered use, and instead will be scoped to a set of initial use cases to help validate the design before opening up to general use.
  • We need proper high level documents that detail the up-to-date status and design of the “Transform” dialect, current intended use cases, some guidelines on interaction, etc.
    • This is an improvement over past similar exercises (e.g. PDL)

I likely missed some points, but huge thanks to @ftynse for answering my annoying questions and for driving this effort in general.

– River

5 Likes

Thanks @River707! This is actually a pretty good summary.

There isn’t a clear place for documentation about “work-and-progress”, so I added a big disclaimer at the top of the dialect documentation and a longer explanation of the initially intended use scenarios at the bottom of it. We will strive to update the dialect documentation to keep track of design decisions and larger changes. The additional documentation hopefully clarifying some of the points raised during the discussion is also pasted below for better visibility.


The transformation control infrastructure provided by this dialect is positioned roughly between rewrite patterns and passes. A transformation that is executed by a transform operation is likely to be sufficiently complex to require at least a set of patterns to be implemented. It is also expected to be more focused than a pass: a pass typically applies identical transformations everywhere in the IR, a transform dialect-controlled transformation would apply to a small subset of operations selected, e.g., by a pattern-matching operation or generated by a previous transformation. It is discouraged, although technically possible, to run a pass pipeline as part of the transform op implementation.

One of the main scenarios for using this dialect is fine-grain chaining of transformations. For example, a loop-like operation may see its iteration domain split into two parts, implemented as separate loops (transformation known as index-set splitting), each of which is then transformed differently (e.g., the first loop is tiled and the second unrolled) with the necessary enabling and cleanup patterns around the main transformation:

// <generate %loop, e.g., by pattern-matching>
// ...
%parts:2 = transform.loop.split %loop { upper_bound_divisible_by = 8 }
transform.loop.tile %parts#0 { tile_sizes = [8] }
transform.loop.unroll %parts#1 { full }

This composition would have been difficult to implement as separate passes since the hypothetical “tiling” and “unrolling” pass would need to somehow differentiate between the parts of the loop produced by the previous pass (both are the same operation, and it is likely undesirable to pollute the operation with pass-specific information). Implementing passes that run the combined transfomration would have run into the combinatorial explosion issue due to multiple possible transform compositions or into the need for deep pass parameterization, the ultimate form of which is an ad-hoc dialect to specify which transformations the pass should run. The transform dialect provides a uniform, extensible mechanism for controlling transformations in such cases.

The transform dialect is supposed to be consumed by an “interpreter” pass that drives the application of transformations. To ensure extensibility and composability, this pass is not expected to actually perform the transformations specified by the ops. Instead, the transformations are implemented by the transform ops themselves via TransformOpInterface. The pass serves as the entry point, handles the flow of transform operations and takes care of bookkeeping. As such, the transform dialect does not provide the interpreter pass. Instead, it provides a set of utilities that can be used by clients to define their own interpreter passes or as part of a more complex pass. For example, the mapping between values in the tranfsorm IR and operations in the payload IR, or the function that applies the transformations specified by ops in the given block sequentially. Note that a transform op may have regions with further transform ops in them, with the op itself guiding how to dispatch the transformation control flow to those regions. This approach allows clients to decide on the relative location of the transform IR in their input (e.g., nested modules, separate modules, optional regions to certain operations, etc.), register additional transform operations and perform client-specific bookkeeping.

Although scoped to a single dialect, this functionality conceptually belongs to the MLIR infrastructure. It aims to be minimally intrusive and opt-in.

Some infrastructural components may grow extra functionality to support the transform dialect. In particular, the pattern infrastructure may add extra hooks to identify the “main results” of a transformation or to notify external observers about changes made to certain operations. These are not expected to affect the existing uses of the infrastructure.

For the sake of reusability, transformations should be implemented as utility functions that are called from the interface methods of transform ops rather than having the methods directly act on the payload IR.

2 Likes

I am new to MLIR, and looking for a way to port my DSL (namely T2S, a Halide-based FPGA programming language) to MLIR/CIRCT as a dialect. I happened to see this post, and this Transform dialect is exactly what I am looking for. As far as I can see, the Transform dialect would enable “embedded IR” in the MLIR infrastructure – i.e. a part of IR (the payload) is embedded inside the whole IR, and another part of the IR (the transformer) would transform the payload into the final IR that really executes. With this functionality, all kinds of Halide-style embedded DSLs that separate concerns of algorithm and schedule could be enabled. It can be very powerful. It can also be very general: any optimization (like CSE or loop unrolling) can be given a payload IR to transform.

To me, a simplified transform idea could be enough. For example (Not mature idea, but roughly like):

#cmap = affine_map<(i, j, k) -> (i, j) >
#amap = affine_map<(i, j, k) -> (i, k) >
#bmap = affine_map<(i, j, k) -> (k, j) >
// A specification for the payload, e.g. matrix multiply C = A * B
%0 = t2s.func GEMM(%C : tensor<100x100xf32>, %A : tensor<100x200xf32>, %B : tensor<200x100xf32>) { 
    %C = t2s.multiply %A, %B {sink = #cmap, srcs = [#amap, #bmap]} : tensor<100x200xf32>, tensor<200x100xf32> -> tensor<100x100xf32>
    return %C    
} 
// The transform part: from the specification, isolate the reference to %A into another function named ALoader
t2s.transform.isolate_producer(%0, %A, ALoader)

It expresses the following transformation:
image

The requirement here is simpler than the general Transform dialect because no pattern matching rule is needed. Instead, %0 (function GEMM) is matched entirely to manipulate.

The IR after transform might look like

#cmap = affine_map<(i, j, k) -> (i, j) >
#amap = affine_map<(i, j, k) -> (i, k) >
#bmap = affine_map<(i, j, k) -> (k, j) >
%AChannel = t2s.func ALoader(%A : tensor<100x200xf32>) { 
    %1 = t2s.load %A {srcs = [#amap]} : tensor<100x200xf32> -> tensor<100x200xf32>
    t2s.write_channel(%AChannel, %1)
} 
%2 = t2s.func GEMM(%C : tensor<100x100xf32>, %AChannel, %B : tensor<200x100xf32>) { 
    %3 = t2s.read_channel(%AChannel)
    %C = t2s.multiply %3, %B {sink = #cmap, srcs = [#bmap]} : tensor<100x200xf32>, tensor<200x100xf32> -> tensor<100x100xf32>    
} 

After the isolation, the two functions can be further transformed (e.g. insert buffers, remove redundant loads, etc.), separately, but in a similar way.

In short, I really like this Transform idea, and believe it can be a very useful feature.