[RFC] Add an op to group/cluster operations

[RFC] Add an op to group/cluster operations

I think there is no operation in MLIR Core that can model some kind of grouping/clustering/scoping of operations without creating a function.

Users of MLIR have already defined such operations. There might be some other examples of clustering ops that I am not aware of. It would be nice to learn about them.

Examples

%1 = "tf_device.cluster"() ({
  %2 = ... -> tensor<1x8x2xf32>
  tf_device.return %2 : tensor<1x8x2xf32>
}) : () -> tensor<*xf32>
%0 = "chlo.rank_specialization_cluster"(%arg0, %arg1, %arg2) ({
       ^bb0(%arg0_ : tensor<*xf32>, %arg1_ : tensor<*xf32>, 
            %arg2_ : tensor<*xf32>):
         "chlo.rank_specialization_cluster_yield"(%2) 
             : (tensor<*xf32>) -> ()
}) : (tensor<*xf32>, tensor<*xf32>, tensor<*xf32>) -> tensor<*xf32>

Both operations have a single region with one block that yields some values. However, the first operation can implicitly capture the values defined outside tf_device.cluster and the second one defines block arguments to do that.

I propose to add such operation to BuiltinOps.td because it seems to be a fundamental operation to be defined along with FuncOp or ModuleOp. The operation would allow both implicit/explicit capture. It would have an optional string attribute to tag the clusters. We can call it ScopeOp, ClusterOp, GroupOp or something else.

%0 = cluster(%a = %arg0 : tensor<2xf32>, 
             %b = %arg1 : i32, 
             %c = %arg2 : memref<*xi32>) "some tag" {
  ...
  return %result : <10xf32>
} -> tensor<10xf32>

That might be interesting for @ezhulenev and @frgossen.

Hi, thanks for the RFC.

I think it might be helpful to look at some more motivating examples.

Having worked on and with a number of these region containing ops, I’m not quite seeing the connection between them all as much as it sounds like you are. In the cases I’m aware of, the mode of capture and mapping between block args, terminators, etc is a pretty fundamental thing about each op, and relatively unique. They also tend to have rich canonicalizations, folders and verifiers tied to each op – and I’m not quite seeing how to generalize that. I might be missing the forest for the trees, though.

1 Like

This is interesting to think of generalizations and I agree examples and semantics would be useful. E.g., is this equivalent to IIFE in C++? What does the string represent? (Is it debugging info?)

@stellaraccident Yes, there are many ops that have a region inside, but I would say that there is a special class of such ops that are needed only to collect/group ops. This is usually done to outline a region for further transformations.

For example, chlo.rank_specialization_cluster is used to collect operations that can be converted to their ranked counterparts together. tf_device.cluster is used to collect operations that will become a kernel later.

There are other applications: some non-greedy fusion heuristic that will group the operations that should be fused together. All these “applications” follow the same pattern: there is a pass that clusters the operations and then there is a pass that transforms the clusters into some kind of kernel or dissolves it into the parent block.

@jpienaar I definitely agree that we need more examples from other dialects. Yes, IIFE lambda is equivalent. The string is more like a “type” to distinguish what kind of cluster it is.

Yeah, no question that these patterns come up frequently. In tree, quant.region has similar semantics (but iirc, carries more side information).

Iree has a few such things, and while tedious to have had to define them, it also brings the ability for specific verifiers and explicit design of load bearing details – and the extra structure can be valuable. My fear with just having a builtin, very generic grouping op is just that it will be a bit too easy to use when a dialect specific specific op would actually be called for.

On the other hand, more at the ML graph/frontend level, I could see this being used for a variety of things in a relatively ad-hoc fashion, which may be fine. It just seems that as you get lower level, you also get more specific and controlled and more specificity helps keep the IR designed well.

I’m not strongly opposed but I think generally, there should be a high bar for builtin ops and we need to interrogate them.

And this would include verification, return type of op to terminator matching, what effects it has (e.g., always recursive side effects or is it allowed to ignore the ops in the region) and the like. I feel like this is where string type is a signal: if there will be switching depending on the value of that string, then seems like different ops are better (contrarily, if there is general behavior that would enable greater reuse, that would show the value). What we could do, is make it easy to define such ops (if it isn’t already), something like a class in ODS which make it easy to create an op. E.g., if the pain is in creating ops vs pain in duplicate transformations on op with the same structure (although there interface could potentially be answer too).

+1 to all this! This kind of feels like adding

std.terminator "some tag" %0, %1, ...

Instead, we can just make sure that we have the right tools in ODS, RegionBranchOpInterface, etc. to make defining ops like this easier.

It seems already in the OP that there is a significant distinction between the capturing and non-capturing version, further arguing against unifying.

1 Like

Fair enough. Let’s keep it the way it is.

The newly introduced scf.execute_region may be used to group operation generically the way this thread intended to?

Thank you, @mehdi_amini. That’s definitely related.