Moving Operations from one Block into another Operation/Block

Hi all,

I have written a utility that I call clustering out of existing MLIR API calls. I admit the naming may not be the best. The API is as follows:

unsigned Cluster(std::set<Operation > &ops,
Block &block,
cluster_op instruction, /
Enum indicating what type the clustering operation is */
mlir::MLIRContext *context);

The aim of the above function is to take a set of operations (usually a subgraph) present within a block, create a new OP (cluster op) within the block, that has a region and blocks within it, and then move the operations into the new OP.

While moving the ops into the new OP, we also need to re-connect ops that used to connect to the moving ops, correctly as to maintain the connectivity of the IR.

I have the above implemented and working in a number of places in our compiler. I feel I may be missing some terminology here, but is there a inbuilt mechanism within the MLIR core that can perform the above, without me having to write the above implementation ? It feels to me like this could be generic functionality. If not do we see this as useful functionality we could generalise and add to MLIR ?

Thanks,
Aaron

Infer from your wording (my apologies if I misconstrue): there is a splice call in LLVM with an API that is really nasty to use but does what you want.

Alternatively, the RewritePattern has a mergeBlock and a mergeBlockBefore.
@dcaballe was asking for something similar but without a RewritePattern and I believe he is working on something.

Thanks for the ping, Nicolas.

I think creating a new region op that encloses a range of existing ops is a common pattern. In addition to what Nicolas mentioned, inlineRegionBefore is another utility that can be used to move ops across regions but it falls a bit short for more generic cases and it’s part of the pattern-rewrite infra. Having a more generic and flexible utility would be super useful!

I haven’t thought too much about the implementation of such a utility yet but in my use case I’m trying to go from something like:

  parentOp {
    Op1
    Op2
    Op3
    Terminator
  }

to something like:

  parentOp {
    NewOp1
    NewOp2 {
      Op1
      Op2
      NewTerminator
    }
    Op3
    Terminator
  }

If you already have a utility working for something similar I’ll be happy to take a look. Indeed, it’s not a trivial transformation. Maybe we can generalize it and contribute it upstream!

Hi Diego

What you describe in the example above, is exactly the kind of functionality I have implemented.

I am more than happy to share the implementation, but I need a couple of days to try and clean out some dependencies I have that would prevent from making it generic, and able to compile outside our out of tree project.

I will try to post a patch around the middle of next week.

Thanks,
Aaron

Related to this, I was fighting the rewriter today as there is no RAUW equivalent.
In my particular case, I was moving a region into another but I was left with some RAUW needs: a value defined outside both ops needed to be replaced by a BBArg of the new op.

As a consequence I had to do a post-processing that explicitly clones with bvm and had fun tracking the self-use that I needed to filter out.

If one of you also has a non-cloning, non BBArgs-replacing RAUW for RewritePattern, I’d be happy to review to plug this hole.

I will try to post a patch around the middle of next week.

Great! Looking forward to it!

If one of you also has a non-cloning, non bbArgs-replacing RAUW for RewritePattern, I’d be happy to review to plug this hole.

I have the impression that we should be able to decompose the transformation into multiple utilities. One of them could be what you need here. We’ll see.

Are the ops contiguous in the block? (This goes towards Nicolas’s suggestion, but I wasn’t sure if that is the case). And is it required that all ops are top most in block? (could one, instead of passing in, use the parent block of head op?) [I’d probably use a predicate function rather than std::set to allow different sets to be used and different clustering approaches, such as clustering based on attribute without populating a set]

@aardeb, just for your reference, I implemented something quickly off-tree to unblock my work here. You can take a look at moveOperationsBefore, getDefsWithExternalUses and the code around their invocations to better understand my use case. Hopefully that helps!

Hi all, we (GrAI Matter Labs) have a very similar problem (and implementation) as well. First, we cluster related ops based on MLIR’s slice analysis. Something like:

llvm::SetVector<Operation *> slice;
mlir::getForwardSlice(states.getOperation(),
                      &slice,
                      [&states](Operation *op) -> bool {
                          return somePredicate(op);
                      });
slice.insert(states.getOperation());
llvm::SetVector<Operation *> backwardRoots(slice.begin(), slice.end());
for (auto *root: backwardRoots) {
    llvm::SetVector<Operation *> backwardSlice;
    ::mlir::getBackwardSlice(
            root, &backwardSlice, [&states, &root](Operation *op) -> bool {
                return someOtherPredicate(op);
            });
    slice.insert(backwardSlice.begin(), backwardSlice.end());
}

return topologicalSort(slice);

In our case the ops need not be contiguous in a block. Afterwards these get moved into some new op region:

template <typename RegionOpTy>
RegionOpTy Cluster::makeRegionOp(OpBuilder &builder, const Cluster& cluster, BlockAndValueMapping &map) {
    Region body;
    mlir::IRRewriter bodyBuilder(builder.getContext());
    prepareBuilderForFloatingRegion(bodyBuilder, body);

    for (auto op : cluster.ops) bodyBuilder.clone(*op, map);
    
    auto mappedValues = cluster.mappedValuesUsedOutside(map).takeVector();
    auto returnOp = bodyBuilder.create<SomeTerminator>(
        UnknownLoc::get(builder.getContext()), llvm::makeArrayRef(mappedValues));

    auto regionOp = builder.create<RegionOpTy>(UnknownLoc::get(builder.getContext()),
                                               returnOp.getOperandTypes());
    regionOp.body().takeBody(body);
    return regionOp;
}

and a bunch of updating ops in the original region to use result values of this new op.

If something like that is available builtin we would be very happy to use that as well.

Hi all

I pushed a draft patch of my code here:
https://reviews.llvm.org/D114888

The interface is:

template <class TerminatorOp, class ClusterOp>
Operation *Cluster(std::vector<Operation *> &ops,
llvm::iplist::iterator insertClusterOp,
Block &block, mlir::MLIRContext *context)

  • The std::vector<Operation *> &ops do not need to be a linear range but need to be topologically sorted and form a connected subgraph
  • llvm::iplist::iterator insertClusterOp, is the position in the block where the new op (where the ops will be moved) is inserted into the block.

The code I have is a bit convoluted but works. Happy to receive feedback on how to improve it.

@dcabelle/@jschuhmacher looking through your code it seems that a lot of the complexity in my code may be because I do not assume a linear range of ops and take subgraph, since you code seems to get away with trying to find the external op connections and such. I may also be re-implementing some MLIR internal code leading to a more complex solution.

Thanks for sharing your use cases and implementations! I’m trying to build a bit more expertise on my use case before providing any feedback but my first impression is that I don’t think we can have a single utility that creates the new region op and move all the target operations into the new region, all in one shot. I think that decomposing the transformation into 2 or 3 utilities would make the implementation simpler and more composable.

I’m also wondering why your implementations clone the set of operations to be moved into the new region. Would it be possible to just move them without cloning them? It was possible at least in my use case and I think that would be more efficient since the number of operations to move could be pretty large.

The reason I am cloning is that I can take a arbitrary subgraph of operations to move from the IR and not necessarily a linear range of ops. As we move the operations we need to remap the operands for the ops to respect the connections that come from ops that also move and any connections that after clustering will come in through the block arguments. At the time of writing that code I could not come up with a way to move the operations while reconnecting the operands (in one single step), so cloning seemed the only option. I agree that it may be expensive, so maybe a move and reconfigure operands is needed here. Maybe we can do that as a two stage process, or add the functionality to do it in a single step.

We’re cloning for similar reasons. Also, sometimes an operation from the source region might need to be replicated into multiple destination regions; this we could also achieve with some pre-processing though.

The transformation looks like this, from

%1 = dialect.parent_region_op -> tensor<1x8x8x2xelT>  {
  %2 = dialect.some_op0 -> memref<3x1x1x1xelT>
  %3 = dialect.some_op1(%arg0, %2) : (tensor<1x8x8x3xelT>, memref<3x1x1x1xelT>) -> tensor<1x8x8x1xelT>
  %4 = dialect.some_op2 -> memref<1x8x8x1xelT>
  %5 = dialect.some_op4(%3, %4) : (tensor<1x8x8x1xelT>, memref<1x8x8x1xelT>) -> tensor<1x8x8x1xelT>
  %6 = dialect.some_op3(%5) : (tensor<1x8x8x1xelT>) -> tensor<1x10x10x1xelT>
  %7 = dialect.some_op0 -> memref<1x3x3x2xelT>
  %8 = dialect.some_op1(%6, %7) : (tensor<1x10x10x1xelT>, memref<1x3x3x2xelT>) -> tensor<1x8x8x2xelT>
  %9 = dialect.some_op2 -> memref<1x8x8x2xelT>
  %10 = dialect.some_op4(%8, %9) : (tensor<1x8x8x2xelT>, memref<1x8x8x2xelT>) -> tensor<1x8x8x2xelT>
  dialect.parent_end %10 : tensor<1x8x8x2xelT>
}

to

%1 = dialect.parent_region_op -> tensor<1x8x8x2xelT> {
  %2 = dialect.region_op0 -> tensor<1x8x8x1xelT> {
    %4 = dialect.region_op1 -> tensor<1x8x8x1xelT> {
      %5 = dialect.some_op0 -> memref<3x1x1x1xelT>
      %6 = dialect.some_op1(%arg0, %5) : (tensor<1x8x8x3xelT>, memref<3x1x1x1xelT>) -> tensor<1x8x8x1xelT>
      %7 = dialect.some_op2 -> memref<1x8x8x1xelT>
      %8 = dialect.some_op4(%6, %7) : (tensor<1x8x8x1xelT>, memref<1x8x8x1xelT>) -> tensor<1x8x8x1xelT>
      dialect.out %8 : tensor<1x8x8x1xelT>
    }
    dialect.out %4 : tensor<1x8x8x1xelT>
  }
  %3 = dialect.region_op0 -> tensor<1x8x8x2xelT> {
    %4 = dialect.region_op1 -> tensor<1x8x8x2xelT> {
      %5 = dialect.some_op0 -> memref<1x3x3x2xelT>
      %6 = dialect.some_op3(%2) : (tensor<1x8x8x1xelT>) -> tensor<1x10x10x1xelT>
      %7 = dialect.some_op1(%6, %5) : (tensor<1x10x10x1xelT>, memref<1x3x3x2xelT>) -> tensor<1x8x8x2xelT>
      %8 = dialect.some_op2 -> memref<1x8x8x2xelT>
      %9 = dialect.some_op4(%7, %8) : (tensor<1x8x8x2xelT>, memref<1x8x8x2xelT>) -> tensor<1x8x8x2xelT>
      dialect.out %9 : tensor<1x8x8x2xelT>
    }
    dialect.out %4 : tensor<1x8x8x2xelT>
  }
  dialect.parent_end %3 : tensor<1x8x8x2xelT>
}

(but it probably seems a bit odd without the context). One of the reasons for the nesting is to avoid having to deal with mapping the block args here.