RFC: Changes to `TilingInterface`

Based on some experience with TilingInterface I want to propose the following changes to the signature of the methods getTiledImplementation, getResultTilePositionMethod and getResultTileValue to the following

struct TilingResult {
  SmallVector<Operation *> generatedOperations;
  SmallVector<Value> replacements;
};

FailureOr<TilingResult> getTiledImplementation(OpBuilder &b, Operation *extractOp,
    ValueRange tileDestinationTensor);

,

FailureOr<Operation *> getResultTilePosition(OpBuilder &b, Operation *extractOp)

and

FailureOr<Value> getResultTileValue(OpBuilder &b, int resultNumber,
    Operation *extractOp, ValueRange tileDestinationTensor)

An explanation of each of the changes follows

Explanation of changes to getTiledImplementation

  • The current signature of these methods only work for orthogonal domains. In keeping with going beyond codegeneration for rectangular domains (see this RFC), using an Operation *extractOp instead of ArrayRef<OpFoldResult> offsets, ArrayRef<OpFoldResult> sizes allows the extractOp to carry the information about the domain. To keep the change an NFC, currently the implementations of the TilingInterface and the algorithms using it will support only tensor.extract_slice operations. But further development of tiling implementations/transformations can generalize to other operations that can specify a richer domain.
    Note: The offsets and sizes that are to be passed as the operands to getTiledImplementation today represent the iteration space (and not a slice of a tensor). So to specify the iteration space we would need to create a “fake” tensor.extract_slice that carries the iteration domain information, which will get DCE-ed in the end
  • The tileDestinationTensor is the destination to use for the tiled implementation. For operations that implement the DestinationStyleOpInterface, current transformations do some complex analysis to preserve the destination. Passing this explicitly will allow preserving the destination passing style semantics in a more robust manner.
  • It is useful to track both the operations created within the call to getTiledImplementation as well as the values that represent the tiled results of the original operation. Today’s implementation happen to correspond to the case where the results of a single operation represent the tiled results. This does not have to be the case. Decoupling these in the return value will help address this assumption.

Explanation of changes to getResultTilePosition.

  • The change to use Operation * as an operand instead of ArrayRef<OpFoldResult> offsets, ArrayRef<OpFoldResult> sizes is the same as above.
  • The current signature of this function returns the offset w.r.t to the original untiled result and size of the tiled result by reference, and only works for orthonormal domains. Instead returning an Operation * that carries the domain information will allow for future developments that go past the orthonormal domains. To keep the change an NFC, the current implementations will return a tensor.extract_slice but future iterations of the implementation/transformations can go past this limitation.

Explanation of changes to getResultTileValue

The changes to this method follow from the changes to getTiledImplementation.

The changes here proposed are expected to be as future proof as we can get based on the current understanding. The aim of this RFC is more to get feedback w.r.t to the proposed changes to make the interface implementation as future proof as possible.
Similar changes are potentially needed even for the PartialReductionOpInterface added recently.

cc @nicolasvasilache @herhut @pifon2a @ftynse @ThomasRaoux @antiagainst @hanchung @matthias-springer

1 Like

Thank you Mahesh!

How would support of other “subset” operations look like? if getTiledImplementation tiles the entire domain, wouldn’t it be better to pass some specification of how the iteration space should be tiled. For the current use case a struct with offsets, sizes and strides + a callback that abstracts away the creation of extract_slice/subview-like operations would be enough. I am wondering how would the specification of iteration domain tiling look like for scatter.

I think it is also important to extend emitScalarImplementation to support tensors, not only memrefs.

The layering in TilingInterface is meant to separate the tiling of the iteration space from the path that generates the code for a tile. For example scf::tileUsingSCFForOp tiles the iteration space. The implementation of the intra-tile loop body is what getTiledImplementation returns. So I am not sure what the call back would be for. The getTiledImplementation does not create the slice of the iteration domain (orthogonal or otherwise), that is done by the caller. All that is needed is to “convey” the slice and the best way to do that is through an operation that captures the specification of the slice (tensor.extract_slice is one such operation). I am open to suggestions though (but callback doesnt seem to be a good fit).

Sure, that’d make sense. I dont think that needs a change in signature of the function. Its a pure implementation detail of the op implementing the interface.

getTiledImplementation creates ExtractSliceOps and the tiled operation(s) inside in makeTiledShapes. link

Let me prototype what I meant by the callback.

I’m generally in favor of the change.

However, I am not a fan of creating ops just so we can pass arguments to a function in a generic way, and even less of assuming some later DCE will remove it. More nitpicky, if it’s an tensor.extract_slice that describes the iteration, what is the tensor value the slice is extracted from? Should we also create a tensor.empty with the appropriate size?

We are missing the concept to represent the iteration space. It cannot be an attribute because the space may depend on values (unless somebody is willing to introduce dependent types/attributes in MLIR). It cannot be an operation because it would need to exist somewhere and potentially follow verification rules beyond what is strictly necessary here. So we should invent a new concept. A direct suggestion is to add a new class TileableIterationSpace supporting isa/cast via TypeID. We can then have a OrthogonalDenseIterationSpace subclass of that containing sizes and offsets, and introduce more interesting shapes later. One readily available example is an AffineDenseIterationSpace that is an integer set with symbols bounds and can represent, e.g., triangular spaces. Implementations can then dyn_cast to one of the iteration space kinds they support and emit the code accordingly. Longer-term, we can build a class hierarchy if desired and introduce virtual methods or concept/model dispatch similar to how interfaces are implemented internally. FWIW, I don’t see how one can write a getTiledImplementation that supports an arbitrary iteration space, so some sort of kind checking is unavoidable.

Regarding the TilingResult structure, I suggest to separate the “main” generated operations, such as linalg.generic that applies to a tile, from all the other generated operations. This is useful for chaining, when one needs to tile the produced operation again.

I’d love to see progress in such a direction that is not limited to affine stuff and would work with gather/scatter.

Separately I have been toying around and prototyped a bit a notion of SubsetOpInterface / SubsetExtractOpInterface / SubsetInsertOpInterface to slap on all relevant ops.
One place I got stuck at is that an insert wants to create a matching extract and vice versa and op type tablegen interface order definition something something fails us miserably …

⚙ D136322 [mlir] ODS: emit interface model method at the end of the header partially addressed this. Diego also outlined the work necessary to address this fully. That being said, these are still op interfaces and I don’t think we should create dummy ops to represent iteration spaces.

Not arguing for passing dummy ops but with such an interface that knows about its companion op, the idea was to use templates + type deduction to get a (very rough sketch):

TileInterface<SubsetExtractOpInterface>(some_args, ctor_args...)

and that we would be able to automatically get:

b.create<SubsetExtractOpInterface::MatchingInsertOp>(ctor_args...)

This way we wouldn’t have to frontload general iteration sets, although progress in that direction is also important.

I meant smth like ⚙ D138298 [mlir][DO-NOT-REVIEW-IT-IS-A-DRAFT] updated getTiledImplementation, but if we can have a hierarchy of classes with isa/cast, I would definitely go for it. I don’t like an idea of passing a dummy op.

I am not able to follow why the callback is needed. It seems to be a way to intercept how getTiledImplementation computes slices, but that is an implementation detail of the getTiledImplementation methods. I am not able to follow why that needs to go through the TilingInterface. Effectively getTiledaimplementation is a callback. This adding a callback to a callback.

IMO, the interesting part in @pifon2a’s diff is the SubsetSpecification class, which is conceptually close to my proposed iteration domain abstraction. If I understand correctly, it was intended as a kind of “union” / sum-type abstraction that may contain a Tile or some other subset specification, but it isn’t visible in the example because we only have hyperrectangular tiles for now. The extra callback then is the equivalent of isa/dyn_cast that accesses the relevant part of SubsetSpecification.

Exactly. Also the extra callback allows to create materialize(tile) instread of extract_slice, to unify GmlSt’s fork of tiling interface with the upstream one.

(Looks like I missed a few of the posts here, responding to those)

That was my thought.

Can you elaborate more on why this is an issue? Agreed about Type not possible since we dont have dependent types. Operation seems fine though. Having a verifier is potentially a good thing, and it can capture any iteration space description in future.

Everything you describe here seems to be solved by having an Operation. I actually think it is a “feature” that you can serialize all of this in the IR. I am assuming the TileableIterationSpace is some sort of serializable object… I am not really following how that is different from an operation.

Yup, thats why I was thinking of an Operation.

In any case, if idea of creating a dummy operation to specify the iteration space is deemed “not great”, I am happy with using the TileableIterationSpace object @ftynse describes… My only requirements here are that they are serializable in the IR.

@MaheshRavishankar I would like to provide a bit more context. There is a tiling interface in MLIR_HLO/GmlSt which is a fork of the upstream tiling interface with the only difference at the moment, that it uses gml_st.materialize instead of tensor.extract_slice to get the tiles.

This RFC proposes changes to the tiling interface and I thought that we can also try to unify both interfaces. I am open to suggestions how to do that. One variant is to have a simple callback argument as in the prototype. Another is to have an interface which has two parameters, one is the op that we tile and second is the subset op that we want to use to materialize the tile, but this requires a lot of infrastructure changes.

Unification of interfaces would allow to converge LinalgExt and tHLO.

It’s kind of hard to figure out much about this since this “materialize” operation does not have any documentation as far as I can tell. I would start with going back to more of first principles here: why isn’t tensor.extract_slice enough? why should the interface let the op implementation customize this? Isn’t the op used for materialization somehow orthogonal to the operation that defines the interface?

I will explain what I have in mind in terms of separation of concerns between a transformation that using the interface (in this case tiling and tile + fuse), and the implementation of the interface by an .

  1. The transformation that is defining the interface is creating a tile of a particular shape as well as the code to iterate over these tiles. Today the scf::tileUsingSCFForOp creates orthogonal tiles that can be described using a tensor.extract_slice operation. So this operation can be passed as an argument to getTiledImplementation to specify to shape of the tile. You could have a different implementation of scf::tileUsingSCFForOp that produces tiles of a different shape. In which case you could pass any other Operation * to represent the shape of the tile that this implementation creates.
  2. Within the implementation of the TilingInterface by an operation, specifically getTiledImplementation, you can check if the operation supports generating code for a tile for that given shape. Basically it can do isa<tensor::ExtractSlice>(shape_specification_op) and generate the code that represents the orthogonally-shaped tile of the original operation. If a different type of the shape_specification_op specifies a different tile shape, the implementation can either return failure (to suggest unhandled shape) or generate a different code for that tile shape. This all becomes part of the implementation of the TilingInterface.

You can replace all uses of Operation * above with the new object hierarchy for describing shapes that Alex Zinenko proposed, that I am all for. If it isnt implemented by the time I get to this, I am happy to take a stab at it as well.

In this picture, I am not sure a callback is needed.

This materialization is what getTiledImplementation is. It is literally meant to be the materialization of the tile that is being requested by the caller.

2 Likes

Technically, it can be represented as an operation. My (philosophical) concern is that conceptually it isn’t one. Operations are representing, through their semantics, what is being done by the program when executed. Here, we are twisting the semantics of tensor operations to define an iteration space. So we have a fake tensor that doesn’t represent any data, and an extract_slice that only serves as a container for operands+attributes while the actual slice is ignored by the program. This feels off, even if clever.

There are also technical things that I find tricky. These operations (tensor.empty + extract_silce) are dead by construction. Given Linalg’s affection for running partial DCE/folding in many places, they may get deleted too early, e.g., during the tiled implementation generation while their attributes still need to be read later, and fun memory corruption debugging ensues. Conversely, we expect that DCE will happen before bufferization so we don’t need to allocate the buffer for the non-data tensor. Such implicit requirements are annoying to handle. Certainly, the tiling pass subsume DCE, but IMO we shouldn’t encourage pass logic subsuming each other.

I don’t really understand this requirement. Could you elaborate why do you want it to be serializable? From my reading the posts above, it is only needed to call the interface method. Do you have a setup in mind where one pass decides on the iteration space partitioning and records this in IR and another pass actually implements the partitioning?

1 Like

Ok, I agree with all your points here. So +1 for having a new object hierarchy for representing iteration spaces

Couple of things here. (1) it is just easier to debug if the iteration space description itself is serializable. You dont need debugger etc. to inspect what the iteration space being passed is. It is either in the IR, or it can be debugged using dump/DBGs.
Long term it might be possible to indeed progressively lower the iteration space itself… Say you have a mix of indirect and direct access (sliced gathers for example), you could have a pass handle indirect accessed first and then direct accesses. My sense is that having the ability to defer how different parts of the iteration space can be handled can be very useful. If it is not serializable in IR, then that context has to be carried through in code, which can be awkward and have its limits. Also, maybe I over-stated how strong of a requirement I see it as. Getting something in place is more important than it being serializable in IR…

Having some time to digest this a bit more, I tend to agree that capturing this as operations may be indeed useful for complex, data-dependent iteration spaces. In this case, I can reframe my feedback as making sure these auxiliary operations are separable from the main computational IR, e.g., by putting them in a separate region.

In any case, I’m okay with any approach that takes these concerns into consideration and documents the decisions so we can revisit them if necessary.

Just catching up on this thread fresh, I’m having trouble picturing what form might satisfy the feedback and am genuinely curious on what this might look like. I get that these things “want” to be modeled as dependent types, and that in MLIR, the two approaches to that tend to be struct/attribute based with careful definitions of value bindings or an “op based” approach.

I typically find that the op based approach leaves less to the imagination (ie. The data dependency is modeled in IR instead of by some kind of binding convention) but the struct/attribute approach tends to spell things how they are represented canonically elsewhere. But then, as you say, you kind of end up with two branches of pseudo IR in one construct and having some notion of separability becomes important to keep it all from being spaghetti.

But I’m having trouble following what things could look like in this case in order to be separable.

Pardon the noise if this is clear to everyone. Just trying to understand better.