RFC for `TilingInterface` for tiling operations that dont fit into Linalg Structured Operation definition

This is RFC to introduce an interface that allows for tiling + distribution of a class of operations that do not naturally fall within the Structured Op interface, but still have semantics that do allow them to be tiled and distributed (and fused in a manner similar to tile + fuse works on Structured Ops). The Linalg tiling algorithm uses the indexing maps, i.e. affine maps that describe the access pattern from the iteration space of the operation to data space of each individual operand, that allow Structured ops to

  1. Compute the loop bounds based on the size of the operands
  2. Allows description of a tiling algorithm (and by extension tile + fuse) that is purely based on the affine maps (and their inversion). The tiling algorithm itself uses the indexing maps to
    • Compute the tile of the iteration space that is required to compute a tile of the result of an operation
    • Given the tile of the iteration space, compute the tile of the operands needed to compute the tile of the result of the operation
    • If the producer of an operand is also a Structured op, the same algorithm can be used to compute a tile of the operand in place by tiling the producer.

Many operations do not have an easy way of specifying the indexing maps, or at least and indexing map that is invertible. In absence of this, the rest of the document describes an interface that allows each operation to specify the information needed by the tiling algorithm to implement tiling in a manner similar to Structured ops.

The interface described here only deal with generating a tiled implementation of the operation, but the same should be usable for tile + fuse but is not explicitly part of this RFC. The tiling algorithm is not described in detail here. It has been prototyped within IREE here and this RFC is to start the process of upstreaming this interface to the MLIR code base. I will also be sending out the patch to move this upstream in a few days.

The algorithm for tiling itself is fairly straight-forward. Without describing the algorithm itself, the rest of this RFC describes the interface and shows the IR that would be generated for some operations that do not fit into the Structured op criteria, i.e.

  1. scatter
  2. gather
  3. pad

Similar to Structured operations, the operations here can have either tensor operands (i.e. has tensor semantics) or memref operands (i.e. has buffer semantics). The interface is intended to support tiling and distribution of both. Note that there is already a linalg.pad_tensor operation which can be tiled and fused (even though the operation is not a Structured operation). Here it is used for illustration purposes, and the current implementation can be adapted to use this interface.

Disclaimer: This is very much an experimental interface, and is expected to evolve over time. This is being upstreamed so that different teams working on similar things can collaborate more effectively.

Intererface methods

  1. SmallVector<StringRef> getLoopIteratorTypes();

This method allows the operation to specify the number of (mutually independent) loops that are required to implement the operation, and the dependence of each loop. This is similar to the iterator_types of Structured operations.

  1. SmallVector<Range> getLoopBounds(OpBuilder &b)

This method returns the SSA values the [lower_bound, upper_bound, step] of each of the loops of the operation. The size of this list must be same as the size of the list returned by getLoopIteratorTypes()

LogicalResult getTiledImplementation(OpBuilder &b,
    ValueRange dest, ValueRange offset, ValueRange tileSizes,
    SmallVectorImpl<Value> &tiledResults,
    SmallVectorImpl<SmallVector<Value>> &resultOffsets,
    SmallVectorImpl<SmallVector<Value>> &resultSizes);

The algorithm used to the tile the operations is expected to use the first two methods to generate the inter-tile loops. Note for the initial implementation, a simplifying assumption is made that only the "parallel" loops are tiled and distributed. For the body of the tiled operation, the above method will be invoked.

Here

  • dest is the Values into which the result of the tiled operation needs to be computed into. When the operation has tensor semantics, these are the values into which the results of the tiled operations are inserted into. When the operation has buffer semantics, these are the values into which the results of the tiled operations are written into using in-place update.

  • offsets are the Values that specifies the offsets along each dimension of the iteration space that locate the tile.

  • tileSizes are the Values that specifies the sizes of along each dimension of the iteration space that define the tile.

  • tiledResults, when the tiling is performed on an operation with tensor semantics, returns the Value that represents the tile of the result that is computed by the tiled operation. This can be left empty when tiling is performed on an operation with buffer semantics.

  • resultOffsets, when the tiling is performed on an operation with buffer semantics, returns the offsets for each result that locate the tile of the result produced by the tiled operation within the final result tensor. This can be left empty when tiling is performed on an operation with buffer semantics since the update happens in-place

  • resultSizes, when the tiling is performed on an operation with buffer semantics, returns the sizes for each result specifies the size of the tile of the result produced by the tiled operation. This can be left empty when tiling is performed on an operation with buffer semantics since the update happens in-place

Using these interface methods, below are example IR of scatter, gather and pad operations after tiling

Scatter Operation

The following is a sample scatter operation (this operation does not exist in MLIR and is only for illustrative purposes).

func @scatter(
    %updates: tensor<?xi32>, %indices: tensor<?xindex>, %dest: tensor<?xi32>) 
    -> tensor<?xi32> {
  %result = scatter
      ins(%updates, %indices : tensor<?xi32>, tensor<?xindex>)
      outs(%dest : tensor<?xi32>) -> tensor<?xi32>
  return %result : tensor<?xi32>
}

Values from %update are written into %dest at locations specified by %indices. %update and %indices have the same shape.

The loop iterators types for this op is {"parallel"}
The loop bounds of this op is {[0, tensor.dim(%indices, 0), 1]}
The tiled implementation generated would be

func @scatter_update_slice(
    %updates: tensor<?xi32>, %indices: tensor<?xindex>, %dest: tensor<?xi32>)    
    -> tensor<?xi32> {
  %c0 = constant 0 : index
  %ub = tensor.dim %indices, %c0 : tensor<?x?xi32>
  %ds = tensor.dim %dest, %c0 : tensor<?x?xi32>
  %ts = “get_tilesize_op”(%c0) : (index) -> index

  %result = linalg.tiled_loop (%iv) = (%c0) to (%ub) step (%ts)
      ins(%arg0 = %updates, %arg1 = %indices : tensor<?xi32>, tensor<?xindex>)
      outs(%arg2 = %dest : tensor<?xi32>) -> tensor<?xi32> {
    %update_tile = tensor.extract_slice %arg0[%iv] [%ts] [1] :
        tensor<?xi32> into tensor<?xi32>
    %indices_tile = tensor.extract_slice %arg1[%iv] [%ts] [1] :
        tensor<?xi32> into tensor<?xi32>
   
    %scatter_tile = scatter
        ins(%update_tile, %indices_tile : tensor<?xi32>, tensor<?xindex>)
        outs(%arg2 : tensor<?xi32>) -> tensor<?xi32>
   %yield = tensor.insert_slice %scatter_tile into %arg2[0] [%ds] [1] : tensor<?xi32> into tensor<?xi32>
    linalg.yield %yield : tensor<?xf32>
  }

  return %result : tensor<?xi32>
}

The invocation of getTiledImplementation for this op would

  • specify offsets to be {%iv}
  • specify tileSizes to be {%ts}

On return from this call, the interface implementation would populate

  • tiledResults with {%scatter_tile}
  • resultOffsets with {{%c0}}
  • resultSizes with {{%ds}}

Note that the size of the result tile produced is same size as the size of %dest tensor of the untiled op. This is assuming that the semantics of the scatter operation disallows multiple updates to the same location using %indices. This is implied by the iterator types being specified as {"parallel"}.

Gather operation

The inverse of the scatter operation is a gather operation.

func @gather(%input: tensor<?xf32>, %indices: tensor<?xindex>) -> tensor<?xf32> {
  %c0 = constant 0 : index
  %d0 = tensor.dim %indices, %c0 : tensor<?xi32>
  %init = linalg.init_tensor [%d0] : tensor<?xf32>
  %result = gather
      ins(%input, %indices : tensor<?xf32>, tensor<?xindex>)
      outs(%init : tensor<?xf32>) -> tensor<?xf32>
  return %result : tensor<?xf32>
}

The loop iterators types for this op is {"parallel"}
The loop bounds of this op is {[0, tensor.dim(%indices, 0), 1]}
The tiled implementation generated would be

func @gather(%input: tensor<?xf32>, %indices: tensor<?xindex>) -> tensor<?xf32> {
  %c0 = constant 0 : index
  %d0 = tensor.dim %indices, %c0 : tensor<?xindex>
  %ts = “get_tilesize_op”(%c0) : (index) -> index
  %init = linalg.init_tensor [%d0] : tensor<?xf32>
  %result = linalg.tiled_loop (%iv) = (%c0) to (%d0) step (%ts)
      ins(%arg0 = %input, %arg1 = %indices : tensor<?xf32>, tensor<?xindex>)
      outs(%arg2 = %init : tensor<?xf32>) -> tensor<?xf32> {
    %init = linalg.init_tensor [%ts] : tensor<?xf32>
    %indices_tile = tensor.extract_slice %arg1[%iv] [%ts] [1] :
        tensor<?xi32> into tensor<?xf32>
    %result_tile = gather
        ins(%arg0, %indices_tile : tensor<?xf32>, tensor<?xindex>)
        outs(%init : tensor<?xf32>) -> tensor<?xf32>
    %result = tensor.insert_slice %result_tile into %arg2[%iv] [%ts] [1] :
        tensor<?xf32> into tensor<?xf32>
    linalg.yield %result : tensor<?xf32>
  }
  return %result : tensor<?xf32>
}

The invocation of getTiledImplementation for this op would

  • specify offsets to be {%iv}
  • specify tileSizes to be {%ts}

On return from this call, the interface implementation would populate

  • tiledResults with {%gather_tile}
  • resultOffsets with {{%iv}}
  • resultSizes with {{%ts}}

Note that converse to the scatter, the gather access all of the %input for every tile of the operation. The is not really overly arduous. The eventual generated code would just read from the location of %input needed to compute the result within each tile, i.e. there is no actual copy or redundant reads of the %input. The entire slice is extract for correct representation of the computation.

Pad operation

func @pad(
    %input: tensor<?x?xf32>, %low0: index, %low1 : index, %high0 : index,
    %high1 : index, %pad_value : f32) -> tensor<?x?xf32> {
  // Has two loops that are [“parallel”, “parallel”]
  // Loop bounds are [0, memref.dim(%result, 0)), [0, memref.dim(%result, 1))
  %result = pad %input, %pad_value,
      low [%low0, %low1], high [%high0, %high1] :
      tensor<?x?xf32> to tensor<?x?xf32>
  return %result : tensor<?x?xf32>
}

The loop iterators types for this op is {"parallel", "parallel"}
The loop bounds of this op is {[0, tensor.dim(%result, 0), 1], [0, tensor.dim(%result, 1), 1]} . Note that this could be re-rewritten in terms of shape of %input
The tiled implementation generated would be

func @pad(
    %input: tensor<?x?xf32>, %low0: index, %low1 : index, %high0 : index,
    %high1 : index, %pad_value : f32) -> tensor<?x?xf32> {
  %d0 = … : index // evaluates to memref.dim(%result, 0)
  %d1 = … : index // evaluates to memref.dim(%result, 1)
  %in_d0 = memref.dim %input, %c0 : index
  %in_d1 = memref.dim %input, %c1 : index
  %init = linalg.init_tensor [%d0, %d1] : tensor<?x?xf32>
  %result = linalg.tiled_loop (%iv0, %iv1) =
      (%c0, %c0) to (%d0, %d1) step (%ts1, %ts2)
      ins(%arg0 = %input : tensor<?x?xf32>)
      outs(%arg1 = %init : tensor<?x?xf32>) {
    %init_tile = linalg.init_tensor[%ts1, %ts2] : tensor<?x?xf32>

    // The next few lines represent the "scalar" implementation of pad operation. You can also use a 
    // represent this computation using "slices" as done here :
    // https://github.com/llvm/llvm-project/blob/main/mlir/test/Dialect/Linalg/tile-pad-tensor-op.mlir
    %result_tile = scf.for %iv2 = %c0 to %ts1 step %c1 init(%arg2 = %init_tile) {
      %y0 = scf.for %iv3 = %c0 to %jt step %c1 init(%arg4 = %arg2) {
        %cond = call @is_in_bounds(...)
        %4 = scf.if %cond {
          %id0 = subi %iv2, %low0 : index
          %id1 = subi %iv3, %low1 : index
          %5 = tensor.extract %arg0[%id0, %id1] : tensor<?x?xf32>
          scf.yield %5 : f32
       } else {
          scf.yield %pad_value : f32
        } -> f32
        %5 = tensor.insert %4 into %arg4[%iv2, %iv3] : tensor<?x?xf32>
        scf.yield %5 : tensor<?x?xf32>
      }
      scf.yield %y0 : tensor<?x?xf32>
    }
    %result =
        subtensor_insert %result_tile into %arg1[%iv0, %iv1] [%ts1, %ts2] [1, 1] :
        tensor<?x?xf32> into tensor<?x?xf32>
    linalg.yield %result : tensor<?x?xf32>
  }
  return %result : tensor<?x?xf32>
}

Future enhancements

(a.k.a. things not part of the initial RFC)

  1. While fusion of such operations in a manner similar to tile + fuse of Structured ops is a prime motivation factor, the initial RFC does not cover these aspects. Some rough ideas on this

    • The tiled implementations are expected to generated tensor.extract_slices operations to get the slice of the operands needed for the tiled operations. These give the offsets and sizes needed to further tile the produces “in-place” using the same interface above, similar to how tile + fuse works on Structured operations.
    • For examples like gather above, the tensor.extract_slice is the entire input for every tile. If fusing through that using that directly is wasteful. Instead you would need some pre-preocessing to compute the “locations” of the %input needed to compute the tile of the gather. This is now approaching inspector-executor techniques and other techniques from sparse computations that are used for such purposes. Those could be leveraged here without any fundamental changes to the interface. (HT @herhut for discussions on this)
  2. The current RFC does not deal with tiling and distribution of iterator types that are not "parallel". There is another RFC that is addressing a similar shortcoming of the tiling + distribution of Structured operations. Extensions to this interface should be able to achieve the same for operations that are not structured operations.

1 Like

Thank you for writing this, Mahesh.

I am a bit confused by the gather example. There is a 1D init:

%init = linalg.init_tensor [%d0] : tensor<?xf32>

a block arg outs(%arg2 = %init : tensor<?xf32>) and also a 2D init:

%init = linalg.init_tensor [%ts, %d1] : tensor<?x?xf32>

Typo. Should be a 1D init_tensor as well. Fixed.

Cool stuff.

For the scatter operation, I find the use of tensor.insert_slice misleading, as it is not inserting a slice at all. This instead is a full copy here. Which also means, that different tiles cannot be computed in parallel anymore, as otherwise the copies would race.

An alternative here would be to, instead of returning resultOffsets and resultSizes, have the operation produce the writeback itself when tiling and always returning the full tensor result when using tensor semantics. For scatter this would just mean returning its result, as the scatter op, even when tiled, consumes and produces the full tensor. This would hide the merging behavior inside of the scatter ops semantics and thereby also enable scatter with conflict.

I think the lowering to buffer world then is also more obvious. The only requirement would be that the result of the operation has to have the same size as its output tile.

Some nits:

You confuse tensor and buffer semantics in the definition of resultOffsets and resultSizes.

In the example, you then use tensor.dim(%updates), which is inconsistent.

Couple of points on this

  1. That is an implementation detail of the “scatter” op tiling. So its orthogonal to the interface itself. There is no scatter op in MLIR core, and this RFC is not intending to add one
  2. Yes, it is true in tensor world that this would strictly mean that each tile is producing the whole tensor and therefore there is a dependence between the different iterations. But it depends on the semantics of the op. IREE has a scatter op here. This is based on the tf.scatter_nd. When bufferized the tiled op takes the whole buffer as an operand. When you lower to scalar code, based on the semantics of the operation you can update in-place. So strictly speaking, its an implementation detail of this op, where it is specifying that the loop is parallel and that each tile is “returning” the whole tensor. It is “racy”, but that is OK in this case based on the op semantics. The interface itself is not making a decision here. All of this is an implementation detail of the op itself. We can chat about this offline if this is too cryptic a description.

I considered this. Going back and forth on this. This might be the way to go eventually. I am hoping that the same interface would be usable for fusion as well (TBD, havent implemented this). Right now the getTiledImplementation returns all the information needed for the “caller” to do the tensor.insert_slice. This has the small benefit of not having to do this insert within each operation. Forcing each operation to do this insert seems like it might mixing implementation details with the description expected from the op of how it tiles itself. Right now I dont see a clear choice between the two. If I get an indication that returning the whole tensor is better, then I’ll update the interface.

Hmm, not sure I follow. The description says that resultOffsets and resultSizes can be left empty if the operation has memref operands. These values are needed only for the tiling algorithm to do the tensor.insert_slice. As I mentioned above, I dont see a need to make each op return the whole tensor, i.e. do the tensor.insert_slice itself. If I revisit that, then this would change to not return these values (by reference) at all.

Fixed . Thanks

The patch ⚙ D106406 [mlir] Add an interface to allow operations to specify how they can be tiled. implements this interface as well as the tiling algorithm to use this interface and tile + distribute ops that implement them.