## Motivation

Support for tiling and the related transformations such as tile-and-fuse, tile-and-distribute for Linalg operations on tensors.

Tiling for a Linalg operation on buffers results in an `scf.for`

or `scf.parallel`

op with a region that contains a number of `subview`

operations to transform the original input and output arguments of the op.

Tiling for a Linalg operation on tensors results in a number of `subtensor`

ops for the inputs and `subtensor_insert`

ops for the outputs of the op, i.e. the elements of the output tensor are populated via destructive updates. Neither `scf.for`

nor `scf.parallel`

are suitable for containing `subtensor_insert`

in their bodies.

The goal is to design a new op, that has a similar semantics to `scf.parallel`

, but at the same time yields subtensors for the specified outputs to make `subtensor_insert`

implicit.

## The new op

The `linalg.tile`

operation represents a loop nest taking the usual lower bounds, upper bounds and steps arguments and the input/output tensor arguments like in `linalg.generic`

. It has one region capturing the loop body. The body region must contain exactly one block that terminates with `linalg.subtensor_yield`

with the arguments matched to the output tensors.

```
linalg.tile (%i0, %i1) = (%lb0, %lb1) to (%ub0, %ub1) step (%s0, %s1)
ins (%A, %B : tensor<?x?xf32>, tensor<?x?xf32>)
outs (%C, %D : tensor<?x?xf32>, tensor<?x?xf32>) {
...
%C_subtensor =
%D_subtensor =
...
linalg.subtensor_yield %C_subtensor into [offsets][sizes][strides],
%D_subtensor into [offsets][sizes][strides]
: type(%C_subtensor), type(%D_subtensor)
}
```

## Tile-and-fuse example

Let’s consider a tiling example for a matmul:

```
func @matmul(%lhs: tensor<24x64xi8>, %rhs: tensor<64x192xi8>,
%uninit_out: tensor<24x192xi32>) -> tensor<24x192xi32> {
%c0 = constant 0 : i32
%c42 = constant 42 : i32
%out = linalg.fill(%uninit_out, %c0)
: tensor<24x192xi32>, i32 -> tensor<24x192xi32>
%prod = linalg.matmul_i8_i8_i32
ins(%lhs, %rhs : tensor<24x64xi8>, tensor<64x192xi8>)
outs(%out : tensor<24x192xi32>) -> tensor<24x192xi32>
return %prod : tensor<24x192xi32>
}
```

Tile the first dimension by 4:

```
func @matmul(%lhs: tensor<24x64xi8>, %rhs: tensor<64x192xi8>,
%uninit_out: tensor<24x192xi32>) -> tensor<24x192xi32> {
%c0_i32 = constant 0 : i32
%c0 = constant 0 : index
%c1 = constant 1 : index
%c3 = constant 3 : index
%c4 = constant 4 : index
%lhs_d0 = dim %lhs, %c0: tensor<24x64xi8>
%lhs_d1 = dim %lhs, %c1 : tensor<24x64xi8>
%rhs_d0 = dim %rhs, %c0: tensor<64x192xi8>
%rhs_d1 = dim %rhs, %c1 : tensor<64x192xi8>
%out_d0 = dim %out, %c0: tensor<24x192xi32>
%out_d1 = dim %out, %c1 : tensor<24x192xi32>
%out = linalg.fill(%uninit_out, %c0_i32)
: tensor<24x192xi32>, i32 -> tensor<24x192xi32>
%prod = linalg.tile (%i) = (%c0) to (%lhs_d0) step (%c4)
ins(%lhs, %rhs : tensor<24x64xi8>, tensor<64x192xi8>)
outs(%out : tensor<24x192xi32>) {
%lhs_d0_size = affine.min affine_map<(d0)[s0] -> (4, -d0 + s0)>(%i)[%lhs_d0]
%lhs_sub = subtensor %lhs[%i, 0] [%lhs_d0_size, %lhs_d1] [1, 1]
: tensor<24x64xi8> to tensor<?x?xi8>
%out_d0_size = affine.min affine_map<(d0, d1) -> (4, d0 - d1)>(%out_d0, %i)
%out_sub = subtensor %out[%i, 0] [%out_d0_size, %out_d1] [1, 1]
: tensor<24x192xi32> to tensor<?x?xi32>
%prod_sub = linalg.matmul_i8_i8_i32
ins(%lhs_sub, %rhs : tensor<?x?xi8>, tensor<64x192xi8>)
outs(%out_sub : tensor<?x?xi32>) -> tensor<?x?xi32>
linalg.subtensor_yield
%prod_sub into [%i, 0][%out_d0_size, %out_d1][1, 1] : tensor<?x?xi32>
}
return %prod : tensor<24x192xi32>
}
```

After the consumer operation was tiled, we can fuse it with the `linalg.fill`

.

```
func @matmul(%lhs: tensor<24x64xi8>, %rhs: tensor<64x192xi8>,
%uninit_out: tensor<24x192xi32>) -> tensor<24x192xi32> {
%c0_i32 = constant 0 : i32
%c0 = constant 0 : index
%c1 = constant 1 : index
%c3 = constant 3 : index
%c4 = constant 4 : index
%lhs_d0 = dim %lhs, %c0: tensor<24x64xi8>
%lhs_d1 = dim %lhs, %c1 : tensor<24x64xi8>
%rhs_d0 = dim %rhs, %c0: tensor<64x192xi8>
%rhs_d1 = dim %rhs, %c1 : tensor<64x192xi8>
%out_d0 = dim %out, %c0: tensor<24x192xi32>
%out_d1 = dim %out, %c1 : tensor<24x192xi32>
%prod = linalg.tile (%i) = (%c0) to (%lhs_d0) step (%c4)
ins(%lhs, %rhs : tensor<24x64xi8>, tensor<64x192xi8>)
outs(%uninit_out : tensor<24x192xi32>) {
%lhs_d0_size = affine.min affine_map<(d0)[s0] -> (4, -d0 + s0)>(%i)[%lhs_d0]
%lhs_sub = subtensor %lhs[%i, 0] [%lhs_d0_size, %lhs_d1] [1, 1]
: tensor<24x64xi8> to tensor<?x?xi8>
%out_d0_size = affine.min affine_map<(d0, d1) -> (4, d0 - d1)>(%out_d0, %i)
%uninit_out_sub = subtensor
%uninit_sub[%i, 0][%out_d0_size, %out_d1][1, 1]
: tensor<24x192xi32> to tensor<?x?xi32>
%out_sub = linalg.fill(%uninit_out_sub, %c0_i32)
: tensor<24x192xi32>, i32 -> tensor<24x192xi32>
%prod_sub = linalg.matmul_i8_i8_i32
ins(%lhs_sub, %rhs : tensor<?x?xi8>, tensor<64x192xi8>)
outs(%out_sub : tensor<?x?xi32>) -> tensor<?x?xi32>
linalg.subtensor_yield
%prod_sub into [%i, 0][%out_d0_size, %out_d1][1, 1] : tensor<?x?xi32>
}
return %prod : tensor<24x192xi32>
}
```