Hi everyone,

Now that linalg on tensors has gained support for representing reductions and a fresh syntax (see https://reviews.llvm.org/D87767 and https://reviews.llvm.org/D87938), we are ready for pushing a little more in that direction.

At a high-level, the scope of the work will include developments in these directions.

# Named Ops and Spec Lang

Scaling up named ops and extending the infrastructure for lang support towards more advanced specifications (e.g. template types, rank agnostic spec, attributes or others, on a per need basis).

# Transformations on Linalg on Tensors

Implementing transformations that have traditionally been relying on memory and alias analysis to work on tensors (e.g. tiling, fusion, distribution, vectorization).

In this context, we’ll be experimenting with considering memory as a mere scratch space for optimization. The basic idea is that if one does not explicitly go through memory, LLVM will insert spills. Optimizing for memory will then consist of hoisting those LLVM-introduced fine-grained spills to coarser-grained MLIR-controlled alloc + load/stores.

A nice by-product is that use-def chains will be preserved which will is expected to help with transformations (e.g. S/L forwarding, fusion, vectorization among others) and make them future-proof to type changes (e.g. sparse).

For this, it is likely that new ops will be needed: notice the `subtensor`

and `subtensor_insert`

ops in a prototypical tiling that I have locally:

```
mlir-opt foo.mlir -linalg-tile="linalg-tile-sizes=2,3,4" -cse
func @matmul(%arg0: tensor<16x1024xf32>, %arg1: tensor<1024x33xf32>, %arg2: tensor<16x33xf32>)
-> tensor<16x33xf32> {
%0 = linalg.matmul ins(%arg0, %arg1: tensor<16x1024xf32>, tensor<1024x33xf32>)
init(%arg2: tensor<16x33xf32>)
-> tensor<16x33xf32>
return %0 : tensor<16x33xf32>
}
```

Produces:

```
func @matmul(%arg0: tensor<16x1024xf32>, %arg1: tensor<1024x33xf32>, %arg2: tensor<16x33xf32>) -> tensor<16x33xf32> {
%c0 = constant 0 : index
%c4 = constant 4 : index
%c1024 = constant 1024 : index
%c16 = constant 16 : index
%c33 = constant 33 : index
%c1 = constant 1 : index
%c2 = constant 2 : index
%c3 = constant 3 : index
%0 = scf.for %arg3 = %c0 to %c16 step %c2 iter_args(%arg4 = %arg2) -> (tensor<16x33xf32>) {
%1 = scf.for %arg5 = %c0 to %c33 step %c3 iter_args(%arg6 = %arg4) -> (tensor<16x33xf32>) {
%2 = scf.for %arg7 = %c0 to %c1024 step %c4 iter_args(%arg8 = %arg6) -> (tensor<16x33xf32>) {
%3 = subtensor %arg0[%arg3, %arg7] [2, 4] [1, 1] : tensor<16x1024xf32> to tensor<2x4xf32>
%6 = subtensor %arg1[%arg7, %arg5] [4, 3] [1, 1] : tensor<1024x33xf32> to tensor<4x3xf32>
%9 = subtensor %arg8[%arg3, %arg5] [2, 3] [1, 1] : tensor<16x33xf32> to tensor<2x3xf32>
%12 = linalg.matmul ins(%3, %6 : tensor<2x4xf32>, tensor<4x3xf32>) init(%9 : tensor<2x3xf32>) -> tensor<2x3xf32>
%13 = subtensor_insert %12 into %arg8[%arg3, %arg5] [%c2, %c3] [%c1, %c1] : tensor<?x?xf32> into tensor<16x33xf32>
scf.yield %13 : tensor<16x33xf32>
}
scf.yield %2 : tensor<16x33xf32>
}
scf.yield %1 : tensor<16x33xf32>
}
return %0 : tensor<16x33xf32>
}
```

`subtensor`

is similar to `subview`

but operates on tensors.

`subtensor_insert`

is the counterpart of `subtensor`

and behaves like `vector.insert`

but on tensors and with semantics compatible with tiling.

# Buffer Allocation and Layout Assignment

Once transformations are implemented on Linalg on tensors, buffer allocation and layout assignment can be performed in local scopes. This is expected to alleviate the phase ordering that is present in e.g. XLA where:

- a grouping of ops is determined (the FusionNode)
- buffers and layout are allocated assuming fusion will occur
- now in buffer form + loops, fusion must happen or else…

Also, if buffer allocation + layout assignment happen after vectorization, this could give us nice aligned + packed layouts almost for free: TBD.

For all this, to happen buffer allocation and placement needs to be extended to understand the reduction semantics of linalg on tensors. It is expected `subtensor`

will just lower to `subview`

, `subtensor_insert`

will become a noop and each `init`

+ `output`

tensor pair will fold into an `output_buffer`

.

A quick local prototype + a bit of extrapolation show this seems reasonable (modulo some corner cases re. tensors returned at function boundaries vs output buffers).

Lastly, note that since Linalg semantics allow mixing tensors and buffers, we can also envision progressive multi-level tiling + partial buffer allocation for various operands. This is expected to simplify the writing of greedy allocation + layout assignment heuristics.

# MHLO -> Linalg on tensors

It is expected that some of the rewrites that could not be done previously will become much simpler. In particular IREE’s DotGeneral lowering went directly from MHLO to Linalg on buffers. We should then be able to remove some of the complexity related to the multiple paths:

- MHLO -> LMHLO -> Linalg on buffers.
- MHLO -> Linalg on tensors for pointwise + region-based fusion.
- MHLO -> Linalg on buffers for some things with reductions.

Of course, things that don’t fit in Linalg will remain unchanged.

When more operations are expressed on Linalg on tensors, certain low-hanging fruits will become easy to pick (e.g. some more pathological broadcast + permutation + DotGeneral combination that would result in materialization of full intermediate memory can just be rewritten as a single `linalg.generic`

op).

# Parallelism and Distribution

This is still too early to discuss but it is expected that remaining in tensor form after transformations will simplify parallel region extraction and help interplay with low-latency runtimes such as IREE.