An update on linalg on tensors

Hi everyone,

Now that linalg on tensors has gained support for representing reductions and a fresh syntax (see https://reviews.llvm.org/D87767 and https://reviews.llvm.org/D87938), we are ready for pushing a little more in that direction.

At a high-level, the scope of the work will include developments in these directions.

Named Ops and Spec Lang

Scaling up named ops and extending the infrastructure for lang support towards more advanced specifications (e.g. template types, rank agnostic spec, attributes or others, on a per need basis).

Transformations on Linalg on Tensors

Implementing transformations that have traditionally been relying on memory and alias analysis to work on tensors (e.g. tiling, fusion, distribution, vectorization).

In this context, we’ll be experimenting with considering memory as a mere scratch space for optimization. The basic idea is that if one does not explicitly go through memory, LLVM will insert spills. Optimizing for memory will then consist of hoisting those LLVM-introduced fine-grained spills to coarser-grained MLIR-controlled alloc + load/stores.

A nice by-product is that use-def chains will be preserved which will is expected to help with transformations (e.g. S/L forwarding, fusion, vectorization among others) and make them future-proof to type changes (e.g. sparse).

For this, it is likely that new ops will be needed: notice the subtensor and subtensor_insert ops in a prototypical tiling that I have locally:

mlir-opt foo.mlir -linalg-tile="linalg-tile-sizes=2,3,4" -cse 

func @matmul(%arg0: tensor<16x1024xf32>, %arg1: tensor<1024x33xf32>, %arg2: tensor<16x33xf32>)
    -> tensor<16x33xf32> {
  %0 = linalg.matmul  ins(%arg0, %arg1: tensor<16x1024xf32>, tensor<1024x33xf32>)
                     init(%arg2: tensor<16x33xf32>)
    -> tensor<16x33xf32>
  return %0 : tensor<16x33xf32>
}

Produces:

 func @matmul(%arg0: tensor<16x1024xf32>, %arg1: tensor<1024x33xf32>, %arg2: tensor<16x33xf32>) -> tensor<16x33xf32> {
    %c0 = constant 0 : index
    %c4 = constant 4 : index
    %c1024 = constant 1024 : index
    %c16 = constant 16 : index
    %c33 = constant 33 : index
    %c1 = constant 1 : index
    %c2 = constant 2 : index
    %c3 = constant 3 : index
    %0 = scf.for %arg3 = %c0 to %c16 step %c2 iter_args(%arg4 = %arg2) -> (tensor<16x33xf32>) {
      %1 = scf.for %arg5 = %c0 to %c33 step %c3 iter_args(%arg6 = %arg4) -> (tensor<16x33xf32>) {
        %2 = scf.for %arg7 = %c0 to %c1024 step %c4 iter_args(%arg8 = %arg6) -> (tensor<16x33xf32>) {
          %3 = subtensor %arg0[%arg3, %arg7] [2, 4] [1, 1]  : tensor<16x1024xf32> to tensor<2x4xf32>
          %6 = subtensor %arg1[%arg7, %arg5] [4, 3] [1, 1]  : tensor<1024x33xf32> to tensor<4x3xf32>
          %9 = subtensor %arg8[%arg3, %arg5] [2, 3] [1, 1]  : tensor<16x33xf32> to tensor<2x3xf32>
          %12 = linalg.matmul ins(%3, %6 : tensor<2x4xf32>, tensor<4x3xf32>) init(%9 : tensor<2x3xf32>)  -> tensor<2x3xf32>
          %13 = subtensor_insert %12 into %arg8[%arg3, %arg5] [%c2, %c3] [%c1, %c1]  : tensor<?x?xf32> into tensor<16x33xf32>
          scf.yield %13 : tensor<16x33xf32>
        }
        scf.yield %2 : tensor<16x33xf32>
      }
      scf.yield %1 : tensor<16x33xf32>
    }
    return %0 : tensor<16x33xf32>
  }

subtensor is similar to subview but operates on tensors.
subtensor_insert is the counterpart of subtensor and behaves like vector.insert but on tensors and with semantics compatible with tiling.

Buffer Allocation and Layout Assignment

Once transformations are implemented on Linalg on tensors, buffer allocation and layout assignment can be performed in local scopes. This is expected to alleviate the phase ordering that is present in e.g. XLA where:

  1. a grouping of ops is determined (the FusionNode)
  2. buffers and layout are allocated assuming fusion will occur
  3. now in buffer form + loops, fusion must happen or else…

Also, if buffer allocation + layout assignment happen after vectorization, this could give us nice aligned + packed layouts almost for free: TBD.

For all this, to happen buffer allocation and placement needs to be extended to understand the reduction semantics of linalg on tensors. It is expected subtensor will just lower to subview, subtensor_insert will become a noop and each init + output tensor pair will fold into an output_buffer.

A quick local prototype + a bit of extrapolation show this seems reasonable (modulo some corner cases re. tensors returned at function boundaries vs output buffers).

Lastly, note that since Linalg semantics allow mixing tensors and buffers, we can also envision progressive multi-level tiling + partial buffer allocation for various operands. This is expected to simplify the writing of greedy allocation + layout assignment heuristics.

MHLO -> Linalg on tensors

It is expected that some of the rewrites that could not be done previously will become much simpler. In particular IREE’s DotGeneral lowering went directly from MHLO to Linalg on buffers. We should then be able to remove some of the complexity related to the multiple paths:

  1. MHLO -> LMHLO -> Linalg on buffers.
  2. MHLO -> Linalg on tensors for pointwise + region-based fusion.
  3. MHLO -> Linalg on buffers for some things with reductions.

Of course, things that don’t fit in Linalg will remain unchanged.
When more operations are expressed on Linalg on tensors, certain low-hanging fruits will become easy to pick (e.g. some more pathological broadcast + permutation + DotGeneral combination that would result in materialization of full intermediate memory can just be rewritten as a single linalg.generic op).

Parallelism and Distribution

This is still too early to discuss but it is expected that remaining in tensor form after transformations will simplify parallel region extraction and help interplay with low-latency runtimes such as IREE.

4 Likes

This is really neat. Tying the Linalg on tensors to Vectors directly would remove the intermediate usage of Linalg on buffers that causes complications in codegeneration. Thanks for pushing on this!

I have to look at the patches you sent out, but patch https://reviews.llvm.org/D88435 introduces tile + fuse on buffers. Hopefully that can be adapted to work on tensors as well. There is nothing specific to buffers in the change itself. Could we co-ordinate on the order in which these changes go in cause they intersect.

Wouldnt it in general become a linalg.copy with the subtensor written being the source subview and the tensor view described in the subtensor_insert becoming the subview to write into?

THis is really great! There are a bunch of patterns that convert from MHLO to Linalg on Buffers in IREE here. These would be prime candidates to be converted to use Linalg on Tensors. From what I can see it should be a fairly straight-forward change to adapt this. The only question being how to make sure this works with the buffer allocation.
@mehdi_amini, @herhut could we figure out (with @nicolasvasilache) a place to move these patterns to. I would really like to move them out of IREE. This might be a topic of interest to bring up in the ODM when time permits. (@hanchung, @asaadaldien for visibility)

Moving them out of IREE would be great. Most of them can be moved out easily, e.g., conv, dot, etc. I remember that the logic of lowering ops that have reduction loops is different. In IREE, we lower it to Linalg indexed_generic op or pooling ops, but in mlir-hlo repo we lower them to parallel loops. I’m happy to work together on this if we want to put them to another place.

Also ping @pifon2a who I believe has already started hacking some of this towards the MHLO repo?