Register promotion in linalg.generic lowering

Consider writing the tensor-times-vector kernel (C(i, j) = A(i, j, k) * B(k)) in MLIR:

#map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
#map3 = affine_map<(d0, d1, d2) -> (d2)>
func.func @func() {
  %A = memref.alloc() : memref<100x100x100xf64>
  %B = memref.alloc() : memref<100xf64>
  %C = memref.alloc() : memref<100x100xf64>
  linalg.generic {indexing_maps = [#map1, #map3, #map2], iterator_types = ["parallel", "parallel", "reduction"]} ins(%A, %B: memref<100x100x100xf64>, memref<100xf64>) outs(%C : memref<100x100xf64>) {
  ^bb0(%in: f64, %in_0: f64, %out: f64):
    %0 = arith.addf %in, %in_0 : f64
    linalg.yield %0 : f64
  }
  return
}

Hitting this with the convert-linalg-to-affine-loops pass yields this:

module {
  func.func @func() {
    %alloc = memref.alloc() : memref<100x100x100xf64>
    %alloc_0 = memref.alloc() : memref<100xf64>
    %alloc_1 = memref.alloc() : memref<100x100xf64>
    affine.for %arg0 = 0 to 100 {
      affine.for %arg1 = 0 to 100 {
        affine.for %arg2 = 0 to 100 {
          %0 = affine.load %alloc[%arg0, %arg1, %arg2] : memref<100x100x100xf64>
          %1 = affine.load %alloc_0[%arg2] : memref<100xf64>
          %2 = arith.addf %0, %1 : f64
          affine.store %2, %alloc_1[%arg0, %arg1] : memref<100x100xf64>
        }
      }
    }
    return
  }
}

A common optimization here would be to introduce a temporary for the value of C(i, j) inside the k loop, accumulate that instead of a store to memory, followed by a store into C(i, j) once the k loop is done. While an optimization like this may be possible in the affine dialect (i didn’t find anything like it), it seems like the most information is present here in the linalg.generic operation to perform this optimization, especially given that the iterator type for the k loop has been marked as “reduction”. I spent some time searching the code and the documentation if something like this was present, so any pointers will be helpful!

I am wondering how the * (multiply) was turned into an + (arith.addf).
And as expressed (= rather than +=) the c(i,j) is overwritten each loop.

The sparse compiler performs this optimization as part of the “sparsification” while lowering to loops (search for “scalarized reduction”). You can see this in action, for example, if you mark A as an all-dense “sparse” tensor in the following code (note that your example seems to be missing the multiply-add part):

#Dense = #sparse_tensor.encoding<{
  dimLevelType = [ "dense", "dense", "dense" ]
}>

#trait_kernel_3d = {
  indexing_maps = [
    affine_map<(i,j,k) -> (i,j,k)>,// B
    affine_map<(i,j,k) -> (k)>,    // C
    affine_map<(i,j,k) -> (i,j)>   // A (out)
  ],
  iterator_types = ["parallel", "parallel", "reduction"],
  doc = "A(i,j) += SUM_k B(i,j,k) * C(k)"
}

func.func @kernel_3d(%arga: tensor<?x?xf32>,
                     %argb: tensor<?x?x?xf32, #Dense>,
                     %argc: tensor<?xf32>) -> tensor<?x?xf32> {
  %0 = linalg.generic #trait_kernel_3d
       ins(%argb, %argc: tensor<?x?x?xf32, #Dense>, tensor<?xf32>)
      outs(%arga: tensor<?x?xf32>) {
    ^bb(%b: f32, %c: f32, %a: f32):
      %0 = arith.mulf %b, %c : f32
      %1 = arith.addf %a, %0 : f32
      linalg.yield %1 : f32
  } -> tensor<?x?xf32>
  return %0 : tensor<?x?xf32>
}

This yields the following (with A linearized, that is just a side effect of the all-dense “sparsification”), where the reduction is scalarized (load before inner, store after):

    scf.for %arg3 = %c0 to %dim step %c1 { 
      scf.for %arg4 = %c0 to %dim_0 step %c1 {
        %4 = arith.muli %dim_0, %arg3 : index
        %5 = arith.addi %4, %arg4 : index
        %6 = memref.load %2[%arg3, %arg4] : memref<?x?xf32>
        %7 = scf.for %arg5 = %c0 to %dim_1 step %c1 iter_args(%arg6 = %6) -> (f32) {
          %8 = arith.muli %dim_1, %5 : index
          %9 = arith.addi %8, %arg5 : index
          %10 = memref.load %0[%9] : memref<?xf32>
          %11 = memref.load %1[%arg5] : memref<?xf32>
          %12 = arith.mulf %10, %11 : f32
          %13 = arith.addf %arg6, %12 : f32
          scf.yield %13 : f32
        } {"Emitted from" = "linalg.generic"}
        memref.store %7, %2[%arg3, %arg4] : memref<?x?xf32>
      } {"Emitted from" = "linalg.generic"}
    } {"Emitted from" = "linalg.generic"}

Whoops, a typo from my side. I edited it into a mulf, but that didn’t change the behavior.

An abuse of notation from my part, I assume the += is occuring as part of the einsum. Is there a different way to express this in MLIR? I adjusted the generic block to do what aart did below to have the addf of the output as well, but that didn’t change anything either.

Thanks Aart, I was asking for this because I know that TACO does it. Is this just a limitation of the dense linear algebra compiler right now?

Perhaps, but please keep in mind that convert-linalg-to-affine-loops should probably only be used to get a direct loop formulation of a linalg expression. I am pretty sure the dense algebra compiler has much better (progressive) lowering paths (and, in addition, the LLVM backend may optimize the IR further).

I am pretty sure the dense algebra compiler has much better (progressive) lowering paths

What is the dense algebra compiler – is it a series of passes? Where is the documentation of it?

Ah, I only used that term because you used it first :wink:

There is not really one dense algebra compiler (nor one sparse compiler for that matter). MLIR is really a compiler infrastructure, and it provides all the building blocks to create such compilers. You will have to look at successful projects, like IREE, which build on MLIR’s foundation.

I see, I had left the linalg documentation page with the impression that there was infrastructure inside mainline MLIR for generation of high-performance linear algebra kernels, as the sparse tensor dialect inside MLIR is a way from the sparse linalg.generic operations to good-performing imperative code.

I’ll look into IREE, and related projects.