The Dialect suitable to describe the tiling operation

Hi folks,

If anyone could answer the following questions or give any insight, that would be great. (As a newbie, not quite skilled in MLIR)

For my case, suppose that we have an Accel that can execute the tensor computation, let’s call it matrix multiplication unit (MMU), which differs from CPU in that the basic computation is the tile(vector/block) multiplication, not the elementwise multiplication.

So, I prefer to use MLIR to make the high-level expression, and then allow it to lower to the hardware code step by step. Here, there is a topic of how to express the data in tile or in vector in higher abstraction as the matrixes need to be sliced so that they can be fed to the mammal unit.

For example, assume that the MMU is [8,32]x[32,8], and when we have two matrixes as operands: A[48,128] and B[128,40], the first thing is to slice the two matrices into the several [32,8] and [8,64], individually, and then feed them to the MMU, which could be best illustrated in the following picture.

Have some investigations about expressing tile computation in MLIR done, they seem to be different from my situation. @bondhugula does a great job at the High-performance computation using MLIR and presents a wonderful example, but he targets the CPU and the generated code is mapped to elementwise multiplication. Also, vector dialect seems to be a potential choice, but not clear.

Overall, the questions are summarized as:

  1. Could the vector dialect abstract the tiling operation that slices a matrix into several sub-matrix?
    ExtractSlicesOp can slice an N-D vector and return a tuple of slice vector, not clear about the next step involving iteration of the slice vectors as the tuple is not the supported type in MLIR. If going further, we should define the lowering pass, right?
  2. Any better ideas for solving the above issue, like defining our own dialect?

Any advice is welcomed.

Thanks

It is a little unclear to me at this point what your end-to-end flow is but I see a few options:

  1. The simplest is to directly express your IR at a higher-level with e.g. A: tensor<6x4xvector<8x32x...>> or memref. The vector dialect should have enough expressiveness and then you’d have a special lowering of vector.contract ... 8x32x8 to your HW-specific dialect (or special LLVM-intrinsic depending on how you want to connecting things). This is the more “programmable way” in the IR, it assumes you don’t have to start from a pile of scalar data + code that you need to raise from. This is the way I’d recommend to prototype connecting the pieces and verifying correctness.
  2. If you must start from scalar matrices then is also depends how much flexibility you have on the input code. If you start from affine loops then you can try to write a matcher that would raise an affine loop post-transformation to the proper vector ops and then you’re back in 1. form.
  3. Otherwise, you can also start from e.g. linalg.matmuland go through transformations that get you to vector.transfer + vector.contract. Here is a quick hack to produce the following IR for illustration purposes:
    scf.for %arg3 = %c0 to %c24 step %c12 {
      scf.for %arg4 = %c0 to %c192 step %c32 {
        %0 = subview %arg2[%arg3, %arg4] [12, 32] [1, 1] : memref<24x192xf32> to memref<12x32xf32, #map0>
        %1 = vector.transfer_read %0[%c0, %c0], %cst {masked = [false, false]} : memref<12x32xf32, #map0>, vector<12x32xf32>
        %2 = scf.for %arg5 = %c0 to %c64 step %c16 iter_args(%arg6 = %1) -> (vector<12x32xf32>) {
          %3 = subview %arg0[%arg3, %arg5] [12, 16] [1, 1] : memref<24x64xf32> to memref<12x16xf32, #map1>
          %4 = subview %arg1[%arg5, %arg4] [16, 32] [1, 1] : memref<64x192xf32> to memref<16x32xf32, #map0>
          %5 = vector.transfer_read %3[%c0, %c0], %cst {masked = [false, false]} : memref<12x16xf32, #map1>, vector<12x16xf32>
          %6 = vector.transfer_read %4[%c0, %c0], %cst {masked = [false, false]} : memref<16x32xf32, #map0>, vector<16x32xf32>
          %7 = vector.contract {indexing_maps = [#map2, #map3, #map4], iterator_types = ["parallel", "parallel", "reduction"]} %5, %6, %cst_0 : vector<12x16xf32>, vector<16x32xf32> into vector<12x32xf32>
          %8 = addf %arg6, %7 : vector<12x32xf32>
          scf.yield %8 : vector<12x32xf32>
        }
        vector.transfer_write %2, %0[%c0, %c0] {masked = [false, false]} : vector<12x32xf32>, memref<12x32xf32, #map0>
      }
    }
  1. The above assumed that you have a good memory subsystem that you can lower vector.transfer_read / vector.transfer_write to on the fly (e.g. like GPUs) but this may not be the case and could be quite slow on your HW. Also you may want to pack tiles for multiple 8x32x8 operations a little ahead of time and amortize that packing (i.e. something between 1. and 3.). There is work in progress for this with Linalg on tensors (see this commit and the packed form around tensor<2x4xf32> into tensor<?x2x4xf32> and tensor<4x3xf32> into tensor<?x?x4x3xf32>, here). This is the preferred future-proof way that will also work with other data-types (e.g. sparse) but can only be exercised end-to-end in IREE for now. It will take some time to make available in core because a bunch of abstractions are still missing around bufferization and nested linalg ops.

So TL;DR, 1. is the easy way to get things connected and tested and likely a prerequisite for the other automated approaches, 2. would be valuable to add, I am sure people more invested in the affine-only path are working towards that, 3. will work but is may not suit your needs depending on your HW, 4. will take time to be available in core.

Cheers!

N

2 Likes

It is a little unclear to me at this point what your end-to-end flow
TensorFlow is expected to be the front end, which is quite like IREE, and we can get the HLO dialect from the TF. And then the HLO would be lowered to Linalg or Linalg + our dialects to ISA that run in the target accelerator.

Option 3 is quite inspiring. I think, I could name a foo dialect and define an operation foo.matmul, which could be lowered to the linalg IR in your post as the linalg.matmul, right?

When the hack is merged? Please let me know and I can’t wait to try it. Moreover, I have no idea about this hack code, which seems tricky:). It would be wonderful if you could give some explanation.

// RUN: export M=24 && export K=64 && export N=192 && export ITERS=10 &&
// RUN: cat %s | sed ‘s@${M}@’“$M”‘@g’| sed ‘s@${K}@’“$K”‘@g’ | sed ‘s@${N}@’“$N”‘@g’| sed >‘s@${ITERS}@’“$ITERS”‘@g’| \

Another question is about the linalg.view. Here shows that there is an operation view in linalg, but there is no description in the Linalg dialect. As far as I know, the !linalg.view type is folded to strided memref, but there seems no clue about the linalg.view disapperance. Do you have any idea?

Many thanks.

Jackwin

This is just a trick to “inject” some hyper parameter in the test / file.
What this does is first exporting variable to the environment, and then cat %s will just read the current file, and pipe it through sed which performs here textual replacement (every M in the file with be replaced with 24, etc. The content is then piped to mlir-opt.

1 Like

!linalg.view have evolved in the strided memref representation.
You now have std.view and std.subview operations along with canonicalizations + type inferences that allow more general semantics than just view.
The Linalg dialect has evolved into supporting strided memrefs and tensors.

1 Like

@mehdi_amini @nicolasvasilache Thank you for your answers. Appreciate it!