Understanding Iteration Space Inference for GenericOp

This is a followup on this question http://discourse.llvm.org/t/window-iterator-type-for-linalg-genericop/3967

I’m attempting to represent a horizontal blur operation using linalg.generic. The IR I am currently generating is this:

#map0 = affine_map<(d0, d1) -> (d0, d1 - 1)>
#map1 = affine_map<(d0, d1) -> (d0, d1)>
#map2 = affine_map<(d0, d1) -> (d0, d1 + 1)>
module @global_scope  {
  func @sumKernel(%arg0: tensor<1x3xi16>) -> i16 {
    %c0 = constant 0 : index
    %c1 = constant 1 : index
    %c2 = constant 2 : index
    %0 = tensor.extract %arg0[%c0, %c0] : tensor<1x3xi16>
    %1 = tensor.extract %arg0[%c0, %c1] : tensor<1x3xi16>
    %2 = addi %0, %1 : i16
    %3 = tensor.extract %arg0[%c0, %c2] : tensor<1x3xi16>
    %4 = addi %2, %3 : i16
    return %4 : i16
  }
  func @main(%arg0: tensor<480x640xi16>) -> tensor<480x640xi16> {
    %c0_i16 = constant 0 : i16
    %0 = linalg.pad_tensor %arg0 low[0, 1] high[0, 1]  {
    ^bb0(%arg1: index, %arg2: index):  // no predecessors
      linalg.yield %c0_i16 : i16
    } : tensor<480x640xi16> to tensor<480x642xi16>·
    %1 = splat %c0_i16 : tensor<480x640xi16>
    %2 = linalg.generic {indexing_maps = [#map0, #map1, #map2, #map1], iterator_types = ["parallel", "window"]} ins(%0, %0, %0 : tensor<480x642xi16>, tensor<480x642xi16>, tensor<480x642xi16>) outs(%1 : tensor<480x640xi16>) {
    ^bb0(%arg1: i16, %arg2: i16, %arg3: i16, %arg4: i16):  // no predecessors
      %3 = tensor.from_elements %arg1, %arg2, %arg3 : tensor<3xi16>
      %c1 = constant 1 : index
      %c3 = constant 3 : index
      %4 = tensor.from_elements %c1, %c3 : tensor<2xindex>
      %5 = tensor.reshape %3(%4) : (tensor<3xi16>, tensor<2xindex>) -> tensor<1x3xi16>
      %6 = call @sumKernel(%5) : (tensor<1x3xi16>) -> i16
      linalg.yield %6 : i16
    } -> tensor<480x640xi16>
    return %2 : tensor<480x640xi16>
  }
}

This seems more or less in keeping with the answer from my previous post, other than the fact that it is making use of statically sized tensors. The intent of the above example is to produce a new tensor whose elements are the sum a 1x3 window in the input tensor. It first pads the input tensor to handle the boundary conditions.

The problem I run into is that the linalg.generic operation above seems to fail to legalize. The problem seems related to the supplied affine maps. This is the error given with the example above:

error: 'linalg.generic' op unexpected result less than 0 at expression #1 in (d0, d1) -> (d0, d1 - 1)

The error seems related to the process of inferring iteration space dimensions. If I change the indexing maps to

#map0 = affine_map<(d0, d1) -> (d0, d1)>
#map1 = affine_map<(d0, d1) -> (d0, d1 + 1)>
#map2 = affine_map<(d0, d1) -> (d0, d1 + 2)>

This results in a different error:

error: 'linalg.generic' op inferred input/output operand #1 has shape's dimension #1 to be greater than or equal to 643, but found 642

I’ve tried other variants on this with different indexing maps given in different orders, but the end result is one of these two errors.

Is there anything obviously wrong I’m doing in this example? Or am I running into some other limitation?

This is probably my bad. Linalg uses the indexing maps to also determine the shape of the operands given the loop extent. And also uses the shape of the operand to get the iteration space. Since you have three affine maps that are accessing the same tensor, the shape of the tensor as determined by Linalg operation validation is different, even though its the same tensor. This causes the mismatch. I apologize for the wrong direction there.

What would work is using a tensor.extract_slice. Ill try to construct the example below

#map0 = affine_map<(d0, d1) -> (d0, d1)>
func @blur(%input : tensor<?x?xi16>) -> tensor<?x?xi16> {
  %c0 = constant 0 : index
  %cst = constant 0 : i16
  %c1 = constant 1 : index
  %0 = linalg.pad_tensor %input %arg0 low[0, 1] high[0, 1] {
    ^bb0(%arg1: index, %arg2: index):  // no predecessors
      linalg.yield %c0_i16 : i16
  } : tensor<?x?xi16> to tensor<?x?xi16>
  %d0 = tensor.dim %input, %c0 : tensor<?x?xi16>
  %d1 = tensor.dim %input, %c1 : tensor<?x?xi16>
  %1 = tensor.extract_slice %0[0, 0][%d0, %d1][1, 1] : tensor<?x?xi16> into tensor<?x?xi16>
  %2 = tensor.extract_slice %0[0, 1][%d0, %d1][1, 1] : tensor<?x?xi16> into tensor<?x?xi16>
  %3 = tensor.extract_slice %0[0, 2][%d0, %d1][1, 1] : tensor<?x?xi16> into tensor<?x?xi16>
  %4 = linalg.init_tensor [%d0, %d1] : tensor<?x?xi16>
  %5 = linalg.fill(%4, %cst) : tensor<?x?xi16>, i16 -> tensor<?x?xi16>
  %6 = linalg.generic {
    indexing_maps = [#map0, #map0, #map0, #map0],
    iterator_types = ["parallel", "window"]}
    ins(%1, %2, %3 : tensor<?x?xi16>, tensor<?x?xi16>, tensor<?x?xi16>)
    outs(%5 : tensor<?x?xi16>) {
   ^bb0(%arg0 : i16, %arg1 : i16, %arg2: i16, %arg3 : i16):
     /// Same body as above
   } -> tensor<?x?xi16>
  return %6 : tensor<?x?xi16>
}

In summary, you create the padded tensor, then you take slices each that takes different “views” of the padded tensor. each offset by 1 in the blurred dimension. Then using the same affine map for all the subtensors. This way there is no shape mismatch.

Couple of things to note

  1. If a linalg op has a “reduction” or “window” iterator type then it has read-modify-write semantics. So the outs operand must be initialized to 0 (as I did above)
  2. tensor.extract_slice has copy semantics. But using the Linalg comprehensive bufferization pass, these tensor.extract_slice will turn into memref.subview which are not copying the data in the underlying buffer, but rather just read the same buffer with different offsets. So you get what you would expect.

Again, sorry about my previous post. Did not think about the bounds computation issue when I suggested the previous approach

Thanks for the quick reply again @MaheshRavishankar. That fixes the tensor shape errors I was running into.

With the existing sequence of transforms I’ve been playing with, the slicing lowers to a sequence of copies. You mentioned the the “comprehensive bufferization pass”, which I take to mean the LinalgComprehensiveModuleBufferize pass. I’m trying to get an idea of when/how to use this pass (I’ve not found any examples using it).

When I try placing the comprehensive bufferization pass into my pass manager, I get the following error: error: unsupported op with tensors. This leads me to believe one or more of the ops in my example are not supported. Do you have any references for how to make use of the comprehensive bufferization transform?

There are 2 tests using it, should be easy to find with a regex looking for comprehensive in .mlir files.

It is highly likely that op support is missing for your use case as the op properties for this pass are not yet factored out into first class interfaces. I can add support for what you need when I come back from vacation on Monday if you give me a minimal repro.

This is the example I am currently working with:

#map = affine_map<(d0, d1) -> (d0, d1)>
builtin.module @global_scope  {
  builtin.func @main(%arg0: tensor<480x640xi16>) -> tensor<480x640xi16> {
    %cst = constant dense<[1, 3]> : tensor<2xindex>
    %c0_i16 = constant 0 : i16
    %c2 = constant 2 : index
    %c1 = constant 1 : index
    %c0 = constant 0 : index
    %0 = linalg.pad_tensor %arg0 low[0, 1] high[0, 1]  {
    ^bb0(%arg1: index, %arg2: index):  // no predecessors
      linalg.yield %c0_i16 : i16
    } : tensor<480x640xi16> to tensor<480x642xi16>
    %1 = tensor.extract_slice %0[0, 0] [480, 640] [1, 1] : tensor<480x642xi16> to tensor<480x640xi16>
    %2 = tensor.extract_slice %0[0, 1] [480, 640] [1, 1] : tensor<480x642xi16> to tensor<480x640xi16>
    %3 = tensor.extract_slice %0[0, 2] [480, 640] [1, 1] : tensor<480x642xi16> to tensor<480x640xi16>
    %4 = linalg.init_tensor [480, 640] : tensor<480x640xi16>
    %5 = linalg.fill(%c0_i16, %4) : i16, tensor<480x640xi16> -> tensor<480x640xi16>
    %6 = linalg.generic {indexing_maps = [#map, #map, #map, #map], iterator_types = ["parallel", "window"]} ins(%1, %2, %3 : tensor<480x640xi16>, tensor<480x640xi16>, tensor<480x640xi16>) outs(%5 : tensor<480x640xi16>) {
    ^bb0(%arg1: i16, %arg2: i16, %arg3: i16, %arg4: i16):  // no predecessors
      %7 = tensor.from_elements %arg1, %arg2, %arg3 : tensor<3xi16>
      %8 = tensor.reshape %7(%cst) : (tensor<3xi16>, tensor<2xindex>) -> tensor<1x3xi16>
      %9 = tensor.extract %8[%c0, %c0] : tensor<1x3xi16>
      %10 = tensor.extract %8[%c0, %c1] : tensor<1x3xi16>
      %11 = addi %9, %10 : i16
      %12 = tensor.extract %8[%c0, %c2] : tensor<1x3xi16>
      %13 = addi %11, %12 : i16
      linalg.yield %13 : i16
    } -> tensor<480x640xi16>
    return %6 : tensor<480x640xi16>
  }
}

tensor.from_element is likely not yet supported.
However, looking at the body of the generic, it is unclear to me why these contorsions are necessary. Are you just adding the 3 args or am I missing something?

Strictly speaking, the tensor contortions are not necessary. The IR was generated assuming the body of the GenericOp is implementing by a function taking a tensor in the shape of the window being processed. This is a detail of the source language I am translating from.

The call to the implementing function has been inlined away already. I was actually hoping the right sequence of peepholes would get rid of the from_elements and reshaping.

This is the initial IR I’m starting with just after translation from the source language:

#map = affine_map<(d0, d1) -> (d0, d1)>
builtin.module @global_scope  {
  builtin.func @sumKernel(%arg0: tensor<1x3xi16>) -> i16 {
    %c2 = constant 2 : index
    %c1 = constant 1 : index
    %c0 = constant 0 : index
    %0 = tensor.extract %arg0[%c0, %c0] : tensor<1x3xi16>
    %1 = tensor.extract %arg0[%c0, %c1] : tensor<1x3xi16>
    %2 = addi %0, %1 : i16
    %3 = tensor.extract %arg0[%c0, %c2] : tensor<1x3xi16>
    %4 = addi %2, %3 : i16
    return %4 : i16
  }
  builtin.func @idiomTestEntryPoint(%arg0: tensor<480x640xi16>) -> tensor<480x640xi16> {
    %c0_i16 = constant 0 : i16
    %0 = linalg.pad_tensor %arg0 low[0, 1] high[0, 1]  {
    ^bb0(%arg1: index, %arg2: index):  // no predecessors
      linalg.yield %c0_i16 : i16
    } : tensor<480x640xi16> to tensor<480x642xi16>
    %1 = tensor.extract_slice %0[0, 0] [480, 640] [1, 1] : tensor<480x642xi16> to tensor<480x640xi16>
    %2 = tensor.extract_slice %0[0, 1] [480, 640] [1, 1] : tensor<480x642xi16> to tensor<480x640xi16>
    %3 = tensor.extract_slice %0[0, 2] [480, 640] [1, 1] : tensor<480x642xi16> to tensor<480x640xi16>
    %c0_i16_7 = constant 0 : i16
    %4 = linalg.init_tensor [480, 640] : tensor<480x640xi16>
    %5 = linalg.fill(%c0_i16_7, %4) : i16, tensor<480x640xi16> -> tensor<480x640xi16>·
    %6 = linalg.generic {indexing_maps = [#map, #map, #map, #map], iterator_types = ["parallel", "window"]} ins(%1, %2, %3 : tensor<480x640xi16>, tensor<480x640xi16>, tensor<480x640xi16>) outs(%5 : tensor<480x640xi16>) {
    ^bb0(%arg1: i16, %arg2: i16, %arg3: i16, %arg4: i16):  // no predecessors
      %7 = tensor.from_elements %arg1, %arg2, %arg3 : tensor<3xi16>
      %c1 = constant 1 : index
      %c3 = constant 3 : index
      %8 = tensor.from_elements %c1, %c3 : tensor<2xindex>
      %9 = tensor.reshape %7(%8) : (tensor<3xi16>, tensor<2xindex>) -> tensor<1x3xi16>
      %10 = call @sumKernel(%9) : (tensor<1x3xi16>) -> i16
      linalg.yield %10 : i16
    } -> tensor<480x640xi16>
    return %6 : tensor<480x640xi16>
  }
}

As you can see, the contortions are there to provide the right interface to the @sumKernel function. I could probably do the elimination of the intermediate tensor during the translation step, since it should always be statically addressed in the kernel body, but that would complicate the translation layer quite a bit.

Yes, this hints at a bunch of missing canonicalizations and folding on tensor operations.

We should implement those on the model of the ones we already have in the vector dialect.

You could also implement the tensor manipulations as vector operations, you’d have to insert extract manually as there is no from_elements operation (we could also add one).

My rule of thumb is that tensor iOS are “more dynamic” and “will generally go through memory by bufferization”.

For the use case I’m seeing here, I’d probably try vector abstractions.