Index maps and tiling

rwalkr · October 31, 2022, 9:55am

Hi,

I’ve been working on some data processing which requires sub-sampling a tensor at regular intervals, with an offset within each tile. My thinking was that this could be implemented as linalg.generic with index maps used on the input to specify the interval and offset - e.g. for sampling an 8x8 tile with offsets [4, 5]:

#map0 = affine_map<(d0, d1) -> (d0 * 8 + 4, d1 * 8 + 5)>
#map1 = affine_map<(d0, d1) -> (d0, d1)>
module attributes {llvm.target_triple = "aarch64-none-linux-gnu"} {
  func.func @pipeline(%arg0: tensor<360x640xf32>) -> tensor<45x80xf32> attributes {llvm.emit_c_interface} {
    %0 = tensor.empty() : tensor<45x80xf32>
    %1 = linalg.generic {indexing_maps = [#map0, #map1], iterator_types = ["parallel", "parallel"]} ins(%arg0 : tensor<360x640xf32>) outs(%0 : tensor<45x80xf32>) {
    ^bb0(%in: f32, %out: f32):
      linalg.yield %in : f32
    } -> tensor<45x80xf32>
    return %1 : tensor<45x80xf32>
  }
}

This works fine with a direct lowering to CPU, but we noticed we weren’t getting the right results when running the pipeline using IREE. I’ve tracked it down to the tiling and can reproduce it using the -linalg-tile pass (as I understand it, IREE uses the same pass/code from MLIR - hence why I’m raising it here).

mlir-opt -linalg-tile="tile-sizes=15,20" test.mlir produces:

#map0 = affine_map<(d0) -> (d0 * 8 + 4)>
#map1 = affine_map<(d0) -> (d0 * -8 + 353, 117)>
#map2 = affine_map<(d0) -> (d0 * 8 + 5)>
#map3 = affine_map<(d0) -> (d0 * -8 + 633, 158)>
#map4 = affine_map<(d0, d1) -> (d0 * 8 + 4, d1 * 8 + 5)>
#map5 = affine_map<(d0, d1) -> (d0, d1)>
module attributes {llvm.target_triple = "aarch64-none-linux-gnu"} {
  func.func @pipeline(%arg0: tensor<360x640xf32>) -> tensor<45x80xf32> attributes {llvm.emit_c_interface} {
    %c0 = arith.constant 0 : index
    %c45 = arith.constant 45 : index
    %c15 = arith.constant 15 : index
    %c80 = arith.constant 80 : index
    %c20 = arith.constant 20 : index
    %0 = tensor.empty() : tensor<45x80xf32>
    %1 = scf.for %arg1 = %c0 to %c45 step %c15 iter_args(%arg2 = %0) -> (tensor<45x80xf32>) {
      %2 = scf.for %arg3 = %c0 to %c80 step %c20 iter_args(%arg4 = %arg2) -> (tensor<45x80xf32>) {
        %3 = affine.apply #map0(%arg1)
        %4 = affine.min #map1(%arg1)
        %5 = affine.apply #map2(%arg3)
        %6 = affine.min #map3(%arg3)
        %extracted_slice = tensor.extract_slice %arg0[%3, %5] [%4, %6] [1, 1] : tensor<360x640xf32> to tensor<?x?xf32>
        %extracted_slice_0 = tensor.extract_slice %arg4[%arg1, %arg3] [15, 20] [1, 1] : tensor<45x80xf32> to tensor<15x20xf32>
        %7 = linalg.generic {indexing_maps = [#map4, #map5], iterator_types = ["parallel", "parallel"]} ins(%extracted_slice : tensor<?x?xf32>) outs(%
extracted_slice_0 : tensor<15x20xf32>) {
        ^bb0(%in: f32, %out: f32):
          linalg.yield %in : f32
        } -> tensor<15x20xf32>
        %inserted_slice = tensor.insert_slice %7 into %arg4[%arg1, %arg3] [15, 20] [1, 1] : tensor<15x20xf32> into tensor<45x80xf32>
        scf.yield %inserted_slice : tensor<45x80xf32>
      }
      scf.yield %2 : tensor<45x80xf32>
    }
    return %1 : tensor<45x80xf32>
  }
}

It looks like the index map is split into maps for each dimension (map0 and map2) which are then applied to compute the locations of the input tile, but then the original map is still used on the linalg.generic over the tile, so the map is effectively applied twice.

Is this something that needs fixing in the tiling pass or is this an invalid use of index maps? For now, I’ve been able to work around it by using linalg.index to get the indices of the output tensor and computing the input tensor indices for tensor.extract.

Thanks

Rob

banach-space · November 4, 2022, 4:16pm

Welcome @rwalkr , thanks for posting

The output after tiling looks … horrible, and that’s a bit suspicious to me. One thing to note, as of ⚙ D135573 [mlir][Linalg] Retire LinalgStrategyTilePass and filter-based pattern., --linalg-tile is no longer available in mlir-opt. You may want to try the transform dialect instead.

Although I’ve not been able to reproduce your results using ToT (I don’t know how to use the transform dialect, yet), I’ve skimmed through your examples and noticed a couple of things.

the map is effectively applied twice

I don’t think that this is a problem here. After tiling, these maps are used to calculate the starting point in the input tensor:

#map0 = affine_map<(d0) -> (d0 * 8 + 4)>
#map2 = affine_map<(d0) -> (d0 * 8 + 5)>

This makes sense to me - why would these maps change? Obvoiusly, these maps are used differently after tiling. In particular, the input indices have to be shifted and/or scaled accordingly:

      %1 = scf.for %arg1 = %c0 to %c45 step %c15 iter_args(%arg2 = %0) -> (tensor<45x80xf32>) {
        %2 = scf.for %arg3 = %c0 to %c80 step %c20 iter_args(%arg4 = %arg2) -> (tensor<45x80xf32>) {
          %3 = affine.apply #map0(%arg1)
          %4 = affine.min #map1(%arg1)
          %5 = affine.apply #map2(%arg3)
          %6 = affine.min #map3(%arg3)

But this looks legit to me. However, these maps are confusing:

#map1 = affine_map<(d0) -> (d0 * -8 + 353, 117)>
#map3 = affine_map<(d0) -> (d0 * -8 + 633, 158)>

IIUC, these are used to calculate the slice sizes for the input tensor:

%4 = affine.min #map1(%arg1)
%6 = affine.min #map3(%arg3)
(...)
%extracted_slice = tensor.extract_slice %arg0[%3, %5] [%4, %6] [1, 1] : tensor<360x640xf32> to tensor<?x?xf32>

For arg0 and arg3 equal 0, you would get %4 = 117 and %6 = 158. And that’s confusing to me. I would always expect slices of size 120 x 160 (input tensor) and 15 x 20 (output tensor). Am I reading this OK?

Having said all this, making sure that we can reproduce this using ToT is probably the most sensible first step.

HTH,
Andrzej

banach-space · November 5, 2022, 4:51pm

OK, I’ve reproduced this using ToT (c97035c49d94). You need to update the example as follows:

  #map0 = affine_map<(d0, d1) -> (d0 * 8 + 4, d1 * 8 + 5)>
  #map1 = affine_map<(d0, d1) -> (d0, d1)>
  module attributes {llvm.target_triple = "aarch64-none-linux-gnu"} {
    func.func @pipeline(%arg0: tensor<360x640xf32>) -> tensor<45x80xf32> attributes {llvm.emit_c_interface} {
     %0 = tensor.empty() : tensor<45x80xf32>
     %1 = linalg.generic {indexing_maps = [#map0, #map1], iterator_types = ["parallel", "parallel"]} ins(%arg0 : tensor<360x640xf32>) outs(%0 : tensor<45x80xf32>) {
      ^bb0(%in: f32, %out: f32):
        linalg.yield %in : f32
      } -> tensor<45x80xf32>
      return %1 : tensor<45x80xf32>
    }

    transform.sequence failures(propagate) {
      ^bb0(%arg1: !pdl.operation):
        %0 = transform.structured.match ops{["linalg.generic"]} in %arg1
        %1, %loops:2 = transform.structured.tile %0 [15, 20]
    }
  }

Next, compile like this:

mlir-opt file.mlir -test-transform-dialect-interpreter

-Andrzej

MaheshRavishankar · November 7, 2022, 2:34am

There might be a bug here indeed,

These maps do look suspicious… The affine.min is meant to handle partial tiles… dont immediately follow those numbers. I wonder if there is something off with the use of folded affine maps.

MaheshRavishankar · November 7, 2022, 2:38am

In the mean time though, it might be better to not use linalg.generic. You are probably looking for linalg.pool_* operation with dilation and strides. I think that more closely matches you are looking for. I am a bit suspicious about this map for the input.

affine_map<(d0, d1) -> (d0 * 8 + 4, d0 * 8 + 5)>

I am not convinced based on the affine map that for every element of the output, the input is in-bounds… ( I know that you say the CPU execution is fine, but that might be just by chance).
Using the pooling op with dilation and strides and comparing the indexing maps might work better. Also pooling ops are handled by IREE (through tiling), so if those fit your need, then this might be a moot point.

nicolasvasilache · November 7, 2022, 10:38am

Thanks for posting this example.

First, note that to subsample like you seem to be doing here, you could also use tensor.insert/extract_slice:

%extracted = tensor.extract_slice %t[4, 5][45, 80][8, 8] : tensor<360x640xf32> to tensor<45x80xf32>
%computed = compute(...)
%res = tensor.insert_slice %computed into %t[4, 5][45, 80][8, 8] : tensor<45x80xf32> into tensor<360x640xf32>

Then you can express the compute as a simple identity map.

Separately, this is what I am seeing at ToT running the last example:

#map = affine_map<(d0) -> (d0 * 8 + 4)>
#map1 = affine_map<(d0) -> (d0 * 8 + 5)>
#map2 = affine_map<(d0, d1) -> (d0 * 8 + 4, d1 * 8 + 5)>
#map3 = affine_map<(d0, d1) -> (d0, d1)>
module attributes {llvm.target_triple = "aarch64-none-linux-gnu"} {
  func.func @pipeline(%arg0: tensor<360x640xf32>) -> tensor<45x80xf32> attributes {llvm.emit_c_interface} {
    %0 = tensor.empty() : tensor<45x80xf32>
    %c15 = arith.constant 15 : index
    %c20 = arith.constant 20 : index
    %c0 = arith.constant 0 : index
    %c45 = arith.constant 45 : index
    %1 = scf.for %arg1 = %c0 to %c45 step %c15 iter_args(%arg2 = %0) -> (tensor<45x80xf32>) {
      %c0_0 = arith.constant 0 : index
      %c80 = arith.constant 80 : index
      %2 = scf.for %arg3 = %c0_0 to %c80 step %c20 iter_args(%arg4 = %arg2) -> (tensor<45x80xf32>) {
        %3 = affine.apply #map(%arg1)
        %4 = affine.apply #map1(%arg3)
        %extracted_slice = tensor.extract_slice %arg0[%3, %4] [117, 158] [1, 1] : tensor<360x640xf32> to tensor<117x158xf32>
        %extracted_slice_1 = tensor.extract_slice %arg4[%arg1, %arg3] [15, 20] [1, 1] : tensor<45x80xf32> to tensor<15x20xf32>
        %5 = linalg.generic {indexing_maps = [#map2, #map3], iterator_types = ["parallel", "parallel"]} ins(%extracted_slice : tensor<117x158xf32>) outs(%extracted_slice_1 : tensor<15x20xf32>) {
        ^bb0(%in: f32, %out: f32):
          linalg.yield %in : f32
        } -> tensor<15x20xf32>
        %inserted_slice = tensor.insert_slice %5 into %arg4[%arg1, %arg3] [15, 20] [1, 1] : tensor<15x20xf32> into tensor<45x80xf32>
        scf.yield %inserted_slice : tensor<45x80xf32>
      }
      scf.yield %2 : tensor<45x80xf32>
    }
    return %1 : tensor<45x80xf32>
  }
}

Some of the recent refactorings also spurred some extra foldings that were missing, so the IR should be cleaner than what it used to be up until recently.

The rationale for [117, 158] appearing in the first tensor.extract_slice is that, as we apply tiling by [15x20], the tile indices span [0 .. 14]x[0 .. 19].

There is an exclusive bounds -> (i.e. -1) inclusive bounds -> (i.e. +1) exclusive bounds conversion happening.

So when we apply the map f(i) = 8*i + 4 on i \in [0, 15) we get f(i) \in [4, 8*14 + 4] or f(i) \in [4, 116] == [4, 117) and so the upper bound for f(i) is 117.

A similar procedure applies to the other dimension.

The reason why maps such as:

#map1 = affine_map<(d0) -> (d0 * -8 + 353, 117)>
#map3 = affine_map<(d0) -> (d0 * -8 + 633, 158)>

may appear to express upper bounds comes from ceil/floor computations and inclusive/exclusive conversions: 353 == 360 - (8 - 1) and 633 == 640 - (8 - 1).

The affine.min need to ensure that at the boundary, we do not overflow.

rwalkr · November 7, 2022, 1:34pm

Hi,

Thanks for the responses. @nicolasvasilache - tensor.extract_slice does seem to do exactly what we need for this workload - thanks! (I didn’t see that it could do strided access on my first skim through the docs) - we’ll use this going forward.

I think in this case we could be sure the linalg.generic would stay within bounds as the op we were lowering from validates that the offset < stride, but yes in general using the index map like that could result in out of bounds access.

There does still seem to be something wrong with the tiled linalg.generic with these maps - if we lower to loops (see below), then we do see the maps applied to the indexes used to access the subview - so the offset portion of the map is being applied twice. However, since tensor.extract_slice seems to be a better way of implementing this, I’m happy to leave it at this point.

Rob

With @banach-space 's reproducer:

mlir-opt -test-transform-dialect-interpreter --empty-tensor-to-alloc-tensor -convert-tensor-to-linalg -one-shot-bufferize="bufferize-function-boundaries=1 allow-return-allocs=1"  --convert-linalg-to-loops test.mlir

produces:

#map = affine_map<(d0, d1) -> (d0, d1)>
#map1 = affine_map<(d0) -> (d0 * 8 + 4)>
#map2 = affine_map<(d0) -> (d0 * 8 + 5)>
module attributes {llvm.target_triple = "aarch64-none-linux-gnu"} {
  func.func @pipeline(%arg0: memref<360x640xf32, strided<[?, ?], offset: ?>>) -> memref<45x80xf32> attributes {llvm.emit_c_interface} {
    %c1 = arith.constant 1 : index
    %c15 = arith.constant 15 : index
    %c20 = arith.constant 20 : index
    %c0 = arith.constant 0 : index
    %c45 = arith.constant 45 : index
    %c80 = arith.constant 80 : index
    %alloc = memref.alloc() {alignment = 128 : i64} : memref<45x80xf32>
    scf.for %arg1 = %c0 to %c45 step %c15 {
      %0 = affine.apply #map1(%arg1)
      scf.for %arg2 = %c0 to %c80 step %c20 {
        %1 = affine.apply #map2(%arg2)
        %subview = memref.subview %arg0[%0, %1] [117, 158] [1, 1] : memref<360x640xf32, strided<[?, ?], offset: ?>> to memref<117x158xf32, strided<[?, ?], offset: ?>>
        %subview_0 = memref.subview %alloc[%arg1, %arg2] [15, 20] [1, 1] : memref<45x80xf32> to memref<15x20xf32, strided<[80, 1], offset: ?>>
        scf.for %arg3 = %c0 to %c15 step %c1 {
          scf.for %arg4 = %c0 to %c20 step %c1 {
            %2 = affine.apply #map1(%arg3)
            %3 = affine.apply #map2(%arg4)
            %4 = memref.load %subview[%2, %3] : memref<117x158xf32, strided<[?, ?], offset: ?>>
            memref.store %4, %subview_0[%arg3, %arg4] : memref<15x20xf32, strided<[80, 1], offset: ?>>
          }
        }
        memref.copy %subview_0, %subview_0 : memref<15x20xf32, strided<[80, 1], offset: ?>> to memref<15x20xf32, strided<[80, 1], offset: ?>>
      }
    }
    return %alloc : memref<45x80xf32>
  }
  transform.sequence failures(propagate) {
  ^bb0(%arg0: !pdl.operation):
    %0 = transform.structured.match ops{["linalg.generic"]} in %arg0
    %tiled_linalg_op, %loops:2 = transform.structured.tile %0[15, 20] {interchange = []}
  }
}

The indexes used to create %subview already have the offsets applied and they’re applied again when creating the indexes for the memref.load - so on the d1 axis instead of 5, 13, 21, … we get 10, 18, 26, …

banach-space · November 7, 2022, 4:00pm

Ah, now I see what you meant in your original message - sorry for not realizing earlier!

This looks like a bug to me and if that’s the case, a GitHub issue would be much appreciated so that we can track this and hopefully prevent others from hitting this.

-Andrzej

Topic		Replies	Views
After Linalg tiling can't lower to llvm MLIR	4	380	November 1, 2023
Question about --affine-loop-tile MLIR	1	609	October 26, 2021
MLIR for arm SME : Reducing tile data transfers MLIR	4	123	July 19, 2024
Tile linalg.batch_matmul MLIR mlir	7	151	December 9, 2024
Linalg.generic and reading elements two-by-two MLIR	4	275	May 24, 2022

Index maps and tiling

Related topics