Redundant buffer synthesis in tf-opt

dpotop · February 19, 2021, 12:49pm

Hello,

Consider this (trivial) tensorflow code:

func @myfun(%x:tensor<2xf32>)->(tensor<2xf32>) {
  %c = "tf.Const"(){value = dense<(3.000000e+00,4.0)>:tensor<2xf32>} : () -> tensor<2xf32>
  %0 = "tf.AddV2"(%c, %x) {device = ""} : (tensor<2xf32>, tensor<2xf32>) -> tensor<2xf32>
  return %0 : tensor<2xf32>
}

Converting it to linalg+std using tf-opt --tf-to-hlo-pipeline --hlo-legalize-to-linalg --mhlo-legalize-to-std produces the following code:

#map = affine_map<(d0) -> (d0)>
module  {
  func @myfun(%arg0: tensor<2xf32>) -> tensor<2xf32> {
    %cst = constant dense<[3.000000e+00, 4.000000e+00]> : tensor<2xf32>
    %0 = linalg.init_tensor [2] : tensor<2xf32>
    %1 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel"]} ins(%arg0, %cst : tensor<2xf32>, tensor<2xf32>) outs(%0 : tensor<2xf32>) {
    ^bb0(%arg1: f32, %arg2: f32, %arg3: f32):  // no predecessors
      %2 = addf %arg1, %arg2 : f32
      linalg.yield %2 : f32
    } -> tensor<2xf32>
    return %1 : tensor<2xf32>
  }
}

This code already has its quirks (the %0 tensor being initialized and given as input to linalg.generic), but my true problem is that I cannot lower this to nice low-level code. My best result is the following:

  global_memref "private" constant @__constant_2xf32 : memref<2xf32> = dense<[3.000000e+00, 4.000000e+00]>
  func @myfun(%arg0: memref<2xf32>, %arg1: memref<2xf32>) {
    %0 = alloc() : memref<1xf32>
    %1 = get_global_memref @__constant_2xf32 : memref<2xf32>
    %c0 = constant 0 : index
    %c2 = constant 2 : index
    %c1 = constant 1 : index
    scf.for %arg2 = %c0 to %c2 step %c1 {
      %2 = load %arg0[%arg2] : memref<2xf32>
      %3 = load %1[%arg2] : memref<2xf32>
      %4 = addf %2, %3 : f32
      store %4, %0[%c0] : memref<1xf32>
      %5 = load %0[%c0] : memref<1xf32>
      store %5, %arg1[%arg2] : memref<2xf32>
    }
    dealloc %0 : memref<1xf32>
    return
  }

Note that a buffer is created, then deallocated, and data is copied around needlessly.

My question: Is there a pipeline of transformations that does not allocate memory and copy data around?

Best,
Dumitru

herhut · February 22, 2021, 10:32am

Can you share some details how you got to the code you show? Like which passes you have used?

dpotop · February 22, 2021, 11:00am

Certainly. I assume your question concerns the second transformation step (the first I already provided in my post). I have used the following command:

mlir-opt \
    --linalg-bufferize --convert-linalg-to-affine-loops \
    --func-bufferize --buffer-results-to-out-params \
    --tensor-constant-bufferize --finalizing-bufferize \
    --convert-linalg-to-affine-loops \
    --affine-loop-fusion \
    --lower-affine \
    --cse \
    --copy-removal \
    --buffer-deallocation

You can substitute mlir-opt with tf-opt, the result is (as expected) the same.

herhut · February 24, 2021, 8:34pm

Looking at this, I think the issue is that you are running --copy-removal before --buffer-deallocation. To be able to remove copies, we rely on dealloc operations to already be inserted. I tried with

--linalg-bufferize --convert-linalg-to-affine-loops --func-bufferize --buffer-results-to-out-params --tensor-constant-bufferize --finalizing-bufferize --buffer-deallocation --copy-removal

and that gave me

module  {
  global_memref "private" constant @__constant_2xf32 : memref<2xf32> = dense<[3.000000e+00, 4.000000e+00]>
  func @myfun(%arg0: memref<2xf32>, %arg1: memref<2xf32>) {
    %0 = get_global_memref @__constant_2xf32 : memref<2xf32>
    affine.for %arg2 = 0 to 2 {
      %1 = affine.load %arg0[%arg2] : memref<2xf32>
      %2 = affine.load %0[%arg2] : memref<2xf32>
      %3 = addf %1, %2 : f32
      affine.store %3, %arg1[%arg2] : memref<2xf32>
    }
    return
  }
}

There also is no real need to use affine here, just using loops gives the same result.

--linalg-bufferize --convert-linalg-to-loops --func-bufferize --buffer-results-to-out-params --tensor-constant-bufferize --finalizing-bufferize --buffer-deallocation --copy-removal --convert-scf-to-std --convert-std-to-llvm

gives you a blurb of llvm IR.

dpotop · February 24, 2021, 10:00pm

Thanks, I’ll try this tomorrow!
D.

@qaco

dpotop · February 25, 2021, 2:03pm

An extra remark here: Making copy removal depend on buffer deallocation means that you won’t be able to remove copies in code with static memory allocation.

Dumitru

@albertcohen

albertcohen · February 25, 2021, 4:03pm

Thanks @herhut for your suggestion.

And yes I agree it is not optimal at all. There are multiple directions here. Implementing another copy-removal pass at a higher level of abstraction seems like the way to go. I’m not saying it should replace the current pass though.

herhut · February 25, 2021, 5:27pm

The current copy removal pass was created to remove the copies that are introduced by the deallocation pass. And the transformation of results to extra arguments also relies on this.

For static allocation, you would need a copy for the latter and the former would not introduce any copies for static allocations (as it won’t try to free them).

But I agree that for the general problem, the current copy removal is not good enough.

Topic		Replies	Views
Tensor to memref conversion (a.k.a. bufferize) question MLIR	17	2187	November 10, 2020
Bug in func-bufferize MLIR	1	233	November 10, 2020
Bug in --convert-linalg-to-affine-loops? MLIR	22	1262	April 29, 2021
Properly using Bufferization related passes MLIR	7	1266	March 4, 2021
Bufferization error related to ```memref.clone``` MLIR	2	460	November 19, 2021

Redundant buffer synthesis in tf-opt

Related Topics