Memory leak for structure while op when using buffer-deallocation pass?

wyzero · May 31, 2021, 11:36am

Hi all,

It seems that buffer-deallocation pass can not handle structure while op correctly.

Here is an example input:

module  {
  func @dynamic_shape_while2(%arg0: memref<8x8xf32>) -> memref<8x8xf32> {
    %0 = scf.while (%arg1 = %arg0) : (memref<8x8xf32>) -> memref<8x8xf32> {
      %1 = memref.alloc() : memref<i1>
      "lmhlo.constant"(%1) {value = dense<true> : tensor<i1>} : (memref<i1>) -> ()
      %2 = memref.alloc() : memref<8x8xf32>
      "lmhlo.multiply"(%arg1, %arg1, %2) : (memref<8x8xf32>, memref<8x8xf32>, memref<8x8xf32>) -> ()
      %3 = memref.alloc() : memref<8x8xf32>
      "lmhlo.subtract"(%arg1, %2, %3) : (memref<8x8xf32>, memref<8x8xf32>, memref<8x8xf32>) -> ()
      %4 = memref.alloc() : memref<8x8xi1>
      "lmhlo.compare"(%arg1, %3, %4) {comparison_direction = "LT"} : (memref<8x8xf32>, memref<8x8xf32>, memref<8x8xi1>) -> ()
      %5 = memref.alloc() : memref<i1>
      "lmhlo.reduce"(%4, %1, %5) ( {
      ^bb0(%arg2: memref<i1>, %arg3: memref<i1>, %arg4: memref<i1>):  // no predecessors
        "lmhlo.and"(%arg2, %arg3, %arg4) : (memref<i1>, memref<i1>, memref<i1>) -> ()
        "lmhlo.terminator"() : () -> ()
      }) {dimensions = dense<[0, 1]> : tensor<2xi64>} : (memref<8x8xi1>, memref<i1>, memref<i1>) -> ()
      %6 = memref.load %5[] : memref<i1>
      scf.condition(%6) %arg1 : memref<8x8xf32>
    } do {
    ^bb0(%arg1: memref<8x8xf32>):  // no predecessors
      %1 = memref.alloc() : memref<8x8xf32>
      "lmhlo.multiply"(%arg1, %arg1, %1) : (memref<8x8xf32>, memref<8x8xf32>, memref<8x8xf32>) -> ()
      %2 = memref.alloc() : memref<8x8xf32>
      "lmhlo.concatenate"(%1, %1, %2) {dimension = 0 : i64} : (memref<8x8xf32>, memref<8x8xf32>, memref<8x8xf32>) -> ()
      scf.yield %2 : memref<8x8xf32>
    }
    return %0 : memref<8x8xf32>
  }
}

after applying -buffer-deallocation pass:

module  {
  func @dynamic_shape_while2(%arg0: memref<8x8xf32>) -> memref<8x8xf32> {
    %0 = memref.clone %arg0 : memref<8x8xf32> to memref<8x8xf32>
    %1 = scf.while (%arg1 = %0) : (memref<8x8xf32>) -> memref<8x8xf32> {
      %2 = memref.alloc() : memref<i1>
      "lmhlo.constant"(%2) {value = dense<true> : tensor<i1>} : (memref<i1>) -> ()
      %3 = memref.alloc() : memref<8x8xf32>
      "lmhlo.multiply"(%arg1, %arg1, %3) : (memref<8x8xf32>, memref<8x8xf32>, memref<8x8xf32>) -> ()
      %4 = memref.alloc() : memref<8x8xf32>
      "lmhlo.subtract"(%arg1, %3, %4) : (memref<8x8xf32>, memref<8x8xf32>, memref<8x8xf32>) -> ()
      memref.dealloc %3 : memref<8x8xf32>
      %5 = memref.alloc() : memref<8x8xi1>
      "lmhlo.compare"(%arg1, %4, %5) {comparison_direction = "LT"} : (memref<8x8xf32>, memref<8x8xf32>, memref<8x8xi1>) -> ()
      memref.dealloc %4 : memref<8x8xf32>
      %6 = memref.alloc() : memref<i1>
      "lmhlo.reduce"(%5, %2, %6) ( {
      ^bb0(%arg2: memref<i1>, %arg3: memref<i1>, %arg4: memref<i1>):  // no predecessors
        "lmhlo.and"(%arg2, %arg3, %arg4) : (memref<i1>, memref<i1>, memref<i1>) -> ()
        "lmhlo.terminator"() : () -> ()
      }) {dimensions = dense<[0, 1]> : tensor<2xi64>} : (memref<8x8xi1>, memref<i1>, memref<i1>) -> ()
      memref.dealloc %5 : memref<8x8xi1>
      memref.dealloc %2 : memref<i1>
      %7 = memref.load %6[] : memref<i1>
      memref.dealloc %6 : memref<i1>
      scf.condition(%7) %arg1 : memref<8x8xf32>
    } do {
    ^bb0(%arg1: memref<8x8xf32>):  // no predecessors
      %2 = memref.alloc() : memref<8x8xf32>
      "lmhlo.multiply"(%arg1, %arg1, %2) : (memref<8x8xf32>, memref<8x8xf32>, memref<8x8xf32>) -> ()
      %3 = memref.alloc() : memref<8x8xf32>
      "lmhlo.concatenate"(%2, %2, %3) {dimension = 0 : i64} : (memref<8x8xf32>, memref<8x8xf32>, memref<8x8xf32>) -> ()
      memref.dealloc %2 : memref<8x8xf32>
      %4 = memref.clone %3 : memref<8x8xf32> to memref<8x8xf32>
      memref.dealloc %3 : memref<8x8xf32>
      scf.yield %4 : memref<8x8xf32>
    }
    return %1 : memref<8x8xf32>
  }
}

The loop variable %arg1 is never gotten freed. Is this a supposed behavior? Or is it a bug?

Thanks!

wyzero · June 1, 2021, 3:12am

The problem can be reduced to following code snippet.

scf.while (%arg0 = %init) {
  %0 = "test.make_condition"(%arg0) : (memref<8x8xf32>) -> i1
  scf.condition(%0) %arg0 : memref<8x8xf32>
} do {
^bb0(%arg1 : memref<8x8xf32>):
  %0 = memref.alloc() : memref<8x8xf32>
  "test.some_element_wise_operation"(%arg1, %0)
  scf.yeild %0 : memref<8x8xf32>
}

wyzero · June 1, 2021, 3:25am

And following picture:

wyzero · June 1, 2021, 5:23am

@herhut @dfki-mako Hi, could you help to take a look?

dfki-jugr · June 11, 2021, 1:42pm

Hey,
sorry for the delayed answer…
Yes, I think you are right. We should free %arg1, since we yield another allocated memory chunk.
We will investigate this issue.

dfki-jugr · June 14, 2021, 11:07am

We took a closer look into the code to investigate the issue further. The good news: This is not an issue in the BufferDeallocation pass. This pass needs an AliasAnalysis to find the related buffers in the code. In the analysis, we check for the ReturnLike interface to get a connection between different regions. In the While-loop, there are two blocks and in the first block, there is the condition check that uses scf.condition. This operation connects the two blocks and passes the arguments to the second block. At this point, we lose the alias connection in the analysis, since the condition operation does not implement the ReturnLike interface.
In general, this is the right behavior, but for the AliasAnalysis, we need further information and a connection between the operands of the condition op and the arguments in the following body-block of the While-op. Probably, this can be done with a new interface?

herhut · June 14, 2021, 12:29pm

Reading the documentation for RegionBranchOpInterface and the ReturnLike trait, it seems to me that ConditionOp is expected implement ReturnLike from the perspective of RegionBranchOpInterface. The interface seems to assume that all operations that terminate a region should be ReturnLike.

If we want to keep ReturnLike for things that return to the parent op, then we need something like RegionTerminator. The property we are trying to model here is that the operands of a terminator operation are forwarded 1:1 to the entry of the target region or the corresponding results of the parent op.

@River707 wdyt, as the original author of the interface? I would suggest to add ReturnLike for now.

dfki-mako · June 28, 2021, 1:02pm

Regarding the issues discussed above: we came up with an RFC CL on Phabricator that introduces the notion of a RegionBranchTerminatorOpInterface. This interface allows us to define which operands (of a terminator op in the scope of a parent RegionBranchOpInterface implementation) will be passed to the individual successor regions.

We also provided an initial draft implementation for the scf.ConditionOp that solves the alias-analysis problems related to the BufferDeallocation pass. Using this CL, your code can be converted to a bufferized program without any memory leaks

wyzero · June 28, 2021, 4:16pm

Great! It will be very useful for our current work (DISC upstream). I’ll take a try once it’s landing. Thanks again.

Topic		Replies	Views
How to bufferize scf.while? MLIR	3	339	November 12, 2020
MLIR Buffer Deallocation and memref store MLIR	1	254	August 31, 2021
Facing issues with bufferization cloneOp MLIR mlir	0	67	April 9, 2024
Bufferization: function call conventions? MLIR	10	523	January 18, 2021
Bufferize dynamic shape control flow MLIR	7	771	August 27, 2021

Memory leak for structure while op when using buffer-deallocation pass?

Related Topics