Memory leak for structure while op when using buffer-deallocation pass?

Hi all,

It seems that buffer-deallocation pass can not handle structure while op correctly.

Here is an example input:

module  {
  func @dynamic_shape_while2(%arg0: memref<8x8xf32>) -> memref<8x8xf32> {
    %0 = scf.while (%arg1 = %arg0) : (memref<8x8xf32>) -> memref<8x8xf32> {
      %1 = memref.alloc() : memref<i1>
      "lmhlo.constant"(%1) {value = dense<true> : tensor<i1>} : (memref<i1>) -> ()
      %2 = memref.alloc() : memref<8x8xf32>
      "lmhlo.multiply"(%arg1, %arg1, %2) : (memref<8x8xf32>, memref<8x8xf32>, memref<8x8xf32>) -> ()
      %3 = memref.alloc() : memref<8x8xf32>
      "lmhlo.subtract"(%arg1, %2, %3) : (memref<8x8xf32>, memref<8x8xf32>, memref<8x8xf32>) -> ()
      %4 = memref.alloc() : memref<8x8xi1>
      "lmhlo.compare"(%arg1, %3, %4) {comparison_direction = "LT"} : (memref<8x8xf32>, memref<8x8xf32>, memref<8x8xi1>) -> ()
      %5 = memref.alloc() : memref<i1>
      "lmhlo.reduce"(%4, %1, %5) ( {
      ^bb0(%arg2: memref<i1>, %arg3: memref<i1>, %arg4: memref<i1>):  // no predecessors
        "lmhlo.and"(%arg2, %arg3, %arg4) : (memref<i1>, memref<i1>, memref<i1>) -> ()
        "lmhlo.terminator"() : () -> ()
      }) {dimensions = dense<[0, 1]> : tensor<2xi64>} : (memref<8x8xi1>, memref<i1>, memref<i1>) -> ()
      %6 = memref.load %5[] : memref<i1>
      scf.condition(%6) %arg1 : memref<8x8xf32>
    } do {
    ^bb0(%arg1: memref<8x8xf32>):  // no predecessors
      %1 = memref.alloc() : memref<8x8xf32>
      "lmhlo.multiply"(%arg1, %arg1, %1) : (memref<8x8xf32>, memref<8x8xf32>, memref<8x8xf32>) -> ()
      %2 = memref.alloc() : memref<8x8xf32>
      "lmhlo.concatenate"(%1, %1, %2) {dimension = 0 : i64} : (memref<8x8xf32>, memref<8x8xf32>, memref<8x8xf32>) -> ()
      scf.yield %2 : memref<8x8xf32>
    }
    return %0 : memref<8x8xf32>
  }
}

after applying -buffer-deallocation pass:

module  {
  func @dynamic_shape_while2(%arg0: memref<8x8xf32>) -> memref<8x8xf32> {
    %0 = memref.clone %arg0 : memref<8x8xf32> to memref<8x8xf32>
    %1 = scf.while (%arg1 = %0) : (memref<8x8xf32>) -> memref<8x8xf32> {
      %2 = memref.alloc() : memref<i1>
      "lmhlo.constant"(%2) {value = dense<true> : tensor<i1>} : (memref<i1>) -> ()
      %3 = memref.alloc() : memref<8x8xf32>
      "lmhlo.multiply"(%arg1, %arg1, %3) : (memref<8x8xf32>, memref<8x8xf32>, memref<8x8xf32>) -> ()
      %4 = memref.alloc() : memref<8x8xf32>
      "lmhlo.subtract"(%arg1, %3, %4) : (memref<8x8xf32>, memref<8x8xf32>, memref<8x8xf32>) -> ()
      memref.dealloc %3 : memref<8x8xf32>
      %5 = memref.alloc() : memref<8x8xi1>
      "lmhlo.compare"(%arg1, %4, %5) {comparison_direction = "LT"} : (memref<8x8xf32>, memref<8x8xf32>, memref<8x8xi1>) -> ()
      memref.dealloc %4 : memref<8x8xf32>
      %6 = memref.alloc() : memref<i1>
      "lmhlo.reduce"(%5, %2, %6) ( {
      ^bb0(%arg2: memref<i1>, %arg3: memref<i1>, %arg4: memref<i1>):  // no predecessors
        "lmhlo.and"(%arg2, %arg3, %arg4) : (memref<i1>, memref<i1>, memref<i1>) -> ()
        "lmhlo.terminator"() : () -> ()
      }) {dimensions = dense<[0, 1]> : tensor<2xi64>} : (memref<8x8xi1>, memref<i1>, memref<i1>) -> ()
      memref.dealloc %5 : memref<8x8xi1>
      memref.dealloc %2 : memref<i1>
      %7 = memref.load %6[] : memref<i1>
      memref.dealloc %6 : memref<i1>
      scf.condition(%7) %arg1 : memref<8x8xf32>
    } do {
    ^bb0(%arg1: memref<8x8xf32>):  // no predecessors
      %2 = memref.alloc() : memref<8x8xf32>
      "lmhlo.multiply"(%arg1, %arg1, %2) : (memref<8x8xf32>, memref<8x8xf32>, memref<8x8xf32>) -> ()
      %3 = memref.alloc() : memref<8x8xf32>
      "lmhlo.concatenate"(%2, %2, %3) {dimension = 0 : i64} : (memref<8x8xf32>, memref<8x8xf32>, memref<8x8xf32>) -> ()
      memref.dealloc %2 : memref<8x8xf32>
      %4 = memref.clone %3 : memref<8x8xf32> to memref<8x8xf32>
      memref.dealloc %3 : memref<8x8xf32>
      scf.yield %4 : memref<8x8xf32>
    }
    return %1 : memref<8x8xf32>
  }
}

The loop variable %arg1 is never gotten freed. Is this a supposed behavior? Or is it a bug?

Thanks!

The problem can be reduced to following code snippet.

scf.while (%arg0 = %init) {
  %0 = "test.make_condition"(%arg0) : (memref<8x8xf32>) -> i1
  scf.condition(%0) %arg0 : memref<8x8xf32>
} do {
^bb0(%arg1 : memref<8x8xf32>):
  %0 = memref.alloc() : memref<8x8xf32>
  "test.some_element_wise_operation"(%arg1, %0)
  scf.yeild %0 : memref<8x8xf32>
}

And following picture:

@herhut @dfki-mako Hi, could you help to take a look?

Hey,
sorry for the delayed answer…
Yes, I think you are right. We should free %arg1, since we yield another allocated memory chunk.
We will investigate this issue.

We took a closer look into the code to investigate the issue further. The good news: This is not an issue in the BufferDeallocation pass. This pass needs an AliasAnalysis to find the related buffers in the code. In the analysis, we check for the ReturnLike interface to get a connection between different regions. In the While-loop, there are two blocks and in the first block, there is the condition check that uses scf.condition. This operation connects the two blocks and passes the arguments to the second block. At this point, we lose the alias connection in the analysis, since the condition operation does not implement the ReturnLike interface.
In general, this is the right behavior, but for the AliasAnalysis, we need further information and a connection between the operands of the condition op and the arguments in the following body-block of the While-op. Probably, this can be done with a new interface?

Reading the documentation for RegionBranchOpInterface and the ReturnLike trait, it seems to me that ConditionOp is expected implement ReturnLike from the perspective of RegionBranchOpInterface. The interface seems to assume that all operations that terminate a region should be ReturnLike.

If we want to keep ReturnLike for things that return to the parent op, then we need something like RegionTerminator. The property we are trying to model here is that the operands of a terminator operation are forwarded 1:1 to the entry of the target region or the corresponding results of the parent op.

@River707 wdyt, as the original author of the interface? I would suggest to add ReturnLike for now.

Regarding the issues discussed above: we came up with an RFC CL on Phabricator that introduces the notion of a RegionBranchTerminatorOpInterface. This interface allows us to define which operands (of a terminator op in the scope of a parent RegionBranchOpInterface implementation) will be passed to the individual successor regions.

We also provided an initial draft implementation for the scf.ConditionOp that solves the alias-analysis problems related to the BufferDeallocation pass. Using this CL, your code can be converted to a bufferized program without any memory leaks :slight_smile:

Great! It will be very useful for our current work (DISC upstream). I’ll take a try once it’s landing. Thanks again.