Loop double-buffering/multi-buffering

ThomasRaoux · February 10, 2022, 5:14am

We discussed software pipelining in a previous thread in order to hide latency on targets including GPU.
One common transformation needed in combination of pipelining is double-buffering. This allow breaking dependencies between consecutive iterations of a loop using the same temporary buffer.

This is an optimization I need in order to improve the efficiency of software pipelining on Nvidia GPU. I just sent a patch to add the first support for such transformation:
https://reviews.llvm.org/D119406
Right now it lives in memref dialect transformation (there may be a better place?) as this is the main dependency but it also depends on SCF dialect.

The strategy taken by this transformation is to loop for AllocOp used a in a loop and that are first fully overwritten then used by other op. In this case we can add a new Rank to the AllocOp and have the loop round robin through the different slice of the new AllocOp.
For example it transforms:

func @multi_buffer(%a: memref<1024x1024xf32>) {
  %0 = memref.alloc() : memref<4x128xf32>
  %c1024 = arith.constant 1024 : index
  %c1 = arith.constant 1 : index
  %c3 = arith.constant 3 : index
  scf.for %arg2 = %c1 to %c1024 step %c3 {
   %1 = memref.subview %a[%arg2, 0] [4, 128] [1, 1] :
    memref<1024x1024xf32> to memref<4x128xf32, affine_map<(d0, d1)[s0] -> (d0 * 1024 + s0 + d1)>>
    memref.copy %1, %0 : memref<4x128xf32, affine_map<(d0, d1)[s0] -> (d0 * 1024 + s0 + d1)>> to memref<4x128xf32>
    "some_use"(%0) : (memref<4x128xf32>) -> ()
  }
  return
}

To

  func @multi_buffer(%arg0: memref<1024x1024xf32>) {
    %0 = memref.alloc() : memref<5x4x128xf32>
    %c1024 = arith.constant 1024 : index
    %c1 = arith.constant 1 : index
    %c3 = arith.constant 3 : index
    scf.for %arg1 = %c1 to %c1024 step %c3 {
      %1 = arith.subi %arg1, %c1 : index
      %2 = arith.divsi %1, %c3 : index
      %c5 = arith.constant 5 : index
      %3 = arith.remsi %2, %c5 : index
      %4 = memref.subview %0[%3, 0, 0] [1, 4, 128] [1, 1, 1] : memref<5x4x128xf32> to memref<4x128xf32, #map0>
      %5 = memref.subview %arg0[%arg1, 0] [4, 128] [1, 1] : memref<1024x1024xf32> to memref<4x128xf32, #map1>
      memref.copy %5, %4 : memref<4x128xf32, #map1> to memref<4x128xf32, #map0>
      "some_use"(%4) : (memref<4x128xf32, #map0>) -> ()
    }
    return
  }

Let me know if anybody is interested in discussing whether such case could be helpful to their architecture or if there are concerns with the design.

bondhugula · February 11, 2022, 6:32am

Normally, I’d just expect a %iv % 5 within an affine.apply op in your transformed IR above – such affine.apply ops can fold into other ops as well (see fold-memref-subview-ops pass for eg.) and in general enable better analysis and optimization (for eg. it’ll further compose into any affine.load and affine.store ops into which subview itself will fold into). You can use the higher-level abstractions instead of lower-level ones – the transformed code will also be more compact and readable.

AIYoungcino · March 20, 2024, 7:37am

Hello, I have a few questions. Where is the multi-buffering mechanism and pipeline used in this code?

Topic		Replies	Views
Double Buffering with the Transform Dialect MLIR	5	704	September 13, 2022
[MLIR] lower bufferization.to_memref to LLVM MLIR llvm , mlir	1	291	May 26, 2024
[RFC] Compile-time memref.alloc Scheduling/Merging optimization MLIR	24	1560	July 2, 2025
Redundant buffer synthesis in tf-opt MLIR	7	622	February 25, 2021
One-shot bufferize help Common CodeGen Infrastructure mlir	2	153	March 18, 2024

Loop double-buffering/multi-buffering

Related topics