Loop double-buffering/multi-buffering

We discussed software pipelining in a previous thread in order to hide latency on targets including GPU.
One common transformation needed in combination of pipelining is double-buffering. This allow breaking dependencies between consecutive iterations of a loop using the same temporary buffer.

This is an optimization I need in order to improve the efficiency of software pipelining on Nvidia GPU. I just sent a patch to add the first support for such transformation:
https://reviews.llvm.org/D119406
Right now it lives in memref dialect transformation (there may be a better place?) as this is the main dependency but it also depends on SCF dialect.

The strategy taken by this transformation is to loop for AllocOp used a in a loop and that are first fully overwritten then used by other op. In this case we can add a new Rank to the AllocOp and have the loop round robin through the different slice of the new AllocOp.
For example it transforms:

func @multi_buffer(%a: memref<1024x1024xf32>) {
  %0 = memref.alloc() : memref<4x128xf32>
  %c1024 = arith.constant 1024 : index
  %c1 = arith.constant 1 : index
  %c3 = arith.constant 3 : index
  scf.for %arg2 = %c1 to %c1024 step %c3 {
   %1 = memref.subview %a[%arg2, 0] [4, 128] [1, 1] :
    memref<1024x1024xf32> to memref<4x128xf32, affine_map<(d0, d1)[s0] -> (d0 * 1024 + s0 + d1)>>
    memref.copy %1, %0 : memref<4x128xf32, affine_map<(d0, d1)[s0] -> (d0 * 1024 + s0 + d1)>> to memref<4x128xf32>
    "some_use"(%0) : (memref<4x128xf32>) -> ()
  }
  return
}

To

  func @multi_buffer(%arg0: memref<1024x1024xf32>) {
    %0 = memref.alloc() : memref<5x4x128xf32>
    %c1024 = arith.constant 1024 : index
    %c1 = arith.constant 1 : index
    %c3 = arith.constant 3 : index
    scf.for %arg1 = %c1 to %c1024 step %c3 {
      %1 = arith.subi %arg1, %c1 : index
      %2 = arith.divsi %1, %c3 : index
      %c5 = arith.constant 5 : index
      %3 = arith.remsi %2, %c5 : index
      %4 = memref.subview %0[%3, 0, 0] [1, 4, 128] [1, 1, 1] : memref<5x4x128xf32> to memref<4x128xf32, #map0>
      %5 = memref.subview %arg0[%arg1, 0] [4, 128] [1, 1] : memref<1024x1024xf32> to memref<4x128xf32, #map1>
      memref.copy %5, %4 : memref<4x128xf32, #map1> to memref<4x128xf32, #map0>
      "some_use"(%4) : (memref<4x128xf32, #map0>) -> ()
    }
    return
  }

Let me know if anybody is interested in discussing whether such case could be helpful to their architecture or if there are concerns with the design.

Normally, I’d just expect a %iv % 5 within an affine.apply op in your transformed IR above – such affine.apply ops can fold into other ops as well (see fold-memref-subview-ops pass for eg.) and in general enable better analysis and optimization (for eg. it’ll further compose into any affine.load and affine.store ops into which subview itself will fold into). You can use the higher-level abstractions instead of lower-level ones – the transformed code will also be more compact and readable.