We discussed software pipelining in a previous thread in order to hide latency on targets including GPU.
One common transformation needed in combination of pipelining is double-buffering. This allow breaking dependencies between consecutive iterations of a loop using the same temporary buffer.
This is an optimization I need in order to improve the efficiency of software pipelining on Nvidia GPU. I just sent a patch to add the first support for such transformation:
https://reviews.llvm.org/D119406
Right now it lives in memref dialect transformation (there may be a better place?) as this is the main dependency but it also depends on SCF dialect.
The strategy taken by this transformation is to loop for AllocOp used a in a loop and that are first fully overwritten then used by other op. In this case we can add a new Rank to the AllocOp and have the loop round robin through the different slice of the new AllocOp.
For example it transforms:
func @multi_buffer(%a: memref<1024x1024xf32>) {
%0 = memref.alloc() : memref<4x128xf32>
%c1024 = arith.constant 1024 : index
%c1 = arith.constant 1 : index
%c3 = arith.constant 3 : index
scf.for %arg2 = %c1 to %c1024 step %c3 {
%1 = memref.subview %a[%arg2, 0] [4, 128] [1, 1] :
memref<1024x1024xf32> to memref<4x128xf32, affine_map<(d0, d1)[s0] -> (d0 * 1024 + s0 + d1)>>
memref.copy %1, %0 : memref<4x128xf32, affine_map<(d0, d1)[s0] -> (d0 * 1024 + s0 + d1)>> to memref<4x128xf32>
"some_use"(%0) : (memref<4x128xf32>) -> ()
}
return
}
To
func @multi_buffer(%arg0: memref<1024x1024xf32>) {
%0 = memref.alloc() : memref<5x4x128xf32>
%c1024 = arith.constant 1024 : index
%c1 = arith.constant 1 : index
%c3 = arith.constant 3 : index
scf.for %arg1 = %c1 to %c1024 step %c3 {
%1 = arith.subi %arg1, %c1 : index
%2 = arith.divsi %1, %c3 : index
%c5 = arith.constant 5 : index
%3 = arith.remsi %2, %c5 : index
%4 = memref.subview %0[%3, 0, 0] [1, 4, 128] [1, 1, 1] : memref<5x4x128xf32> to memref<4x128xf32, #map0>
%5 = memref.subview %arg0[%arg1, 0] [4, 128] [1, 1] : memref<1024x1024xf32> to memref<4x128xf32, #map1>
memref.copy %5, %4 : memref<4x128xf32, #map1> to memref<4x128xf32, #map0>
"some_use"(%4) : (memref<4x128xf32, #map0>) -> ()
}
return
}
Let me know if anybody is interested in discussing whether such case could be helpful to their architecture or if there are concerns with the design.