Hello guys, recently i have been working on using gpu
dialect to implement asynchronous oprations. However, when i want to create dynamic number of streams, i met the following question:
The example code are like followed:
module attributes {gpu.container_module} {
gpu.module @kernels {
gpu.func @kernel (%mem : memref<?xi8>, %mem2 : memref<?xi8>) kernel {
%tx = gpu.thread_id x
%val = memref.load %mem[%tx] : memref<?xi8>
gpu.printf "memref dev element %lld: %d\n" %tx, %val : index, i8
memref.store %val, %mem2[%tx] : memref<?xi8>
gpu.return
}
}
func.func @main() {
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%c2 = arith.constant 2 : index
%c3 = arith.constant 3 : index
%c1i8 = arith.constant 1 : i8
%c0i8 = arith.constant 100 : i8
%c2i8 = arith.constant 101 : i8
%c4i8 = arith.constant 102 : i8
%c8i8 = arith.constant 103 : i8
%mem = memref.alloca() : memref<4xi8>
%mem2 = memref.alloca() : memref<4xi8>
memref.store %c0i8, %mem[%c0] : memref<4xi8>
memref.store %c2i8, %mem[%c1] : memref<4xi8>
memref.store %c4i8, %mem[%c2] : memref<4xi8>
memref.store %c8i8, %mem[%c3] : memref<4xi8>
scf.for %arg0 = %c0 to %c3 step %c1 {
%0 = memref.view %mem[%arg0][%c2] : memref<4xi8> to memref<?xi8>
%t5 = gpu.wait async
%dev2, %t7 = gpu.alloc async [%t5] (%c2) : memref<?xi8>
%dev1, %t6 = gpu.alloc async [%t7] (%c2) : memref<?xi8>
%t3 = gpu.memcpy async [%t6] %dev2, %0 : memref<?xi8>, memref<?xi8>
%t4 = gpu.launch_func async [%t3] @kernels::@kernel blocks in (%c1, %c1, %c1) threads in (%c2, %c1, %c1) args(%dev2 : memref<?xi8>, %dev1 : memref<?xi8>)
%1 = memref.view %mem2[%arg0][%c2] : memref<4xi8> to memref<?xi8>
%t2 = gpu.memcpy async [%t4] %1, %dev1 : memref<?xi8>, memref<?xi8>
%cast = memref.cast %mem2 : memref<4xi8> to memref<*xi8>
func.call @printMemrefI8(%cast) : (memref<*xi8>) -> ()
%t9 = gpu.dealloc async [%t2] %dev2 : memref<?xi8>
%t8 = gpu.dealloc async [%t9] %dev1 : memref<?xi8>
gpu.wait [%t8]
}
return
}
func.func private @printMemrefI8(memref<*xi8>) attributes { llvm.emit_c_interface }
}
In the above example, i want to create three streams for data transfering and computing asynchronously. The number of streams is determined by the lower and upper bound of scf.for
, which means it is configurable and determined during runtime (dynamic).
However, the test result shows that this kind of stream division works even worse than the non-division version. I suspect the gpu.wait [%t8]
block the creation and excution of next stream.
Therefore, is there a way to solve the problem? Like move gpu.wait
outside of the loop and collect the !gpu.async.token
to be the oprands of the gpu.wait
? @csigg