Is there a way to create dynamic number of streams on gpu?

Hello guys, recently i have been working on using gpu dialect to implement asynchronous oprations. However, when i want to create dynamic number of streams, i met the following question:

The example code are like followed:

module attributes {gpu.container_module} {
    gpu.module @kernels {
            gpu.func @kernel (%mem : memref<?xi8>, %mem2 : memref<?xi8>) kernel {
                %tx = gpu.thread_id x
                %val = memref.load %mem[%tx] : memref<?xi8>
                gpu.printf "memref dev element %lld: %d\n" %tx, %val : index, i8
                memref.store %val, %mem2[%tx] : memref<?xi8>
                gpu.return
            }
        }

    func.func @main() {
        %c0 = arith.constant 0 : index
        %c1 = arith.constant 1 : index
        %c2 = arith.constant 2 : index
        %c3 = arith.constant 3 : index
        %c1i8 = arith.constant 1 : i8
        %c0i8 = arith.constant 100 : i8
        %c2i8 = arith.constant 101 : i8
        %c4i8 = arith.constant 102 : i8
        %c8i8 = arith.constant 103 : i8
        %mem = memref.alloca() : memref<4xi8>
        %mem2 = memref.alloca() : memref<4xi8>
        memref.store %c0i8, %mem[%c0] : memref<4xi8>
        memref.store %c2i8, %mem[%c1] : memref<4xi8>
        memref.store %c4i8, %mem[%c2] : memref<4xi8>
        memref.store %c8i8, %mem[%c3] : memref<4xi8>
        scf.for %arg0 = %c0 to %c3 step %c1 {
            %0 = memref.view %mem[%arg0][%c2] : memref<4xi8> to memref<?xi8>
            %t5 = gpu.wait async
            %dev2, %t7 = gpu.alloc async [%t5] (%c2) : memref<?xi8>
            %dev1, %t6 = gpu.alloc async [%t7] (%c2) : memref<?xi8>
            %t3 = gpu.memcpy async [%t6] %dev2, %0 : memref<?xi8>, memref<?xi8>
            %t4 = gpu.launch_func async [%t3] @kernels::@kernel blocks in (%c1, %c1, %c1) threads in (%c2, %c1, %c1) args(%dev2 : memref<?xi8>, %dev1 : memref<?xi8>)
            %1 = memref.view %mem2[%arg0][%c2] : memref<4xi8> to memref<?xi8>
            %t2 = gpu.memcpy async [%t4] %1, %dev1 : memref<?xi8>, memref<?xi8>
            %cast = memref.cast %mem2 : memref<4xi8> to memref<*xi8>
            func.call @printMemrefI8(%cast) : (memref<*xi8>) -> ()
            %t9 = gpu.dealloc async [%t2] %dev2 : memref<?xi8>
            %t8 = gpu.dealloc async [%t9] %dev1 : memref<?xi8>
            gpu.wait [%t8]
        }
        return
    }

    func.func private @printMemrefI8(memref<*xi8>) attributes { llvm.emit_c_interface }
}

In the above example, i want to create three streams for data transfering and computing asynchronously. The number of streams is determined by the lower and upper bound of scf.for, which means it is configurable and determined during runtime (dynamic).

However, the test result shows that this kind of stream division works even worse than the non-division version. I suspect the gpu.wait [%t8] block the creation and excution of next stream.

Therefore, is there a way to solve the problem? Like move gpu.wait outside of the loop and collect the !gpu.async.token to be the oprands of the gpu.wait? @csigg