How to lower the combination of async gpu ops in `gpu` Dialect

Recently I’m working with the gpu Dialect to fully take advantages of the capablility of NVIDIA GPU multi-thread. But I met some problems when I tried to use async attribute for some ops.
When I just write the following code in test.mlir:

module attributes {gpu.container_module} {
  func.func @main() {
    %c2 = arith.constant 2 : index
    %0 = gpu.wait async
    %1, %2 = gpu.alloc async [%0] (%c2) : memref<?xf32>
    %5, %6 = gpu.alloc async [%0] (%c2) : memref<?xf32>
    %3 = gpu.dealloc async [%2] %1 : memref<?xf32>
    %4 = gpu.dealloc async [%6] %5 : memref<?xf32>
    gpu.wait [%3]
    return
  }
}

and lower it with the following pipeline:

mlir-opt test.mlir -llvm-request-c-wrappers | \
mlir-opt -gpu-to-llvm | \
mlir-opt -reconcile-unrealized-casts

it can output the llvm ir and work well with my C codes(-llvm-request-c-wrappers).
However, when i try to add the gpu.memcpy op into my code:

module attributes {gpu.container_module} {
  func.func @main() {
    %c2 = arith.constant 2 : index
    %0 = gpu.wait async
    %1, %2 = gpu.alloc async [%0] (%c2) : memref<?xf32>
    %5, %6 = gpu.alloc async [%0] (%c2) : memref<?xf32>
    %7 = gpu.memcpy async [%2, %6] %1, %5 : memref<?xf32>, memref<?xf32>
    %3 = gpu.dealloc async [%2] %1 : memref<?xf32>
    %4 = gpu.dealloc async [%6] %5 : memref<?xf32>
    gpu.wait [%3]
    return
  }
}

and using the same pipeline, it prompts an error:

<stdin>:7:10: error: failed to legalize operation 'gpu.memcpy' that was explicitly marked illegal
    %1 = gpu.memcpy async [%asyncToken, %asyncToken_1] %memref, %memref_0 : memref<?xf32>, memref<?xf32>
         ^
<stdin>:7:10: note: see current operation: %34 = "gpu.memcpy"(%19#1, %33#1, %19#0, %33#0) : (!gpu.async.token, !gpu.async.token, memref<?xf32>, memref<?xf32>) -> !gpu.async.token
module {
}

I had tried out other pipelines but eventually failed. So what’s the right pipeline to lower it? I also want to add gpu.launch_func op into my code, so what’s the right pipeline for lowering mlir code containing gpu.wait, gpu.alloc, gpu.memcpy and gpu.launch_func?
(the LLVM version I use is llvmorg-16.0.6, and CUDA version is 11.8 with NVIDIA GeForce RTX 2060 SUPER)

Have you looked at the integration tests in-tree? For example it seems that mlir/test/Integration/GPU/CUDA/async.mlir is using gpu.memcpy in async mode?

There are main differences between the two situations:

  1. The examples in mlir/test/Integration/GPU/CUDA/async.mlir are the combination of async dialect and gpu dialect, and here I just want to use the only gpu dialect;
  2. The type of the async token is !async.token in async dialect, while in gpu dialect it’s !gpu.async.token, and they are not compatible with each other.

I can successfully run the example in mlir/test/Integration/GPU/CUDA/async.mlir, and currently I’m trying out working on the combination of the two dialects to solve my problem. But I still want to know the solution for the gpu-dialect-used-only situation, which would bring great convenience for my development. :thinking:

I think I have found the solution.
The reason why the problem occurs is that the number of gpu.memcpy dependency tokens must equal to 1, as in mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp :

LogicalResult ConvertMemcpyOpToGpuRuntimeCallPattern::matchAndRewrite(
    gpu::MemcpyOp memcpyOp, OpAdaptor adaptor,
    ConversionPatternRewriter &rewriter) const {
  auto memRefType = memcpyOp.getSrc().getType().cast<MemRefType>();

  // the constraints here
  if (failed(areAllLLVMTypes(memcpyOp, adaptor.getOperands(), rewriter)) ||
      !isConvertibleAndHasIdentityMaps(memRefType) ||
      failed(isAsyncWithOneDependency(rewriter, memcpyOp)))
    return failure();
  ...
}

So if the gpu.memcpy is given more than one dependency, the rewriter would fail to convert this op into cuda runtime calls, thus leaving gpu.memcpy unchanged, which is explicitly marked illegal after the pass.
When I changed the number of dependencies for gpu.memcpy like below, the -gpu-to-llvm pass can work perfectly:

module attributes {gpu.container_module} {
  func.func @main() {
    %c2 = arith.constant 2 : index
    %5 = memref.alloca (%c2) : memref<?xf32>
    %0 = gpu.wait async
    %1, %2 = gpu.alloc async [%0] (%c2) : memref<?xf32>
    %7 = gpu.memcpy async [%2] %1, %5 : memref<?xf32>, memref<?xf32>
    %3 = gpu.dealloc async [%2] %1 : memref<?xf32>
    gpu.wait [%3]
    return
  }
}

However, this leads to another problem:
When I try to add -gpu-async-region pass before the -gpu-to-llvm pass, I found it not working again, as the IR generated after the -gpu-async-region pass like this:

module attributes {gpu.container_module} {
  func.func @main() attributes {llvm.emit_c_interface} {
    %c2 = arith.constant 2 : index
    %alloca = memref.alloca(%c2) : memref<?xf32>
    %0 = gpu.wait async
    %memref, %asyncToken = gpu.alloc async [%0] (%c2) : memref<?xf32>
    %1 = gpu.memcpy async [%asyncToken] %memref, %alloca : memref<?xf32>, memref<?xf32>
    %2 = gpu.dealloc async [%1, %asyncToken] %memref : memref<?xf32>
    gpu.wait [%2]
    return
  }
}

I noticed that the -gpu-async-region pass adds an extra dependency for gpu.dealloc op, but for gpu.dealloc op only one dependency is allowed, which again causes “illegal op” error.
So why would this happen? Is it a bug for -async-gpu-region pass? @mehdi_amini

I haven’t looked at these flows in a very long time, I don’t think they are actively developed right now, maybe @herhut @csigg or @ezhulenev remember the intent?

The -gpu-async-region pass expects non-async gpu ops, this comment.

How do you expect your initial memcpy example to map to streams? If it should map to a single stream (you only have one gpu.wait async, which lowers to creating a stream), you could try:

module attributes {gpu.container_module} {
  func.func @main() {
    %c2 = arith.constant 2 : index
    %0 = gpu.wait async
    %1, %2 = gpu.alloc async [%0] (%c2) : memref<?xf32>
    %5, %6 = gpu.alloc async [%0] (%c2) : memref<?xf32>
    %7 = gpu.memcpy async [%0] %1, %5 : memref<?xf32>, memref<?xf32>
    %3 = gpu.dealloc async [%7] %1 : memref<?xf32>
    %4 = gpu.dealloc async [%7] %5 : memref<?xf32>
    gpu.wait [%3]
    gpu.wait [%4]
    return
  }
}

On the other hand, if you would like to map it to multiple streams, something like this might work:

module attributes {gpu.container_module} {
  func.func @main() {
    %c2 = arith.constant 2 : index
    %0 = gpu.wait async
    %1, %2 = gpu.alloc async [%0] (%c2) : memref<?xf32>
    %9 = gpu.wait async
    %5, %6 = gpu.alloc async [%9] (%c2) : memref<?xf32>
    %7 = gpu.wait async [%2, %6]
    %8 = gpu.memcpy async [%7] %1, %5 : memref<?xf32>, memref<?xf32>
    %3 = gpu.dealloc async [%8] %1 : memref<?xf32>
    %4 = gpu.dealloc async [%8] %5 : memref<?xf32>
    gpu.wait [%3, %4]
    return
  }
}

Generally though, the idea of gpu async is to build on top of async dialect and map async regions to individual streams, synchronized with events across async and parent regions.

That is really helpful! But I still have some questions below:

So in the first example, are the excution of the ops in linear order? (as there is only one gpu.wait async op and lowers to creating only one stream, all the ops are arranged on that only stream?)

And in the second example, are there 3 streams during the excution because of the 3 gpu.await async ops? The first 2 streams are responsible for the allocation of memory on GPU, and then the two streams are destroyed, with the third stream taking over and doing memcopy and deallocation ops. Am I get it right?