How to lower the combination of async gpu ops in `gpu` Dialect

BHbean · August 16, 2023, 6:14am

Recently I’m working with the gpu Dialect to fully take advantages of the capablility of NVIDIA GPU multi-thread. But I met some problems when I tried to use async attribute for some ops.
When I just write the following code in test.mlir:

module attributes {gpu.container_module} {
  func.func @main() {
    %c2 = arith.constant 2 : index
    %0 = gpu.wait async
    %1, %2 = gpu.alloc async [%0] (%c2) : memref<?xf32>
    %5, %6 = gpu.alloc async [%0] (%c2) : memref<?xf32>
    %3 = gpu.dealloc async [%2] %1 : memref<?xf32>
    %4 = gpu.dealloc async [%6] %5 : memref<?xf32>
    gpu.wait [%3]
    return
  }
}

and lower it with the following pipeline:

mlir-opt test.mlir -llvm-request-c-wrappers | \
mlir-opt -gpu-to-llvm | \
mlir-opt -reconcile-unrealized-casts

it can output the llvm ir and work well with my C codes(-llvm-request-c-wrappers).
However, when i try to add the gpu.memcpy op into my code:

module attributes {gpu.container_module} {
  func.func @main() {
    %c2 = arith.constant 2 : index
    %0 = gpu.wait async
    %1, %2 = gpu.alloc async [%0] (%c2) : memref<?xf32>
    %5, %6 = gpu.alloc async [%0] (%c2) : memref<?xf32>
    %7 = gpu.memcpy async [%2, %6] %1, %5 : memref<?xf32>, memref<?xf32>
    %3 = gpu.dealloc async [%2] %1 : memref<?xf32>
    %4 = gpu.dealloc async [%6] %5 : memref<?xf32>
    gpu.wait [%3]
    return
  }
}

and using the same pipeline, it prompts an error:

<stdin>:7:10: error: failed to legalize operation 'gpu.memcpy' that was explicitly marked illegal
    %1 = gpu.memcpy async [%asyncToken, %asyncToken_1] %memref, %memref_0 : memref<?xf32>, memref<?xf32>
         ^
<stdin>:7:10: note: see current operation: %34 = "gpu.memcpy"(%19#1, %33#1, %19#0, %33#0) : (!gpu.async.token, !gpu.async.token, memref<?xf32>, memref<?xf32>) -> !gpu.async.token
module {
}

I had tried out other pipelines but eventually failed. So what’s the right pipeline to lower it? I also want to add gpu.launch_func op into my code, so what’s the right pipeline for lowering mlir code containing gpu.wait, gpu.alloc, gpu.memcpy and gpu.launch_func?
(the LLVM version I use is llvmorg-16.0.6, and CUDA version is 11.8 with NVIDIA GeForce RTX 2060 SUPER)

mehdi_amini · August 16, 2023, 8:40am

Have you looked at the integration tests in-tree? For example it seems that mlir/test/Integration/GPU/CUDA/async.mlir is using gpu.memcpy in async mode?

BHbean · August 16, 2023, 9:00am

There are main differences between the two situations:

The examples in mlir/test/Integration/GPU/CUDA/async.mlir are the combination of async dialect and gpu dialect, and here I just want to use the only gpu dialect;
The type of the async token is !async.token in async dialect, while in gpu dialect it’s !gpu.async.token, and they are not compatible with each other.

I can successfully run the example in mlir/test/Integration/GPU/CUDA/async.mlir, and currently I’m trying out working on the combination of the two dialects to solve my problem. But I still want to know the solution for the gpu-dialect-used-only situation, which would bring great convenience for my development.

BHbean · August 21, 2023, 2:05pm

I think I have found the solution.
The reason why the problem occurs is that the number of gpu.memcpy dependency tokens must equal to 1, as in mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp :

LogicalResult ConvertMemcpyOpToGpuRuntimeCallPattern::matchAndRewrite(
    gpu::MemcpyOp memcpyOp, OpAdaptor adaptor,
    ConversionPatternRewriter &rewriter) const {
  auto memRefType = memcpyOp.getSrc().getType().cast<MemRefType>();

  // the constraints here
  if (failed(areAllLLVMTypes(memcpyOp, adaptor.getOperands(), rewriter)) ||
      !isConvertibleAndHasIdentityMaps(memRefType) ||
      failed(isAsyncWithOneDependency(rewriter, memcpyOp)))
    return failure();
  ...
}

So if the gpu.memcpy is given more than one dependency, the rewriter would fail to convert this op into cuda runtime calls, thus leaving gpu.memcpy unchanged, which is explicitly marked illegal after the pass.
When I changed the number of dependencies for gpu.memcpy like below, the -gpu-to-llvm pass can work perfectly:

module attributes {gpu.container_module} {
  func.func @main() {
    %c2 = arith.constant 2 : index
    %5 = memref.alloca (%c2) : memref<?xf32>
    %0 = gpu.wait async
    %1, %2 = gpu.alloc async [%0] (%c2) : memref<?xf32>
    %7 = gpu.memcpy async [%2] %1, %5 : memref<?xf32>, memref<?xf32>
    %3 = gpu.dealloc async [%2] %1 : memref<?xf32>
    gpu.wait [%3]
    return
  }
}

However, this leads to another problem:
When I try to add -gpu-async-region pass before the -gpu-to-llvm pass, I found it not working again, as the IR generated after the -gpu-async-region pass like this:

module attributes {gpu.container_module} {
  func.func @main() attributes {llvm.emit_c_interface} {
    %c2 = arith.constant 2 : index
    %alloca = memref.alloca(%c2) : memref<?xf32>
    %0 = gpu.wait async
    %memref, %asyncToken = gpu.alloc async [%0] (%c2) : memref<?xf32>
    %1 = gpu.memcpy async [%asyncToken] %memref, %alloca : memref<?xf32>, memref<?xf32>
    %2 = gpu.dealloc async [%1, %asyncToken] %memref : memref<?xf32>
    gpu.wait [%2]
    return
  }
}

I noticed that the -gpu-async-region pass adds an extra dependency for gpu.dealloc op, but for gpu.dealloc op only one dependency is allowed, which again causes “illegal op” error.
So why would this happen? Is it a bug for -async-gpu-region pass? @mehdi_amini

mehdi_amini · August 21, 2023, 8:45pm

I haven’t looked at these flows in a very long time, I don’t think they are actively developed right now, maybe @herhut @csigg or @ezhulenev remember the intent?

csigg · August 22, 2023, 11:05am

The -gpu-async-region pass expects non-async gpu ops, this comment.

How do you expect your initial memcpy example to map to streams? If it should map to a single stream (you only have one gpu.wait async, which lowers to creating a stream), you could try:

module attributes {gpu.container_module} {
  func.func @main() {
    %c2 = arith.constant 2 : index
    %0 = gpu.wait async
    %1, %2 = gpu.alloc async [%0] (%c2) : memref<?xf32>
    %5, %6 = gpu.alloc async [%0] (%c2) : memref<?xf32>
    %7 = gpu.memcpy async [%0] %1, %5 : memref<?xf32>, memref<?xf32>
    %3 = gpu.dealloc async [%7] %1 : memref<?xf32>
    %4 = gpu.dealloc async [%7] %5 : memref<?xf32>
    gpu.wait [%3]
    gpu.wait [%4]
    return
  }
}

On the other hand, if you would like to map it to multiple streams, something like this might work:

module attributes {gpu.container_module} {
  func.func @main() {
    %c2 = arith.constant 2 : index
    %0 = gpu.wait async
    %1, %2 = gpu.alloc async [%0] (%c2) : memref<?xf32>
    %9 = gpu.wait async
    %5, %6 = gpu.alloc async [%9] (%c2) : memref<?xf32>
    %7 = gpu.wait async [%2, %6]
    %8 = gpu.memcpy async [%7] %1, %5 : memref<?xf32>, memref<?xf32>
    %3 = gpu.dealloc async [%8] %1 : memref<?xf32>
    %4 = gpu.dealloc async [%8] %5 : memref<?xf32>
    gpu.wait [%3, %4]
    return
  }
}

Generally though, the idea of gpu async is to build on top of async dialect and map async regions to individual streams, synchronized with events across async and parent regions.

BHbean · August 28, 2023, 1:55am

That is really helpful! But I still have some questions below:

csigg:

module attributes {gpu.container_module} {
  func.func @main() {
    %c2 = arith.constant 2 : index
    %0 = gpu.wait async
    %1, %2 = gpu.alloc async [%0] (%c2) : memref<?xf32>
    %5, %6 = gpu.alloc async [%0] (%c2) : memref<?xf32>
    %7 = gpu.memcpy async [%0] %1, %5 : memref<?xf32>, memref<?xf32>
    %3 = gpu.dealloc async [%7] %1 : memref<?xf32>
    %4 = gpu.dealloc async [%7] %5 : memref<?xf32>
    gpu.wait [%3]
    gpu.wait [%4]
    return
  }
}

So in the first example, are the excution of the ops in linear order? (as there is only one gpu.wait async op and lowers to creating only one stream, all the ops are arranged on that only stream?)

csigg:

module attributes {gpu.container_module} {
  func.func @main() {
    %c2 = arith.constant 2 : index
    %0 = gpu.wait async
    %1, %2 = gpu.alloc async [%0] (%c2) : memref<?xf32>
    %9 = gpu.wait async
    %5, %6 = gpu.alloc async [%9] (%c2) : memref<?xf32>
    %7 = gpu.wait async [%2, %6]
    %8 = gpu.memcpy async [%7] %1, %5 : memref<?xf32>, memref<?xf32>
    %3 = gpu.dealloc async [%8] %1 : memref<?xf32>
    %4 = gpu.dealloc async [%8] %5 : memref<?xf32>
    gpu.wait [%3, %4]
    return
  }
}

And in the second example, are there 3 streams during the excution because of the 3 gpu.await async ops? The first 2 streams are responsible for the allocation of memory on GPU, and then the two streams are destroyed, with the third stream taking over and doing memcopy and deallocation ops. Am I get it right?

Topic		Replies	Views
How to lowering gpu.launch correctly MLIR	4	175	December 4, 2023
Lower gpu dialect failed MLIR gpu	4	614	August 25, 2023
How to allocate memory inside gpu kernel function MLIR gpu	23	1138	November 16, 2023
Lowering GPU dialect MLIR	9	1761	June 14, 2021
Error at lower gpu dialect to llvmir MLIR	6	461	March 26, 2022

How to lower the combination of async gpu ops in `gpu` Dialect

Related Topics