Gpu.memcpy does not support generic memref layouts

teekarna · August 13, 2024, 5:08pm

It seems that gpu.memcpy op lowering in gpu-to-llvm pass only supports memrefs with an identity layout.

Here’s a small example that takes a subview of a GPU-allocated memref and copies that to the host:

module {
  func.func @foo() -> (memref<128xf32>, memref<1xf32>) {
    %memref = gpu.alloc() : memref<128xf32>
    %subview = memref.subview %memref[10] [1] [1] : memref<128xf32> to memref<1xf32, strided<[1], offset: 10>>
    %alloc = memref.alloc() {alignment = 8 : i64} : memref<1xf32>
    gpu.memcpy  %alloc, %subview : memref<1xf32>, memref<1xf32, strided<[1], offset: 10>>
    return %memref, %alloc : memref<128xf32>, memref<1xf32>
  }
}

Lowering with

mlir-opt -gpu-async-region -expand-strided-metadata -gpu-to-llvm test.mlir

leaves the gpu.memcpy op intact:

    ...
    %60 = builtin.unrealized_conversion_cast %59 : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)> to memref<1xf32>
    %61 = llvm.call @mgpuStreamCreate() : () -> !llvm.ptr
    %62 = builtin.unrealized_conversion_cast %61 : !llvm.ptr to !gpu.async.token
    %63 = gpu.memcpy async [%62] %60, %37 : memref<1xf32>, memref<1xf32, strided<[1], offset: 10>>
    %64 = builtin.unrealized_conversion_cast %63 : !gpu.async.token to !llvm.ptr
    ...

The above example works if the subview offset is set to zero.

The relevant identity layout check is here: llvm-project/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp at fbf81e300489f0489edab20493f1db02e2a3bc74 · llvm/llvm-project · GitHub

Would it be possible to generalize this? What is missing?

ftynse · August 14, 2024, 8:21am

gpu.memcpy is a wrapper against cudaMemcpy and the like for other platforms: llvm-project/mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp at 13008aa45d406a65ee7adfc7672a038e4def1ad3 · llvm/llvm-project · GitHub, similarly to some other ops in the gpu dialect. It has no intention of supporting the full generality of mlir types, that would be unpredictable performance-wise.

The offset situation is rather unfortunate here. I think we should be able to support non-zero offsets specifically, as long as the memref is contiguous otherwise. (Though my opinion is that the offset should just be removed from the layout.)

teekarna · August 14, 2024, 2:43pm

Thanks, that makes sense. Supporting only the offset does not help much as quite often you’d need strides as well (e.g., with a subview of a 2D buffer).

There seems to be cuMemcpy2DAsync and cuMemcpy3DAsync which do support strides and offsets. Adding these would cover 1D, 2D and 3D cases. But there’s no solution for the generic case I guess.

ftynse · August 14, 2024, 3:05pm

Those could potentially become new ops. One should look into whether equivalents are available on different platforms.

krzysz00 · August 20, 2024, 10:05pm

Given that the underlying copy operation is, on both CUDA and HIP, a memcpy()-like operation (that is, it takes a source pointer, a destination pointer, and a length), trying to copy out of anything other than an identity-layout memref (or maybe a memref that’s got weird strides but is otherwise contiguous) is an unreasonable primitive to expect out of the GPU dialect.

And the tricky thing with offsets is that your offseting will need to reduce to memcpy(gpuBasePtr, cpuBasePtr + offset, length - offset), which is a rather hard condition to guarantee, given that you can’t getelementptr a memref because that’s not how those work.

I’d argue that the correct solution is to memref.copy into a fresh allocation and then gpu.memcpy over to the device, since that’s the only way the general case would be supported

Topic		Replies	Views
Memref.alloca in AMD GPU kernels seem to lower to llvm.alloca with an incorrect address space MLIR gpu	24	888	January 4, 2023
Memref to bare pointer conversion when lowering to GPU code MLIR	2	303	March 14, 2022
Help lowering GPU modules to LLVM MLIR	11	766	August 24, 2023
Memref pass help MLIR gpu , riscv	5	67	November 18, 2024
Fail to convert memref to llvm MLIR	2	341	February 28, 2022

Gpu.memcpy does not support generic memref layouts

Related topics