Gpu.memcpy does not support generic memref layouts

It seems that gpu.memcpy op lowering in gpu-to-llvm pass only supports memrefs with an identity layout.

Here’s a small example that takes a subview of a GPU-allocated memref and copies that to the host:

module {
  func.func @foo() -> (memref<128xf32>, memref<1xf32>) {
    %memref = gpu.alloc() : memref<128xf32>
    %subview = memref.subview %memref[10] [1] [1] : memref<128xf32> to memref<1xf32, strided<[1], offset: 10>>
    %alloc = memref.alloc() {alignment = 8 : i64} : memref<1xf32>
    gpu.memcpy  %alloc, %subview : memref<1xf32>, memref<1xf32, strided<[1], offset: 10>>
    return %memref, %alloc : memref<128xf32>, memref<1xf32>
  }
}

Lowering with

mlir-opt -gpu-async-region -expand-strided-metadata -gpu-to-llvm test.mlir

leaves the gpu.memcpy op intact:

    ...
    %60 = builtin.unrealized_conversion_cast %59 : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)> to memref<1xf32>
    %61 = llvm.call @mgpuStreamCreate() : () -> !llvm.ptr
    %62 = builtin.unrealized_conversion_cast %61 : !llvm.ptr to !gpu.async.token
    %63 = gpu.memcpy async [%62] %60, %37 : memref<1xf32>, memref<1xf32, strided<[1], offset: 10>>
    %64 = builtin.unrealized_conversion_cast %63 : !gpu.async.token to !llvm.ptr
    ...

The above example works if the subview offset is set to zero.

The relevant identity layout check is here: llvm-project/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp at fbf81e300489f0489edab20493f1db02e2a3bc74 · llvm/llvm-project · GitHub

Would it be possible to generalize this? What is missing?

gpu.memcpy is a wrapper against cudaMemcpy and the like for other platforms: llvm-project/mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp at 13008aa45d406a65ee7adfc7672a038e4def1ad3 · llvm/llvm-project · GitHub, similarly to some other ops in the gpu dialect. It has no intention of supporting the full generality of mlir types, that would be unpredictable performance-wise.

The offset situation is rather unfortunate here. I think we should be able to support non-zero offsets specifically, as long as the memref is contiguous otherwise. (Though my opinion is that the offset should just be removed from the layout.)

Thanks, that makes sense. Supporting only the offset does not help much as quite often you’d need strides as well (e.g., with a subview of a 2D buffer).

There seems to be cuMemcpy2DAsync and cuMemcpy3DAsync which do support strides and offsets. Adding these would cover 1D, 2D and 3D cases. But there’s no solution for the generic case I guess.

Those could potentially become new ops. One should look into whether equivalents are available on different platforms.

Given that the underlying copy operation is, on both CUDA and HIP, a memcpy()-like operation (that is, it takes a source pointer, a destination pointer, and a length), trying to copy out of anything other than an identity-layout memref (or maybe a memref that’s got weird strides but is otherwise contiguous) is an unreasonable primitive to expect out of the GPU dialect.

And the tricky thing with offsets is that your offseting will need to reduce to memcpy(gpuBasePtr, cpuBasePtr + offset, length - offset), which is a rather hard condition to guarantee, given that you can’t getelementptr a memref because that’s not how those work.

I’d argue that the correct solution is to memref.copy into a fresh allocation and then gpu.memcpy over to the device, since that’s the only way the general case would be supported