The offset situation is rather unfortunate here. I think we should be able to support non-zero offsets specifically, as long as the memref is contiguous otherwise. (Though my opinion is that the offset should just be removed from the layout.)
Thanks, that makes sense. Supporting only the offset does not help much as quite often you’d need strides as well (e.g., with a subview of a 2D buffer).
There seems to be cuMemcpy2DAsync and cuMemcpy3DAsync which do support strides and offsets. Adding these would cover 1D, 2D and 3D cases. But there’s no solution for the generic case I guess.
Given that the underlying copy operation is, on both CUDA and HIP, a memcpy()-like operation (that is, it takes a source pointer, a destination pointer, and a length), trying to copy out of anything other than an identity-layout memref (or maybe a memref that’s got weird strides but is otherwise contiguous) is an unreasonable primitive to expect out of the GPU dialect.
And the tricky thing with offsets is that your offseting will need to reduce to memcpy(gpuBasePtr, cpuBasePtr + offset, length - offset), which is a rather hard condition to guarantee, given that you can’t getelementptr a memref because that’s not how those work.
I’d argue that the correct solution is to memref.copy into a fresh allocation and then gpu.memcpy over to the device, since that’s the only way the general case would be supported