When using MLIR to generate GPU code, I noticed that many host-side global data are passed to gpu.launch_func as input parameters, which seems to cause some issues.(Someone kindly answered this question for me before.)
After examining examples in the test directory, I found that this issue appears to be resolved only by manually adding data transfers from host side to device side. Is there a way to avoid manually modifying auto-generated IR, perhaps through some passes or other mechanisms?
[Additional context that might help: I’m looking for a more automated approach to handle host-to-device data transfers when lowering MLIR to GPU, rather than having to modify the IR manually. Has anyone encountered similar issues or developed passes to handle this automatically?]
Let me know if you’d like me to clarify or expand on any part of this question.
Additionally, it seems that besides global data, if I create a value using memref.alloc, it cannot be directly passed as an input parameter to gpu.launch_func either, otherwise it will result in errors similar to the following:
'cuStreamSynchronize(stream)' failed with 'CUDA_ERROR_ILLEGAL_ADDRESS'
'cuStreamDestroy(stream)' failed with 'CUDA_ERROR_ILLEGAL_ADDRESS'
'cuModuleUnload(module)' failed with 'CUDA_ERROR_ILLEGAL_ADDRESS'
Managing GPU-CPU data transfers automatically might require a significant amount of compiler work, and sometimes, it’s not even possible without full visibility into the entire program. It depends on what you want to support. my5cents
One way of handling copies automatically is to use the vendor’s unified memory or managed memory solutions. Unified memory might require specific systems, but managed memory is widely available.
In MLIR, you can allocate memref data using %memref = gpu.alloc host_shared () : memref<10xf32> and enable GPU-CPU read-write access. The underlying driver and hardware will manage the virtual pages and handle data transfers automatically.