This link provided a detailed path to execute MLIR on GPUs without JIT. However, there is still one issue when considering performance. The device part is compiled into a cubin and the host part uses cuModulLoadData/cuModuleUnload to load the cubin before launching kernels and unloading it at the end. This overhead from the two runtime API calls cannot be ignored if the kernel is very small and the host part is wrapped by loops.
For example, I have such MLIR:
module attributes {gpu.container_module} {
func @laplacian(%arg0: memref<66x66xf64>, %arg1: memref<66x66xf64>) attributes {} {
...
gpu.launch_func @laplacian_kernel::@laplacian_kernel blocks in (...) threads in (...) args(...)
return
}
gpu.module @laplacian_kernel {
gpu.func @laplacian_kernel(...) kernel {
...
gpu.return
}
}
}
It is translated into LLVM IR:
@laplacian_kernel_laplacian_kernel_kernel_name = internal constant [17 x i8] c"laplacian_kernel\00"
@laplacian_kernel_gpubin_cst = internal constant [9384 x i8] \00\00\00 A HUGE BINARY STRING \00\00\00
define void @laplacian(...) !dbg !3 {
...
%29 = call i8* @mgpuModuleLoad(i8* getelementptr inbounds ([9384 x i8], [9384 x i8]* @laplacian_kernel_gpubin_cst, i64 0, i64 0)), !dbg !22
%30 = call i8* @mgpuModuleGetFunction(i8* %29, i8* getelementptr inbounds ([17 x i8], [17 x i8]* @laplacian_kernel_laplacian_kernel_kernel_name, i64 0, i64 0)), !dbg !23
%31 = call i8* @mgpuStreamCreate(), !dbg !24
....
call void @mgpuLaunchKernel(i8* %30, i64 8, i64 1, i64 1, i64 512, i64 1, i64 1, i32 0, i8* %31, i8** %47, i8** null), !dbg !136
call void @mgpuStreamSynchronize(i8* %31), !dbg !137
call void @mgpuStreamDestroy(i8* %31), !dbg !138
call void @mgpuModuleUnload(i8* %29), !dbg !139
ret void, !dbg !140
}
Then there is a main.cpp which call the host function laplacian
:
extern laplacian(....);
...
for (int i = 0; i < iters; ++i) {
laplacian(in, out);
}
The final profiling data (by nvprof) may be like:
==16995== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 22.64% 2.8947ms 100 28.947us 28.256us 29.983us laplacian_kernel
API calls: 3.86% 17.042ms 100 170.42us 153.24us 285.28us cuModuleLoadData
1.59% 7.0116ms 100 70.115us 64.703us 86.837us cuModuleUnload
cuModuleLoadData/cuModuleUnload take about 250us while the kernel (laplacian_kernel) only tasks 28.947us.
Is there any way to avoid such runtime calls or make it called only once in the whole program?