MLIR GPU execution without runtime load/unload

This link provided a detailed path to execute MLIR on GPUs without JIT. However, there is still one issue when considering performance. The device part is compiled into a cubin and the host part uses cuModulLoadData/cuModuleUnload to load the cubin before launching kernels and unloading it at the end. This overhead from the two runtime API calls cannot be ignored if the kernel is very small and the host part is wrapped by loops.

For example, I have such MLIR:

module attributes {gpu.container_module} {
  func @laplacian(%arg0: memref<66x66xf64>, %arg1: memref<66x66xf64>) attributes {} {
    ...
    gpu.launch_func  @laplacian_kernel::@laplacian_kernel blocks in (...) threads in (...) args(...)
    return
  }
  gpu.module @laplacian_kernel {
    gpu.func @laplacian_kernel(...) kernel {
      ...
      gpu.return
    }
  }
}

It is translated into LLVM IR:


@laplacian_kernel_laplacian_kernel_kernel_name = internal constant [17 x i8] c"laplacian_kernel\00"
@laplacian_kernel_gpubin_cst = internal constant [9384 x i8] \00\00\00 A HUGE BINARY STRING \00\00\00

define void @laplacian(...) !dbg !3 {
  ...
  %29 = call i8* @mgpuModuleLoad(i8* getelementptr inbounds ([9384 x i8], [9384 x i8]* @laplacian_kernel_gpubin_cst, i64 0, i64 0)), !dbg !22
  %30 = call i8* @mgpuModuleGetFunction(i8* %29, i8* getelementptr inbounds ([17 x i8], [17 x i8]* @laplacian_kernel_laplacian_kernel_kernel_name, i64 0, i64 0)), !dbg !23
  %31 = call i8* @mgpuStreamCreate(), !dbg !24
  ....
  call void @mgpuLaunchKernel(i8* %30, i64 8, i64 1, i64 1, i64 512, i64 1, i64 1, i32 0, i8* %31, i8** %47, i8** null), !dbg !136
  call void @mgpuStreamSynchronize(i8* %31), !dbg !137
  call void @mgpuStreamDestroy(i8* %31), !dbg !138
  call void @mgpuModuleUnload(i8* %29), !dbg !139
  ret void, !dbg !140
}

Then there is a main.cpp which call the host function laplacian:

extern laplacian(....);
...
for (int i = 0; i < iters; ++i) {
  laplacian(in, out);
}

The final profiling data (by nvprof) may be like:

==16995== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   22.64%  2.8947ms       100  28.947us  28.256us  29.983us  laplacian_kernel
      API calls:    3.86%  17.042ms       100  170.42us  153.24us  285.28us  cuModuleLoadData
                    1.59%  7.0116ms       100  70.115us  64.703us  86.837us  cuModuleUnload

cuModuleLoadData/cuModuleUnload take about 250us while the kernel (laplacian_kernel) only tasks 28.947us.

Is there any way to avoid such runtime calls or make it called only once in the whole program?

There are a few solutions to this, neither of which is available in upstream MLIR afaik.

If you are certain that you will never need to unload loaded modules before process termination, you can implement lazy loading of modules in the runtime library. This is done in https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/compiler/mlir/tools/kernel_gen/tf_gpu_runtime_wrappers.cc for example. This works with the existing lowering for gpu.launch_func.

You can also change the lowering of gpu.launch_func to insert the module loading in some initializer function and have the module as a global. This is reasonably easy if you have an intializer function in your setup.

If you want to be more sophisticated and reason about lifetimes, I’d suggest to model loading and unloading in the IR. Doing this at the LLVM level is painful, so I’d suggest lowering the gpu.launch_func to something that makes the steps explicit and then hoist loads/unloads out of loops, etc.