Repeated calls to a ROCm kernel are very slow


Repeated calls to a kernel that has been lowered for AMDGPUs are very slow, especially compared to the same code lowered for CUDA. Here’s an example:

// File test.mlir
module attributes {gpu.container_module} {
  func.func @main() {
    %c2 = arith.constant 2 : i32
      %c0 = arith.constant 0 : index
      %c1 = arith.constant 1 : index
      %max = arith.constant 1000 : index
     scf.for %arg0 = %c0 to %max step %c1 {
       gpu.launch_func @test_func::@test_func blocks in (%c1, %c1, %c1) threads in (%c1, %c1, %c1)
  gpu.module @test_func {
    gpu.func @test_func () kernel {

This code calls an empty kernel 1000 times. Now, lowering this to LLVMIR for ROCm, and then compiling it with clang produces an executable that is quite slow:

$ mlir-opt -gpu-kernel-outlining -convert-scf-to-cf -convert-func-to-llvm='use-bare-ptr-memref-call-conv' -pass-pipeline='gpu.module(strip-debuginfo,convert-gpu-to-rocdl,reconcile-unrealized-casts)' -convert-math-to-llvm -gpu-to-hsaco='chip=gfx906' -gpu-to-llvm -reconcile-unrealized-casts test.mlir | mlir-translate -mlir-to-llvmir -o test_rocm.ll
$ clang test_rocm.ll -lmlir_rocm_runtime -o test_rocm
$ time ./test_rocm
./test_rocm  1.28s user 4.47s system 105% cpu 5.445 total

The same thing for CUDA:

$ mlir-opt -gpu-kernel-outlining -convert-scf-to-cf -convert-func-to-llvm='use-bare-ptr-memref-call-conv' -pass-pipeline='gpu.module(strip-debuginfo,convert-gpu-to-nvvm,reconcile-unrealized-casts)' -convert-math-to-llvm -gpu-to-cubin -gpu-to-llvm -reconcile-unrealized-casts test.mlir | mlir-translate -mlir-to-llvmir -o test_cuda.ll
$ clang test_cuda.ll -lmlir_cuda_runtime -o test_cuda
$ time ./test_cuda
./test_cuda  0.06s user 0.42s system 94% cpu 0.508 total

(I might add that CUDA doesn’t seem to optimize-out the kernel calls, since nvprof returns that the test_func kernel was called 1000 times).

I investigated a little bit and found that, according to ltrace -c, the majority of the execution time in the ROCm case is spent creating streams:

$ ltrace -c ./test_rocm
% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
 64.22    5.010827        5010      1000 mgpuStreamCreate
 18.93    1.477283        1477      1000 mgpuModuleLoad
 11.49    0.896801         896      1000 mgpuStreamDestroy
  2.30    0.179130         179      1000 mgpuModuleUnload
  1.09    0.085361          85      1000 mgpuStreamSynchronize
  1.08    0.084516          84      1000 mgpuLaunchKernel
  0.88    0.068589          68      1000 mgpuModuleGetFunction
------ ----------- ----------- --------- --------------------
100.00    7.802507                  7000 total

Here’s the result for the CUDA version as well:

$ ltrace -c ./test_cuda
% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
 45.86    0.519534         519      1000 mgpuModuleLoad
  9.59    0.108639         108      1000 mgpuStreamDestroy
  9.55    0.108217         108      1000 mgpuStreamCreate
  9.00    0.101964         101      1000 mgpuLaunchKernel
  8.98    0.101758         101      1000 mgpuModuleUnload
  8.57    0.097057          97      1000 mgpuModuleGetFunction
  8.46    0.095816          95      1000 mgpuStreamSynchronize
------ ----------- ----------- --------- --------------------

You can see the difference is pretty clear.

Modifying the mlir_rocm_runtime library to prevent streams from being destroyed, and caching the first created stream indeed improve the exeuction time by a lot.

So, in the end, my questions are:

  • Do you have an idea for a workaround for this? Without modifying the mlir_rocm_runtime library.
  • Do you think this is an optimization problem from the HIP library? (the mlir_cuda_runtime code seems very similar to its rocm equivalent, so I don’t think the improvements should be at this level)

Thank you for reading, let me know if you have any questions.

For a use case we have with TensorFlow, we essentially have modified the runtime layer to reuse streams and even cache modules. This is achieved via a custom lowering of gpu.launch_func and a runtime implementation that has module caching.

The upstream lowering of gpu.launch_func is essentially a bare minimum implementation. If you do not want to modify the runtime, you could also add an optimization pass that hoists the module loading to an initializer function. You could also replace the calls to mgpuStreamCreate and mgpuStreamDestroy by a shared stream that is created once (maybe in that initializer function).

Depending on your use case, you can also get away with not unloading modules/destroying streams. Otherwise, I’d create a destructor-like function, as well, that is called on program termination.

Ok, that makes perfect sense, and having a custom runtime and lowering seems like the cleanest solution.

It just seems weird to me that the same thing with CUDA is much faster, and since this looks like a pretty common case, I’m wondering if it would be beneficial to implement this upstream, whether it’s in MLIR or reaching to the HIP people directly.

Thanks for reporting this! I’ve passed it along - if we have any updates that I can share I’ll let you know

1 Like