Complicated bug encontered when trying to use `memref.alloca` inside GPU kernel function

Hello guys!

As I raised the disscusion several months ago, I have been keeping an eye on the progress on the topic of “Allocating memory inside GPU kernel”. Recently I found that the latest release of LLVM(which is llvmorg-17.0.6) had supported memref.alloca inside GPU kernel, which can satisfy most use cases with gpu dialect!

However, as for my personal use case, I still found a tricky bug in the “corner”, with which I will show you below:

I have a minimum piece of code named test.mlir that can reproduce the bug:

module {
    func.func @main() {
        %c0 = arith.constant 0 : index
        %c1 = arith.constant 1 : index
        %c20 = arith.constant 20 : index
        %c1i8 = arith.constant 1 : i8
        %dev0 = gpu.alloc (%c20) : memref<?xi8>
        gpu.launch blocks(%bx, %by, %bz) in (%x = %c1, %y = %c1, %z = %c1)
                   threads(%tx, %ty, %tz) in (%o = %c1, %p = %c1, %q = %c1) {
            // memref.copy %dev0, %dev1 : memref<?xindex> to memref<?xindex>
            %alloca = memref.alloca (%c20) : memref<?xi8>
            %t = vector.splat %c1i8 : vector<20xi8>
            %20 = scf.for %arg16 = %c0 to %c20 step %c1 iter_args(%arg17 = %t) -> (vector<20xi8>) {
                %34 = vector.load %alloca[%arg16] : memref<?xi8>, vector<20xi8>
                scf.yield %34 : vector<20xi8>
            }
            %30 = vector.extractelement %20[%c0 : index] : vector<20xi8>
            memref.store %30, %dev0[%c0] : memref<?xi8>.   // <== store method 1
            // vector.store %20, %dev0[%c1] : memref<?xi8>, vector<20xi8>    // <== store method 2
            gpu.terminator
        }
        return
    }
}

And I’m trying to lower it to LLVM-IR with the following pass pipeline:

$ mlir-opt test.mlir --gpu-kernel-outlining \
| mlir-opt -convert-arith-to-llvm \
| mlir-opt -convert-scf-to-cf \
| mlir-opt -convert-vector-to-llvm \
| mlir-opt -finalize-memref-to-llvm \
| mlir-opt -pass-pipeline='builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-nvvm,reconcile-unrealized-casts,gpu-to-cubin))' \
| mlir-opt -gpu-async-region -gpu-to-llvm \
| mlir-opt -convert-func-to-llvm \
| mlir-opt -reconcile-unrealized-casts 

it would prompt the following error:

LLVM ERROR: Cannot select: t34: i64,ch = dynamic_stackalloc t0, t33, Constant:i64<0>
  t33: i64 = and t31, Constant:i64<-8>
    t31: i64 = add nuw t152, Constant:i64<7>
      t152: i64,ch = load<(dereferenceable invariant load (s64) from `ptr addrspace(101) null`, addrspace 101)> t0, TargetExternalSymbol:i64'main_kernel_param_0', undef:i64
        t1: i64 = TargetExternalSymbol'main_kernel_param_0'
        t3: i64 = undef
      t30: i64 = Constant<7>
    t32: i64 = Constant<-8>
  t2: i64 = Constant<0>
In function: main_kernel
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.      Program arguments: mlir-opt -pass-pipeline=builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-nvvm,reconcile-unrealized-casts,gpu-to-cubin))
1.      Running pass 'Function Pass Manager' on module 'LLVMDialectModule'.
2.      Running pass 'NVPTX DAG->DAG Pattern Instruction Selection' on function '@main_kernel'
...
[1]    3361632 done                           wafer-opt dev-alloc.mlir --gpu-kernel-outlining | 
       3361633 done                           wafer-opt -convert-arith-to-llvm | 
       3361634 done                           wafer-opt -convert-scf-to-cf | 
       3361635 done                           wafer-opt -convert-vector-to-llvm | 
       3361636 done                           wafer-opt -finalize-memref-to-llvm | 
       3361637 IOT instruction (core dumped)  wafer-opt  | 
       3361638 done                           wafer-opt -gpu-async-region -gpu-to-llvm | 
       3361639 done                           wafer-opt -convert-func-to-llvm | 
       3361640 done                           wafer-opt -reconcile-unrealized-casts |

The trigger of the bug is really complicated and trick, which might satisfy the following conditions:

  • memref.alloca op should be present inside the kernel, with which we can get a piece of memref named %alloca ;
  • a loop (scf.for) with loop-carried values(iter_args) should be present inside the kernel;
  • inside the loop, some load operations should be done with %alloca, and the loaded values should somehow form the result of the loop;
  • when trying to store the result delivered by the loop(use either memref.store or vector.store) into the memory on device(allocated by gpu.alloc), the bug is then triggered.

I am really confused about how this could happen. I am looking forward to your helping hands. Thanks in advance! (LLVM version llvmorg-17.0.6, cuda version release 12.0, V12.0.76)

To investigate further, this’ll be an LLVM bug, not an MLIR one. (More precisely, this is either
a) A bug or missing feature in the LLVM backend for NVPTX or
b) A case where you have requested something that isn’t possible and you get those error messages because your IR can’t be compiled

Now, as to debugging

Firstly, note that the gpu-to-cubin pass has been deprecated upstream for several months now - see 1828deb7524f8d371825dfa6bd39cc17b7247e54 and f204aee1b9173ed9ae72017808f0a379c3a8de7a . Please move to, for example, -gpu-lower-to-nvvm-pipeline if that’s present on your release, if not, you’re looking to replace gpu-to-cubin with nvvm-attach-target,gpu-module-to-binary in that pipeline

Once you’ve done that, you can use --debug-only=serialize-to-llvm in order to get a dump of the LLVM IR before any optimizations or backend compilation is run. That’ll give you the ability to move debugging on this into LLVM.

(In the alternative, you’ll want to patch SerializeToBlob.cpp to sneak in a llvmModule.dump() right before auto error = transformer(&llvmModule); , which’ll also get you pre-optimization LLVM IR.)

PTX didn’t provide any way to dynamically allocate stack memory until PTX 7.3 (CUDA-11.3). NVPTX back-end in LLVM still does not have support for that instruction implemented. So the failure to lower dynamic_stackalloc is expected.

On a side note, one of the reasons nobody is in a rush to implement it is that using stack on a GPU is probably one of the easiest ways to kill performance. While adding support for dynamic_stackalloc would be nice for completeness sake, if you intend to use it somewhere in a hot path, that may not work that well in practice. We are almost always doing everything possible to eliminate stack use everywhere where performance on GPU matters.

1 Like

Also, I’d like to point out an easy work around here: memref.alloca() : memref<20xi8>

The size is constant, might as well put it in the type.

And if you need that to look like a memref<?xi8> for some reason, memref.cast is right there.

Thanks for your reply!

This is a really detailed guide for me! However, the current latest LLVM does not support either -gpu-lower-to-nvvm-pipeline or nvvm-attach-target,gpu-module-to-binary passes. But it’s really helpful for me to prepare for the new release version of LLVM.

Thanks for explanation from the motivation aspect. It shows that allocating memory inside GPU dynamically is not encouraged and not performance-friendly. We should avoid using stack memory as much as possible. But currently, this use does not seem to be able to be eliminated in our scenario, in which we are trying hard to avoid this issue.

I have tried this way and it really worked! Thanks for your solution! And I am curious about the deep-in behaviors beneath the different ways of allocation. So what is the behavior of llvm.alloca instruction with constant size?