Hello guys!
As I raised the disscusion several months ago, I have been keeping an eye on the progress on the topic of “Allocating memory inside GPU kernel”. Recently I found that the latest release of LLVM(which is llvmorg-17.0.6
) had supported memref.alloca
inside GPU kernel, which can satisfy most use cases with gpu
dialect!
However, as for my personal use case, I still found a tricky bug in the “corner”, with which I will show you below:
I have a minimum piece of code named test.mlir
that can reproduce the bug:
module {
func.func @main() {
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%c20 = arith.constant 20 : index
%c1i8 = arith.constant 1 : i8
%dev0 = gpu.alloc (%c20) : memref<?xi8>
gpu.launch blocks(%bx, %by, %bz) in (%x = %c1, %y = %c1, %z = %c1)
threads(%tx, %ty, %tz) in (%o = %c1, %p = %c1, %q = %c1) {
// memref.copy %dev0, %dev1 : memref<?xindex> to memref<?xindex>
%alloca = memref.alloca (%c20) : memref<?xi8>
%t = vector.splat %c1i8 : vector<20xi8>
%20 = scf.for %arg16 = %c0 to %c20 step %c1 iter_args(%arg17 = %t) -> (vector<20xi8>) {
%34 = vector.load %alloca[%arg16] : memref<?xi8>, vector<20xi8>
scf.yield %34 : vector<20xi8>
}
%30 = vector.extractelement %20[%c0 : index] : vector<20xi8>
memref.store %30, %dev0[%c0] : memref<?xi8>. // <== store method 1
// vector.store %20, %dev0[%c1] : memref<?xi8>, vector<20xi8> // <== store method 2
gpu.terminator
}
return
}
}
And I’m trying to lower it to LLVM-IR with the following pass pipeline:
$ mlir-opt test.mlir --gpu-kernel-outlining \
| mlir-opt -convert-arith-to-llvm \
| mlir-opt -convert-scf-to-cf \
| mlir-opt -convert-vector-to-llvm \
| mlir-opt -finalize-memref-to-llvm \
| mlir-opt -pass-pipeline='builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-nvvm,reconcile-unrealized-casts,gpu-to-cubin))' \
| mlir-opt -gpu-async-region -gpu-to-llvm \
| mlir-opt -convert-func-to-llvm \
| mlir-opt -reconcile-unrealized-casts
it would prompt the following error:
LLVM ERROR: Cannot select: t34: i64,ch = dynamic_stackalloc t0, t33, Constant:i64<0>
t33: i64 = and t31, Constant:i64<-8>
t31: i64 = add nuw t152, Constant:i64<7>
t152: i64,ch = load<(dereferenceable invariant load (s64) from `ptr addrspace(101) null`, addrspace 101)> t0, TargetExternalSymbol:i64'main_kernel_param_0', undef:i64
t1: i64 = TargetExternalSymbol'main_kernel_param_0'
t3: i64 = undef
t30: i64 = Constant<7>
t32: i64 = Constant<-8>
t2: i64 = Constant<0>
In function: main_kernel
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0. Program arguments: mlir-opt -pass-pipeline=builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-nvvm,reconcile-unrealized-casts,gpu-to-cubin))
1. Running pass 'Function Pass Manager' on module 'LLVMDialectModule'.
2. Running pass 'NVPTX DAG->DAG Pattern Instruction Selection' on function '@main_kernel'
...
[1] 3361632 done wafer-opt dev-alloc.mlir --gpu-kernel-outlining |
3361633 done wafer-opt -convert-arith-to-llvm |
3361634 done wafer-opt -convert-scf-to-cf |
3361635 done wafer-opt -convert-vector-to-llvm |
3361636 done wafer-opt -finalize-memref-to-llvm |
3361637 IOT instruction (core dumped) wafer-opt |
3361638 done wafer-opt -gpu-async-region -gpu-to-llvm |
3361639 done wafer-opt -convert-func-to-llvm |
3361640 done wafer-opt -reconcile-unrealized-casts |
The trigger of the bug is really complicated and trick, which might satisfy the following conditions:
memref.alloca
op should be present inside the kernel, with which we can get a piece of memref named%alloca
;- a loop (
scf.for
) with loop-carried values(iter_args
) should be present inside the kernel; - inside the loop, some load operations should be done with
%alloca
, and the loaded values should somehow form the result of the loop; - when trying to store the result delivered by the loop(use either
memref.store
orvector.store
) into the memory on device(allocated bygpu.alloc
), the bug is then triggered.
I am really confused about how this could happen. I am looking forward to your helping hands. Thanks in advance! (LLVM version llvmorg-17.0.6
, cuda version release 12.0, V12.0.76
)