How to allocate memory inside gpu kernel function

Hello guys, I have been working with gpu dialect for a while. A new problem hit me when I try to allocate memory inside gpu.launch op. The test code is like below:

module {
    func.func @main() {
        %c1 = arith.constant 1 : index
        %c20 = arith.constant 20 : index
        gpu.launch blocks(%bx, %by, %bz) in (%x = %c1, %y = %c1, %z = %c1)
                   threads(%tx, %ty, %tz) in (%o = %c1, %p = %c1, %q = %c1) {
            %mem0 = memref.alloc (%c20) : memref<?xindex>
            %mem1 = memref.alloc (%c20) : memref<?xindex>     
            gpu.terminator
        }
        return
    }
}

when I lower it with the following pipeline:

mlir-opt dev-alloc.mlir --gpu-kernel-outlining | mlir-opt -convert-memref-to-llvm | mlir-opt -pass-pipeline='builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-nvvm,reconcile-unrealized-casts,gpu-to-cubin))' | mlir-opt -gpu-async-region -gpu-to-llvm | mlir-opt -convert-func-to-llvm | mlir-cpu-runner      --shared-libs=${LLVM_BUILD_DIR}/lib/libmlir_cuda_runtime.so    --shared-libs=${LLVM_BUILD_DIR}/lib/libmlir_c_runner_utils.so    --shared-libs=${LLVM_BUILD_DIR}/lib/libmlir_runner_utils.so    --entry-point-result=void -O0

everything works fine. However, when I tried to add memref.store or memref.copy after memref.alloc op like this:

module {
    func.func @main() {
        %c1 = arith.constant 1 : index
        %c20 = arith.constant 20 : index
        gpu.launch blocks(%bx, %by, %bz) in (%x = %c1, %y = %c1, %z = %c1)
                   threads(%tx, %ty, %tz) in (%o = %c1, %p = %c1, %q = %c1) {
            %mem0 = memref.alloc (%c20) : memref<?xindex>
            memref.store %c1, %mem0[%c1] : memref<?xindex>    // added here
            gpu.terminator
        }
        return
    }
}

an error occured:

<stdin>:24:16: error: 'llvm.call' op 'malloc' does not reference a symbol in the current scope
      %alloc = memref.alloc(%arg0) : memref<?xindex>
               ^
<stdin>:24:16: note: see current operation: %18 = "llvm.call"(%17) {callee = @malloc, fastmathFlags = #llvm.fastmath<none>} : (i64) -> !llvm.ptr<i8>
Error: entry point not found

when I replaced memref.alloc with gpu.alloc (also worked fine when there was no memref.store op):

module {
    func.func @main() {
        %c1 = arith.constant 1 : index
        %c20 = arith.constant 20 : index
        gpu.launch blocks(%bx, %by, %bz) in (%x = %c1, %y = %c1, %z = %c1)
                   threads(%tx, %ty, %tz) in (%o = %c1, %p = %c1, %q = %c1) {
            %mem0 = gpu.alloc (%c20) : memref<?xindex>.   // replace here
            memref.store %c1, %mem0[%c1] : memref<?xindex>
            gpu.terminator
        }
        return
    }
}

another error occured:

<unknown>:0: error: failed to legalize operation 'gpu.alloc' that was explicitly marked illegal
<unknown>:0: note: see current operation: %1 = "gpu.alloc"(<<UNKNOWN SSA VALUE>>) {operand_segment_sizes = array<i32: 0, 1, 0>} : (index) -> memref<?xindex>
Error: entry point not found

It seems that memref.alloc inside kernel still tries to call malloc function provided by host, and gpu.alloc is something like cudaMalloc which is also provided by cuda library for host to call.

So what is the proper way to alloc a piece of memory inside gpu kernel function? (I’m using llvmorg-16.0.6)

What kind of allocation are you trying to do here? In general allocating memory dynamically from the gpu is not possible in cuda or other GPU APIs. You can allocate stack/private memory using memref.alloca or you can use shared memory by creating a global variable in the right address space.

Neither of those will allow you to do malloc kind of allocations though.

I want some private temporary memory on GPU to store some results.

Actually I write a piece of CUDA C++ code to dynamically allocate memory inside kernel using the new keyword, and it works fine: (cuda 11.8, NVIDIA GeForce RTX 2060 SUPER)

#include <stdio.h>

__global__ void myKernel(int n) {
    int* test = new int[n];
    test[threadIdx.x] = 1024;
    printf("test[0]=%d from block %d, thread %d\n", test[0], blockIdx.x, threadIdx.x);
    printf("test[1]=%d from block %d, thread %d\n", test[1], blockIdx.x, threadIdx.x);
    delete[] test;
}

int main() {
    myKernel<<<1, 2>>>(100);
    cudaDeviceSynchronize();
    return 0;
}

and it outputs:

test[0]=1024 from block 0, thread 0
test[0]=0 from block 0, thread 1
test[1]=0 from block 0, thread 0
test[1]=1024 from block 0, thread 1

which means it does dynamically allocate private memory on GPU.

I tried memref.alloca op inside kernel, but it got crashed:

LLVM ERROR: Cannot select: t8: i64,ch = dynamic_stackalloc t0, Constant:i64<1280>, Constant:i64<0>
  t4: i64 = Constant<1280>
  t1: i64 = Constant<0>
In function: main_kernel
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.      Program arguments: wafer-opt -pass-pipeline=builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-nvvm,reconcile-unrealized-casts,gpu-to-cubin))
1.      Running pass 'Function Pass Manager' on module 'LLVMDialectModule'.
2.      Running pass 'NVPTX DAG->DAG Pattern Instruction Selection' on function '@main_kernel'
 #0 0x000056187c5e6ab6 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x2bcbab6)
 #1 0x000056187c5e4404 SignalHandler(int) Signals.cpp:0:0
 #2 0x00007f6f52242520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #3 0x00007f6f52296a7c __pthread_kill_implementation ./nptl/pthread_kill.c:44:76
 #4 0x00007f6f52296a7c __pthread_kill_internal ./nptl/pthread_kill.c:78:10
 #5 0x00007f6f52296a7c pthread_kill ./nptl/pthread_kill.c:89:10
 #6 0x00007f6f52242476 gsignal ./signal/../sysdeps/posix/raise.c:27:6
 #7 0x00007f6f522287f3 abort ./stdlib/abort.c:81:7
 #8 0x0000561879bf0d2e llvm::json::OStream::value(llvm::json::Value const&) (.cold) JSON.cpp:0:0
 #9 0x000056187b420b1d llvm::SelectionDAGISel::CannotYetSelect(llvm::SDNode*) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x1a05b1d)
#10 0x000056187b4230ea llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, unsigned char const*, unsigned int) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x1a080ea)
#11 0x000056187b41dd2a llvm::SelectionDAGISel::DoInstructionSelection() (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x1a02d2a)
#12 0x000056187b42b609 llvm::SelectionDAGISel::CodeGenAndEmitDAG() (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x1a10609)
#13 0x000056187b42f414 llvm::SelectionDAGISel::SelectAllBasicBlocks(llvm::Function const&) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x1a14414)
#14 0x000056187b4312c2 llvm::SelectionDAGISel::runOnMachineFunction(llvm::MachineFunction&) (.part.0) SelectionDAGISel.cpp:0:0
#15 0x000056187b6488c5 llvm::MachineFunctionPass::runOnFunction(llvm::Function&) (.part.0) MachineFunctionPass.cpp:0:0
#16 0x000056187c1f4263 llvm::FPPassManager::runOnFunction(llvm::Function&) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x27d9263)
#17 0x000056187c1f4499 llvm::FPPassManager::runOnModule(llvm::Module&) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x27d9499)
#18 0x000056187c1f4a05 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x27d9a05)
#19 0x0000561879fd666b mlir::gpu::SerializeToBlobPass::translateToISA[abi:cxx11](llvm::Module&, llvm::TargetMachine&) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x5bb66b)
#20 0x0000561879fd68cc mlir::gpu::SerializeToBlobPass::runOnOperation() (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x5bb8cc)
#21 0x000056187c345881 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x292a881)
#22 0x000056187c345f59 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x292af59)
#23 0x000056187c344681 mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x2929681)
#24 0x000056187c34572e mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x292a72e)
#25 0x000056187c345f59 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x292af59)
#26 0x000056187c346bb9 mlir::PassManager::run(mlir::Operation*) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x292bbb9)
#27 0x000056187b091efb performActions(llvm::raw_ostream&, bool, bool, std::shared_ptr<llvm::SourceMgr> const&, mlir::MLIRContext*, llvm::function_ref<mlir::LogicalResult (mlir::PassManager&)>, bool, bool) (.constprop.0) MlirOptMain.cpp:0:0
#28 0x000056187b09250f processBuffer(llvm::raw_ostream&, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, bool, bool, bool, bool, bool, bool, llvm::function_ref<mlir::LogicalResult (mlir::PassManager&)>, mlir::DialectRegistry&, llvm::ThreadPool*) MlirOptMain.cpp:0:0
#29 0x000056187b0927ab mlir::LogicalResult llvm::function_ref<mlir::LogicalResult (std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::raw_ostream&)>::callback_fn<mlir::MlirOptMain(llvm::raw_ostream&, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::function_ref<mlir::LogicalResult (mlir::PassManager&)>, mlir::DialectRegistry&, bool, bool, bool, bool, bool, bool, bool)::'lambda'(std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::raw_ostream&)>(long, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::raw_ostream&) MlirOptMain.cpp:0:0
#30 0x000056187c510e8e mlir::splitAndProcessBuffer(std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::function_ref<mlir::LogicalResult (std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::raw_ostream&)>, llvm::raw_ostream&, bool, bool) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x2af5e8e)
#31 0x000056187b090ba0 mlir::MlirOptMain(llvm::raw_ostream&, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, mlir::PassPipelineCLParser const&, mlir::DialectRegistry&, bool, bool, bool, bool, bool, bool, bool, bool) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x1675ba0)
#32 0x000056187b092c0d mlir::MlirOptMain(int, char**, llvm::StringRef, mlir::DialectRegistry&, bool) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x1677c0d)
#33 0x0000561879bf0f2c main (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x1d5f2c)
#34 0x00007f6f52229d90 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
#35 0x00007f6f52229e40 call_init ./csu/../csu/libc-start.c:128:20
#36 0x00007f6f52229e40 __libc_start_main ./csu/../csu/libc-start.c:379:5
#37 0x0000561879c44485 _start (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x229485)
Error: entry point not found

yes for private memory it is simpler as long as this is memory which is only live during one kernel invocation.

I think the crash points to NVPTX not supporting dynamic sized alloca. Can you try with a constant size defined within the kernel?
If it works and you need dynamic sized alloca it will need to be fixed in NVPTX backend. Cuda C++ uses a different backend which is not open source.

It seems even with constant size, it would still crash:

module {
    func.func @main() {
        %c1 = arith.constant 1 : index
        %c20 = arith.constant 20 : index
        gpu.launch blocks(%bx, %by, %bz) in (%x = %c1, %y = %c1, %z = %c1)
                   threads(%tx, %ty, %tz) in (%o = %c1, %p = %c1, %q = %c1) {
            %mem0 = memref.alloca () : memref<20xindex>.   // constant size
            memref.store %c1, %mem0[%c1] : memref<20xindex>
            gpu.terminator
        }
        return
    }
}
LLVM ERROR: Cannot select: t8: i64,ch = dynamic_stackalloc t0, Constant:i64<1280>, Constant:i64<0>
  t4: i64 = Constant<1280>
  t1: i64 = Constant<0>
In function: main_kernel
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.      Program arguments: wafer-opt -pass-pipeline=builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-nvvm,reconcile-unrealized-casts,gpu-to-cubin))
1.      Running pass 'Function Pass Manager' on module 'LLVMDialectModule'.
2.      Running pass 'NVPTX DAG->DAG Pattern Instruction Selection' on function '@main_kernel'
 #0 0x000055c2946fbab6 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x2bcbab6)
 #1 0x000055c2946f9404 SignalHandler(int) Signals.cpp:0:0
 #2 0x00007f68dba42520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #3 0x00007f68dba96a7c __pthread_kill_implementation ./nptl/pthread_kill.c:44:76
 #4 0x00007f68dba96a7c __pthread_kill_internal ./nptl/pthread_kill.c:78:10
 #5 0x00007f68dba96a7c pthread_kill ./nptl/pthread_kill.c:89:10
 #6 0x00007f68dba42476 gsignal ./signal/../sysdeps/posix/raise.c:27:6
 #7 0x00007f68dba287f3 abort ./stdlib/abort.c:81:7
 #8 0x000055c291d05d2e llvm::json::OStream::value(llvm::json::Value const&) (.cold) JSON.cpp:0:0
 #9 0x000055c293535b1d llvm::SelectionDAGISel::CannotYetSelect(llvm::SDNode*) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x1a05b1d)
#10 0x000055c2935380ea llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, unsigned char const*, unsigned int) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x1a080ea)
#11 0x000055c293532d2a llvm::SelectionDAGISel::DoInstructionSelection() (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x1a02d2a)
#12 0x000055c293540609 llvm::SelectionDAGISel::CodeGenAndEmitDAG() (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x1a10609)
#13 0x000055c293544414 llvm::SelectionDAGISel::SelectAllBasicBlocks(llvm::Function const&) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x1a14414)
#14 0x000055c2935462c2 llvm::SelectionDAGISel::runOnMachineFunction(llvm::MachineFunction&) (.part.0) SelectionDAGISel.cpp:0:0
#15 0x000055c29375d8c5 llvm::MachineFunctionPass::runOnFunction(llvm::Function&) (.part.0) MachineFunctionPass.cpp:0:0
#16 0x000055c294309263 llvm::FPPassManager::runOnFunction(llvm::Function&) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x27d9263)
#17 0x000055c294309499 llvm::FPPassManager::runOnModule(llvm::Module&) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x27d9499)
#18 0x000055c294309a05 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x27d9a05)
#19 0x000055c2920eb66b mlir::gpu::SerializeToBlobPass::translateToISA[abi:cxx11](llvm::Module&, llvm::TargetMachine&) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x5bb66b)
#20 0x000055c2920eb8cc mlir::gpu::SerializeToBlobPass::runOnOperation() (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x5bb8cc)
#21 0x000055c29445a881 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x292a881)
#22 0x000055c29445af59 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x292af59)
#23 0x000055c294459681 mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x2929681)
#24 0x000055c29445a72e mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x292a72e)
#25 0x000055c29445af59 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x292af59)
#26 0x000055c29445bbb9 mlir::PassManager::run(mlir::Operation*) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x292bbb9)
#27 0x000055c2931a6efb performActions(llvm::raw_ostream&, bool, bool, std::shared_ptr<llvm::SourceMgr> const&, mlir::MLIRContext*, llvm::function_ref<mlir::LogicalResult (mlir::PassManager&)>, bool, bool) (.constprop.0) MlirOptMain.cpp:0:0
#28 0x000055c2931a750f processBuffer(llvm::raw_ostream&, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, bool, bool, bool, bool, bool, bool, llvm::function_ref<mlir::LogicalResult (mlir::PassManager&)>, mlir::DialectRegistry&, llvm::ThreadPool*) MlirOptMain.cpp:0:0
#29 0x000055c2931a77ab mlir::LogicalResult llvm::function_ref<mlir::LogicalResult (std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::raw_ostream&)>::callback_fn<mlir::MlirOptMain(llvm::raw_ostream&, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::function_ref<mlir::LogicalResult (mlir::PassManager&)>, mlir::DialectRegistry&, bool, bool, bool, bool, bool, bool, bool)::'lambda'(std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::raw_ostream&)>(long, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::raw_ostream&) MlirOptMain.cpp:0:0
#30 0x000055c294625e8e mlir::splitAndProcessBuffer(std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::function_ref<mlir::LogicalResult (std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::raw_ostream&)>, llvm::raw_ostream&, bool, bool) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x2af5e8e)
#31 0x000055c2931a5ba0 mlir::MlirOptMain(llvm::raw_ostream&, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, mlir::PassPipelineCLParser const&, mlir::DialectRegistry&, bool, bool, bool, bool, bool, bool, bool, bool) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x1675ba0)
#32 0x000055c2931a7c0d mlir::MlirOptMain(int, char**, llvm::StringRef, mlir::DialectRegistry&, bool) (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x1677c0d)
#33 0x000055c291d05f2c main (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x1d5f2c)
#34 0x00007f68dba29d90 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
#35 0x00007f68dba29e40 call_init ./csu/../csu/libc-start.c:128:20
#36 0x00007f68dba29e40 __libc_start_main ./csu/../csu/libc-start.c:379:5
#37 0x000055c291d59485 _start (/wafer/tzy/wafer-compiler/build/bin/wafer-opt+0x229485)
Error: entry point not found

Thanks for telling me. No wonder the two programs act differently.

I’m surprised that would fail. It should actually be trivially become dead at llvm ir level. Can you share the llvm ir generated?

Can you elaborate on the “not open source” part?

CUDA compilation in clang is as open source as it gets.
Dynamic memory allocation on the GPU side is possible – just call malloc there. It may be prohibitive performance-wise as, presumably, the allocation would need to be synchronized with the host-side CUDA runtime’s idea of GPU memory allocation state.

Dynamic stack allocation is technically possible with recent CUDA versions, but it’s not currently implemented in the NVPTX back-end. PTX ISA 8.3

I was talking about nvidia’s CUDA stack which uses nvvm compiler. I assumed that’s what was being used there but I could be wrong.

can malloc be call from the kernel side? I thought this was a host only API but I tend to only use the low level cuda API so maybe this has been added and I don’t know about it.

good to know, that’s probably what is causing the crash mentioned above.

Yes. It’s one of the very few GPU-side APIs provided by NVIDIA. 1. Introduction — PTX Interoperability 12.3 documentation

1 Like

cool! That’s what we should lower memref.alloc to then.

1 Like

it cannot be lowered to llvmir, the process failed when applying -pass-pipeline='builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-nvvm,reconcile-unrealized-casts,gpu-to-cubin))' pass.

So is it a confirmed bug? Should I propose an issue or something to help you?

There are two issues at play here:

  1. Using cudaMalloc inside a kernel requires nvlink and there’s no support for it in MLIR right now, it’s on my todos but I don’t know when I’ll get to it.
  2. AFAIK there’s no support for hipMalloc on AMD devices, so I don’t know if we’ll ever have gpu.alloc inside devices upstreamed, right now the CUDA & HIP wrappers only have host wrappers.

Patches are always welcomed.

Thanks for your reply!

I have been treated cudaMalloc as host API and never expect it to be used inside a kernel with the help of nvlink. What’s the difference between “cudaMalloc inside a kernel” and “new inside a kernel”?

In gpu dialect, I still treat gpu.alloc as a host API, so maybe using memref.alloc or memref.alloca inside a kernel sounds more reasonable?

My knowledge is still too shallow on this aspect, maybe I need to dive deeper to commit patches.

cudaMalloc can be called inside a kernel and that requires linking with nvlink, but you’re right malloc should be available too. My comments above apply only to cudaMalloc.

Making memref.alloc work should be doable. I’ll take a look later to see what’s happening.

If you want private memory, you’ll want to use memref.alloca while explicitly setting the memory space to #gpu.address_space<private>

Or you can add a private memory attribution to the function

Thanks a lot for your checkout! It would be of great help if memref.alloc could be available!

Thanks for your suggestion! But im not so familiar with private memory attribution. Could you please tell me where i can find an example or give me an example code snippet?

See LaunchOp::addPrivateAttribution(Type type, Location loc);

The type will need to be a memref built as follows

auto privateAttr = builder.getAttr<gpu::AddressSpaceAttr>(
        gpu::GPUDialect::getPrivateAddressSpace());
Type allocaType = MemRefType::get(shape, elementType, {}, privateAttr);

You could also pass allocaType as the type for memref::AllocaOp

The important thing is that indication that you’re going for the private memory space.

Thanks a lot! I will try it in my code!