Memref.alloca in AMD GPU kernels seem to lower to llvm.alloca with an incorrect address space

Hi, I hope the title is accurate and explicit enough, if it isn’t feel free to correct it.

When lowering a gpu kernel with a memref.alloca inside of it to rocdl, MLIR seems to generate llvm.alloca operations for the wrong address space.

For example, for this code :

// File test.mlir
module attributes {gpu.container_module} {
  llvm.func @main() {
    %1 = arith.constant 1 : index
    gpu.launch_func @test_func::@test_func blocks in (%1, %1, %1) threads in (%1, %1, %1)
    llvm.return
  }
  gpu.module @test_func {
    gpu.func @test_func () kernel {
      %0 = memref.alloca() : memref<1xi8>
      %1 = arith.constant 0 :i8
      %2 = arith.constant 0 :index
      memref.store %1, %0[%2] : memref<1xi8>
      gpu.return
    }
  }
}

I’m using mlir-opt and mlir-translate like this to get LLVMIR :

mlir-opt -convert-gpu-to-rocdl -gpu-to-hsaco='chip=gfx906' -gpu-to-llvm test.mlir -gpu-to-llvm | mlir-translate -mlir-to-llvmir -o test.ll

Which produces the following main function in test.ll:

// ...
define void @main() !dbg !3 {                                                                                                                                                                  
  %1 = call ptr @mgpuModuleLoad(ptr @test_func_gpubin_cst), !dbg !7                                                                                                                            
  %2 = call ptr @mgpuModuleGetFunction(ptr %1, ptr @test_func_test_func_kernel_name), !dbg !9                                                                                                  
  %3 = call ptr @mgpuStreamCreate(), !dbg !10                                                                                                                                                  
  %4 = alloca %0, align 8, !dbg !11                                                                                                                                                            
  %5 = alloca ptr, i32 0, align 8, !dbg !12                                                                                                                                                    
  call void @mgpuLaunchKernel(ptr %2, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i32 0, ptr %3, ptr %5, ptr null), !dbg !13                                                                     
  call void @mgpuStreamSynchronize(ptr %3), !dbg !14                                                                                                                                           
  call void @mgpuStreamDestroy(ptr %3), !dbg !15                                                                                                                                                 call void @mgpuModuleUnload(ptr %1), !dbg !16                                                                                                                                                
  ret void, !dbg !17                                                                                                                                                                           
} 
// ...

Then, using clang to compile test.ll, I get a segfault with the following stack trace :

clang-15: /home/racolin/microcard/llvm-project/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp:497: void getCopyToParts(llvm::SelectionDAG&, const llvm::SDLoc&, llvm::SDValue, llvm::SDV
alue*, unsigned int, llvm::MVT, const llvm::Value*, llvm::Optional<unsigned int>, llvm::ISD::NodeType): Assertion `NumParts == 1 && "No-op copy with multiple parts!"' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.   
Stack dump:                                                                                                                                                                                    0.      Program arguments: /home/racolin/microcard/builds/llvm/bin/clang-15 -cc1 -triple amdgcn -S -disable-free -clear-ast-before-backend -main-file-name test.final.ll -mrelocation-model pic
 -pic-level 1 -fhalf-no-semantic-interposition -mframe-pointer=all -fmath-errno -ffp-contract=on -fno-rounding-math -fno-verbose-asm -no-integrated-as -mconstructor-aliases -mllvm -treat-scal
able-fixed-error-as-warning -debugger-tuning=gdb -fno-dwarf-directory-asm -resource-dir /home/racolin/microcard/builds/llvm/lib/clang/15.0.2 -fdebug-compilation-dir=/home/racolin/microcard/tm
p -ferror-limit 19 -fgnuc-version=4.2.1 -fcolor-diagnostics -o /tmp/test-f7a861.s -x ir test.final.ll                 
1.      Code generation                                                                                                                                                                        
2.      Running pass 'CallGraph Pass Manager' on module 'test.final.ll'.                                                                                                                       
3.      Running pass 'AMDGPU DAG->DAG Pattern Instruction Selection' on function '@main'                                                                                                       
 #0 0x00007f3e13514f41 PrintStackTraceSignalHandler(void*) Signals.cpp:0:0                                                                                                                      #1 0x00007f3e13512754 SignalHandler(int) Signals.cpp:0:0                                                                                                                                       #2 0x00007f3e17cce140 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x13140)
 #3 0x00007f3e12e0ece1 raise ./signal/../sysdeps/unix/sysv/linux/raise.c:51:1                                                                                                                  
 #4 0x00007f3e12df8537 abort ./stdlib/abort.c:81:7                                                                                                                                             
 #5 0x00007f3e12df840f get_sysdep_segment_value ./intl/loadmsgcat.c:509:8                                                                                                                      
 #6 0x00007f3e12df840f _nl_load_domain ./intl/loadmsgcat.c:970:34                                                                                                                              
 #7 0x00007f3e12e07662 (/lib/x86_64-linux-gnu/libc.so.6+0x31662)                                                                                                                               
 #8 0x00007f3e128a08d1 getCopyToParts(llvm::SelectionDAG&, llvm::SDLoc const&, llvm::SDValue, llvm::SDValue*, unsigned int, llvm::MVT, llvm::Value const*, llvm::Optional<unsigned int>, llvm::
ISD::NodeType) SelectionDAGBuilder.cpp:0:0                                                                                                                                                     
 #9 0x00007f3e128a5092 llvm::TargetLowering::LowerCallTo(llvm::TargetLowering::CallLoweringInfo&) const (/home/racolin/microcard/builds/llvm/lib/libLLVMSelectionDAG.so.15+0x23b092)
#10 0x00007f3e128b20af llvm::SelectionDAGBuilder::lowerInvokable(llvm::TargetLowering::CallLoweringInfo&, llvm::BasicBlock const*) (/home/racolin/microcard/builds/llvm/lib/libLLVMSelectionDAG
.so.15+0x2480af)                                                                               
#11 0x00007f3e128d07de llvm::SelectionDAGBuilder::LowerCallTo(llvm::CallBase const&, llvm::SDValue, bool, bool, llvm::BasicBlock const*) (/home/racolin/microcard/builds/llvm/lib/libLLVMSelect
ionDAG.so.15+0x2667de)                                                                                                                                                                         
#12 0x00007f3e128bfa5d llvm::SelectionDAGBuilder::visitCall(llvm::CallInst const&) (/home/racolin/microcard/builds/llvm/lib/libLLVMSelectionDAG.so.15+0x255a5d)
#13 0x00007f3e128ed5a9 llvm::SelectionDAGBuilder::visit(llvm::Instruction const&) (/home/racolin/microcard/builds/llvm/lib/libLLVMSelectionDAG.so.15+0x2835a9)
#14 0x00007f3e1296dd28 llvm::SelectionDAGISel::SelectBasicBlock(llvm::ilist_iterator<llvm::ilist_detail::node_options<llvm::Instruction, true, false, void>, false, true>, llvm::ilist_iterator
<llvm::ilist_detail::node_options<llvm::Instruction, true, false, void>, false, true>, bool&) (/home/racolin/microcard/builds/llvm/lib/libLLVMSelectionDAG.so.15+0x303d28)
#15 0x00007f3e1296ef78 llvm::SelectionDAGISel::SelectAllBasicBlocks(llvm::Function const&) (/home/racolin/microcard/builds/llvm/lib/libLLVMSelectionDAG.so.15+0x304f78)
#16 0x00007f3e12970f82 llvm::SelectionDAGISel::runOnMachineFunction(llvm::MachineFunction&) (.part.0) SelectionDAGISel.cpp:0:0
#17 0x00007f3e167d2af5 llvm::MachineFunctionPass::runOnFunction(llvm::Function&) (.part.0) MachineFunctionPass.cpp:0:0
#18 0x00007f3e1403416b llvm::FPPassManager::runOnFunction(llvm::Function&) (/home/racolin/microcard/builds/llvm/lib/libLLVMCore.so.15+0x25e16b)
#19 0x00007f3e1461dbf2 (anonymous namespace)::CGPassManager::runOnModule(llvm::Module&) CallGraphSCCPass.cpp:0:0
#20 0x00007f3e14034c13 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/home/racolin/microcard/builds/llvm/lib/libLLVMCore.so.15+0x25ec13)
#21 0x00007f3e16e9ecbc clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions con
st&, llvm::StringRef, llvm::Module*, clang::BackendAction, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream>>) (/home/racolin/microcard/builds/llvm/lib/lib
clangCodeGen.so.15+0xdecbc)
#22 0x00007f3e172a121e clang::CodeGenAction::ExecuteAction() (/home/racolin/microcard/builds/llvm/lib/libclangCodeGen.so.15+0x4e121e)
#23 0x00007f3e15becb59 clang::FrontendAction::Execute() (/home/racolin/microcard/builds/llvm/lib/libclangFrontend.so.15+0x11ab59)
#24 0x00007f3e15b6ae16 clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) (/home/racolin/microcard/builds/llvm/lib/libclangFrontend.so.15+0x98e16)
#25 0x00007f3e17cb7428 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) (/home/racolin/microcard/builds/llvm/lib/libclangFrontendTool.so.15+0x4428)
#26 0x0000000000413f0d cc1_main(llvm::ArrayRef<char const*>, char const*, void*) (/home/racolin/microcard/builds/llvm/bin/clang-15+0x413f0d)
#27 0x000000000040dae0 ExecuteCC1Tool(llvm::SmallVectorImpl<char const*>&) driver.cpp:0:0
#28 0x0000000000410679 clang_main(int, char**) (/home/racolin/microcard/builds/llvm/bin/clang-15+0x410679)
#29 0x00007f3e12df9d0a __libc_start_main ./csu/../csu/libc-start.c:308:16
#30 0x000000000040d06a _start (/home/racolin/microcard/builds/llvm/bin/clang-15+0x40d06a)
clang-15: error: unable to execute command: Aborted
clang-15: error: clang frontend command failed due to signal (use -v to see invocation)
clang version 15.0.2 (https://github.com/llvm/llvm-project.git 4bd3f3759259548e159aeba5c76efb9a0864e6fa)
Target: amdgcn
Thread model: posix
InstalledDir: /home/racolin/microcard/builds/llvm/bin
clang-15: note: diagnostic msg: Error generating preprocessed source(s) - no preprocessable inputs.

Modifying the above function by adding addrspace(5) to the alloca operation and then casting the pointers solves the segfault :

// ...
define void @main() !dbg !3 {                                                                                                                                                                  
  %1 = call ptr @mgpuModuleLoad(ptr @test_func_gpubin_cst), !dbg !7                                                                                                                            
  %2 = call ptr @mgpuModuleGetFunction(ptr %1, ptr @test_func_test_func_kernel_name), !dbg !9                                                                                                  
  %3 = call ptr @mgpuStreamCreate(), !dbg !10
  // HERE                                                                                                                                                  
  %a = alloca %0, align 8, addrspace(5), !dbg !11                                                                                                                                                            
  %b = alloca ptr, i32 0, align 8, addrspace(5), !dbg !12    
  %4 = addrspacecast ptr addrspace(5) %a to ptr
  %5 = addrspacecast ptr addrspace(5) %b to ptr                                                                                                                                                
  call void @mgpuLaunchKernel(ptr %2, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i32 0, ptr %3, ptr %5, ptr null), !dbg !13                                                                     
  call void @mgpuStreamSynchronize(ptr %3), !dbg !14                                                                                                                                           
  call void @mgpuStreamDestroy(ptr %3), !dbg !15                                                                                                                                                 call void @mgpuModuleUnload(ptr %1), !dbg !16                                                                                                                                                
  ret void, !dbg !17                                                                                                                                                                           
} 

I have several questions :

  • am I doing something wrong ?
  • should MLIR generate alloca operations that use addrspace(5) ?
  • do you know why LLVM crashes when compiling code for the AMDGPU target with allocas in addrspace(0) instead of 5 ?

Thank you for your time, I hope the post is clear enough, if it isn’t feel free to ask for more precisions.

The alloca certainly should use addrspace(5). However, the DAG shouldn’t crash in the way it is here just by having the wrong address space here.

1 Like

This looks like a selection DAG isel pass or an underlying utility bug: using the default address space should have led to a clean error somewhere.

1 Like

Yes, I think so too. Do you think I should post an issue on the llvm github about this ? Or is there a more appropriate place ?

If I understand correctly, the convert-gpu-to-rocdl pass should lower memref.alloca to use addrspace(5). Should I post an issue about this on the github ?

Yes, a github issue would be the right way for the LLVM pass issue. For convert-gpu-to-rocdl, its authors/maintainers might already be on this forum (MLIR) - if you can tag them, that’ll help.

Hi @krzysz00, excuse me for tagging you but I looked a bit at the contributors for convert-gpu-to-rocdl and it seems that you could maybe help with this issue.

I can’t comment on the crash - I’m not familiar with the backend.

I do agree that an alloca should use address space 5 on AMDGPU to represent that sort of register/stack allocated value. I’ll be willing to approve a patch to that effect - the reason it never came up upstream is because we defined our own alloc operation that lowered to function attributions, not alloca.

1 Like

Also, if you want to debug this a bit more, I’d suggest extracting the llvm.func from within the GPU module after -convert-gpu-to-rocdl and then running it through mlir-translate, opt, and llc

Also, the LLVM IR you showed there is the host code - running that through the AMDGPU backend feels weird?

Thank you for your replies !

Now I’m curious about how you do this alloc operation, do you know where I can find the code for that ?

If you want a little more details, the reason we stumbled upon this problem is that we get the mlir GPU dialect code with a conversion from affine.for (convert-affine-for-to-gpu), and this for loop contains memref.alloca operations.
We have a version for NVIDIA GPUs that comes from this converted affine.for loop and works fine, and it seemed weird that the same code but converted to AMDGPU wouldn’t work.

Yes, I agree that it does feel weird (even though the compiler shouldn’t just crash, but that’s already fixed: Segfault when compiling for AMDGPU backend with alloca instructions using the default addrspace · Issue #59250 · llvm/llvm-project · GitHub).
In our actual application, this crash occurs during the gpu-to-hsaco pass, and at this point the alloca instruction is indeed in the gpu kernel function. I chose this simpler example
because it crashes with the same stack trace.

Ok, so, first notes. The test.mlir file you posted runs fine with mlir-cpu-runner as follows

./bin/mlir-opt --convert-gpu-to-rocdl--gpu-to-hsaco=chip=gfx906 --gpu-to-llvm /tmp/external-bug.mlir | ./bin/mlir-cpu-runner --shared-libs=./lib/libmlir_rocm_runtime.so --entry-point-result=void

The code you were trying to run through clang is host code - feeding it to amdgcn may produce unexpected results. For all I know, the fact that there aren’t any kernels in test.ll could mean that alloca-related passes don’t run and so the crash happens.

… and now I’ve looked at the issue against LLVM and so most of the above is irrelevant.

Now, looking at GPUOpsLowering, the alloca calls inserted there are explicitly not listing addrspace(5) because NVPTX doesn’t like it when you do that. So, we may want to have some divergence between NVPTX and AMDGPU lowerings so that alloca going to AMDGPU is placed in the correct address space to prevent these sorts of bugs (because while you can pass around pointers to scratch space, it involves ORing in some constants and it appears the compiler isn’t built to handle that code - or requires an explicit address space cast).

So, I think the fix is to adjust the GPU lowering to set address space 5 on alloca’s that don’t have one … but only on amdgpu, like you suggested.

Ok, on further investigation, the gpu.func to llvm.func converter does add address space 5 if you used a private memory attribution instead of alloca. So we probably want a pass on memref that adds address spaces to allocas and then to call it in --convert-gpu-to-rocdl

You should create correct IR in the first place. The datalayout tells you the address space to use for allocas if you’re creating them without additional context. Otherwise you have to have a hack like NVPTX does where it inserts addrspacecast pairs later

Yeah, agreed that feeding the correct IR to LLVM needs to happen, it’s just that when someone calls for memref.alloca, what the correct IR is may well be platform-dependent if they didn’t explicitly consider address spaces.

Thanks to both of you again for taking time to answers this :).

So should I (or somebody more knowledgeable) create an issue about this pass ?

Also, @krzysz00, could you please show me an example using a private memory attribution instead of alloca ? I’m curious about this and don’t exactly know what it looks like.

Might not be a bad idea to create an issue about the pass.

Now, as to those attributions, they look like this

gpu.func @func(%arg0: ...) workspace(%arg1: memref<..., 3>), private(%arg2: memref<..., 5>) kernel {
   ...
}

and get rewritten into shared memory arguments and alloca’s respectively when lowering to LLVM

Yes, in the meantime I got around the doc a little bit more and understood this a little better.

However, like I mentioned, we actually use the convert-affine-for-to-gpu pass to get the initial gpu.func, and the affine.for loop already contains the memref.allocas, which means the gpu.func we generate also contains those memref.allocas, and they never get converted to addrspace(5) obviously, hence the need to do that for AMD GPUs.

I will create an issue to suggest this and link to this topic.

In the meantime, I think we might write our own pass to convert to addrspace(5). We can’t just change
the type of the original memrefs, since we also use the same base MLIR code to generate code for other targets besides AMDGPU.

Thanks again for your useful answers :slight_smile:

On changing the memref type, the tricky thing is handling the operations that use the changed type. Because just search-replacing all the memref.alloca uses to set the type to one with addrspace(5) is liable to cause issues, for instance, you need to handle the view-like operations (like subview), which would also need an address space update. But, on the other hand, for calls, you probably want to add addrspacecast, which doesn’t currently exist in memref, or maybe you don’t.