How to avoid duplicate code in gpu.launch

Hi, everyone
I am using gpu.launch and find this problem: in my application, I define a dialect named SNN, there are many operations SNN.iaf, this operation convert into a gpu.launch after lower, but each SNN.iaf lowering to a different function, they has same code but different name, and after lowering pipelines, gpu.func is convert into cubin, the code is very large due to this reason(there are binary code named “main_kernel_29_gpubin_cst”, “main_kernel_28_gpubin_cst”, …). when I try to define the gpu function and use gpu.launch_func, I get a problem : error: redefinition of symbol named ‘gpu_kernels_iaf_gpu_func_kernel_name’ , here is my code

module attributes {gpu.container_module} {
  gpu.module @gpu_kernels {
    gpu.func @iaf_gpu_func() kernel attributes {sym_visibility = "private"} {
      gpu.return
    }
  }
  func.func @main() {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    gpu.launch_func  @gpu_kernels::@iaf_gpu_func blocks in (%c1, %c1, %c1) threads in (%c1, %c1, %c1) 
    %c1_2 = arith.constant 1 : index
    gpu.launch_func  @gpu_kernels::@iaf_gpu_func blocks in (%c1_2, %c1_2, %c1_2) threads in (%c1_2, %c1_2, %c1_2) 
    return
  }
}

here is my pipelines:

lower-test:
	@${BUDDY_OPT} ${INPUT} \
		-gpu-kernel-outlining \
		-gpu-async-region \
		-func-bufferize -buffer-deallocation \
		-lower-affine -memref-expand \
		-convert-gpu-to-nvvm -gpu-to-cubin  \
		-convert-index-to-llvm -finalize-memref-to-llvm -convert-arith-to-llvm -convert-cf-to-llvm -convert-func-to-llvm --gpu-to-llvm\
		-reconcile-unrealized-casts -o ./log.mlir

How can I avoid the duplicate binary code?
Thanks for your help!

Hi there, actually it seems that I used to meet this problem too. MLIR seems not to support using gpu.launch_func to launch the same piece of kernel function(I do not dig out the reason), so I also got the error: redefinition of symbol named.

My solution is to inline the kernel function by using gpu.launch instead of gpu.launch_func, and then use scf.for to launch kernel multiple times. So maybe you should add additional -gpu-kernel-outlining pass in your pipeline.

Hope my experience can help a little :smiling_face:

Thanks for your suggestion, but I have a trouble to use scf.for, because the arguments of this operation is some memref, and they may have different dim, so I don’t know how to store them in a memref. Anyway, thanks for your help, maybe I should call an external c function for a try.