Need guidance on optimizing LLVM IR for GPU targets

Hey guys! :smiling_face_with_three_hearts:

I have been diving deep into the world of LLVM IR for GPU targets, specifically aiming to port some compute-hungry algorithms and unleash their potential on the GPU. While I’ve made some headway, optimizing memory access and taking full advantage of GPU-specific optimizations are proving to be tricky beasts.

I am hoping the awesome LLVM community can lend a hand:

  • Anyone have any secret sauce or best practices for optimizing LLVM IR to squeeze every ounce of performance out of GPUs?
  • Specifically, I’m struggling with optimizing data transfers between that big GPU memory and the device registers. Any tips or tricks?
  • Are there specific LLVM IR constructs or optimization passes that are golden nuggets for GPU performance?

I also check this thread: https://discourse.llvm.org/t/how-to-get-the-passed-argument-of-every-gpu-kernel-callinst-in-llvm-ir-producedbyhipalteryx But I have not found any solution. Could anyone guide me about this?
I am looking forward to any tips or resources you can share on conquering GPU optimization with LLVM IR.

Thanks in advance :innocent:

https://dl.acm.org/doi/pdf/10.1145/3570638 is a decent starting point. There’s no much special about LLVM-IR versus just knowing the target, in this case PTX. Only specific suggestion I can make it to read the PTX ISA 8.5 and read which LLVM intrinsics map to those.

Hello @Aurora :smiley: :smiley:
To solve this problem you just need to focus on data layout, memory management, and GPU-specific optimizations. Leverage vectorization, synchronization, and register management techniques. Experiment with different optimization passes and kernel fusion. Refer to LLVM resources for guidance