How to implement vectorloadStore pass for a new GPU backend?

My goal is to optimize the performance of softmax kernel.

When comparing the NVPTX backend and my own GPU backend to generate LLVM IR, I found that NVPTX vectorized load and store instructions.

Part of NVPTX LLVM IR

  %31 = load <4 x half>, ptr addrspace(1) %scevgep91, align 8, !invariant.load !5
  %32 = fmul <4 x half> %31, <half 0xH211F, half 0xH211F, half 0xH211F, half 0xH211F>
  %33 = fcmp oge <4 x half> %19, %32
  %34 = or <4 x i1> %30, %33
  %35 = select <4 x i1> %34, <4 x half> %19, <4 x half> %32
  %scevgep85 = getelementptr i8, ptr addrspace(1) %scevgep89, i64 %27
  %scevgep86 = getelementptr i8, ptr addrspace(1) %scevgep85, i64 -524288

Further analysis found that the NVPTX backend has a separate LoadStoreVectorizer pass. However, when I directly added this pass, I got incorrect results. My question is, what other passes or dependences need to be enabled to implement load and store vectorization?

NVPTX is a somewhat odd back-end, so there’s a chance that not everything there is applicable to other back-ends. AMDGPU may be a somewhat better reference.

That said, the LoadStoreVectorizer is a generic IR pass (I assume we’re talking about https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp) and it should’ve produced correct results. So, when you’re saying “added this pass, I got incorrect results”, it’s not clear what IR you gave to the pass, what you got back and what you expected to see.

Is that the IR you want to lower to vector instructions for your target? It looks pretty well vectorized already, so there’s not much for IR-level LoadStoreVectorizer pass to do here.

If your problem that load <4 x half> gets lowered to for individual loads, then you need to make sure your <target>ISelLowering.cpp sets the op as Legal or Custom and the rest of your back-end handles the rest of the lowering for the vector load.