My goal is to optimize the performance of softmax kernel.
When comparing the NVPTX backend and my own GPU backend to generate LLVM IR, I found that NVPTX vectorized load and store instructions.
Part of NVPTX LLVM IR
%31 = load <4 x half>, ptr addrspace(1) %scevgep91, align 8, !invariant.load !5
%32 = fmul <4 x half> %31, <half 0xH211F, half 0xH211F, half 0xH211F, half 0xH211F>
%33 = fcmp oge <4 x half> %19, %32
%34 = or <4 x i1> %30, %33
%35 = select <4 x i1> %34, <4 x half> %19, <4 x half> %32
%scevgep85 = getelementptr i8, ptr addrspace(1) %scevgep89, i64 %27
%scevgep86 = getelementptr i8, ptr addrspace(1) %scevgep85, i64 -524288
Further analysis found that the NVPTX backend has a separate LoadStoreVectorizer pass. However, when I directly added this pass, I got incorrect results. My question is, what other passes or dependences need to be enabled to implement load and store vectorization?
NVPTX is a somewhat odd back-end, so there’s a chance that not everything there is applicable to other back-ends. AMDGPU may be a somewhat better reference.
Is that the IR you want to lower to vector instructions for your target? It looks pretty well vectorized already, so there’s not much for IR-level LoadStoreVectorizer pass to do here.
If your problem that load <4 x half> gets lowered to for individual loads, then you need to make sure your <target>ISelLowering.cpp sets the op as Legal or Custom and the rest of your back-end handles the rest of the lowering for the vector load.