My goal is to optimize the performance of softmax kernel.
When comparing the NVPTX backend and my own GPU backend to generate LLVM IR, I found that NVPTX vectorized load and store instructions.
Part of NVPTX LLVM IR
%31 = load <4 x half>, ptr addrspace(1) %scevgep91, align 8, !invariant.load !5
%32 = fmul <4 x half> %31, <half 0xH211F, half 0xH211F, half 0xH211F, half 0xH211F>
%33 = fcmp oge <4 x half> %19, %32
%34 = or <4 x i1> %30, %33
%35 = select <4 x i1> %34, <4 x half> %19, <4 x half> %32
%scevgep85 = getelementptr i8, ptr addrspace(1) %scevgep89, i64 %27
%scevgep86 = getelementptr i8, ptr addrspace(1) %scevgep85, i64 -524288
Further analysis found that the NVPTX backend has a separate LoadStoreVectorizer pass. However, when I directly added this pass, I got incorrect results. My question is, what other passes or dependences need to be enabled to implement load and store vectorization?
Is that the IR you want to lower to vector instructions for your target? It looks pretty well vectorized already, so there’s not much for IR-level LoadStoreVectorizer pass to do here.
If your problem that load <4 x half> gets lowered to for individual loads, then you need to make sure your <target>ISelLowering.cpp sets the op as Legal or Custom and the rest of your back-end handles the rest of the lowering for the vector load.