[RFC] Cleaning up the NVIDIA (and potentially AMD) GPU backend

Disclaimer: I wanted early feedback for something I have been thinking about. I don’t have a timeline yet, but if nobody objects it might happen “soon-ish”.

The NVPTX backend has some oddities and (IMHO) a wrong transformation. At the same time, we have the more evolved AMDGPU backend which is also not free of oddities and duplicates some ideas (or the other way around).

My general idea would be to minimize the difference between our GPU code paths (Clang till the end) and move passes into the middle end if we have reason to believe it might actually help us to run them early. (We can still run them late to verify the preconditions for the target are met if necessary).

  1. Emit address spaces for the NVPTX target early, mirroring our AMDGPU handling. This should help us in many ways. We get more coverage for address spaces in the middle end, we should get better AA results, we should get better address space deduction (@shiltian is now actively working to finish https://reviews.llvm.org/D120586), we get less divergence for GPU targets, we can remove the NVPTX pass that attaches AS “locally” late in the game [1], [2], …
    [1] llvm-project/llvm/lib/Target/NVPTX/NVPTXGenericToNVVM.cpp at main · llvm/llvm-project · GitHub
    [2] llvm-project/llvm/lib/Target/NVPTX/NVPTXLowerAlloca.cpp at main · llvm/llvm-project · GitHub

  2. Hoist allocas into the entry block as part of a middle end pass, properly. What the NVPTX backend does is certainly wrong for allocas in loops. [3]
    [3] llvm-project/llvm/lib/Target/NVPTX/NVPTXAllocaHoisting.cpp at main · llvm/llvm-project · GitHub

  3. Unify the AA’s in the backends. The rules are not totally different, we could have a AddrSpaceAwareAA for most of it that uses target knowledge.

I am certain there are other things we could move and share, e.g., simplification of atomic accesses to local memory.

Just wanted to get some feedback on this since I read over the backend passes and I was not happy we have so many, duplication, and only run some of them late which can cause performance degradation.

(Tag @arsenm @Artem-B)

I’m not that bothered whether amdgpu and nvptx share passes or not but am always in favour of moving work to the middle end. We’ve got good mileage out of minimising clang and runtime differences for GPU openmp.

This would be 80% solved by BasicAA using TTI::addrspacesMayAlias. There was resistance to doing it this way a long time ago for some reason. Besides that, TargetAA not being first (and instead being last) in the AA pipeline is not ideal

I think changing that as you suggest is very sensible. We should look at the compile time impact for non-GPU targets for both changes and if they are none we should go for it.

Another piece of target AA is constant address handling. IMO constant address space should not exist. It mostly covers up weaknesses in the IR handling of !invariant. For example, last I checked you can’t mark a memcpy was invariant for the load half. Besides that, we use it as a weak hint for addressing mode matching about what offset sizes work

SGTM. A lot of these oddities are from the early days when we didn’t have better options. Your plan sounds sensible.