Disclaimer: I wanted early feedback for something I have been thinking about. I don’t have a timeline yet, but if nobody objects it might happen “soon-ish”.
The NVPTX backend has some oddities and (IMHO) a wrong transformation. At the same time, we have the more evolved AMDGPU backend which is also not free of oddities and duplicates some ideas (or the other way around).
My general idea would be to minimize the difference between our GPU code paths (Clang till the end) and move passes into the middle end if we have reason to believe it might actually help us to run them early. (We can still run them late to verify the preconditions for the target are met if necessary).
Emit address spaces for the NVPTX target early, mirroring our AMDGPU handling. This should help us in many ways. We get more coverage for address spaces in the middle end, we should get better AA results, we should get better address space deduction (@shiltian is now actively working to finish https://reviews.llvm.org/D120586), we get less divergence for GPU targets, we can remove the NVPTX pass that attaches AS “locally” late in the game , , …
 llvm-project/llvm/lib/Target/NVPTX/NVPTXGenericToNVVM.cpp at main · llvm/llvm-project · GitHub
 llvm-project/llvm/lib/Target/NVPTX/NVPTXLowerAlloca.cpp at main · llvm/llvm-project · GitHub
Hoist allocas into the entry block as part of a middle end pass, properly. What the NVPTX backend does is certainly wrong for allocas in loops. 
 llvm-project/llvm/lib/Target/NVPTX/NVPTXAllocaHoisting.cpp at main · llvm/llvm-project · GitHub
Unify the AA’s in the backends. The rules are not totally different, we could have a AddrSpaceAwareAA for most of it that uses target knowledge.
I am certain there are other things we could move and share, e.g., simplification of atomic accesses to local memory.
Just wanted to get some feedback on this since I read over the backend passes and I was not happy we have so many, duplication, and only run some of them late which can cause performance degradation.