[RFC] Cleaning up the NVIDIA (and potentially AMD) GPU backend

jdoerfert · June 29, 2023, 4:44pm

Disclaimer: I wanted early feedback for something I have been thinking about. I don’t have a timeline yet, but if nobody objects it might happen “soon-ish”.

The NVPTX backend has some oddities and (IMHO) a wrong transformation. At the same time, we have the more evolved AMDGPU backend which is also not free of oddities and duplicates some ideas (or the other way around).

My general idea would be to minimize the difference between our GPU code paths (Clang till the end) and move passes into the middle end if we have reason to believe it might actually help us to run them early. (We can still run them late to verify the preconditions for the target are met if necessary).

Emit address spaces for the NVPTX target early, mirroring our AMDGPU handling. This should help us in many ways. We get more coverage for address spaces in the middle end, we should get better AA results, we should get better address space deduction (@shiltian is now actively working to finish https://reviews.llvm.org/D120586), we get less divergence for GPU targets, we can remove the NVPTX pass that attaches AS “locally” late in the game [1], [2], …
[1] llvm-project/llvm/lib/Target/NVPTX/NVPTXGenericToNVVM.cpp at main · llvm/llvm-project · GitHub
[2] llvm-project/llvm/lib/Target/NVPTX/NVPTXLowerAlloca.cpp at main · llvm/llvm-project · GitHub
Hoist allocas into the entry block as part of a middle end pass, properly. What the NVPTX backend does is certainly wrong for allocas in loops. [3]
[3] llvm-project/llvm/lib/Target/NVPTX/NVPTXAllocaHoisting.cpp at main · llvm/llvm-project · GitHub
Unify the AA’s in the backends. The rules are not totally different, we could have a AddrSpaceAwareAA for most of it that uses target knowledge.

I am certain there are other things we could move and share, e.g., simplification of atomic accesses to local memory.

Just wanted to get some feedback on this since I read over the backend passes and I was not happy we have so many, duplication, and only run some of them late which can cause performance degradation.

(Tag @arsenm @Artem-B)

JonChesterfield · June 29, 2023, 5:01pm

I’m not that bothered whether amdgpu and nvptx share passes or not but am always in favour of moving work to the middle end. We’ve got good mileage out of minimising clang and runtime differences for GPU openmp.

arsenm · June 29, 2023, 6:41pm

This would be 80% solved by BasicAA using TTI::addrspacesMayAlias. There was resistance to doing it this way a long time ago for some reason. Besides that, TargetAA not being first (and instead being last) in the AA pipeline is not ideal

jdoerfert · June 29, 2023, 6:45pm

I think changing that as you suggest is very sensible. We should look at the compile time impact for non-GPU targets for both changes and if they are none we should go for it.

arsenm · June 29, 2023, 6:49pm

Another piece of target AA is constant address handling. IMO constant address space should not exist. It mostly covers up weaknesses in the IR handling of !invariant. For example, last I checked you can’t mark a memcpy was invariant for the load half. Besides that, we use it as a weak hint for addressing mode matching about what offset sizes work

Artem-B · June 29, 2023, 7:18pm

SGTM. A lot of these oddities are from the early days when we didn’t have better options. Your plan sounds sensible.

Topic		Replies	Views
Guidance on working with the NVIDIA GPU back-end LLVM Dev List Archives	0	73	December 17, 2019
[RFC] Device runtime library (re)design OpenMP	0	100	July 25, 2019
[NVPTX] We need an LLVM CUDA math library, after all LLVM Dev List Archives	13	79	July 13, 2013
[RFC] design doc for straight-line scalar optimizations LLVM Dev List Archives	7	71	August 25, 2015
Status of PTX Backend LLVM Dev List Archives	3	72	October 8, 2010

[RFC] Cleaning up the NVIDIA (and potentially AMD) GPU backend

Related Topics