I’ve been seeing IR generated from upstream tools (such as Triton) targeting NVPTX is flattened to handle multiple elements on a single CUDA core (thread). Often the flattening is beyond the vectorization utilization and increases register pressure which eventually reduces thread occupancy. Therefore, I’m wondering if a loop reoller that rerolls consecutive scalar code back to a loop could help there. Besides the register pressure, reducing the IR size which in turns improves compiler throughput is also important for JIT compilation.
It would be great to see a specific example.
In general, GPU code generation relies on very aggressive inlining and loop unrolling, at least for the IR produced by CUDA. Linear chunks of code with limited lifetime of in-register data should be optimized by ptxas reasonably well. You may need an occasional hint to tall ptxas not to get too enthusiastic about register use, but most of the time it’s not needed. If there’s a lot of live data, then re-rolling will likely result in saving some of it into an alloca and it tends to be very expensive on GPUs.
I’m doubtful that re-rolling the loops will be a clear win. It may be useful in some cases, but I’d like to have a better idea of what it is exactly that we want to fix.
It would be great to see a specific example.
Compiler Explorer is an example, which comes from a Pytorch OSS benchmark. The Pytorch/Triton autotuner somehow decides to throw 8 elements to a thread at a time. But from GPU POV, a single core does not have enough resources to process those elements in parallel and I’m not sure that using that many scalar registers for the code chunk is optimal.
Would it better if we reoll that scalar code into a loop while still with some parallelism inside each loop iteration for vectorization? Some heuristics is going to be need though.
Can you elaborate on what you mean by “not enough resources” in this context?
Usually it’s a combination of resources needed by the individual kernels and the launch grid size specifid by the user. E.g. if the kernel needs 64 registers, the block can have no more than 1024 threads.
In this case, your example compiles on sm_80 to use just 32 registers and can be launched with up to 2K threads per block. I do not think it’s reasonable to re-roll this particular code.
If this particular code did run into runtime failure due to insufficient resources, most likely the issue is on the side of the code that launches that kernel. Either it miscalculates the grid size, or just assumes some specific kernel resource usage.
Typical ways to deal with this is to either calculate launch grid based on the exact kernel resource requirements queries via CUDA APIs, or explicitly tell compiler about the expected launch grid size and let it attempt to constrain register use to make it possible.
I mean the vector load unit (up-to-128-bit load on A100) and the FPU/ALU are not enough to execute that IR fully in parallel. But I’m not sure a rerolled loop can have the on-par performance with the scalar code. Do you think back jumps can hurt performance a lot?
In this case, your example compiles on sm_80 to use just 32 registers and can be launched with up to 2K threads per block. I do not think it’s reasonable to re-roll this particular code.
In this case, yes, loop reroll doesn’t help. But elsewhere I did see a kernel used up to 55 registers and thread occupancy went down.
If this particular code did run into runtime failure due to insufficient resources, most likely the issue is on the side of the code that launches that kernel. Either it miscalculates the grid size, or just assumes some specific kernel resource usage.
I did fix this issue by tuning up the grid size so that each thread gets less work. But I’m not sure a bigger grid size is always better, as that’s going to trigger more waves of threading.
So in general, which of the schemes is better, i.e, a thread doing more work via loops and thus with fewer waves, or the other around?
Typical ways to deal with this is to either calculate launch grid based on the exact kernel resource requirements queries via CUDA APIs
Could you give me a pointer to those APIs? Thanks.