I’m trying to understand better how I can promote/remove alloca with LLVM/opt regards the NVPTX target. Here’s little toy example I’m using:
toy.cu:
global void axpy(float a, float* x, float* y)
{
y[threadIdx.x] = a * x[threadIdx.x];
}
Generate IR (mind the -O0 is on purpose):
clang++ -O0 --cuda-device-only -emit-llvm -S toy.cu --cuda-gpu-arch=sm_75
Generate PTX:
llc -march=nvptx64 -mcpu=sm_75 < toy-cuda-nvptx64-nvidia-cuda-sm_75.ll > toy-cuda-nvptx64-nvidia-cuda-sm_75.ptx
Check for stack usage:
ptxas -v -arch sm_75 toy-cuda-nvptx64-nvidia-cuda-sm_75.ptx
ptxas info : 1 bytes gmem
ptxas info : Compiling entry function ‘Z4axpyfPfS’ for ‘sm_75’
ptxas info : Function properties for Z4axpyfPfS
24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 14 registers, 376 bytes cmem[0]
This results in 24 bytes stack usage.
If I repeat the same sequence but calling Clang with -O3 then stack usage comes out to zero.
I am trying to understand if there is an optimization pass I could call which can eliminate the stack usage. I tried
opt -S -O3 < toy-cuda-nvptx64-nvidia-cuda-sm_75.ll > toy-cuda-nvptx64-nvidia-cuda-sm_75.ptx
and also
opt --mem2reg -S …
but it didn’t change the IR at all.
To be clear: I’d like to start with calling clang with -O0 and then use opt to remove the stack usage. Is that possible or is clang doing something opt can’t do here?
Maybe it would already help to see what clang opt pipeline is when using -O3.