How to reduce stack usage in PTX using opt

I’m trying to understand better how I can promote/remove alloca with LLVM/opt regards the NVPTX target. Here’s little toy example I’m using:

global void axpy(float a, float* x, float* y)
y[threadIdx.x] = a * x[threadIdx.x];

Generate IR (mind the -O0 is on purpose):

clang++ -O0 --cuda-device-only -emit-llvm -S --cuda-gpu-arch=sm_75

Generate PTX:

llc -march=nvptx64 -mcpu=sm_75 < toy-cuda-nvptx64-nvidia-cuda-sm_75.ll > toy-cuda-nvptx64-nvidia-cuda-sm_75.ptx

Check for stack usage:

ptxas -v -arch sm_75 toy-cuda-nvptx64-nvidia-cuda-sm_75.ptx

ptxas info : 1 bytes gmem
ptxas info : Compiling entry function ‘Z4axpyfPfS’ for ‘sm_75’
ptxas info : Function properties for Z4axpyfPfS
24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 14 registers, 376 bytes cmem[0]

This results in 24 bytes stack usage.

If I repeat the same sequence but calling Clang with -O3 then stack usage comes out to zero.

I am trying to understand if there is an optimization pass I could call which can eliminate the stack usage. I tried

opt -S -O3 < toy-cuda-nvptx64-nvidia-cuda-sm_75.ll > toy-cuda-nvptx64-nvidia-cuda-sm_75.ptx

and also

opt --mem2reg -S …

but it didn’t change the IR at all.

To be clear: I’d like to start with calling clang with -O0 and then use opt to remove the stack usage. Is that possible or is clang doing something opt can’t do here?
Maybe it would already help to see what clang opt pipeline is when using -O3.

-O0 code has the optnone attribute added to all functions, which will prevent optimisation passes from touching them even if you ask for them via opt. You’ll either need to patch your IR or use -Xclang -disable-O0-optnone to permit that to work.

1 Like