Can the AMDGPU backend introduce LDS memory (shared memory) instructions during the compilation process in a kernel that does explicitly use shared memory?
I have a benchmark which shows this strange behavior.
When I compile the program with two different sets of optimizations, one version seems to issue LDS memory instructions but the other version does not. And the one that uses LDS is faster.
I profiled the program with rocprof to get the number of LDS instructions issued. (SQ_INSTS_LDS)