I just found when there are multiple offload regions, all the finally assembled kernels use equal amount registers corresponding to the kernel that uses the most registers. This causes all my kernels spilling registers and thus kills performance. This is surprising and I didn’t see this behavior with IBM XL compiler.
The reproducer is provided at https://bugs.llvm.org/show_bug.cgi?id=46450
I also noticed the same issue with AOMP.
So I’m wondering where could potentially be buggy in the compiling/linking flow.