Kernel launch error using amdclang++ targeting to amdgpu

hi, I’m trying to compile a simple vectoradd code using hipcc and amdclang++. The code has already been converted to hip code using hipify.
I can get the correct output using
hipcc vectoradd.hip -o vectoradd.
However, when I use
amdclang++ -x hip vectoradd.hip -o vectoradd --offload-arch=gfx1032 --hip-link
I get an error: Failed to launch vectorAdd kernel (error code the operation cannot be performed in the present state)! My environment is ubuntu20.04, rocm5.6.0. After I switched to rocm5.4.3, there was no kernel launch error, but the computation was wrong. In fact, the error was the same as with 5.6.0.
My GPU is 6650xt. I have tried many compile options but still cannot resolve the issue. What should I do?Both amdclang++ released by AMD and clang++ I built myself came out the same error!

If the hipcc call works, try to add -v or -### to see what they do. It’s just a wrapper, so you should be able to collect all the necessary options and such.

Tag: @yxsamliu.

Thank you for the reply! I finally got the correct results using the -O3 command. -O3, -O2, -O1 all work, while -g and -O0 do not. This error is quite strange indeed.

I can imagine that hipclang defaults to O2/O3, while “clang” does default to O0. What exactly went wrong is hard to say w/o error trace, the IR, etc.

Thanks a lot. hipcc defaults to O3 optimization. I tried hipcc -O0 and it also errored out, while -g works normally. -g still uses O3 optimization. On another machine with ubuntu18.04, rocm5.1.2, hipcc -O0 works normally.

@arsenm This issue can be reproduced with trunk with a simpler kernel

It only happens at -O0. Basically, the instructions for loading kernel arg seem wrong. According to the kernel descriptor, 4 SGPRs for private segment buffer, 2 for dispatch ptr, 2 for queue ptr, then kernel arg segment ptr should take S[8-9], but the kernel arg is loaded not using S[8-9]. As a result, the kernel is not able to save the result.

But it is? I don’t see the problem. There’s a bunch of chasing through flat access to a stack slot but it looks correct?

currently, the ISA emitted from trunk is:

kernel(int*): ; @kernel(int*)
s_mov_b32 s33, 0
s_add_u32 s12, s12, s17
s_addc_u32 s13, s13, 0
s_setreg_b32 hwreg(HW_REG_FLAT_SCR_LO), s12
s_setreg_b32 hwreg(HW_REG_FLAT_SCR_HI), s13
s_add_u32 s0, s0, s17
s_addc_u32 s1, s1, 0
s_load_dwordx2 s[4:5], s[8:9], 0x0 ; p => s[4:5]
s_mov_b64 s[6:7], src_private_base
s_mov_b32 s8, 32
s_lshr_b64 s[6:7], s[6:7], s8
v_mov_b32_e32 v2, 8
v_mov_b32_e32 v0, s6
v_mov_b32_e32 v3, v0
v_mov_b32_e32 v0, 16
v_mov_b32_e32 v4, s6
v_mov_b32_e32 v1, v4
v_mov_b32_e32 v5, v3
v_mov_b32_e32 v4, v2
s_waitcnt lgkmcnt(0)
v_mov_b32_e32 v7, s5
v_mov_b32_e32 v6, s4 ; p is now in v[6:7]
flat_store_dwordx2 v[4:5], v[6:7] ; p is stored to v[4:5]
flat_load_dwordx2 v[4:5], v[2:3] ; p is now in v[2:3]
v_mov_b32_e32 v3, v1
v_mov_b32_e32 v2, v0 ; v[2:3] is overwritten, p is never used again
s_waitcnt vmcnt(0) lgkmcnt(0)
flat_store_dwordx2 v[2:3], v[4:5]
flat_load_dwordx2 v[0:1], v[0:1]
v_mov_b32_e32 v2, 7
s_waitcnt vmcnt(0) lgkmcnt(0)
flat_store_dword v[0:1], v2 ; This is store of 7, but v[0:1] is not p