On this Godbolt instance I have a small kernel with vector size 2 from mfakto. It uses these OpenCL builtins: convert_uint2, convert_float2, as_uint2, mad24. However, when compiled to either gfx906 or nvptx, it seems to be setting up call sequences for convert_\* and mad24:
__private float2 qf = convert_float2(mad24(q.d4, 32768u, q.d3));
qf = qf * 32768.0f;
__private uint2 qi = convert_uint2(qf*nf);
s_add_u32 s16, s16, _Z14convert_float2Dv2_j@rel32@lo+4
s_addc_u32 s17, s17, _Z14convert_float2Dv2_j@rel32@hi+12
s_mov_b64 s[4:5], s[48:49]
s_mov_b64 s[6:7], s[38:39]
s_mov_b64 s[8:9], s[36:37]
s_mov_b64 s[10:11], s[34:35]
s_mov_b32 s12, s53
s_mov_b32 s13, s52
s_mov_b32 s14, s51
s_mov_b32 s15, s50
v_mov_b32_e32 v31, v59
s_swappc_b64 s[30:31], s[16:17]
v_mul_f32_e32 v1, 0x47000000, v1
v_mul_f32_e32 v0, 0x47000000, v0
v_mul_f32_e32 v0, v57, v0
v_mul_f32_e32 v1, v56, v1
(you get the point)
{ // callseq 1, 0
st.param.v2.b32 [param0], {%r7, %r8};
call.uni (retval0), _Z14convert_float2Dv2_j, (param0);
ld.param.v2.b32 {%r9, %r10}, [retval0];
} // callseq 1
To investigate further I added -emit-llvm. It looks like the call for as_uint2 is being eliminated at the final IR output, but calls for convert_\* and mad24 do remain.
The question is: why is LLVM doing this? The hardware has instructions for u32/f32 conversion and LLVM is known to use them for shaders. In fact, I can get LLVM to emit them by using __builtin_convertvector:
v_mov_b32_e32 v0, v9
s_swappc_b64 s[30:31], s[54:55]
v_cvt_f32_u32_e32 v0, v0
v_cvt_f32_u32_e32 v1, v1
s_getpc_b64 s[16:17] # this is for a later mul24, irrelevant here
s_add_u32 s16, s16, _Z5mul24Dv2_jS_@rel32@lo+4
s_addc_u32 s17, s17, _Z5mul24Dv2_jS_@rel32@hi+12
s_mov_b64 s[4:5], s[48:49]
v_mul_f32_e32 v0, 0x47000000, v0
v_mul_f32_e32 v1, 0x47000000, v1 # end mul24 stuff
v_mul_f32_e32 v1, v46, v1
v_mul_f32_e32 v0, v47, v0
v_cvt_u32_f32_e32 v46, v0
v_cvt_u32_f32_e32 v47, v1
So… no. I don’t get it. Is there some edge case that I’m not thinking about? Can I in any way assure LLVM that these edge cases will not happen?
Oh. I also tried the spirv64 target, where the output did have the desired OpExtInst %6 %1 u_mad24 .... A similar call is present in -emit-llvm, so perhaps this should really be the job of a later stage?
Or, perhaps there already is something that turns the calls into instructions after the gfx9 assembly / nvptx code is generated?