OpenCL conversion operations: why call?

On this Godbolt instance I have a small kernel with vector size 2 from mfakto. It uses these OpenCL builtins: convert_uint2, convert_float2, as_uint2, mad24. However, when compiled to either gfx906 or nvptx, it seems to be setting up call sequences for convert_\* and mad24:

  __private float2 qf = convert_float2(mad24(q.d4, 32768u, q.d3));
  qf = qf * 32768.0f;
  
  __private uint2 qi = convert_uint2(qf*nf);
        s_add_u32 s16, s16, _Z14convert_float2Dv2_j@rel32@lo+4
        s_addc_u32 s17, s17, _Z14convert_float2Dv2_j@rel32@hi+12
        s_mov_b64 s[4:5], s[48:49]
        s_mov_b64 s[6:7], s[38:39]
        s_mov_b64 s[8:9], s[36:37]
        s_mov_b64 s[10:11], s[34:35]
        s_mov_b32 s12, s53
        s_mov_b32 s13, s52
        s_mov_b32 s14, s51
        s_mov_b32 s15, s50
        v_mov_b32_e32 v31, v59
        s_swappc_b64 s[30:31], s[16:17]
        v_mul_f32_e32 v1, 0x47000000, v1
        v_mul_f32_e32 v0, 0x47000000, v0
        v_mul_f32_e32 v0, v57, v0
        v_mul_f32_e32 v1, v56, v1
        (you get the point)
        { // callseq 1, 0
        st.param.v2.b32         [param0], {%r7, %r8};
        call.uni (retval0), _Z14convert_float2Dv2_j, (param0);
        ld.param.v2.b32         {%r9, %r10}, [retval0];
        } // callseq 1

To investigate further I added -emit-llvm. It looks like the call for as_uint2 is being eliminated at the final IR output, but calls for convert_\* and mad24 do remain.

The question is: why is LLVM doing this? The hardware has instructions for u32/f32 conversion and LLVM is known to use them for shaders. In fact, I can get LLVM to emit them by using __builtin_convertvector:

        v_mov_b32_e32 v0, v9
        s_swappc_b64 s[30:31], s[54:55]
        v_cvt_f32_u32_e32 v0, v0
        v_cvt_f32_u32_e32 v1, v1
        s_getpc_b64 s[16:17] # this is for a later mul24, irrelevant here
        s_add_u32 s16, s16, _Z5mul24Dv2_jS_@rel32@lo+4
        s_addc_u32 s17, s17, _Z5mul24Dv2_jS_@rel32@hi+12
        s_mov_b64 s[4:5], s[48:49]
        v_mul_f32_e32 v0, 0x47000000, v0
        v_mul_f32_e32 v1, 0x47000000, v1 # end mul24 stuff
        v_mul_f32_e32 v1, v46, v1
        v_mul_f32_e32 v0, v47, v0
        v_cvt_u32_f32_e32 v46, v0
        v_cvt_u32_f32_e32 v47, v1

So… no. I don’t get it. Is there some edge case that I’m not thinking about? Can I in any way assure LLVM that these edge cases will not happen?


Oh. I also tried the spirv64 target, where the output did have the desired OpExtInst %6 %1 u_mad24 .... A similar call is present in -emit-llvm, so perhaps this should really be the job of a later stage?

Or, perhaps there already is something that turns the calls into instructions after the gfx9 assembly / nvptx code is generated?

as_uint2 is macro of a built-in intrinsic identified by clang, whereas convert_float2is an external function which needs to be further lowered by backend passes. They’re handled in different phases. If there’s no optimiaztion pass importing or lowering the call, then it will be kept as is.

You can check IR emitted by clang w/o going through the optimization pipeline by passing -Xclang -disable-llvm-passes, and there’s no as_uint2 at all.

So I really should’ve put this in “AMDGPU” category then. In any case, I should benchmark first, now that I do know how to write a version without those calls…

A full dump by mfakto’s own pipeline via rocm reveals no real calls. No idea who or what’s removing it, but I guess there’s a lesson somewhere about what to trust. And a small interpretability bug.