Hi Arthur, Craig,

Thanks for you comments about GCC/Clang intrinsics. I never considered using them, but they might be better alternative to inline assembly.

Is there a one for regular MUL?

Anyway, I want to go the opposite direction. If I can I relay on compiler’s optimizations. If I want to use MULX in Clang I do it like that:

unsigned long mulx(unsigned long x, unsigned long y, unsigned long* hi)

{

auto p = (unsigned __int128){x} * y;

*hi = static_cast(p >> 64);

return static_cast(p);

}

https://godbolt.org/z/PbgFb9

If compiled with -mbmi2 -mtune=generic it just uses MULX instruction.

mulx(unsigned long, unsigned long, unsigned long*):

mov rcx, rdx

mov rdx, rsi

mulx rdx, rax, rdi

mov qword ptr [rcx], rdx

ret

What I want to do it move it further - rewrite the above mulx() helper without using __int128 type in a way that a compiler would recognize that it should use MUL/MULX instruction.

A possible implementation looks like

uint64_t mul_full_64_generic(uint64_t x, uint64_t y, uint64_t* hi)

{

uint64_t xl = x & 0xffffffff;

uint64_t xh = x >> 32;

uint64_t yl = y & 0xffffffff;

uint64_t yh = y >> 32;

uint64_t t = xl * yl;

uint64_t l = t & 0xffffffff;

uint64_t h = t >> 32;

t = xh * yl;

t += h;

h = t >> 32;

t = xl * yh + (t & 0xffffffff);

l |= t << 32;

*hi = xh * yh + h + (t >> 32);

return l;

}

As expected, Clang is not able to match this pattern currently.

If we want to implement this optimization in Clang, there are some questions I have:

- Can we prove this pattern is equivalent of MUL 64x64 → 128?
- What pass this optimization should be added to?
- Can this pattern be split into smaller ones? E.g. UMULH.

Paweł