Hi Arthur, Craig,
Thanks for you comments about GCC/Clang intrinsics. I never considered using them, but they might be better alternative to inline assembly.
Is there a one for regular MUL?
Anyway, I want to go the opposite direction. If I can I relay on compiler’s optimizations. If I want to use MULX in Clang I do it like that:
unsigned long mulx(unsigned long x, unsigned long y, unsigned long* hi)
{
auto p = (unsigned __int128){x} * y;
*hi = static_cast(p >> 64);
return static_cast(p);
}
https://godbolt.org/z/PbgFb9
If compiled with -mbmi2 -mtune=generic it just uses MULX instruction.
mulx(unsigned long, unsigned long, unsigned long*):
mov rcx, rdx
mov rdx, rsi
mulx rdx, rax, rdi
mov qword ptr [rcx], rdx
ret
What I want to do it move it further - rewrite the above mulx() helper without using __int128 type in a way that a compiler would recognize that it should use MUL/MULX instruction.
A possible implementation looks like
uint64_t mul_full_64_generic(uint64_t x, uint64_t y, uint64_t* hi)
{
uint64_t xl = x & 0xffffffff;
uint64_t xh = x >> 32;
uint64_t yl = y & 0xffffffff;
uint64_t yh = y >> 32;
uint64_t t = xl * yl;
uint64_t l = t & 0xffffffff;
uint64_t h = t >> 32;
t = xh * yl;
t += h;
h = t >> 32;
t = xl * yh + (t & 0xffffffff);
l |= t << 32;
*hi = xh * yh + h + (t >> 32);
return l;
}
As expected, Clang is not able to match this pattern currently.
If we want to implement this optimization in Clang, there are some questions I have:
- Can we prove this pattern is equivalent of MUL 64x64 → 128?
- What pass this optimization should be added to?
- Can this pattern be split into smaller ones? E.g. UMULH.
Paweł