on division of __int128 bit integer

Hi Team,

I observer that division of __int128 bit is very heavy operation.
It internally call a routine ‘__udivti3’, which internally call ‘__udivmodti4’.

Due to it the overall performance is much much slower (almost 15 time slower than if I do it via a combination of 64-bit or microsoft ‘_udiv128’).

Also what to know if I can directly call below routine directly from my code :
unsigned long long __udivmodti4 (unsigned long long , unsigned long long , unsigned long long *)

This is against statement “Conceptually,in operation, like ‘/’ or ‘%’, no function is called, so no header is provided for __udivmodti4 etc. – and we should not call them explicitly in our code.” but wanted to know more on it.


First of all, the code in compiler-rt/builtins is rather confusing for me. No idea what tu_int is but I’m guessing this must be unsigned __int128, not unsigned long long.

Secondly the “slow” case proceeds single bit at a time, so this loop may need ~64 iterations in worst case:

From measurements, it looks GCC has it implemented more efficiently although I have no idea where do they have the code for it.

If you are interested, I have division implemented for unsigned 128-bit integer here: https://github.com/chfast/intx/blob/master/include/intx/int128.hpp#L632-L687

From quick check it takes 16 ns, GCC runtime lib: 25 ns, GMP: 20 ns.

It follows “Improved division by invariant integers” https://gmplib.org/~tege/division-paper.pdf

If make sense I can contribute this implementation as replacement for __udivmodti4. Although I’m not sure what requirements for __udivmodti4 are.

Dear Paweł,

Thanks a lot! The solution helped!