Bignums

Hello all!

I'm working on a library with bignum support, and I wanted to try LLVM as an apparently simpler and more portable system to my current design (a Haskell script which spits out mixed C and assembly). Porting the script to use the LLVM bindings instead of the current hack was pretty easy. But I have a few remaining questions:

(1) Are bignums exposed to any higher-level language? It would be nice to be able to write all the code in C and compile it with Clang... but I need 256-bit integer support.

(2) Is there a way to convince LLVM's register allocator to do the right thing on x86? I'm getting swaths of code like:
         movq 88(%rsp), %rax
         mulq 112(%rsp)
         movq %rax, %r15
         addq %r11, %r15
         movq %rdx, %r14
         adcq %rcx, %r14
         adcq $0, %r9
(that's a 64x64 -> 128-bit multiply with 192-bit accumulate.) The problem is, %r11 and %rcx are dead here. It should have just added %rax and %rdx into them. This results in more movs, more spills, more code and less performance.

(3) Is there a way to avoid this whole mess? I'm using a script to spit out ugly code in large part because GCC and Clang both refuse to unroll loops that look like

int i,j;
// accumulator uses inline asm to do the 64x64->128->192-bit accumulate above
accumulator acc(a[0], b[0]);
tmp[0] = acc.shift();
for (j=1; j<n; j++) {
   for (i=0; i<=j; i++)
     acc.mac(a[i], b[j-i]);
   tmp[j] = acc.shift();
}

where n is a template parameter, and thus known at compile time. Is there some clang pass which will unroll this properly? -O3 -funroll-loops doesn't seem to (in fact, -funroll-loops makes it worse... it tries to unroll the loops by a factor of 4 instead of completely unwinding them). Is there some opt pass which can fix it? This is more relevant if clang can do sane things with the registers (its performance is half that of GCC right now), but it'd be nice to know.

Thanks for your time!
-- Mike Hamburg

With some more tweaking, LLVM comes within ~10% of GCC. Still, there's a difference. Here's a test case, which does a 128-bit x 128-bit to 256-bit multiply.

define private i192 @mac(i192 %accum, i64 %x, i64 %y) alwaysinline {
entry:
   %xx = sext i64 %x to i128
   %yy = sext i64 %y to i128
   %zz = mul i128 %xx, %yy
   %zzz = zext i128 %zz to i192
   %out = add i192 %accum, %zzz
   ret i192 %out
}

define [4 x i64] @fullmul([2 x i64] %x, [2 x i64] %y) {
entry:
   %xlo = extractvalue [2 x i64] %x, 0
   %xhi = extractvalue [2 x i64] %x, 1
   %ylo = extractvalue [2 x i64] %y, 0
   %yhi = extractvalue [2 x i64] %y, 1
   %m0 = call i192 @mac(i192 0, i64 %xlo, i64 %ylo)
   %o0 = trunc i192 %m0 to i64
   %out0 = insertvalue [4 x i64] undef, i64 %o0, 0
   %m1 = lshr i192 %m0, 64
   %m10 = call i192 @mac(i192 %m1, i64 %xhi, i64 %ylo)
   %m11 = call i192 @mac(i192 %m10, i64 %xlo, i64 %yhi)
   %o1 = trunc i192 %m11 to i64
   %out1 = insertvalue [4 x i64] %out0, i64 %o1, 1
   %m2 = lshr i192 %m11, 64
   %m20 = call i192 @mac(i192 %m2, i64 %xhi, i64 %yhi)
   %o2 = trunc i192 %m20 to i64
   %out2 = insertvalue [4 x i64] %out1, i64 %o2, 2
   %m3 = lshr i192 %m20, 64
   %o3 = trunc i192 %m3 to i64
   %out3 = insertvalue [4 x i64] %out2, i64 %o3, 3
   ret [4x i64] %out3
}

#include <sys/types.h>
extern void fullmul(u_int64_t out[4], u_int64_t a, u_int64_t b, u_int64_t c, u_int64_t d);
int main(int argc, char **argv) {
   u_int64_t x[4]={1,2,3,4}, n;
   for (n=0; n<100000000; n++)
     fullmul(x,x[0],x[1],x[2],x[3]);
}

The assembly is:
     movq %rdx, %r9
     movq %r9, %rax
     imulq %rcx
     movq %rax, %r10
     movq %rdx, %r11
     movq %rsi, %rax
     imulq %rcx
     movq %rax, (%rdi)
     movq %rdx, %rcx ;;; extraneous
     addq %r10, %rcx
     adcq $0, %r11
     sbbq %r10, %r10
     andq $1, %r10
     movq %rsi, %rax
     imulq %r8
     addq %rcx, %rax
     movq %rax, 8(%rdi)
     movq %rdx, %rcx ;;; extraneous
     adcq %r11, %rcx
     adcq $0, %r10
     movq %r9, %rax
     imulq %r8
     addq %rcx, %rax
     movq %rax, 16(%rdi)
     adcq %r10, %rdx
     movq %rdx, 24(%rdi)
     ret

... which has 2 extraneous mov instructions. On my laptop (1.66GHz Conroe), removing the extra movs and renaming registers drops the time from 1.95s to 1.75s, of which 0.4s is loop overhead. The difference is less significant on newer hardware (Nehalem).

It's hard to say what impact such extraneous mov's have on a real program. My current testing code uses 4x4 Montgomery multiplies, which are longer but have many more extraneous mov's. So far I've only tested the GCC version against the LLVM version, and the GCC version is 5-10% faster, but this time the difference is more significant on newer hardware, perhaps because the mul's and adc's are less of a bottleneck. The difference is further skewed by the fact that the GCC and LLVM versions use a slightly different algorithm, and the LLVM code seems to optimize slightly better in other ways... so it may be that the extra mov's are hurting it more than 5-10%.

Cheers,
Mike