Generating Big Num addition code which uses ADC (add with carry) instructions

I’m trying to work out LLVM code which generates something similar to the following when adding large multiword numbers stored as separate words:

ADD x1 x1
ADC x2 y2
ADC x3 y3

etc, where such a three argument add like ADC on x86 (which includes a carry in the addition) is available as a machine op.

The background to this is that I’m trying to implement fast multiword addition in Haskell, which can compile via LLVM, so I thought if I knew what LLVM code generates the ideal assembly I can work backwards and use the appropriate Haskell prim-ops.

Of course one can use GMP, but I’d suspect with two-four word additions the loss of inlining opportunities and the cost of the external call would be greater than the addition itself. So it would be great to be able to work out how to do it in LLVM and then either implement it with current GHC or add an extra GHC prim-op which allows GHC to access the appropriate LLVM code.

Sorry I don’t know much about LLVM besides my brief readings of the docs, and I’ve posted here because I couldn’t find an llvm-users list. There’s a clang-users list, but I didn’t think that would be appropriate because GHC compiles via LLVM directly, not via clang.

Any help or guidance appreciated.

Clinton

This might “just work” on many targets by using the appropriate integral type.

e.g. on x86_64, something like

%a = add i128 %b, %c

will lower to

add rax,rbx

adc rcx,rdx

Some caveats, based on my experience with LLVM 3.3, so things may have changed by then:

  • Not all targets support operations on integers wider than the “native width”. e.g. on x86_64, trying to multiply i256 x i256 will net you an assert. I believe i128 x i128 is handled, perhaps by a function call, but nothing larger.

  • At least on x86_64, everything is fully unrolled. So adding i2048 + i2048 will generate an ADD and 31 ADC instructions, with no looping. This makes sense for “small” multiprecision integers, but can get silly as things get wider.

In my project, I am using i128 in a few places (add, subtract, shift) to get efficient code, and relying on my own library functions for anything wider.

I’m looking for code that produces the results that the i128 code produces but by instead using multiple i64s. Like I said, this is Haskell primops generating the llvm code, so it would really good if I could just point LLVM in the right direction instead of having to write a new primop for GHC. If I can see working LLVM code that does an efficient addition with carry on multiple machine size words I can then perhaps work out the Haskell primops that would produce that code.

You can make up the i128 by shifting and masking.

Here’s a snippet of IR from my compiler showing how i64 quantities are loaded from memory, combined to make an i128, added, and decomposed back to i64. LLVM is smart enough to figure out that the shifting isn’t really required.

%17 = getelementptr inbounds i64* %13, i32 1
%18 = load i64* %17
%19 = zext i64 %18 to i128
%20 = shl i128 %19, 64
%21 = getelementptr inbounds i64* %13, i32 0
%22 = load i64* %21
%23 = zext i64 %22 to i128
%24 = or i128 %20, %23
%25 = getelementptr inbounds i64* %15, i32 1
%26 = load i64* %25
%27 = zext i64 %26 to i128
%28 = shl i128 %27, 64
%29 = getelementptr inbounds i64* %15, i32 0
%30 = load i64* %29
%31 = zext i64 %30 to i128
%32 = or i128 %28, %31
%33 = add i128 %24, %32
%34 = trunc i128 %33 to i64
%35 = getelementptr inbounds i64* %16, i32 0
store i64 %34, i64* %35
%36 = lshr i128 %33, 64
%37 = trunc i128 %36 to i64
%38 = getelementptr inbounds i64* %16, i32 1
store i64 %37, i64* %38

And here’s the x64_64 assembly that LLVM produced from the above:

0000000000000000 <LMtop.Fadd>:
0: 48 8b 47 18 mov 0x18(%rdi),%rax
4: 48 8b 48 40 mov 0x40(%rax),%rcx
8: 48 8b 50 48 mov 0x48(%rax),%rdx
c: 48 03 48 30 add 0x30(%rax),%rcx
10: 48 13 50 38 adc 0x38(%rax),%rdx
14: 48 89 48 50 mov %rcx,0x50(%rax)
18: 48 89 50 58 mov %rdx,0x58(%rax)
1c: c3 retq

The approach looks horrid, but the results are what you want, and work with LLVM as-is today.