Bignum development

Bill_Hart · June 11, 2010, 5:37pm

Hi all,

After searching for a decent compiler backend for ages (google
sometimes isn't helpful), I recently stumbled upon LLVM. Woot!!

I work on bignum arithmetic (I'm a professional mathematician) and
have recently decided to switch from developing GPL'd bignum code to
BSD licensed code. (See http://www.mpir.org/ which I contributed to
for a while - a fork of GMP).

Please bear with me if these are stupid questions (I am a
mathematician and have only a glancing knowledge of compiler internals
- though not quite zero knowledge).

Looking through the language reference, I see you have an int type
which can be any number of bits. On x86_64, however, I see that mul
only "works" for up to 128 bits. In all cases it appears to return
only the low half of the product.

For addition, I still haven't understood the documentation well enough
to know whether a carry out from an int64 + int64 addition can be
retrieved, e.g. through a trap, and how to do ADC. There are also some
other primitives which might be useful for bignum....

Obviously many non-C languages have optimised bignum capabilities
these days, and you do NOT get good performance by writing such
primitives in C!!

Anyhow, to my questions:

a) What plans are there to support addition, subtraction,
multiplication, division, shifting, logical not and other logic ops
for >= 64/128 bits (machine word size, whatever)? The documentation
mentions integers of millions of bits....

b) Would it be desirable to add primitives for retrieving both the low
and high half of a multiplication and optimisations to combine where
wide multiplications are available on the chip?

c) I want to ask something about retrieving the carry or borrow from
an addition or subtraction and using it in an ADC or SBB..... not sure
what to ask.....

d) Are you guys at all interested in supporting languages that provide
multiprecision arithmetic? I have a vague notion that I'd like to sign
on to help with that (and may be able to bring some other experienced
devels with me if there was some interest).

I did see APInt and understand the necessity of such a library for
constant folding etc. but my questions aren't specifically about that
(though perhaps I can help out there too, if anything still remains to
be done....) Specifically I am talking about providing so-called
"native" bignum support, i.e. highly optimised, riding on the bare
metal, type bignum stuff.

Please understand this is just an initial email to enquire about
current status, plans, etc, so I can gauge how my existing expertise
might or might not be of use.

Bill Hart.

P.S: I'm excited about finding LLVM. It is without a doubt one of the
most outstanding Open Source projects I have encountered to date!! But
you already knew that.

Eli_Friedman1 · June 11, 2010, 9:44pm

a) What plans are there to support addition, subtraction,
multiplication, division, shifting, logical not and other logic ops
for >= 64/128 bits (machine word size, whatever)? The documentation
mentions integers of millions of bits....

Pretty much everything besides muliplication and division should work
at any width; the result is generally not especially efficient,
though.

b) Would it be desirable to add primitives for retrieving both the low
and high half of a multiplication and optimisations to combine where
wide multiplications are available on the chip?

This should already work where applicable; for example, the following
C code gives an appropriately optimized result on x86-32:

int a(int x, int y) { return ((long long)x)*y >> 32; }

c) I want to ask something about retrieving the carry or borrow from
an addition or subtraction and using it in an ADC or SBB..... not sure
what to ask.....

LLVM automatically expands wide addition/subtraction with ADC/SBB.
There's no other general way to write something which optimizes to an
ADC/SBB, at least at the moment. What do you want here?

d) Are you guys at all interested in supporting languages that provide
multiprecision arithmetic? I have a vague notion that I'd like to sign
on to help with that (and may be able to bring some other experienced
devels with me if there was some interest).

By that, you mean languages that provide arbitrary-width integers?
You can already support that by lowering such operations to calls to a
multi-precision library instead of IR operations in your frontend. It
would be an interesting experiment to write a pass which converts IR
operations which are too wide to calls to a multi-precision library,
though.

-Eli

Eli_Friedman1 · June 11, 2010, 11:51pm

Hi Eli,

a) What plans are there to support addition, subtraction,
multiplication, division, shifting, logical not and other logic ops
for >= 64/128 bits (machine word size, whatever)? The documentation
mentions integers of millions of bits....

Pretty much everything besides muliplication and division should work
at any width; the result is generally not especially efficient,
though.

OK, I only tried multiplication last night when I first had a decent
fiddle (been marking exams, so haven't had much time to really get
dirty hands, sorry).

Of course efficiency is imperative for bignum stuff as it underpins
absolutely behemoth projects like Sage (http://www.sagemath.org).
Every cycle counts, so to speak. One cycle per limb difference in
multiplication makes a nearly 40% difference in absolutely everything
for literally many thousands of developers!

Right; improvements are definitely welcome here; you might want to
take a look at what we currently generate here, though, to see what
improvements are most necessary.

b) Would it be desirable to add primitives for retrieving both the low
and high half of a multiplication and optimisations to combine where
wide multiplications are available on the chip?

This should already work where applicable; for example, the following
C code gives an appropriately optimized result on x86-32:

int a(int x, int y) { return ((long long)x)*y >> 32; }

Right. I'll take a closer look at that, but I don't think it is what I mean.

I'm specifically talking about x86_64 and returning the high 64 bits
of a 64x64 bit multiply. I don't think this is done, or done
efficiently in LLVM currently. Or I might be missing something.

Should work the same as the previous example; try the following on x86-64:
long long a(long long x, long long y) { return ((__int128_t)x)*y >> 64; }

A basic operation in bignum stuff is addmul_1, i.e. multiply a
multiple precision integer with a single "limb" (machine word) and add
the resulting multiple precision result to another bignum. This is not
best achieved by doing lots of wide multiplies, dealing with carries,
then adding. On AMD x86_64 we get 2.5 cycles per limb in direct
assembly for addmul_1 (actually achieving the maximum retirement rate
for macro ops on AMD), but more like 10 cycles per limb with C. Even
the assembly version of addmul_1 is not efficient enough for a
multiplication basecase though, where you essentially need the whole
basecase multiplication done in assembly!

It is this assembly basecase which I would like to be able to code
directly in LLVM assembly.

Hmm... might be interesting, but you'll likely get much better
practical results by just

c) I want to ask something about retrieving the carry or borrow from
an addition or subtraction and using it in an ADC or SBB..... not sure
what to ask.....

LLVM automatically expands wide addition/subtraction with ADC/SBB.
There's no other general way to write something which optimizes to an
ADC/SBB, at least at the moment. What do you want here?

I should take a look at what happens for a multiprecision
addition/subtraction before I answer that one. I merely assumed that
this also only worked up to 128 bits on x86_64 as per multiplication.

It should work at any width (although once you get to extremely wide
integers, the number of spills makes the code get messy).

Also, this isn't something we do at the moment, but it's would be
possible to teach the backend to transform code using the i1 output of
llvm.uadd.with.overflow into an ADC, etc.

d) Are you guys at all interested in supporting languages that provide
multiprecision arithmetic? I have a vague notion that I'd like to sign
on to help with that (and may be able to bring some other experienced
devels with me if there was some interest).

By that, you mean languages that provide arbitrary-width integers?

Right. E.g. Haskell, python, ECL, many others.

You can already support that by lowering such operations to calls to a
multi-precision library instead of IR operations in your frontend.

Indeed that is what they do.... currently.

I work on the bignum libraries though, not language front ends. So, I
want to know how to implement a bignum library using LLVM. At the
moment, I can't get close enough to the machine to do anything
efficient. These libraries usually have large amounts of assembly, not
C. But LLVM is so capable, I think we've been doing this wrong all
along. Instead of writing assembly for each machine, we should be
using LLVM assembly code and writing part of the bignum library on the
"other side" of this interface.

If you have time, check out the cminusminus language reference. They
had a bits1 type for flags, like carry from addition, and a mulhi and
mullo instruction. Now cminusminus never took off and died an untimely
death (largely because they wrote it in some bizarre ML in my
opinion), but from a bignum point of view, they got this one minor
detail right.

Hmm... LLVM currently uses i1 pretty extensively, and you can express
equivalents of mulhi/mullo as mentioned above.

There's a few ways this could be done though. I mean, imagine if LLVM
optimised around bignum calls by say breaking bignum calls up into
more basic primitives and rearranging and coallescing those, etc. I'm
not saying it *should* do that. It would be a multi-year effort, and
probably a large grant application to do it (I'm seriously toying with
the idea...). But I am interested to know what the existing intentions
of the LLVM devs are for supporting arbitrary precision stuff, as it
seemed, at least at first blush, as though something non-trivial may
have been planned all along.

It interested me greatly when I noticed you had these intn types for
an arbitrary number of bits, and it almost made me think there was
some planning had gone into supporting arbitrary precision arithmetic
as basic instructions in the LLVM. Bizarre, but not stupid, perhaps.

There really isn't any grand plan here; LLVM switched a while back
from named integer types to a generalized integer type as part of a
large type-system rewrite; the types were left unrestricted just
because there wasn't any particular reason to restrict them. The
support for such types has gradually improved since then. And most
recently, we've started generating such types in the optimizers to
allow transforming complex structures into SSA form.

There have really only been sporadic queries about practical support
for real multi-precision arithmetic, but we'd welcome any
improvements.

-Eli

Bill_Hart · June 12, 2010, 12:41am

Hi Eli,

a) What plans are there to support addition, subtraction,
multiplication, division, shifting, logical not and other logic ops
for >= 64/128 bits (machine word size, whatever)? The documentation
mentions integers of millions of bits....

Pretty much everything besides muliplication and division should work
at any width; the result is generally not especially efficient,
though.

OK, I only tried multiplication last night when I first had a decent
fiddle (been marking exams, so haven't had much time to really get
dirty hands, sorry).

Of course efficiency is imperative for bignum stuff as it underpins
absolutely behemoth projects like Sage (http://www.sagemath.org).
Every cycle counts, so to speak. One cycle per limb difference in
multiplication makes a nearly 40% difference in absolutely everything
for literally many thousands of developers!

Right; improvements are definitely welcome here; you might want to
take a look at what we currently generate here, though, to see what
improvements are most necessary.

Roger that.

b) Would it be desirable to add primitives for retrieving both the low
and high half of a multiplication and optimisations to combine where
wide multiplications are available on the chip?

This should already work where applicable; for example, the following
C code gives an appropriately optimized result on x86-32:

int a(int x, int y) { return ((long long)x)*y >> 32; }

Right. I'll take a closer look at that, but I don't think it is what I mean.

I'm specifically talking about x86_64 and returning the high 64 bits
of a 64x64 bit multiply. I don't think this is done, or done
efficiently in LLVM currently. Or I might be missing something.

Should work the same as the previous example; try the following on x86-64:
long long a(long long x, long long y) { return ((__int128_t)x)*y >> 64; }

Ah, I looked for this last night for some time. I tried uint128_t,
u_int128_t, uint128, int128, everything you can imagine... except of
course .... that. Should've read the manual.

Damn, it does do this in one mul for unsigned multiplication! I feel
embarrassed now. The stupid thing is I actually know this is what
happens in gcc 4.4.x when you do something like the above in C. I
should've thought of this.

That totally solves my "problem".

A basic operation in bignum stuff is addmul_1, i.e. multiply a
multiple precision integer with a single "limb" (machine word) and add
the resulting multiple precision result to another bignum. This is not
best achieved by doing lots of wide multiplies, dealing with carries,
then adding. On AMD x86_64 we get 2.5 cycles per limb in direct
assembly for addmul_1 (actually achieving the maximum retirement rate
for macro ops on AMD), but more like 10 cycles per limb with C. Even
the assembly version of addmul_1 is not efficient enough for a
multiplication basecase though, where you essentially need the whole
basecase multiplication done in assembly!

It is this assembly basecase which I would like to be able to code
directly in LLVM assembly.

Hmm... might be interesting, but you'll likely get much better
practical results by just

... using GMP? Right. Good plan. Been there, done that.

Going to work on a project which for the time being is called bsdnt
(virtually vaporware at this stage). Definitely don't want to just
replicate GMP with a BSD license. We're interested in parallel code
and lots of other goodies. Implementing on top of LLVM is (probably)
perfect for this.

c) I want to ask something about retrieving the carry or borrow from
an addition or subtraction and using it in an ADC or SBB..... not sure
what to ask.....

LLVM automatically expands wide addition/subtraction with ADC/SBB.
There's no other general way to write something which optimizes to an
ADC/SBB, at least at the moment. What do you want here?

I should take a look at what happens for a multiprecision
addition/subtraction before I answer that one. I merely assumed that
this also only worked up to 128 bits on x86_64 as per multiplication.

It should work at any width (although once you get to extremely wide
integers, the number of spills makes the code get messy).

Also, this isn't something we do at the moment, but it's would be
possible to teach the backend to transform code using the i1 output of
llvm.uadd.with.overflow into an ADC, etc.

d) Are you guys at all interested in supporting languages that provide
multiprecision arithmetic? I have a vague notion that I'd like to sign
on to help with that (and may be able to bring some other experienced
devels with me if there was some interest).

By that, you mean languages that provide arbitrary-width integers?

Right. E.g. Haskell, python, ECL, many others.

You can already support that by lowering such operations to calls to a
multi-precision library instead of IR operations in your frontend.

Indeed that is what they do.... currently.

I work on the bignum libraries though, not language front ends. So, I
want to know how to implement a bignum library using LLVM. At the
moment, I can't get close enough to the machine to do anything
efficient. These libraries usually have large amounts of assembly, not
C. But LLVM is so capable, I think we've been doing this wrong all
along. Instead of writing assembly for each machine, we should be
using LLVM assembly code and writing part of the bignum library on the
"other side" of this interface.

If you have time, check out the cminusminus language reference. They
had a bits1 type for flags, like carry from addition, and a mulhi and
mullo instruction. Now cminusminus never took off and died an untimely
death (largely because they wrote it in some bizarre ML in my
opinion), but from a bignum point of view, they got this one minor
detail right.

Hmm... LLVM currently uses i1 pretty extensively, and you can express
equivalents of mulhi/mullo as mentioned above.

What is i1? Sorry for the really daft question. These short
abbreviations are sometimes hard to look up.....

There's a few ways this could be done though. I mean, imagine if LLVM
optimised around bignum calls by say breaking bignum calls up into
more basic primitives and rearranging and coallescing those, etc. I'm
not saying it *should* do that. It would be a multi-year effort, and
probably a large grant application to do it (I'm seriously toying with
the idea...). But I am interested to know what the existing intentions
of the LLVM devs are for supporting arbitrary precision stuff, as it
seemed, at least at first blush, as though something non-trivial may
have been planned all along.

It interested me greatly when I noticed you had these intn types for
an arbitrary number of bits, and it almost made me think there was
some planning had gone into supporting arbitrary precision arithmetic
as basic instructions in the LLVM. Bizarre, but not stupid, perhaps.

There really isn't any grand plan here; LLVM switched a while back
from named integer types to a generalized integer type as part of a
large type-system rewrite; the types were left unrestricted just
because there wasn't any particular reason to restrict them. The
support for such types has gradually improved since then. And most
recently, we've started generating such types in the optimizers to
allow transforming complex structures into SSA form.

OK. Thanks. That's interesting to know. I'll see if I can now get a
better understanding of the evolution of this.

There have really only been sporadic queries about practical support
for real multi-precision arithmetic, but we'd welcome any
improvements.

Sure. I'm mainly thinking of LLVM's stated aim of "offering support
for other non-C languages", which is (a paraphrase of) what you have
somewhere, listed in the open projects on the LLVM site. It occurs to
me that bignum support is one thing that many languages require, but
currently have to use GMP for. If your language is BSD licensed
though, that is out of the question, hence some pretty poor bignum
implementations out there (I mean relatively speaking
performance-wise).

There are obviously numerous ways we might use LLVM to aid development
of "bsdnt". I'll keep exploring those options. It sounds like, for the
time being, analysing existing code output and looking for ways to
improve it on certain arches is perhaps one way we may be of
assistance.

By the way, the LLVM website is absolutely spectacular. The attention
to detail is amazing. There were some things that I found oddly
difficult to find, and I am still mystified why I googled for such a
long time for something like LLVM (days of googling, not minutes) and
didn't hit upon it. I wonder if I simply overlooked it because I
thought it was "just" a JIT or virtual machine.

Heh, ever thought of putting lots of terms like, "faster than C",
"high level assembly language", "lower level than C", "modern compiler
back end", "better than gcc's RTL", "similar to RTL", etc. on the
site, to make things more visible to a certain class of google
searches. (Better than gcc, faster than gcc, BSD licensed C compiler,
beats gcc in the programming language shootout - heh, naughty me).

Like the /. article today. Congrats!!

Bill.

Bill_Hart · June 12, 2010, 1:50am

If you have time, check out the cminusminus language reference. They
had a bits1 type for flags, like carry from addition, and a mulhi and
mullo instruction. Now cminusminus never took off and died an untimely
death (largely because they wrote it in some bizarre ML in my
opinion), but from a bignum point of view, they got this one minor
detail right.

Hmm... LLVM currently uses i1 pretty extensively, and you can express
equivalents of mulhi/mullo as mentioned above.

What is i1? Sorry for the really daft question. These short
abbreviations are sometimes hard to look up.....

Don't worry, someone got this question.

I was focused on the ML comment and imagined that i1 was some bizarre
low level ML!

One bit integer. Now I certainly get that!! (reminds me of a friend
who claimed to have a 1 bit hard drive :-))

I now see that LLVM assembly has been designed with C in mind. If I
think like a C compiler I think I'm right. Time to get my front end to
spit out some appropriate IR and see how we go!

Bill.

Eli_Friedman1 · June 12, 2010, 2:02am

What is i1? Sorry for the really daft question. These short
abbreviations are sometimes hard to look up.....

1-bit integer; in practice, usually used like a boolean, although it
supports the full complement of integer operations.

There have really only been sporadic queries about practical support
for real multi-precision arithmetic, but we'd welcome any
improvements.

Sure. I'm mainly thinking of LLVM's stated aim of "offering support
for other non-C languages", which is (a paraphrase of) what you have
somewhere, listed in the open projects on the LLVM site. It occurs to
me that bignum support is one thing that many languages require, but
currently have to use GMP for. If your language is BSD licensed
though, that is out of the question, hence some pretty poor bignum
implementations out there (I mean relatively speaking
performance-wise).

There are obviously numerous ways we might use LLVM to aid development
of "bsdnt". I'll keep exploring those options. It sounds like, for the
time being, analysing existing code output and looking for ways to
improve it on certain arches is perhaps one way we may be of
assistance.

Sounds like an interesting project. We're always happy to answer
questions and add projects to the The LLVM Compiler Infrastructure Project
page.

By the way, the LLVM website is absolutely spectacular. The attention
to detail is amazing. There were some things that I found oddly
difficult to find, and I am still mystified why I googled for such a
long time for something like LLVM (days of googling, not minutes) and
didn't hit upon it. I wonder if I simply overlooked it because I
thought it was "just" a JIT or virtual machine.

Heh, ever thought of putting lots of terms like, "faster than C",
"high level assembly language", "lower level than C", "modern compiler
back end", "better than gcc's RTL", "similar to RTL", etc. on the
site, to make things more visible to a certain class of google
searches. (Better than gcc, faster than gcc, BSD licensed C compiler,
beats gcc in the programming language shootout - heh, naughty me).

I believe Chris wrote most of the copy on the homepage; perhaps he can comment.

-Eli

Bill_Hart · June 12, 2010, 2:09am

There are obviously numerous ways we might use LLVM to aid development
of "bsdnt". I'll keep exploring those options. It sounds like, for the
time being, analysing existing code output and looking for ways to
improve it on certain arches is perhaps one way we may be of
assistance.

Sounds like an interesting project. We're always happy to answer
questions and add projects to the The LLVM Compiler Infrastructure Project
page.

As soon as we decide whether we can actually do what we hope, and when
we actually have some code which is not imaginary, definitely I'll
announce it here in the hope of getting a plug.

Believe it or not, one of the critical issues for one of our devs is
Win64 support. I know that's not available at this point. If anyone
has any comments on that, it would be great to know what the plans
are, if any, for this.

By the way, is it generally helpful if I post relevant C or IR
sequences here which do not optimise as one would hope/expect? I think
I have found some, but need to check....

Bignum stuff historically explores interesting corners of compiler
technology

Bill.

Eli_Friedman1 · June 12, 2010, 2:24am

There are obviously numerous ways we might use LLVM to aid development
of "bsdnt". I'll keep exploring those options. It sounds like, for the
time being, analysing existing code output and looking for ways to
improve it on certain arches is perhaps one way we may be of
assistance.

Sounds like an interesting project. We're always happy to answer
questions and add projects to the The LLVM Compiler Infrastructure Project
page.

As soon as we decide whether we can actually do what we hope, and when
we actually have some code which is not imaginary, definitely I'll
announce it here in the hope of getting a plug.

Believe it or not, one of the critical issues for one of our devs is
Win64 support. I know that's not available at this point. If anyone
has any comments on that, it would be great to know what the plans
are, if any, for this.

The only issue I know of is 5005 – Win64 stack corruption when spilling callee saved XMM registers.
; perhaps someone else knows more.

By the way, is it generally helpful if I post relevant C or IR
sequences here which do not optimise as one would hope/expect? I think
I have found some, but need to check....

Sure; send a bunch in a single email, and I'll make sure to go through
them. Likely some are already in Bugzilla and/or the various
README.txt files (lib/Target/README.txt, lib/Target/X86/README.txt,
etc.).

-Eli

Bill_Hart · June 12, 2010, 2:37am

There are obviously numerous ways we might use LLVM to aid development
of "bsdnt". I'll keep exploring those options. It sounds like, for the
time being, analysing existing code output and looking for ways to
improve it on certain arches is perhaps one way we may be of
assistance.

Sounds like an interesting project. We're always happy to answer
questions and add projects to the The LLVM Compiler Infrastructure Project
page.

As soon as we decide whether we can actually do what we hope, and when
we actually have some code which is not imaginary, definitely I'll
announce it here in the hope of getting a plug.

Believe it or not, one of the critical issues for one of our devs is
Win64 support. I know that's not available at this point. If anyone
has any comments on that, it would be great to know what the plans
are, if any, for this.

The only issue I know of is 5005 – Win64 stack corruption when spilling callee saved XMM registers.
; perhaps someone else knows more.

Wait, you *do* support Win64? You seriously need to put this on your webpage!!

It's not just me that missed this!!

By the way, is it generally helpful if I post relevant C or IR
sequences here which do not optimise as one would hope/expect? I think
I have found some, but need to check....

Sure; send a bunch in a single email, and I'll make sure to go through
them. Likely some are already in Bugzilla and/or the various
README.txt files (lib/Target/README.txt, lib/Target/X86/README.txt,
etc.).

OK, will read those first (and save them up as I find them).

Bill.

Eli_Friedman1 · June 12, 2010, 3:15am

Support is in-progress.
LLVM 2.7 Release Notes is the
set of architectures which have gone through release verification for
the most recent release, and is a subset of usable architectures.
Perhaps it would be appropriate to put a list of in-progress
architectures somewhere.

-Eli

Chris_Lattner · June 12, 2010, 3:58pm

I'd prefer to avoid inflammatory stuff on the web page, and stick to quietly building better compilers and tools. The blog is a good place for advocacy though.

-Chris

Bill_Hart · June 12, 2010, 6:37pm

I think I have an example of why it is somehow important to be able to
retrieve the carry and do an add with carry. Consider this short C
program:

#include <stdlib.h>
#include <stdio.h>

#define BITS 64

/****************************************

Types

****************************************/

typedef unsigned long ul;
typedef __uint128_t ull;
typedef ulong mp_size;
typedef const ulong * mp_src;
typedef ulong * mp_dst;
typedef ulong * mp_ptr;

/****************************************

Random routines

****************************************/

ull __randval = (ull) 13993185049168412078UL;
const ull __randprime = (ull) 9223372036854775814UL * 2 + 1;
const ull __randmult = 18148508189596611927UL;

ul ul_randlimb(void)
{
__randval = (__randval * __randmult) % __randprime;
return (ul) __randval;
}

/****************************************

Unsigned multiple precision routines

****************************************/

mp_ptr mp_init(mp_size n)
{
return malloc(n*sizeof(ul));
}

static inline
ul mp_add_nc(mp_dst r, mp_src a, mp_src b, mp_size n)
{
long i;

ul cy;

   const __uint128_t v = (__uint128_t) a[0] + b[0];
   r[0] = (ul) v;
   cy = v >> 64;

   for (i = 1; i < n; i++) {
      __uint128_t u = (__uint128_t) a[i] + b[i] + cy;
      r[i] = (ul) u;
      cy = u >> BITS;
   }

return cy;
}

void mp_rand_n(mp_dst r, mp_size n)
{
mp_size i;

for (i = 0; i < n; i++)
r[i] = ul_randlimb();
}

int main(void)
{

mp_ptr a, b, c;
ul i;

   a = mp_init(1000);
   b = mp_init(1000);
   c = mp_init(1000);

mp_rand_n(a, 1000);
mp_rand_n(b, 1000);

for (i = 0; i < 2400000; i++)
mp_add_nc(c, a, b, 1000);

return 0;
}

Ignore all of it except the mp_add_nc function. Now this runs at 4
cycles per int64 addition on AMD K10. If I fiddle a bit and loop
unroll, I get 2.5 cycles. But optimal is actually 1.6 cycles.

The part of the loop in question becomes:

%tmp.i = add i64 %indvar.i, 1 ; <i64> [#uses=2]
  %22 = load i64* %scevgep.i, align 8 ; <i64> [#uses=1]
  %23 = zext i64 %22 to i128 ; <i128> [#uses=1]
  %24 = load i64* %scevgep3.i, align 8 ; <i64> [#uses=1]
  %25 = zext i64 %24 to i128 ; <i128> [#uses=1]
  %26 = zext i64 %cy.02.i to i128 ; <i128> [#uses=1]
  %27 = add i128 %23, %26 ; <i128> [#uses=1]
  %28 = add i128 %27, %25 ; <i128> [#uses=2]
  %29 = trunc i128 %28 to i64 ; <i64> [#uses=1]
  store i64 %29, i64* %scevgep4.i, align 8
  %30 = lshr i128 %28, 64 ; <i128> [#uses=1]
  %31 = trunc i128 %30 to i64 ; <i64> [#uses=1]
  %exitcond = icmp eq i64 %tmp.i, 999 ; <i1> [#uses=1]

In other words, it just extends everything to an i128 and adds.
There's no way to tell it that it can add a[i], b[i] and cy with a
single adc. (Well it could if the loop iteration wasn't messing with
the carry flag).

Indeed, this compiles to:

.LBB1_7: # %bb.i
                                        # Parent Loop BB1_6 Depth=1
                                        # => This Inner Loop Header: Depth=2
        addq (%rbx,%rsi,8), %rdi
        movl $0, %r8d
        adcq $0, %r8
        addq (%r14,%rsi,8), %rdi
        adcq $0, %r8
        movq %rdi, (%r15,%rsi,8)
        incq %rsi
        cmpq $1000, %rsi # imm = 0x3E8
        movq %r8, %rdi
        jne .LBB1_7

So it basically tries to keep track of the carry in %r8 instead of in
the carry flag.

As hinted, the other optimisation missed here, is that instead of
comparing with $1000 it can start with %rsi at $-1000 and increment
each iteration and simply do a jnz .LBB1_7 doing away with the cmpq.

I've tried tricking it into doing this in every way I can think of,
but without success so far.

Bill.

Bill_Hart · June 13, 2010, 1:27am

I was able to get the loop to increment from -999 to 0 using IR
directly. That got rid of the cmpq.

The carry i was after was able to be obtained using the intrinsic
@llvm.uadd.with.overflow.i64, however there is no way to add with
carry and have it realise that the resulting *carry out* cannot exceed
1. It actually writes the carry to a byte, and then uses logical
operations on it, which slows things down much more.

I guess what is needed is an intrinsic @llvm.uadc.with.overflow.i64
which should take three arguments, the two i64's being added and an
i1, being the carry from before.

This and a similar usbb might be the only things missing to make IR
efficient for developing low level routines for a bignum library!

Bill.

Alastair_Lynn · June 13, 2010, 3:33am

Hi Bill-

I think, ideally, the backend would be able to match arbitrary-precision arithmetic to add-with-carry or subtract-with-borrow through i65/i33. That would remove the need for the overflow intrinsics entirely.

Alistair

Bill_Hart · June 13, 2010, 3:52am

Yeah I had a think about it, and I think intrinsics are the wrong way
to do it. So I'd say you are likely right.

Bill.

Duncan_Sands · June 13, 2010, 12:22pm

Yeah I had a think about it, and I think intrinsics are the wrong way
to do it. So I'd say you are likely right.

For this to work well, the way the code generators handle flags will need
to be improved: currently it is suboptimal, in fact kind of a hack.

Ciao,

Duncan.

Bill_Hart · June 13, 2010, 3:16pm

I think from the C compiler's point of view, it is going to want it to
work for any size above an i64, i.e. all the way up to an i128 so that
if the user of the C compiler does this computation with __uint128_t's
then it will Do The Right Thing TM.

Basically, you want

unsigned long a, b, c, d;

....

const __uint128_t u = (__uint128_t) a + b;
const unsigned long v = u >> 64;
const __uint128_t w = (__uint128_t) c + d + v;

to do the right thing, namely do the first addition with an ADD, the
carry flag holding the shift value v, and do the third addition with a
single ADC if available (assuming there are no instructions between
which kill the carry flag, in which case a byte would have to be
used).

In IR, this is going to look like:

  %0 = load i64* %a, align 8 ; <i64> [#uses=1]
  %1 = zext i64 %0 to i128 ; <i128> [#uses=1]
  %2 = load i64* %b, align 8 ; <i64> [#uses=1]
  %3 = zext i64 %2 to i128 ; <i128> [#uses=1]
  %4 = add i128 %1, %3 ; <i128> [#uses=1]

%5 = lshr i128 %4, 64 ; <i128> [#uses=1]
%6 = trunc i128 %5 to i64 ; <i64> [#uses=1]

  %7 = load i64* %c, align 8 ; <i64> [#uses=1]
  %8 = zext i64 %7 to i128 ; <i128> [#uses=1]
  %9 = load i64* %d, align 8 ; <i64> [#uses=1]
  %10 = zext i64 %9 to i128 ; <i128> [#uses=1]
  %11 = add i128 %8, %10 ; <i128> [#uses=1]

%12 = zext i64 %6 to i128 ; <i128> [#uses=1]
%13 = add i128 %11, %12 ; <i128> [#uses=1]
.....

But as I mentioned, some care needs to be taken with loop counters so
that comparing them to their limit does not destroy the carry flag.
And of course the whole optimisation needs to work for the similar
thing involving __uint64_t's on 32 bit machines.

It's one hell of an optimisation. But also a radical speedup (33% in a
loop without unrolling, better than 100% with unrolling).

Bill.

Bill_Hart · June 13, 2010, 3:28pm

I should quickly add that the whole thing should be made to work for
subtraction and borrow too.

I think multiplication is OK. There you do a full i64xi64
multiplication by casting to i128's. It currently does the right thing
for that.

You then add the low i64 to the result with any previous carry,
yielding a carry. You then add the high limb to the result, yielding a
carry.

I think that all works, just by making use of the optimisation we've
been discussing.

Division currently works as expected, I think, i.e. div and mod get
coallesced into a single DIV. So it really is just carry handling that
needs to be fixed I think.

Everything else, like shifting i64's by first casting to i128's first,
no doubt works as expected. Perhaps there's some code sequences here
or there that could be improved slightly, but otherwise I think,
technically, everything is there to produce efficient code for bignum
stuff.

Bill.

Hisham_Chowdhury · June 16, 2010, 11:22pm

Hello,
I have a question regrading the analysis pass that generates loop info from an .ll code. My previous understanding was there will be just one loop header(in the loop info) for a particular loop. But, when i use isLoopHeader() member function from the loop info class I get 'true' return value for two different basic blocks. Note both basic blocks are loop conditional block(break condition of the loop), but only one starts the loop. Is this the right behavior for the isLoopHeader() member function?

Thanks for help,
Hisham

Thomas_B_Jablin · June 17, 2010, 12:52am

Hi Hisham,
Most likely the basic blocks are the headers of two different loops. Try running viewCFG() on the function in question to see if this is the case.
Tom

Topic		Replies	Views
Bignums LLVM Dev List Archives	1	79	April 1, 2011
Checked arithmetic LLVM Dev List Archives	30	178	March 30, 2008
Extending LLVM LLVM Dev List Archives	5	145	March 28, 2004
9 Ideas To Better Support Source Language Developers LLVM Dev List Archives	20	211	January 8, 2004
Does current LLVM target-independent code generator supports my strange chip? LLVM Dev List Archives	17	148	November 28, 2008

Bignum development

Related topics