multiprecision add/sub

bagel99 · February 15, 2017, 10:22pm

I suggest that LLVM needs intrinsics for add/sub with carry, e.g.

declare {T, i1} @llvm.addc.T(T %a, T %b, i1 c)

The current multiprecision clang intrinsics example:
  void foo(unsigned *x, unsigned *y, unsigned *z)
  { unsigned carryin = 0;
    unsigned carryout;
    z[0] = __builtin_addc(x[0], y[0], carryin, &carryout);
    carryin = carryout;
    z[1] = __builtin_addc(x[1], y[1], carryin, &carryout);
    carryin = carryout;
    z[2] = __builtin_addc(x[2], y[2], carryin, &carryout);
    carryin = carryout;
    z[3] = __builtin_addc(x[3], y[3], carryin, &carryout);
  }
uses the LLVM intrinsic "llvm.uadd.with.overflow" and generates
horrible code that doesn't use the "adc" x86 instruction.

What is the current thinking on improving multiprecision arithmetic?

Stephen_Canon1 · February 15, 2017, 10:28pm

Why do you think this requires new intrinsics instead of teaching the optimizer what to do with the existing intrinsics?

– Steve

majnemer · February 15, 2017, 10:59pm

> I suggest that LLVM needs intrinsics for add/sub with carry, e.g.
>
> declare {T, i1} @llvm.addc.T(T %a, T %b, i1 c)
>
> The current multiprecision clang intrinsics example:
> void foo(unsigned *x, unsigned *y, unsigned *z)
> { unsigned carryin = 0;
> unsigned carryout;
> z[0] = __builtin_addc(x[0], y[0], carryin, &carryout);
> carryin = carryout;
> z[1] = __builtin_addc(x[1], y[1], carryin, &carryout);
> carryin = carryout;
> z[2] = __builtin_addc(x[2], y[2], carryin, &carryout);
> carryin = carryout;
> z[3] = __builtin_addc(x[3], y[3], carryin, &carryout);
> }
> uses the LLVM intrinsic "llvm.uadd.with.overflow" and generates
> horrible code that doesn't use the "adc" x86 instruction.
>
> What is the current thinking on improving multiprecision arithmetic?

Why do you think this requires new intrinsics instead of teaching the
optimizer what to do with the existing intrinsics?

In general, it is harder to reason about memory. Also, you are forced to
allocate memory for the carryout even if you are not interested in using it.

mjacob · February 15, 2017, 11:38pm

> I suggest that LLVM needs intrinsics for add/sub with carry, e.g.
>
> declare {T, i1} @llvm.addc.T(T %a, T %b, i1 c)
>
> The current multiprecision clang intrinsics example:
> void foo(unsigned *x, unsigned *y, unsigned *z)
> { unsigned carryin = 0;
> unsigned carryout;
> z[0] = __builtin_addc(x[0], y[0], carryin, &carryout);
> carryin = carryout;
> z[1] = __builtin_addc(x[1], y[1], carryin, &carryout);
> carryin = carryout;
> z[2] = __builtin_addc(x[2], y[2], carryin, &carryout);
> carryin = carryout;
> z[3] = __builtin_addc(x[3], y[3], carryin, &carryout);
> }
> uses the LLVM intrinsic "llvm.uadd.with.overflow" and generates
> horrible code that doesn't use the "adc" x86 instruction.
>
> What is the current thinking on improving multiprecision arithmetic?

Why do you think this requires new intrinsics instead of teaching the
optimizer what to do with the existing intrinsics?

In general, it is harder to reason about memory. Also, you are forced to
allocate memory for the carryout even if you are not interested in using it.

While the clang builtins access memory conceptually, similar to every variable access in C, this is removed by mem2reg. The optimizer doesn't have to reason about memory in the example after mem2reg was run.

Brian_Smith · February 16, 2017, 12:24am

IMO, as a multiprecision math library maker, the "teaching the
optimizer what to do with the existing intrinsics" approach is much
better as long as it can be made to work. If one is careful, MSVC does
optimize its intrinsics into ADC instructions in a reasonable way, so
I think it is probably doable.

(Below, all math is multiprecision.)

There are actually a few different operations besides pure addition
and subtraction, e.g. `a - (b >> 1)` instead of just `a - b`. Also
consider that we sometimes want "a - b" to be side-channel free and
other times we'd rather "a - b" to be as fast as possible and
optimized for the case where carries are unlikely to propagate (far).
That's already three different operations, just for subtraction.

Cheers,
Brian

Chandler_Carruth · February 16, 2017, 3:44am

I feel like there are two things being conflated here:

Does LLVM need new LLVM intrinsics to model this. The add-with-overflow intrinsics seem likely fine for as-is, don’t require memory, but we don’t lower the code to use adc for some reason. That seems like an LLVM bug that should be fixed in the x86 backend. If optimizing the LLVM intrinsics is hard, then maybe we need better ones, but that’s likely about the (somewhat ridiculous) dance we have to do to “sum” the overflow flags rather than anything to do with memory.
Does Clang need new C builtins to avoid needing the carry-in / carry-out being in memory? I don’t have a strong opinion about this. But the example in the original email gets optimized to have no unnecessary memory accesses already: https://godbolt.org/g/GFij97

#1 seems like the most interesting problem to solve. File a bug?

bagel99 · February 16, 2017, 5:12pm

It takes two "llvm.uadd.with.overflow" instances to model the add-with-carry
when there is a carry-in. Look at the IR generated by the example.

I figured that the optimization of this would bedifficult (else it would have
already been done :-)). And would this optimization have to be done for every
architecture?

bagel99 · February 16, 2017, 5:15pm

I was suggesting #1. I don't care about clang.

Am I supposed to file a bug against every backend?

Stephen_Canon1 · February 16, 2017, 6:08pm

Don’t make this assumption. There’s lots of opportunities for optimization scattered around. Some of them are left because they’re genuinely difficult, but most either simply haven’t been noticed by anyone, or are known but simply haven’t been done because no one has pushed for them (I’m fairly sure that this particular optimization falls into that category—folks have known the codegen wasn’t great since these intrinsics were added, but no one has really complained about it, so people have worked on higher-impact stuff).

– Steve

bagel99 · February 17, 2017, 7:52pm

But I am still curious as to whether the optimization is architecture
independent or must be done again for each backend?

bagel

nemanjai · February 22, 2017, 5:54am

I believe that providing additional intrinsics that would directly produce the ISD::ADDC/ISD::SUBC nodes would provide the additional advantage of being able to directly produce these nodes for code that doesn’t have anything to do with multiprecision addition/subtraction. I am not aware of any way currently available to produce IR that will generate these nodes directly.

mehdi_amini · March 7, 2017, 6:56am

But what is the use case? What IR pattern it would replace?

nemanjai · March 14, 2017, 1:45pm

Sorry for the delay in responding.

The use case is basically for being able to emit instructions that correspond to these nodes. For example on Power, we use these instructions that set and use the Carry bit for comparisons without using compare instructions.

One example where we would be able to make use of these would be in https://reviews.llvm.org/D28637.

bagel99 · March 14, 2017, 5:38pm

My use case is multi-precision arithmetic which is key to efficient
implementation of public-key encryption. Take the follow routine:

void addm(unsigned *x, unsigned *y, unsigned *z, unsigned n)
{ unsigned carryin;
unsigned carryout = 0;
unsigned i;

  for (i=0; i<n; i++)
  { carryin = carryout;
     z[i] = __builtin_addc(x[i], y[i], carryin, &carryout);
  }
}

If one looks at the clang IR generated, there are *four* instances of
"llvm.uadd.with.overflow" in the loop. Somehow this has to be optimized to a
*single* machine instruction add-with-carry.

This isn't currently done (the code for x86-64 is horrible). I think
optimization would be much more simple if the four "llvm.uadd.with.overflow"
were replaced with a single "llvm.addc".

Topic		Replies	Views
Bignum development LLVM Dev List Archives	20	301	June 17, 2010
Using Multiprecision Arithmetic Builtins effectively Using Clang	0	88	November 13, 2015
Optimizing math code LLVM Dev List Archives	6	204	February 18, 2014
optimisation issue in an llvm IR pass LLVM Dev List Archives	6	146	July 4, 2019
Checked arithmetic LLVM Dev List Archives	30	322	March 30, 2008

multiprecision add/sub

Related topics