multiprecision add/sub

I suggest that LLVM needs intrinsics for add/sub with carry, e.g.

  declare {T, i1} @llvm.addc.T(T %a, T %b, i1 c)

The current multiprecision clang intrinsics example:
  void foo(unsigned *x, unsigned *y, unsigned *z)
  { unsigned carryin = 0;
    unsigned carryout;
    z[0] = __builtin_addc(x[0], y[0], carryin, &carryout);
    carryin = carryout;
    z[1] = __builtin_addc(x[1], y[1], carryin, &carryout);
    carryin = carryout;
    z[2] = __builtin_addc(x[2], y[2], carryin, &carryout);
    carryin = carryout;
    z[3] = __builtin_addc(x[3], y[3], carryin, &carryout);
  }
uses the LLVM intrinsic "llvm.uadd.with.overflow" and generates
horrible code that doesn't use the "adc" x86 instruction.

What is the current thinking on improving multiprecision arithmetic?

Why do you think this requires new intrinsics instead of teaching the optimizer what to do with the existing intrinsics?

– Steve

> I suggest that LLVM needs intrinsics for add/sub with carry, e.g.
>
> declare {T, i1} @llvm.addc.T(T %a, T %b, i1 c)
>
> The current multiprecision clang intrinsics example:
> void foo(unsigned *x, unsigned *y, unsigned *z)
> { unsigned carryin = 0;
> unsigned carryout;
> z[0] = __builtin_addc(x[0], y[0], carryin, &carryout);
> carryin = carryout;
> z[1] = __builtin_addc(x[1], y[1], carryin, &carryout);
> carryin = carryout;
> z[2] = __builtin_addc(x[2], y[2], carryin, &carryout);
> carryin = carryout;
> z[3] = __builtin_addc(x[3], y[3], carryin, &carryout);
> }
> uses the LLVM intrinsic "llvm.uadd.with.overflow" and generates
> horrible code that doesn't use the "adc" x86 instruction.
>
> What is the current thinking on improving multiprecision arithmetic?

Why do you think this requires new intrinsics instead of teaching the
optimizer what to do with the existing intrinsics?

In general, it is harder to reason about memory. Also, you are forced to
allocate memory for the carryout even if you are not interested in using it.

> I suggest that LLVM needs intrinsics for add/sub with carry, e.g.
>
> declare {T, i1} @llvm.addc.T(T %a, T %b, i1 c)
>
> The current multiprecision clang intrinsics example:
> void foo(unsigned *x, unsigned *y, unsigned *z)
> { unsigned carryin = 0;
> unsigned carryout;
> z[0] = __builtin_addc(x[0], y[0], carryin, &carryout);
> carryin = carryout;
> z[1] = __builtin_addc(x[1], y[1], carryin, &carryout);
> carryin = carryout;
> z[2] = __builtin_addc(x[2], y[2], carryin, &carryout);
> carryin = carryout;
> z[3] = __builtin_addc(x[3], y[3], carryin, &carryout);
> }
> uses the LLVM intrinsic "llvm.uadd.with.overflow" and generates
> horrible code that doesn't use the "adc" x86 instruction.
>
> What is the current thinking on improving multiprecision arithmetic?

Why do you think this requires new intrinsics instead of teaching the
optimizer what to do with the existing intrinsics?

In general, it is harder to reason about memory. Also, you are forced to
allocate memory for the carryout even if you are not interested in using it.

While the clang builtins access memory conceptually, similar to every variable access in C, this is removed by mem2reg. The optimizer doesn't have to reason about memory in the example after mem2reg was run.

IMO, as a multiprecision math library maker, the "teaching the
optimizer what to do with the existing intrinsics" approach is much
better as long as it can be made to work. If one is careful, MSVC does
optimize its intrinsics into ADC instructions in a reasonable way, so
I think it is probably doable.

(Below, all math is multiprecision.)

There are actually a few different operations besides pure addition
and subtraction, e.g. `a - (b >> 1)` instead of just `a - b`. Also
consider that we sometimes want "a - b" to be side-channel free and
other times we'd rather "a - b" to be as fast as possible and
optimized for the case where carries are unlikely to propagate (far).
That's already three different operations, just for subtraction.

Cheers,
Brian

I feel like there are two things being conflated here:

  1. Does LLVM need new LLVM intrinsics to model this. The add-with-overflow intrinsics seem likely fine for as-is, don’t require memory, but we don’t lower the code to use adc for some reason. That seems like an LLVM bug that should be fixed in the x86 backend. If optimizing the LLVM intrinsics is hard, then maybe we need better ones, but that’s likely about the (somewhat ridiculous) dance we have to do to “sum” the overflow flags rather than anything to do with memory.

  2. Does Clang need new C builtins to avoid needing the carry-in / carry-out being in memory? I don’t have a strong opinion about this. But the example in the original email gets optimized to have no unnecessary memory accesses already: https://godbolt.org/g/GFij97

#1 seems like the most interesting problem to solve. File a bug?

It takes two "llvm.uadd.with.overflow" instances to model the add-with-carry
when there is a carry-in. Look at the IR generated by the example.

I figured that the optimization of this would bedifficult (else it would have
already been done :-)). And would this optimization have to be done for every
architecture?

I was suggesting #1. I don't care about clang.

Am I supposed to file a bug against every backend?

Don’t make this assumption. There’s lots of opportunities for optimization scattered around. Some of them are left because they’re genuinely difficult, but most either simply haven’t been noticed by anyone, or are known but simply haven’t been done because no one has pushed for them (I’m fairly sure that this particular optimization falls into that category—folks have known the codegen wasn’t great since these intrinsics were added, but no one has really complained about it, so people have worked on higher-impact stuff).

– Steve

But I am still curious as to whether the optimization is architecture
independent or must be done again for each backend?

bagel

I believe that providing additional intrinsics that would directly produce the ISD::ADDC/ISD::SUBC nodes would provide the additional advantage of being able to directly produce these nodes for code that doesn’t have anything to do with multiprecision addition/subtraction. I am not aware of any way currently available to produce IR that will generate these nodes directly.

But what is the use case? What IR pattern it would replace?

Sorry for the delay in responding.

The use case is basically for being able to emit instructions that correspond to these nodes. For example on Power, we use these instructions that set and use the Carry bit for comparisons without using compare instructions.

One example where we would be able to make use of these would be in https://reviews.llvm.org/D28637.

My use case is multi-precision arithmetic which is key to efficient
implementation of public-key encryption. Take the follow routine:

void addm(unsigned *x, unsigned *y, unsigned *z, unsigned n)
{ unsigned carryin;
  unsigned carryout = 0;
  unsigned i;

  for (i=0; i<n; i++)
  { carryin = carryout;
     z[i] = __builtin_addc(x[i], y[i], carryin, &carryout);
  }
}

If one looks at the clang IR generated, there are *four* instances of
"llvm.uadd.with.overflow" in the loop. Somehow this has to be optimized to a
*single* machine instruction add-with-carry.

This isn't currently done (the code for x86-64 is horrible). I think
optimization would be much more simple if the four "llvm.uadd.with.overflow"
were replaced with a single "llvm.addc".