> I suggest that LLVM needs intrinsics for add/sub with carry, e.g.
>
> declare {T, i1} @llvm.addc.T(T %a, T %b, i1 c)
>
> The current multiprecision clang intrinsics example:
> void foo(unsigned *x, unsigned *y, unsigned *z)
> { unsigned carryin = 0;
> unsigned carryout;
> z[0] = __builtin_addc(x[0], y[0], carryin, &carryout);
> carryin = carryout;
> z[1] = __builtin_addc(x[1], y[1], carryin, &carryout);
> carryin = carryout;
> z[2] = __builtin_addc(x[2], y[2], carryin, &carryout);
> carryin = carryout;
> z[3] = __builtin_addc(x[3], y[3], carryin, &carryout);
> }
> uses the LLVM intrinsic "llvm.uadd.with.overflow" and generates
> horrible code that doesn't use the "adc" x86 instruction.
>
> What is the current thinking on improving multiprecision arithmetic?
Why do you think this requires new intrinsics instead of teaching the
optimizer what to do with the existing intrinsics?
In general, it is harder to reason about memory. Also, you are forced to
allocate memory for the carryout even if you are not interested in using it.
While the clang builtins access memory conceptually, similar to every variable access in C, this is removed by mem2reg. The optimizer doesn't have to reason about memory in the example after mem2reg was run.
IMO, as a multiprecision math library maker, the "teaching the
optimizer what to do with the existing intrinsics" approach is much
better as long as it can be made to work. If one is careful, MSVC does
optimize its intrinsics into ADC instructions in a reasonable way, so
I think it is probably doable.
(Below, all math is multiprecision.)
There are actually a few different operations besides pure addition
and subtraction, e.g. `a - (b >> 1)` instead of just `a - b`. Also
consider that we sometimes want "a - b" to be side-channel free and
other times we'd rather "a - b" to be as fast as possible and
optimized for the case where carries are unlikely to propagate (far).
That's already three different operations, just for subtraction.
I feel like there are two things being conflated here:
Does LLVM need new LLVM intrinsics to model this. The add-with-overflow intrinsics seem likely fine for as-is, don’t require memory, but we don’t lower the code to use adc for some reason. That seems like an LLVM bug that should be fixed in the x86 backend. If optimizing the LLVM intrinsics is hard, then maybe we need better ones, but that’s likely about the (somewhat ridiculous) dance we have to do to “sum” the overflow flags rather than anything to do with memory.
Does Clang need new C builtins to avoid needing the carry-in / carry-out being in memory? I don’t have a strong opinion about this. But the example in the original email gets optimized to have no unnecessary memory accesses already: https://godbolt.org/g/GFij97
#1 seems like the most interesting problem to solve. File a bug?
It takes two "llvm.uadd.with.overflow" instances to model the add-with-carry
when there is a carry-in. Look at the IR generated by the example.
I figured that the optimization of this would bedifficult (else it would have
already been done :-)). And would this optimization have to be done for every
architecture?
Don’t make this assumption. There’s lots of opportunities for optimization scattered around. Some of them are left because they’re genuinely difficult, but most either simply haven’t been noticed by anyone, or are known but simply haven’t been done because no one has pushed for them (I’m fairly sure that this particular optimization falls into that category—folks have known the codegen wasn’t great since these intrinsics were added, but no one has really complained about it, so people have worked on higher-impact stuff).
I believe that providing additional intrinsics that would directly produce the ISD::ADDC/ISD::SUBC nodes would provide the additional advantage of being able to directly produce these nodes for code that doesn’t have anything to do with multiprecision addition/subtraction. I am not aware of any way currently available to produce IR that will generate these nodes directly.
The use case is basically for being able to emit instructions that correspond to these nodes. For example on Power, we use these instructions that set and use the Carry bit for comparisons without using compare instructions.
If one looks at the clang IR generated, there are *four* instances of
"llvm.uadd.with.overflow" in the loop. Somehow this has to be optimized to a
*single* machine instruction add-with-carry.
This isn't currently done (the code for x86-64 is horrible). I think
optimization would be much more simple if the four "llvm.uadd.with.overflow"
were replaced with a single "llvm.addc".