[RFC] carry-less multiplication instruction

shawnl · July 5, 2020, 9:18am

Carry-less multiplication[1] instructions exist (at least optionally) on many architectures: armv8, RISC-V, x86_64, POWER, SPARC, C64x, and possibly more.

This proposal is to add a llvm.clmul instruction. Or if that is contentious, llvm.experimental.bitmanip.clmul instruction. It takes two integer operands of the same width, and returns an integer with twice the width of the operands. (Is there a good reason to make these the same width, as all the other operations do even when it doesn’t really make sense for the mathematical operation–like multiplication or ctpop/ctlz/cttz?)

If the CPU does not have a dedication clmul operation, it can be lowered to regular multiplication, by using holes to avoid carrys.

==Where is clmul used?==

While somewhat specialized, the RISC-V manual documents many uses: [2]

The classic applications forclmulare Cyclic Redundancy Check (CRC) [11, 26]

and Galois/CounterMode (GCM), but more applications exist, including the following examples.There are obvious applications in hashing and pseudo random number generations. For exam-ple, it has been reported that hashes based on carry-less multiplications can outperform Google’sCityHash [17].

clmulof a number with itself inserts zeroes between each input bit. This can be useful for generatingMorton code [23].

clmulof a number with -1 calculates the prefix XOR operation. This can be useful for decodinggray codes.Another application of XOR prefix sums calculated withclmulis branchless tracking of quotedstrings in high-performance parsers. [16]

Carry-less multiply can also be used to implement Erasure code efficiently. [14]

==clmul lowering without hardware support==
A 8x8=>16 clmul can also be lowered to a 32x32=>64 multiplication when there is no specialized instruction (also 15x15=>30, to a 60x60=>120, or if bitreverse is available 16x16=>32 to TWO 64x64=>64 multiplications)[3].

[1] https://en.wikipedia.org/wiki/Carry-less_product
[2] (page 30) https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf
[3] https://www.bearssl.org/constanttime.html

(First posted to discord

LebedevRI · July 5, 2020, 10:21am

Carry-less multiplication[1] instructions exist (at least optionally) on many architectures: armv8, RISC-V, x86_64, POWER, SPARC, C64x, and possibly more.

This proposal is to add a llvm.clmul instruction. Or if that is contentious, llvm.experimental.bitmanip.clmul instruction. It takes two integer operands of the same width, and returns an integer with twice the width of the operands. (Is there a good reason to make these the same width, as all the other operations do even when it doesn’t really make sense for the mathematical operation–like multiplication or ctpop/ctlz/cttz?)

If the CPU does not have a dedication clmul operation, it can be lowered to regular multiplication, by using holes to avoid carrys.

==Where is clmul used?==

While somewhat specialized, the RISC-V manual documents many uses: [2]

The classic applications forclmulare Cyclic Redundancy Check (CRC) [11, 26]

and Galois/CounterMode (GCM), but more applications exist, including the following examples.There are obvious applications in hashing and pseudo random number generations. For exam-ple, it has been reported that hashes based on carry-less multiplications can outperform Google’sCityHash [17].

clmulof a number with itself inserts zeroes between each input bit. This can be useful for generatingMorton code [23].

clmulof a number with -1 calculates the prefix XOR operation. This can be useful for decodinggray codes.Another application of XOR prefix sums calculated withclmulis branchless tracking of quotedstrings in high-performance parsers. [16]

Carry-less multiply can also be used to implement Erasure code efficiently. [14]

==clmul lowering without hardware support==
A 8x8=>16 clmul can also be lowered to a 32x32=>64 multiplication when there is no specialized instruction (also 15x15=>30, to a 60x60=>120, or if bitreverse is available 16x16=>32 to TWO 64x64=>64 multiplications)[3].

[1] Carry-less product - Wikipedia
[2] (page 30) https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf
[3] BearSSL - Constant-Time Crypto

What benefit would this intrinsic would bring to the middle-end IR,
over it's current naive expanded form?

Note that teaching backends to produce it, or even adding it to
backend (ISD opcodes)
and matching it in DAGCombiner has much lower barrier of entry, i
would suggest to start there.

(First posted to discord

--
Shawn Landden

Roman

nhaehnle · July 5, 2020, 12:12pm

Carry-less multiplication[1] instructions exist (at least optionally) on many architectures: armv8, RISC-V, x86_64, POWER, SPARC, C64x, and possibly more.

This proposal is to add a llvm.clmul instruction. Or if that is contentious, llvm.experimental.bitmanip.clmul instruction. It takes two integer operands of the same width, and returns an integer with twice the width of the operands. (Is there a good reason to make these the same width, as all the other operations do even when it doesn’t really make sense for the mathematical operation–like multiplication or ctpop/ctlz/cttz?)

If the CPU does not have a dedication clmul operation, it can be lowered to regular multiplication, by using holes to avoid carrys.

==Where is clmul used?==

While somewhat specialized, the RISC-V manual documents many uses: [2]

The classic applications forclmulare Cyclic Redundancy Check (CRC) [11, 26]

and Galois/CounterMode (GCM), but more applications exist, including the following examples.There are obvious applications in hashing and pseudo random number generations. For exam-ple, it has been reported that hashes based on carry-less multiplications can outperform Google’sCityHash [17].

clmulof a number with itself inserts zeroes between each input bit. This can be useful for generatingMorton code [23].

clmulof a number with -1 calculates the prefix XOR operation. This can be useful for decodinggray codes.Another application of XOR prefix sums calculated withclmulis branchless tracking of quotedstrings in high-performance parsers. [16]

Carry-less multiply can also be used to implement Erasure code efficiently. [14]

==clmul lowering without hardware support==
A 8x8=>16 clmul can also be lowered to a 32x32=>64 multiplication when there is no specialized instruction (also 15x15=>30, to a 60x60=>120, or if bitreverse is available 16x16=>32 to TWO 64x64=>64 multiplications)[3].

[1] Carry-less product - Wikipedia
[2] (page 30) https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf
[3] BearSSL - Constant-Time Crypto

What benefit would this intrinsic would bring to the middle-end IR,
over it's current naive expanded form?

Isn't a "naive" expansion of NxN carryless multiply extremely involved? I'd expect something like 2N shifts, N truncs, N selects, and N xors.

That link mentions an alternative that is more efficient, but I wouldn't exactly call it naive...

Cheers,
Nicolai

jyknight · July 5, 2020, 2:54pm

It’d be useful in your proposal to to note which of the existing llvm target specific intrinsics this generic intrinsic can effectively supersede. (E.g. llvm.x86.pclmulqdq for x86.)

When we are already supporting a given function via target specific intrinsics for a number of different targets, that seems a pretty good argument for making it available as a more generic target independent intrinsic.

topperc · July 5, 2020, 6:44pm

Shawn,

Are you able to summarize the different instructions from the various targets. It looks like there different implementation choices made for each target. For example, X86 takes two v2i64 inputs and picks either an even or odd element from each to multiply to produce a v1i128 result. It looks like RISC-V has instructions to produce either the high half of the result or the low half of the result. Those are the only two I checked.

Will a common intrinsic need custom handling for each target or is there a common version that multiple targets use that we should choose for the intrinsic?

clattner · July 5, 2020, 8:07pm

No, the typical expansion (used by the existing LLVM code generator) uses “grade school math” to decompose large multiplications into smaller ones.

Wide multiplications (e.g. those on X86) are pretty easy to pattern match from “sign/zero extend operands from 64 bits to 128 bits, then do a 128x128=128 multiply” for example. This is all already handled by the existing selection dag infra, my understanding is that it works well in practice.

-Chris

LebedevRI · July 5, 2020, 8:09pm

[1] Carry-less product - Wikipedia
[2] (page 30) https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf
[3] BearSSL - Constant-Time Crypto

What benefit would this intrinsic would bring to the middle-end IR,
over it's current naive expanded form?

Isn't a "naive" expansion of NxN carryless multiply extremely involved? I'd expect something like 2N shifts, N truncs, N selects, and N xors.

That link mentions an alternative that is more efficient, but I wouldn't exactly call it naive…

No, the typical expansion (used by the existing LLVM code generator) uses “grade school math” to decompose large multiplications into smaller ones.

Wide multiplications (e.g. those on X86) are pretty easy to pattern match from “sign/zero extend operands from 64 bits to 128 bits, then do a 128x128=128 multiply” for example.

That is what i meant, yes.

This is all already handled by the existing selection dag infra, my understanding is that it works well in practice.

-Chris

Roman

jyknight · July 5, 2020, 9:28pm

“Carry-less multiplication” isn’t a wider bit width multiplication, it’s an entirely different operation. (See the Wikipedia page linked in the proposal).

shawnl · July 6, 2020, 10:34am

Shawn,

Are you able to summarize the different instructions from the various targets. It looks like there different implementation choices made for each target. For example, X86 takes two v2i64 inputs and picks either an even or odd element from each to multiply to produce a v1i128 result. It looks like RISC-V has instructions to produce either the high half of the result or the low half of the result. Those are the only two I checked.

Will a common intrinsic need custom handling for each target or is there a common version that multiple targets use that we should choose for the intrinsic?

Only the Power8 instructions are differen’t, as it can do two 64+64=>128 multiplications at the same time, with the result xored together (Karatsuba-style), and if you don’t want that you have to make sure you are multiplying by zero for one of them. So Power would require special lowing for 128+128=>256 multiply. So with Power you get 3 multiplys, but you have to zero some registers, while on RISC-V you would have 4 right after each other, and them have to xor the middle two.

shawnl · July 6, 2020, 10:44am

This proposal is to add a llvm.clmul instruction.

What benefit would this intrinsic would bring to the middle-end IR,
over it’s current naive expanded form?

Isn’t a “naive” expansion of NxN carryless multiply extremely involved?

Yes it is. And this is then sped up with a table (such as in the official GCM spec), however using a table can introduce key-dependent loads and security problems. The 32+32->64 or 64+64->64 multiplication lowering is generally constant-time and does not have these security problems.

shawnl · July 6, 2020, 11:41am

This proposal is to add a llvm.clmul instruction.

What benefit would this intrinsic would bring to the middle-end IR,
over it's current naive expanded form?

Isn't a "naive" expansion of NxN carryless multiply extremely involved?
I'd expect something like 2N shifts, N truncs, N selects, and N xors.

Yes it is. And this is then sped up with a table (such as in the official GCM spec), however using a table can introduce key-dependent loads and security problems. The 32+32->64 or 64+64->64 multiplication lowering is generally constant-time and does not have these security problems.

shawnl · July 6, 2020, 11:42am

Only the Power8 instructions are differen't, as it can do two 64+64=>128 multiplications at the same time, with the result xored together (Karatsuba-style), and if you don't want that you have to make sure you are multiplying by zero for one of them. So Power would require special lowing for 128+128=>256 multiply. So with Power you get 3 multiplys, but you have to zero some registers, while on RISC-V you would have 4 right after each other, and them have to xor the middle two.

Stephen_Canon1 · July 8, 2020, 4:23pm

FWIW, this seems like a no-brainer to me (as llvm.experimental initially), assuming that it can be designed in such a way that it would eliminate the need for intrinsics on at least two targets (I think it should be possible to do so, with a small amount of back-end work).

– Steve

LebedevRI · July 9, 2020, 2:41pm

(As per IRC discussion)

I understand that the carry-less multiplication algorithm has it's uses
since/and it is implemented as an instruction in many architectures
and that adding it as a general-purpose intrinsic will allow us
to drop target-specific intrinsics as by-product.

What i do *NOT* understand is: what is the actual/main goal/driving
factor of adding an LLVM intrinsic for it?

The use that was mentioned is crypto, and i'm personally not really
registering anything else. Am i just misreading it?
The crypto use-case doesn't make sense to me, because
as of this moment LLVM "explicitly" has zero constant-time
guarantees for LLVM IR instructions/intrinsics.

I feel like it's a really important question.
If there isn't interest in crypto/constant-time here, i think it would
be best to explicitly state so. If there is, i think it may be good
to hear from Chandler (who IIRC is driving some constant-time
work for C++/LLVM, that is not yet widely public)

Roman

Stephen_Canon1 · July 9, 2020, 3:13pm

CLMUL is absolutely useful outside of “crypto” contexts that want/require “constant time” operation.

To name just two families of uses, it’s the backbone of many hash/checksum algorithms and error-correcting codes, where the goal is often simply to go as fast as possible, and uArch side-channel resistance is not a concern.

– Steve

Finkel_Hal_J · July 9, 2020, 3:24pm

CLMUL is absolutely useful outside of “crypto” contexts that want/require “constant time” operation.

To name just two families of uses, it’s the backbone of many hash/checksum algorithms and error-correcting codes, where the goal is often simply to go as fast as possible, and uArch side-channel resistance is not a concern.

– Steve

+1

See, e.g., Crazily fast hashing with carry-less multiplications – Daniel Lemire's blog -- and also, CLMUL instruction set - Wikipedia, "One use of these instructions is to improve the speed of applications doing block cipher encryption in Galois/Counter Mode, which depends on finite field GF(2^k) multiplication. Another application is the fast calculation of CRC values, including those used to implement the LZ77 sliding window DEFLATE algorithm in zlib and pngcrush."

-Hal

shawnl · July 9, 2020, 4:39pm

Carry-less multiplication[1] instructions exist (at least optionally) on many architectures: armv8, RISC-V, x86_64, POWER, SPARC, C64x, and possibly more.

This proposal is to add a llvm.clmul instruction. Or if that is contentious, llvm.experimental.bitmanip.clmul instruction. It takes two integer operands of the same width, and returns an integer with twice the width of the operands. (Is there a good reason to make these the same width, as all the other operations do even when it doesn’t really make sense for the mathematical operation–like multiplication or ctpop/ctlz/cttz?)

If the CPU does not have a dedication clmul operation, it can be lowered to regular multiplication, by using holes to avoid carrys.

==Where is clmul used?==

While somewhat specialized, the RISC-V manual documents many uses: [2]

The classic applications forclmulare Cyclic Redundancy Check (CRC) [11, 26]

and Galois/CounterMode (GCM), but more applications exist, including the following examples.There are obvious applications in hashing and pseudo random number generations. For exam-ple, it has been reported that hashes based on carry-less multiplications can outperform Google’sCityHash [17].

clmulof a number with itself inserts zeroes between each input bit. This can be useful for generatingMorton code [23].

clmulof a number with -1 calculates the prefix XOR operation. This can be useful for decodinggray codes.Another application of XOR prefix sums calculated withclmulis branchless tracking of quotedstrings in high-performance parsers. [16]

Carry-less multiply can also be used to implement Erasure code efficiently. [14]

==clmul lowering without hardware support==
A 8x8=>16 clmul can also be lowered to a 32x32=>64 multiplication when there is no specialized instruction (also 15x15=>30, to a 60x60=>120, or if bitreverse is available 16x16=>32 to TWO 64x64=>64 multiplications)[3].

[1] Carry-less product - Wikipedia
[2] (page 30) https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf
[3] BearSSL - Constant-Time Crypto

What benefit would this intrinsic would bring to the middle-end IR,
over it's current naive expanded form?

Note that teaching backends to produce it, or even adding it to
backend (ISD opcodes)
and matching it in DAGCombiner has much lower barrier of entry, i
would suggest to start there.

It cannot be matched.

kparzysz-quic · July 9, 2020, 5:27pm

FWIW, Hexagon has a pass to recognize polynomial multiplication:
llvm/lib/Target/Hexagon/HexagonLoopIdiomRecognition.cpp
See "PolynomialMultiplyRecognize"

topperc · July 9, 2020, 7:15pm

I think i’d prefer to have the output type match the input type. Makes it more similar to other intrinsics and binary operators. We should be able to zero extend the inputs if you want something like 64x64->128.

Are you planning to expose this to C through clang? What types would we expose?

~Craig

kparzysz-quic · July 9, 2020, 7:31pm

I didn’t have any specific plans. What happens next depends on whether people find this code useful. It’s kind of complicated because since the time it was written the “canonical form” of LLVM IR has changed multiple times, and I added more and more stuff to “recanonicalize” it back into the form the idiom recognition code is used to seeing.

I was thinking about inventing a different way of recognizing the idiom, but I haven’t had time to spend on it. Whatever it is, it should be immune to ongoing changes in instcombine.

Topic		Replies	Views
RFC carry-less multiplication instruction LLVM Project	1	1063	July 3, 2020
Portable multiplication 64 x 64 -> 128 for int128 reimplementation LLVM Dev List Archives	7	139	January 2, 2019
Bignum development LLVM Dev List Archives	20	182	June 17, 2010
Optimizing math code LLVM Dev List Archives	6	125	February 18, 2014
Rather poor code optimisation of current clang/LLVM targeting Intel x86 (both -64 and -32) LLVM Dev List Archives	8	136	December 3, 2018

[RFC] carry-less multiplication instruction

Related topics