[RFC] carry-less multiplication instruction

Carry-less multiplication[1] instructions exist (at least optionally) on many architectures: armv8, RISC-V, x86_64, POWER, SPARC, C64x, and possibly more.

This proposal is to add a llvm.clmul instruction. Or if that is contentious, llvm.experimental.bitmanip.clmul instruction. It takes two integer operands of the same width, and returns an integer with twice the width of the operands. (Is there a good reason to make these the same width, as all the other operations do even when it doesn’t really make sense for the mathematical operation–like multiplication or ctpop/ctlz/cttz?)

If the CPU does not have a dedication clmul operation, it can be lowered to regular multiplication, by using holes to avoid carrys.

==Where is clmul used?==

While somewhat specialized, the RISC-V manual documents many uses: [2]

The classic applications forclmulare Cyclic Redundancy Check (CRC) [11, 26]

and Galois/CounterMode (GCM), but more applications exist, including the following examples.There are obvious applications in hashing and pseudo random number generations. For exam-ple, it has been reported that hashes based on carry-less multiplications can outperform Google’sCityHash [17].

clmulof a number with itself inserts zeroes between each input bit. This can be useful for generatingMorton code [23].

clmulof a number with -1 calculates the prefix XOR operation. This can be useful for decodinggray codes.Another application of XOR prefix sums calculated withclmulis branchless tracking of quotedstrings in high-performance parsers. [16]

Carry-less multiply can also be used to implement Erasure code efficiently. [14]

==clmul lowering without hardware support==
A 8x8=>16 clmul can also be lowered to a 32x32=>64 multiplication when there is no specialized instruction (also 15x15=>30, to a 60x60=>120, or if bitreverse is available 16x16=>32 to TWO 64x64=>64 multiplications)[3].

[1] https://en.wikipedia.org/wiki/Carry-less_product
[2] (page 30) https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf
[3] https://www.bearssl.org/constanttime.html

(First posted to discord

Carry-less multiplication[1] instructions exist (at least optionally) on many architectures: armv8, RISC-V, x86_64, POWER, SPARC, C64x, and possibly more.

This proposal is to add a llvm.clmul instruction. Or if that is contentious, llvm.experimental.bitmanip.clmul instruction. It takes two integer operands of the same width, and returns an integer with twice the width of the operands. (Is there a good reason to make these the same width, as all the other operations do even when it doesn’t really make sense for the mathematical operation–like multiplication or ctpop/ctlz/cttz?)

If the CPU does not have a dedication clmul operation, it can be lowered to regular multiplication, by using holes to avoid carrys.

==Where is clmul used?==

While somewhat specialized, the RISC-V manual documents many uses: [2]

The classic applications forclmulare Cyclic Redundancy Check (CRC) [11, 26]

and Galois/CounterMode (GCM), but more applications exist, including the following examples.There are obvious applications in hashing and pseudo random number generations. For exam-ple, it has been reported that hashes based on carry-less multiplications can outperform Google’sCityHash [17].

clmulof a number with itself inserts zeroes between each input bit. This can be useful for generatingMorton code [23].

clmulof a number with -1 calculates the prefix XOR operation. This can be useful for decodinggray codes.Another application of XOR prefix sums calculated withclmulis branchless tracking of quotedstrings in high-performance parsers. [16]

Carry-less multiply can also be used to implement Erasure code efficiently. [14]

==clmul lowering without hardware support==
A 8x8=>16 clmul can also be lowered to a 32x32=>64 multiplication when there is no specialized instruction (also 15x15=>30, to a 60x60=>120, or if bitreverse is available 16x16=>32 to TWO 64x64=>64 multiplications)[3].

[1] https://en.wikipedia.org/wiki/Carry-less_product
[2] (page 30) https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf
[3] https://www.bearssl.org/constanttime.html

What benefit would this intrinsic would bring to the middle-end IR,
over it's current naive expanded form?

Note that teaching backends to produce it, or even adding it to
backend (ISD opcodes)
and matching it in DAGCombiner has much lower barrier of entry, i
would suggest to start there.

(First posted to discord

--
Shawn Landden

Roman

Carry-less multiplication[1] instructions exist (at least optionally) on many architectures: armv8, RISC-V, x86_64, POWER, SPARC, C64x, and possibly more.

This proposal is to add a llvm.clmul instruction. Or if that is contentious, llvm.experimental.bitmanip.clmul instruction. It takes two integer operands of the same width, and returns an integer with twice the width of the operands. (Is there a good reason to make these the same width, as all the other operations do even when it doesn’t really make sense for the mathematical operation–like multiplication or ctpop/ctlz/cttz?)

If the CPU does not have a dedication clmul operation, it can be lowered to regular multiplication, by using holes to avoid carrys.

==Where is clmul used?==

While somewhat specialized, the RISC-V manual documents many uses: [2]

The classic applications forclmulare Cyclic Redundancy Check (CRC) [11, 26]

and Galois/CounterMode (GCM), but more applications exist, including the following examples.There are obvious applications in hashing and pseudo random number generations. For exam-ple, it has been reported that hashes based on carry-less multiplications can outperform Google’sCityHash [17].

clmulof a number with itself inserts zeroes between each input bit. This can be useful for generatingMorton code [23].

clmulof a number with -1 calculates the prefix XOR operation. This can be useful for decodinggray codes.Another application of XOR prefix sums calculated withclmulis branchless tracking of quotedstrings in high-performance parsers. [16]

Carry-less multiply can also be used to implement Erasure code efficiently. [14]

==clmul lowering without hardware support==
A 8x8=>16 clmul can also be lowered to a 32x32=>64 multiplication when there is no specialized instruction (also 15x15=>30, to a 60x60=>120, or if bitreverse is available 16x16=>32 to TWO 64x64=>64 multiplications)[3].

[1] https://en.wikipedia.org/wiki/Carry-less_product
[2] (page 30) https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf
[3] https://www.bearssl.org/constanttime.html

What benefit would this intrinsic would bring to the middle-end IR,
over it's current naive expanded form?

Isn't a "naive" expansion of NxN carryless multiply extremely involved? I'd expect something like 2N shifts, N truncs, N selects, and N xors.

That link mentions an alternative that is more efficient, but I wouldn't exactly call it naive...

Cheers,
Nicolai

It’d be useful in your proposal to to note which of the existing llvm target specific intrinsics this generic intrinsic can effectively supersede. (E.g. llvm.x86.pclmulqdq for x86.)

When we are already supporting a given function via target specific intrinsics for a number of different targets, that seems a pretty good argument for making it available as a more generic target independent intrinsic.

Shawn,

Are you able to summarize the different instructions from the various targets. It looks like there different implementation choices made for each target. For example, X86 takes two v2i64 inputs and picks either an even or odd element from each to multiply to produce a v1i128 result. It looks like RISC-V has instructions to produce either the high half of the result or the low half of the result. Those are the only two I checked.

Will a common intrinsic need custom handling for each target or is there a common version that multiple targets use that we should choose for the intrinsic?

No, the typical expansion (used by the existing LLVM code generator) uses “grade school math” to decompose large multiplications into smaller ones.

Wide multiplications (e.g. those on X86) are pretty easy to pattern match from “sign/zero extend operands from 64 bits to 128 bits, then do a 128x128=128 multiply” for example. This is all already handled by the existing selection dag infra, my understanding is that it works well in practice.

-Chris

[1] https://en.wikipedia.org/wiki/Carry-less_product
[2] (page 30) https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf
[3] https://www.bearssl.org/constanttime.html

What benefit would this intrinsic would bring to the middle-end IR,
over it's current naive expanded form?

Isn't a "naive" expansion of NxN carryless multiply extremely involved? I'd expect something like 2N shifts, N truncs, N selects, and N xors.

That link mentions an alternative that is more efficient, but I wouldn't exactly call it naive…

No, the typical expansion (used by the existing LLVM code generator) uses “grade school math” to decompose large multiplications into smaller ones.

Wide multiplications (e.g. those on X86) are pretty easy to pattern match from “sign/zero extend operands from 64 bits to 128 bits, then do a 128x128=128 multiply” for example.

That is what i meant, yes.

This is all already handled by the existing selection dag infra, my understanding is that it works well in practice.

-Chris

Roman

“Carry-less multiplication” isn’t a wider bit width multiplication, it’s an entirely different operation. (See the Wikipedia page linked in the proposal).

Shawn,

Are you able to summarize the different instructions from the various targets. It looks like there different implementation choices made for each target. For example, X86 takes two v2i64 inputs and picks either an even or odd element from each to multiply to produce a v1i128 result. It looks like RISC-V has instructions to produce either the high half of the result or the low half of the result. Those are the only two I checked.

Will a common intrinsic need custom handling for each target or is there a common version that multiple targets use that we should choose for the intrinsic?

Only the Power8 instructions are differen’t, as it can do two 64+64=>128 multiplications at the same time, with the result xored together (Karatsuba-style), and if you don’t want that you have to make sure you are multiplying by zero for one of them. So Power would require special lowing for 128+128=>256 multiply. So with Power you get 3 multiplys, but you have to zero some registers, while on RISC-V you would have 4 right after each other, and them have to xor the middle two.

This proposal is to add a llvm.clmul instruction.

What benefit would this intrinsic would bring to the middle-end IR,
over it’s current naive expanded form?

Isn’t a “naive” expansion of NxN carryless multiply extremely involved?

Yes it is. And this is then sped up with a table (such as in the official GCM spec), however using a table can introduce key-dependent loads and security problems. The 32+32->64 or 64+64->64 multiplication lowering is generally constant-time and does not have these security problems.

This proposal is to add a llvm.clmul instruction.

What benefit would this intrinsic would bring to the middle-end IR,
over it's current naive expanded form?

Isn't a "naive" expansion of NxN carryless multiply extremely involved?
I'd expect something like 2N shifts, N truncs, N selects, and N xors.

Yes it is. And this is then sped up with a table (such as in the official GCM spec), however using a table can introduce key-dependent loads and security problems. The 32+32->64 or 64+64->64 multiplication lowering is generally constant-time and does not have these security problems.

Only the Power8 instructions are differen't, as it can do two 64+64=>128 multiplications at the same time, with the result xored together (Karatsuba-style), and if you don't want that you have to make sure you are multiplying by zero for one of them. So Power would require special lowing for 128+128=>256 multiply. So with Power you get 3 multiplys, but you have to zero some registers, while on RISC-V you would have 4 right after each other, and them have to xor the middle two.

FWIW, this seems like a no-brainer to me (as llvm.experimental initially), assuming that it can be designed in such a way that it would eliminate the need for intrinsics on at least two targets (I think it should be possible to do so, with a small amount of back-end work).

– Steve

(As per IRC discussion)

I understand that the carry-less multiplication algorithm has it's uses
since/and it is implemented as an instruction in many architectures
and that adding it as a general-purpose intrinsic will allow us
to drop target-specific intrinsics as by-product.

What i do *NOT* understand is: what is the actual/main goal/driving
factor of adding an LLVM intrinsic for it?

The use that was mentioned is crypto, and i'm personally not really
registering anything else. Am i just misreading it?
The crypto use-case doesn't make sense to me, because
as of this moment LLVM "explicitly" has zero constant-time
guarantees for LLVM IR instructions/intrinsics.

I feel like it's a really important question.
If there isn't interest in crypto/constant-time here, i think it would
be best to explicitly state so. If there is, i think it may be good
to hear from Chandler (who IIRC is driving some constant-time
work for C++/LLVM, that is not yet widely public)

Roman

CLMUL is absolutely useful outside of “crypto” contexts that want/require “constant time” operation.

To name just two families of uses, it’s the backbone of many hash/checksum algorithms and error-correcting codes, where the goal is often simply to go as fast as possible, and uArch side-channel resistance is not a concern.

– Steve

CLMUL is absolutely useful outside of “crypto” contexts that want/require “constant time” operation.

To name just two families of uses, it’s the backbone of many hash/checksum algorithms and error-correcting codes, where the goal is often simply to go as fast as possible, and uArch side-channel resistance is not a concern.

– Steve

+1

See, e.g., https://lemire.me/blog/2015/10/26/crazily-fast-hashing-with-carry-less-multiplications/ -- and also, https://en.wikipedia.org/wiki/CLMUL_instruction_set, "One use of these instructions is to improve the speed of applications doing block cipher encryption in Galois/Counter Mode, which depends on finite field GF(2^k) multiplication. Another application is the fast calculation of CRC values, including those used to implement the LZ77 sliding window DEFLATE algorithm in zlib and pngcrush."

-Hal

Carry-less multiplication[1] instructions exist (at least optionally) on many architectures: armv8, RISC-V, x86_64, POWER, SPARC, C64x, and possibly more.

This proposal is to add a llvm.clmul instruction. Or if that is contentious, llvm.experimental.bitmanip.clmul instruction. It takes two integer operands of the same width, and returns an integer with twice the width of the operands. (Is there a good reason to make these the same width, as all the other operations do even when it doesn’t really make sense for the mathematical operation–like multiplication or ctpop/ctlz/cttz?)

If the CPU does not have a dedication clmul operation, it can be lowered to regular multiplication, by using holes to avoid carrys.

==Where is clmul used?==

While somewhat specialized, the RISC-V manual documents many uses: [2]

The classic applications forclmulare Cyclic Redundancy Check (CRC) [11, 26]

and Galois/CounterMode (GCM), but more applications exist, including the following examples.There are obvious applications in hashing and pseudo random number generations. For exam-ple, it has been reported that hashes based on carry-less multiplications can outperform Google’sCityHash [17].

clmulof a number with itself inserts zeroes between each input bit. This can be useful for generatingMorton code [23].

clmulof a number with -1 calculates the prefix XOR operation. This can be useful for decodinggray codes.Another application of XOR prefix sums calculated withclmulis branchless tracking of quotedstrings in high-performance parsers. [16]

Carry-less multiply can also be used to implement Erasure code efficiently. [14]

==clmul lowering without hardware support==
A 8x8=>16 clmul can also be lowered to a 32x32=>64 multiplication when there is no specialized instruction (also 15x15=>30, to a 60x60=>120, or if bitreverse is available 16x16=>32 to TWO 64x64=>64 multiplications)[3].

[1] https://en.wikipedia.org/wiki/Carry-less_product
[2] (page 30) https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf
[3] https://www.bearssl.org/constanttime.html

What benefit would this intrinsic would bring to the middle-end IR,
over it's current naive expanded form?

Note that teaching backends to produce it, or even adding it to
backend (ISD opcodes)
and matching it in DAGCombiner has much lower barrier of entry, i
would suggest to start there.

It cannot be matched.

FWIW, Hexagon has a pass to recognize polynomial multiplication:
  llvm/lib/Target/Hexagon/HexagonLoopIdiomRecognition.cpp
See "PolynomialMultiplyRecognize"

I think i’d prefer to have the output type match the input type. Makes it more similar to other intrinsics and binary operators. We should be able to zero extend the inputs if you want something like 64x64->128.

Are you planning to expose this to C through clang? What types would we expose?

~Craig

I didn’t have any specific plans. What happens next depends on whether people find this code useful. It’s kind of complicated because since the time it was written the “canonical form” of LLVM IR has changed multiple times, and I added more and more stuff to “recanonicalize” it back into the form the idiom recognition code is used to seeing.

I was thinking about inventing a different way of recognizing the idiom, but I haven’t had time to spend on it. Whatever it is, it should be immune to ongoing changes in instcombine.