[RFC] CRC Recognition in LoopIdiomRecognizer

jofa · January 24, 2024, 2:30pm

Hi all,

I have a patch consisting of a few commits implementing CRC recognition in the Loop Idiom Recognizer, replacing the loop with a lookup table. I would like to establish three things in this thread:

Whether this something that is suitable for LLVM
How it should be reviewed
Volunteers for reviewers

The initial motivation was CoreMark, to which this patch increases performance by over 10% on our 32-bit RISC-V core. However, the application is broad enough to apply to any project implementing a bitwise CRC loop. A similar project was presented at the gnu couldron, to which claimed there are more than 200 bitwise implementations in Fedora packages. I have not verified this, but have collected a handful of crc implementations from the web (including the linux kernel), and adapted these into the unit tests.

The patch currently consists of seven commits, each containing a distinct piece of functionality. Would this be a separate review for each one, or one large PR for them all?

Current limitations:

Only works on byte loops
CRC size can be any, but the data is limited to one byte. i.e. a loop with iteration count 8.
Only works on single-block loops
most CRC loops would have been flattened to one block with select instructions this far into the pipeline.

Both limitations were in effort to reduce complexity, especially for a first patch. The code can be fairly easily extended to overcome these limitations.

And lastly, a little bit about implementation details in a crude five step process!

Check if the loop looks like CRC and extract some useful information
Execute one iteration of the instruction of the loop to see what happens to our output value
Ensure the output value is predicated on the value of the M/LSB of our input
Construct an expected output value of one iteration of CRC using the extracted information from step one and compare
Construct a lookup table and replace the output value with a lookup

Thanks all!
Joe

kparzysz · January 24, 2024, 3:48pm

It’s better to recognize polynomial multiplication, since it’s more general.

To share our experiences with that—
Hexagon has some old code trying to handle it (motivated by coremark as well). One issue we were having was that matching the code was hard since the actual IR was changing (due to changes in instcombine). To deal with that we added a local simplifier that was massaging the code to look like what we expected.

In hindsight, tracking individual bits through the loop seems like a much better approach, since it would be immune to instcombine. Plus tracking based on LLVM IR would be target-independent. On Hexagon we had a MI-level bit tracker, and I have wanted to write one for LLVM IR, but I never did.

jofa · January 24, 2024, 4:36pm

I’m only semi-confident on the maths here, and I haven’t read through the whole of PolynomialMultiplyRecognize, but patterns I keep seeing in that pass that it’s attemping to match, e.g

  // for i = 0..31
  //   R = phi (P, R')
  //   if (R & 1)
  //     R' = (R >> 1) ^ Q
  //   else
  //     R' = R >> 1

match polynomial division over GF(2^n). I understand polynomial division can be part of polynomial multiplication, i.e. multiplication in a finite field is the multiplication of your two polynomials modulo your reducing polynomial (or long division using this as divisior).

So wouldn’t it make more sense to recognise the polynomial division rather than multiplication? I suppose that’s what this patch is kind of doing.

kparzysz · January 24, 2024, 5:22pm

Some architectures have polynomial multiplication instructions (Hexagon, for example). I’m not aware of any specific instructions for division though. If they exist, recognizing both would be nice.

CRC is a remainder P(x) mod Q(x) for some input polynomial P, and a fixed polynomial Q. If Q is irreducible, then the set of all such remainders forms a field, and so division will be equivalent to multiplication by (likely other) element of that field. If the polynomial Q is compile-time constant, it may be possible to extract the bits of the inverse by looking at the loop. IIRC, this is what the Hexagon code is doing[1].

[1] And the reason for that is the ISA—the presence of pmpy instruction (but not pdiv).

JeffreyALaw · January 24, 2024, 6:53pm

Note that Misha has already adjusted the old Hexagon code to work for RISC-V and generate clmul instructions. The general feedback when we last discussed this was the code should be generalized rather than being implemented by each port (quite reasonable). If someone wanted to start that task it’d be greatly appreciated (I’ve got Misha doing stuff that is higher priority for Ventana right now).

I’ve also got an engineer working on a GCC implementation which might provide some inspiration if someone wanted to look at it. That GCC implementation also has numerous tests that might be helpful.

Note the LLVM implementation is geared towards generating clmul. The GCC implementation has two paths once it hits target expansion. It can generate a table based CRC (including constructing larger CRCs from a 256 entry table) as well as clmul based sequences for RISC-V.

I don’t usually participate in LLVM reviews/discussions, so I don’t have pointers handy to Misha’s work. But don’t hesitate to reach out and I can get a copy of that work easily. jlaw@ventanamicro.com.

jofa · January 25, 2024, 9:34am

I did also extend this patch to generate a crc intrinsic if the target supported it, instead of the table lookup. In the case of RISC-V this would be a clmul expansion in the backend. I haven’t finished that work, but also it was performing worse than the table based in my experiments. If you already have clmul expansion working for RISC-V, then this patch could be extended using your expansion!

The general feedback when we last discussed this was the code should be generalized rather than being implemented by each port (quite reasonable). If someone wanted to start that task it’d be greatly appreciated

Could this patch not be described as such? This should work for any target.

mgudim · January 29, 2024, 8:23pm

Hi all,

As Krzysztof mentioned, the CRC recognition is pretty fragile. I saw the same in my experience too. I would like to suggest this idea on how to improve this.

Let us think about polynomials of fixed size as vectors of bits. The bitwise addition is the xor operation. Then, for a fixed divisor D(x) the operation r: P(x) -> P(x) mod Q(x) is linear. That means we can recognize crc (and maybe other linear things) by the following method.

Let us say we have a function of one input f(x).

(1) Prove that f is linear, i.e. satisfies this identity f(x ^ y) = f(x) ^ f(y)
(2) Compute values of f on basis (f(1), f(1 << 1), f(1 << 2), ...)
(3) Compare the computed values to known CRC(1), CRC(1 << 1), ... CRC(1 << 2), .... If all the values are equal, then f computes a crc. Note that we would either have to recoginze D(x) either from scanning code or just try some popular values.

@kparzysz @jofa What do you think?

jofa · January 30, 2024, 11:07am

Thanks for the reply Mikhail. When you say ‘CRC recognition is pretty fragile’, are you referring to the Hexagon recognition, or my patch? Because I don’t quite understand how the method in the patch is any more fragile than the method you outlined.

Also, I’m not clear on some of the steps. How would you calculate f(n)? We would surely have to look at each instruction and apply them.

kparzysz · January 30, 2024, 2:06pm

After thinking about it for a while—I think if @jofa’s code detects CRC then I think it’s worth checking it in.

Extending or generalizing would be nice, but that will take time, and we can get this CRC right now.

@mgudim: This sounds like an interesting approach. Do you think there are other examples of idiomatic functions that could be detected this way?

mgudim · January 31, 2024, 6:13am

@jofa

When you say ‘CRC recognition is pretty fragile’, are you referring to the Hexagon recognition, or my patch?

I am referring to any method that scans the code and looks for particular patterns in it.

Because I don’t quite understand how the method in the patch is any more fragile than the method you outlined.

The method I outlined is based on proving some property of the code (f(x ^ y) = f(x) ^ f(y)) which is more stable to different variations in IR. With my method you don’t need to know any possible variations of CRC, you just need to know what irreducible polynomials are used and what the crc values should be on 32 numbers:
1 << 0, 1 << 1, ... , 1 << 31. This can be precomputed offline.

Also, I’m not clear on some of the steps. How would you calculate f(n)?

Let us say n = 11. Then to calculate f(n) knowing that f and if you know values f(1 << 0) = 2, f(1 << 1) = 3 and f(1 << 3) = 4:

f(11) = f((1 << 0) ^ (1 << 1) ^ (1 << 3)) = f(1 << 0) ^ f(1 << 1) ^ f(1 << 3) = 2 ^ 3 ^ 4 = ...

Actually we don’t need to “calculate f(n)”, we need to prove that for all n we have that f(n) = crc(n). If f is linear you just need to prove f(1 << 0) = crc(1 << 0), f(1 << 1) = crc(1 << 1), ...

@kparzysz

and we can get this CRC right now.

Yes, I agree. Maybe later we can extend it.

Do you think there are other examples of idiomatic functions that could be detected this way?

I honestly don’t know. Perhaps once we can prove that some code is linear we can run some huge code base through the recognizer and see what it finds.

jofa · January 31, 2024, 2:30pm

Thanks Mikhail, I appreciate the explanation! This makes sense to me. Thinking about how this would look, your outlined method won’t be too difficult to implement given the helper functions in my patch. i.e. we could use the output of the symbolic execution to calculate f(1 << 0), f (1 << 1) …

Nonetheless, I’m happy to hear you both think this patch is something that can go in. Can anybody help with regards to how it should be reviewed and volunteers for reviewers? Additionally, anybody else who should have visibility of this?

mgudim · February 1, 2024, 8:25am

@jofa
I looked very briefly and added some comments mostly asking to add explanations.

@jofa @kparzysz
It would be great this could take advantage of clmul instructions if the target has them. My suggestion is that in LoopRecognition part we emit polynomial division intrinsics. Later, after all inlining happens, these intrinsics may combine to something bigger. I think in coremark 8-bit crc’s combine to 16-bit and to 32-bit. During CodeGenPrepare if the target has clmul instructions, we can use Barret’s formula to simpify that using polynomial multiplication. If target doesn’t have polynomial multiplicatoin, we can emit a table.

jofa · February 1, 2024, 9:21am

@mgudim Yes, that’s exactly what the plan was. I propose some sort of call to TLI to check if the backend has custom lowering for crc. If it does, then emit an intrinsic. And as you said, we can combine intrinsic calls together later in the pipeline. If the target does not support custom lowering, I think it’s best to leave the table based expansion in LoopIdiomRecognize, as we already have the CRC information.

Topic		Replies	Views
Rather poor code optimisation of current clang/LLVM targeting Intel x86 (both -64 and -32) LLVM Dev List Archives	8	110	December 3, 2018
[RFC] Strlen loop idiom recognition/folding LLVM Dev List Archives	0	153	July 8, 2020
Improve the performance of JamCRC LLVM Dev List Archives	3	66	April 13, 2017
[AArch64] Target-specific loop idiom recognition IR & Optimizations	11	960	August 18, 2023
RFC: Generic IR reductions LLVM Dev List Archives	18	126	February 1, 2017

[RFC] CRC Recognition in LoopIdiomRecognizer

Related topics