[RFC] CRC Recognition in LoopIdiomRecognizer

Hi all,

I have a patch consisting of a few commits implementing CRC recognition in the Loop Idiom Recognizer, replacing the loop with a lookup table. I would like to establish three things in this thread:

  1. Whether this something that is suitable for LLVM
  2. How it should be reviewed
  3. Volunteers for reviewers

The initial motivation was CoreMark, to which this patch increases performance by over 10% on our 32-bit RISC-V core. However, the application is broad enough to apply to any project implementing a bitwise CRC loop. A similar project was presented at the gnu couldron, to which claimed there are more than 200 bitwise implementations in Fedora packages. I have not verified this, but have collected a handful of crc implementations from the web (including the linux kernel), and adapted these into the unit tests.

The patch currently consists of seven commits, each containing a distinct piece of functionality. Would this be a separate review for each one, or one large PR for them all?

Current limitations:

  • Only works on byte loops
    CRC size can be any, but the data is limited to one byte. i.e. a loop with iteration count 8.
  • Only works on single-block loops
    most CRC loops would have been flattened to one block with select instructions this far into the pipeline.

Both limitations were in effort to reduce complexity, especially for a first patch. The code can be fairly easily extended to overcome these limitations.

And lastly, a little bit about implementation details in a crude five step process!

  1. Check if the loop looks like CRC and extract some useful information
  2. Execute one iteration of the instruction of the loop to see what happens to our output value
  3. Ensure the output value is predicated on the value of the M/LSB of our input
  4. Construct an expected output value of one iteration of CRC using the extracted information from step one and compare
  5. Construct a lookup table and replace the output value with a lookup

Thanks all!
Joe

1 Like

Itā€™s better to recognize polynomial multiplication, since itā€™s more general.

To share our experiences with thatā€”
Hexagon has some old code trying to handle it (motivated by coremark as well). One issue we were having was that matching the code was hard since the actual IR was changing (due to changes in instcombine). To deal with that we added a local simplifier that was massaging the code to look like what we expected.

In hindsight, tracking individual bits through the loop seems like a much better approach, since it would be immune to instcombine. Plus tracking based on LLVM IR would be target-independent. On Hexagon we had a MI-level bit tracker, and I have wanted to write one for LLVM IR, but I never did.

Iā€™m only semi-confident on the maths here, and I havenā€™t read through the whole of PolynomialMultiplyRecognize, but patterns I keep seeing in that pass that itā€™s attemping to match, e.g

  // for i = 0..31
  //   R = phi (P, R')
  //   if (R & 1)
  //     R' = (R >> 1) ^ Q
  //   else
  //     R' = R >> 1

match polynomial division over GF(2^n). I understand polynomial division can be part of polynomial multiplication, i.e. multiplication in a finite field is the multiplication of your two polynomials modulo your reducing polynomial (or long division using this as divisior).

So wouldnā€™t it make more sense to recognise the polynomial division rather than multiplication? I suppose thatā€™s what this patch is kind of doing.

Some architectures have polynomial multiplication instructions (Hexagon, for example). Iā€™m not aware of any specific instructions for division though. If they exist, recognizing both would be nice.

CRC is a remainder P(x) mod Q(x) for some input polynomial P, and a fixed polynomial Q. If Q is irreducible, then the set of all such remainders forms a field, and so division will be equivalent to multiplication by (likely other) element of that field. If the polynomial Q is compile-time constant, it may be possible to extract the bits of the inverse by looking at the loop. IIRC, this is what the Hexagon code is doing[1].

[1] And the reason for that is the ISAā€”the presence of pmpy instruction (but not pdiv).

Note that Misha has already adjusted the old Hexagon code to work for RISC-V and generate clmul instructions. The general feedback when we last discussed this was the code should be generalized rather than being implemented by each port (quite reasonable). If someone wanted to start that task itā€™d be greatly appreciated (Iā€™ve got Misha doing stuff that is higher priority for Ventana right now).

Iā€™ve also got an engineer working on a GCC implementation which might provide some inspiration if someone wanted to look at it. That GCC implementation also has numerous tests that might be helpful.

Note the LLVM implementation is geared towards generating clmul. The GCC implementation has two paths once it hits target expansion. It can generate a table based CRC (including constructing larger CRCs from a 256 entry table) as well as clmul based sequences for RISC-V.

I donā€™t usually participate in LLVM reviews/discussions, so I donā€™t have pointers handy to Mishaā€™s work. But donā€™t hesitate to reach out and I can get a copy of that work easily. jlaw@ventanamicro.com.

I did also extend this patch to generate a crc intrinsic if the target supported it, instead of the table lookup. In the case of RISC-V this would be a clmul expansion in the backend. I havenā€™t finished that work, but also it was performing worse than the table based in my experiments. If you already have clmul expansion working for RISC-V, then this patch could be extended using your expansion!

The general feedback when we last discussed this was the code should be generalized rather than being implemented by each port (quite reasonable). If someone wanted to start that task itā€™d be greatly appreciated

Could this patch not be described as such? This should work for any target.

Hi all,

As Krzysztof mentioned, the CRC recognition is pretty fragile. I saw the same in my experience too. I would like to suggest this idea on how to improve this.

Let us think about polynomials of fixed size as vectors of bits. The bitwise addition is the xor operation. Then, for a fixed divisor D(x) the operation r: P(x) -> P(x) mod Q(x) is linear. That means we can recognize crc (and maybe other linear things) by the following method.

Let us say we have a function of one input f(x).

(1) Prove that f is linear, i.e. satisfies this identity f(x ^ y) = f(x) ^ f(y)
(2) Compute values of f on basis (f(1), f(1 << 1), f(1 << 2), ...)
(3) Compare the computed values to known CRC(1), CRC(1 << 1), ... CRC(1 << 2), .... If all the values are equal, then f computes a crc. Note that we would either have to recoginze D(x) either from scanning code or just try some popular values.

@kparzysz @jofa What do you think?

Thanks for the reply Mikhail. When you say ā€˜CRC recognition is pretty fragileā€™, are you referring to the Hexagon recognition, or my patch? Because I donā€™t quite understand how the method in the patch is any more fragile than the method you outlined.

Also, Iā€™m not clear on some of the steps. How would you calculate f(n)? We would surely have to look at each instruction and apply them.

After thinking about it for a whileā€”I think if @jofaā€™s code detects CRC then I think itā€™s worth checking it in.

Extending or generalizing would be nice, but that will take time, and we can get this CRC right now.

@mgudim: This sounds like an interesting approach. Do you think there are other examples of idiomatic functions that could be detected this way?

@jofa

When you say ā€˜CRC recognition is pretty fragileā€™, are you referring to the Hexagon recognition, or my patch?

I am referring to any method that scans the code and looks for particular patterns in it.

Because I donā€™t quite understand how the method in the patch is any more fragile than the method you outlined.

The method I outlined is based on proving some property of the code (f(x ^ y) = f(x) ^ f(y)) which is more stable to different variations in IR. With my method you donā€™t need to know any possible variations of CRC, you just need to know what irreducible polynomials are used and what the crc values should be on 32 numbers:
1 << 0, 1 << 1, ... , 1 << 31. This can be precomputed offline.

Also, Iā€™m not clear on some of the steps. How would you calculate f(n)?

Let us say n = 11. Then to calculate f(n) knowing that f and if you know values f(1 << 0) = 2, f(1 << 1) = 3 and f(1 << 3) = 4:

f(11) = f((1 << 0) ^ (1 << 1) ^ (1 << 3)) = f(1 << 0) ^ f(1 << 1) ^ f(1 << 3) = 2 ^ 3 ^ 4 = ...

Actually we donā€™t need to ā€œcalculate f(n)ā€, we need to prove that for all n we have that f(n) = crc(n). If f is linear you just need to prove f(1 << 0) = crc(1 << 0), f(1 << 1) = crc(1 << 1), ...

@kparzysz

and we can get this CRC right now.

Yes, I agree. Maybe later we can extend it.

Do you think there are other examples of idiomatic functions that could be detected this way?

I honestly donā€™t know. Perhaps once we can prove that some code is linear we can run some huge code base through the recognizer and see what it finds.

Thanks Mikhail, I appreciate the explanation! This makes sense to me. Thinking about how this would look, your outlined method wonā€™t be too difficult to implement given the helper functions in my patch. i.e. we could use the output of the symbolic execution to calculate f(1 << 0), f (1 << 1) ā€¦

Nonetheless, Iā€™m happy to hear you both think this patch is something that can go in. Can anybody help with regards to how it should be reviewed and volunteers for reviewers? Additionally, anybody else who should have visibility of this?

@jofa
I looked very briefly and added some comments mostly asking to add explanations.

@jofa @kparzysz
It would be great this could take advantage of clmul instructions if the target has them. My suggestion is that in LoopRecognition part we emit polynomial division intrinsics. Later, after all inlining happens, these intrinsics may combine to something bigger. I think in coremark 8-bit crcā€™s combine to 16-bit and to 32-bit. During CodeGenPrepare if the target has clmul instructions, we can use Barretā€™s formula to simpify that using polynomial multiplication. If target doesnā€™t have polynomial multiplicatoin, we can emit a table.

@mgudim Yes, thatā€™s exactly what the plan was. I propose some sort of call to TLI to check if the backend has custom lowering for crc. If it does, then emit an intrinsic. And as you said, we can combine intrinsic calls together later in the pipeline. If the target does not support custom lowering, I think itā€™s best to leave the table based expansion in LoopIdiomRecognize, as we already have the CRC information.