Loop identification

Catello_Cioffi · January 13, 2017, 3:45pm

Good afternoon,

I’m working on modifying the Mips backend in order to add new functionalities. I’ve successfully implemented the intrinsics, but I want to recognize a pattern like this:

int seq[max];
int cnt = 0;

for (int i = 0; i < max; i++)
{

for (int j = 0; i < 16; i++)
{
char hn = (seq[i] & (3<<(j2)) >> (j2);
if (hn == 2)
{
cnt++;
}
}

}

and transform it into something like:

int seq[max];
int cnt = 0;

for (int i = 0; i < max; i++)
{

cnt += intrinsic(seq[i], 2);

}

Do you know what I can use to transform the loop? Or if exists something similar in LLVM?

Thanks,

Catello

Krzysztof_Parzyszek · January 13, 2017, 4:19pm

Hi Catello,

LLVM does have a "loop idiom recognition" pass which, in principle, does exactly that kind of a thing: it recognizes loops that perform memcpy/memset operations. It does not recognize any target-specific idioms though and there isn't really much in it that would make such recognition easier. We have some cases like yours on Hexagon, where we want to replace certain loops with Hexagon-specific intrinsics, and the way we do it is that we have (in our own compiler) a separate pass that runs at around the same time, but which does "Hexagon-specific loop idiom recognition". That pass is not present in llvm.org, mostly because it hooks up a target specific pass in a way that is not "officially supported".

If LLVM supported adding such target-specific passes at that point in the optimization pipeline, you could just write your own pass and plug it in there.

-Krzysztof

Finkel_Hal_J · January 13, 2017, 4:52pm

Hi Catello,

LLVM does have a "loop idiom recognition" pass which, in principle, does exactly that kind of a thing: it recognizes loops that perform memcpy/memset operations. It does not recognize any target-specific idioms though and there isn't really much in it that would make such recognition easier. We have some cases like yours on Hexagon, where we want to replace certain loops with Hexagon-specific intrinsics, and the way we do it is that we have (in our own compiler) a separate pass that runs at around the same time, but which does "Hexagon-specific loop idiom recognition". That pass is not present in llvm.org, mostly because it hooks up a target specific pass in a way that is not "officially supported".

If LLVM supported adding such target-specific passes at that point in the optimization pipeline, you could just write your own pass and plug it in there.

This certainly seems like a reasonable thing to support, but the question is: Why should your pass run early in the mid-level optimizer (i.e. in the part of the pipeline we generally consider canonicalization) instead of as an early IR pass in the backend? Adding IR-level passes early in the backend is well supported. There are plenty of potential answers here for why earlier is better (e.g. affecting inlining decisions, idioms might be significantly more difficult to recognize after vectorization, etc.) but I think we need to discuss the use case.

-Hal

Krzysztof_Parzyszek · January 13, 2017, 5:13pm

The reason is that the idiom code may end up looking different each time one of the preceding optimization is changed. Also, some of the optimizations (instruction combiner, for example) have a tendency to greatly obfuscate the code, making it really hard to extract useful data from the idiom code. It is not always enough to simply recognize a pattern, but to replace it with an intrinsic some additional parameters may need to be obtained from the initial code. When the code is mangled by the combiner, this process may be a lot harder. Also, combiner is one of those things that change quite often. For recognizing loop idioms, the loop optimizations may be the main problem. The idiom code may end up getting unrolled, rotated, or otherwise rendered unrecognizable.

-Krzysztof

mehdi_amini · January 13, 2017, 5:25pm

If LLVM supported adding such target-specific passes at that point in
the optimization pipeline, you could just write your own pass and plug
it in there.

This certainly seems like a reasonable thing to support, but the
question is: Why should your pass run early in the mid-level optimizer
(i.e. in the part of the pipeline we generally consider
canonicalization) instead of as an early IR pass in the backend? Adding
IR-level passes early in the backend is well supported. There are plenty
of potential answers here for why earlier is better (e.g. affecting
inlining decisions, idioms might be significantly more difficult to
recognize after vectorization, etc.) but I think we need to discuss the
use case.

The reason is that the idiom code may end up looking different each time one of the preceding optimization is changed. Also, some of the optimizations (instruction combiner, for example) have a tendency to greatly obfuscate the code

This seems quite contradictory with what I was constantly told about the goal of inst-combine, i.e. it is supposed to canonicalize the IR to make it easier for further passes to recognize patterns.

, making it really hard to extract useful data from the idiom code. It is not always enough to simply recognize a pattern, but to replace it with an intrinsic some additional parameters may need to be obtained from the initial code. When the code is mangled by the combiner, this process may be a lot harder. Also, combiner is one of those things that change quite often. For recognizing loop idioms, the loop optimizations may be the main problem. The idiom code may end up getting unrolled, rotated, or otherwise rendered unrecognizable.

That part (loop optimization) was acknowledged in Hal answer IIUC.

Krzysztof_Parzyszek · January 13, 2017, 6:00pm

Chris has described the "canonical form" as one that has been "maximally strength-reduced". The problem with this is that it is very vague:
- Whether you can simplify something further often depends on how complex the simplification code can become. Also, the equivalency problem is provably unsolvable, so there isn't a hard stop at which it can reliably determine that "this is the final form".
- It doesn't tell the consumers what really they should expect to see in the reduced code, and what should be folded/simplified. Sure, you can read the combiner source, but that may change in the next commit.

I've seen the same problem with SimpilfyCFG a few years back. It would go bonkers trying to simplify branches, and as a result it would create some degenerate loop nest out of a single loop with a simple conditional "continue" statement in it. We've had to add some code to throttle it back or otherwise the outcome was just too complex to analyze without writing a loop nest optimizer. (I don't know what it's like now, I haven't looked at it lately).

We have the same thing yet again in the DAG combiner, where the rules are listed in the comments, instead of being listed in the documentation. People write selection patterns that need to match the actual DAGs, and knowing what form the code will take would certainly be of help.

A canonicalizer should, above all, make the code look predictable instead of trying hard to make it optimal, and that's not something that I have confidence in at the moment.

-Krzysztof

Krzysztof_Parzyszek · January 13, 2017, 6:53pm

The main case for us was recognizing polynomial multiplications. Hexagon has instructions that do that, and the goal was to replace loops that do that with intrinsics. The problem is that these loops often get unrolled and intertwined with other code, making the replacement hard, or impossible. This is especially true if some of the multiplication code is combined with instructions that were not originally part of it (I don't remember 100% if that was happening, but the loop optimizations were the main culprit in making it hard).

-Krzysztof

Finkel_Hal_J · January 14, 2017, 1:08am

but I think we need to discuss the use case.

The main case for us was recognizing polynomial multiplications. Hexagon has instructions that do that, and the goal was to replace loops that do that with intrinsics. The problem is that these loops often get unrolled and intertwined with other code, making the replacement hard, or impossible. This is especially true if some of the multiplication code is combined with instructions that were not originally part of it (I don't remember 100% if that was happening, but the loop optimizations were the main culprit in making it hard).

This is integer multiplication or floating-point multiplication? If it is integer multiplication, I'd expect that using SCEV would be the easiest way to recognize the relevant patterns. SCEV is supposed to understand all of the unobfuscation tricks.

Do these instructions contain an implicit loop (of runtime trip count) or are you trying to match loops of some fixed trip count?

-Hal

Krzysztof_Parzyszek · January 14, 2017, 10:57pm

It's binary polynomial multiplication. The loops are usually with a compile-time constant iteration count.

I posted a specific patch with that code:
https://reviews.llvm.org/D28694

-Krzysztof

Catello_Cioffi · January 16, 2017, 8:23am

The result in the example I made doesn’t need to be exactly like that; that is I want to recognize the inner loop and translate it into a machine instruction. I thought that translating it into an intrinsic could be easier but it seems not.
What I was thinking to do is to modify the Mips backend adding a Loop Pass or a Function Pass. Do you think that this could work?

Krzysztof_Parzyszek · January 16, 2017, 2:14pm

The polynomial multiplication discussed below is actually exactly that---it converts a loop into a single intrinsic that performs polynomial multiplication. The polynomials are over {0, 1}, i.e. the coefficients are bits, and a 32-bit integer can represent a polynomial of degree 31. Hexagon (among other architectures) has an instruction that takes two such inputs (to integers representing polynomials) and produces their polynomial product.

Generally, target-specific passes run after all target-independent passes. This may be a problem for loop idiom recognition, since all the target-independent optimizations may transform the loop in a way that makes the idiom recognition harder. The PassManagerBuilder supports the concept of insertion points, where extra passes could be added, and I'm hoping that an insertion point could be added at the place where the target-independent idiom recognition pass runs, which would help both you and me (with the polynomial multiplication, and few other things, which are not yet in the trunk).

-Krzysztof

Krzysztof_Parzyszek · January 26, 2017, 10:00pm

You can now add your own loop pass that could do loop idiom recognition for your target. I just did that for Hexagon in:
https://reviews.llvm.org/rL293213.

-Krzysztof

Topic		Replies	Views
Loop optimization LLVM Dev List Archives	4	171	August 13, 2019
opt, loopidiom pass and compiling system libraries LLVM Dev List Archives	0	119	January 12, 2011
IR Passes and TargetTransformInfo: Straw Man LLVM Dev List Archives	30	292	August 1, 2013
Request for compilers LLVM Dev List Archives	2	214	April 17, 2013
[RFC] CRC Recognition in LoopIdiomRecognizer Loop Optimizations	13	843	May 8, 2025

Loop identification

Related topics