[RFC] Framework for Finding and Using Similarity at the IR Level

Hello,

I’m Andrew Litteken, and I am working on a framework for defining, detecting, and deduplicating similar code sections at the IR level.

Programmers can introduce sections of code that perform nearly the same operation. By copying and pasting code throughout a code base, or using a piece of code as a reference to create a new piece of code that has nearly the same structure redundancies are inadvertently introduced throughout programs. Furthermore, compilers can introduce similarity through sets of instructions that require the same code generation strategies, or optimizing to the same patterns.

For example these two pieces of code:

int fn(const std::vector &myVec) {
for (auto it = myVec.begin(),
et = myVec.end(); it != et; ++it) {
if (*it & 1)
return 0;
}
return 1;
}

And

int fn(const std::vector &myVec) {
for (const int &x : myVec) {
if (x % 2)
return 1;
}
return 0;
}

have a conditional that is finding whether the variable (it and x respectively) is equal to 1. And, they generate the same piece of IR code for the conditional:

%11 = load i32, i32* %10, align 4
%12 = and i32 %11, 1
%13 = icmp eq i32 %12, 0
%14 = getelementptr inbounds i32, i32* %10, i64 1
br i1 %13, label %7, label %15

Such sections of code can be called similar. Two sections of code are considered similar when they perform the same operations on a given set of inputs. However, this does not require identity. This particular definition allows for the same sequences of instructions to have different operand names. For example:

%1 = add i32 %a, 10
%2 = add i32 %a, %1
%3 = icmp slt icmp %1, %2

and

%1 = add i32 11, %a
%2 = sub i32 %a, %1
%3 = icmp sgt icmp %2, %1

are clearly not identical. There are different constants, the comparison operator is different and there are flipped operands. But if the constant 10, the constant 11, and register %a are considered inputs inputs into these sections, then the two regions could reduce to be:

%1 = add i32 %input1, %input2
%2 = sub %input2 %a, %1
%3 = icmp slt icmp %1, %2

Ultimately, these sets of operations that are performing the same action, even though the operands are different. This patch focuses on this sort of “structural” similarity.

However, this concept can be easily generalized. For example, two sections of code could be considered similar when they contain the same instructions, but are not necessarily used in the same order. Yet, because of the way the operands are used in this different ordering, the instructions still compute the same results for the given inputs.

If these similarities are recognized, it could offer improvements towards reducing code size, analyzing redundancies in a large project, or creating tools to help refactor code for a programmer.

This new framework offers an interface to detect similarity at the IR level. It has an internal representation that can be accessed via an analysis, allowing for easy access by other transformation and analysis passes that want to leverage detected similarity within a program. It can also be easily serialized into a file format like JSON, or as OptRemarks. In the future, a more advanced notion of similarity could be used, and would it be relatively simple to swap a new detection algorithm into the similarity framework.

At present, there have been two different applications developed on top of this framework. The first is a similarity visualizer tool called LLVM-Sim that can ingest a module, and uses the framework to find the similarity in that module, and output a report for a programmer in a serialized form, like mentioned above. The second is an IR Outliner using the analysis pass as the basis for the outlining candidates. I will first talk briefly about the framework before moving into the two applications.

Finding Similarity:

The framework for finding similarity is centered around adding an analysis pass that finds similar regions of code that do not cross basic blocks, and groups them into different groups of similarity based on their structure. These are then made available to other passes, such as an outliner, or a tool, such as LLVM-Sim.

In more detail, finding similarity works by hashing instructions to integers, and placing them in a long list. The list can be thought of as a string, with each unsigned integer acting as a character. This string can then be processed by a Suffix Tree to find repeated substrings in the program. Following this, the instances of each repeated substring are analyzed for structural similarity. In this case, if a one-to-one mapping for the operands of two different instances of a repeated substring can be found, so that each set of operands is operated on in the same way, then those two regions of instructions are considered to be structurally similar to one another. These tasks are handled by three new classes: The IRInstructionMapper, the IRSimilarityCandidate, and the IRSimilarityIdentifier.

Finding Similarity Implementation:

IRInstructionMapper:
Since matching instructions that are identical does not provide that much information when looking for structural similarity, the value of the instruction, and the value of the operands for most instructions are not considered. Instead only the opcode, types and parameters are considered. Ultimately, if the opcode and types match, then the instructions are considered to be performing the same operation. However, for some instructions like shufflevector, or getelementptr instructions, it is ensured that the parameters match, in keeping with the “isSameOperationAs” instruction method. Each unique instruction via this scheme is mapped to an unsigned integer. Creating the mapping of the entire program to integers treats the program like a string with each integer as a character, and allows the use of data structures such as the Suffix Tree (recently refactored into LibSupport in review D80586) to process the string for repeated sequences of instruction in program.

This maps instructions such as
%1 = add i32 %a, 1
%2 = add i32 %b, %c
To the same values as they have the same opcode and same type, and instructions such as
%1 = add i32 %a, 1
%2 = add i64 %b, %c
%3 = mul i64 %b, %c
Are mapped to different values, since they have different opcodes and types.

IRSimilarityIdentifier:
The list of integers representing the program is then passed to the Suffix Tree, which can quickly find repeated substrings, and creates an IRSimilarityCandidate for each instance of these repeated substring. The IRSimilarityCandidate provides an interface for comparing the operand use and instructions of one IRSimilarityCandidate to another to determine structural similarity.

Structural similarity is determined by attempting to find a one-to-one mapping from each register in one candidate to each register in another. If this mapping is found, then the two candidates are performing the same operations on a given set of inputs and are structurally similar. For example:

%1 = add %2, %3
%4 = add %1, %2

Is equivalent to

%a = add %b, %c
%d = add %b, %a

And the IRSimilarityCandidates containing these sections of code would be considered structurally similar. But

%1 = add %2, %3
%4 = add %1, %2

Is not equivalent to

%a = add %b, %c
%d = add %a, %e.

Since the use of %2 in the first section does not match the use of %b in the second. So the IRSimilarityCandidates containing these sections would not be considered structurally similar.

The IRSimilarityIdentifier takes the IRSimilarityCandidates created from the instances of the repeated substrings and uses the IRSimilarityCandidate’s interface, described above, to determine which instances of a given repeated substring are structurally similar to one another.

Once it has been determined which instances of a repeated substring that are structurally similar, sets of IRSimilarityCandidates that have been found to be compatible are grouped together (a Similarity Group). The IRSimilarityCandidates contained in each group are structurally similar to one another. In terms of the example above, the first two sections of code would be in the same Similarity Group, but the third is not structurally similar to either. Therefore, it would be in its own group, and does not have a matching structure with another IRSimilarityCandidate in the program.

Application: Visualizing Similarity

One potential application is simply exporting any similarities found for viewing by a user. LLVM-Sim is a tool that accepts a module, and passes them to the IRSimilarityIdentifier and retrieves the Similarity Groups for the module. These are then exported to a JSON file that has an entry for each Similarity Group, and a list of ranges for each region of similarity contained in that group. This tool also could be implemented with remarks rather than JSON for faster, more streamlined processing of program information.

If a module had a Similarity Group with two sections, instructions 5 to 10 and instructions 15 to 20, and a different Similarity Group contained two sections between instructions 1 to 4 and 21 to 24, the output JSON would be:

{
“1": [{“s”: 5, “e”: 10, {“s": 15, “e”: 20}],
“2": [{“s”: 1, “e”: 4}, {“s”: 21, “e”: 24}]
}

This information could be used to create interfaces to look at, and compare the similarity present in the program, such as in the example interface here: https://reviews.llvm.org/D86974, which has two instances of the program alongside its IR with two different similar sections highlighted.

Application: Deduplicating Similarity with Outlining:

Alternatively, following identifying the similar sections of the program, the similar regions of code can be deduplicated. The IR Outliner makes use of the Code Extractor and the new Similarity Analysis Pass to remove each of the similar ranges individually, and the new IR Outliner deduplicates the individually extracted functions for each region ensuring that the information needed for each extraction is not lost.

Outlining at the IR level can benefit any target architecture and reduce the overall size of a program before any target specific decisions. On the other hand, the Machine Outliner in LLVM runs after the register allocator and needs to be implemented for each target separately. The concept of code structural similarity is easier to implement in a general way at the IR level.

The IR Outliner is off by default. There is a flag -mllvm -ir-outliner that will use the defined cost model to attempt to outline sections of code that will lead to an overall decrease in the size of the program. There is also a debugging option -mllvm -ir-outliner-no-cost that will outline the Similarity Groups greedily based on how many instructions will be removed, but does not consider the number of instructions added.

There are more details about the implementation of the IR Outliner at the end of the RFC.

Results:
The LLVM Test Suite, compiled with the IR Outliner on top of -Oz, has seen code size reductions of up to 60% and an average of 1% code size reduction for both X86 and AArch64. In these tests, the IR Outliner is placed as the last stage before code generation for the given target. There are some regressions. These are due to the fact that our cost model is not as accurate at the assembly code generation phase. It can be difficult to estimate how many instructions will be added during code generation at the IR level. For instance, if many arguments need to be created to handle constants, the decisions when creating a call at the assembly level are not directly obvious. So, our cost model may estimate that fewer instructions are added than actually need to be. Using target hooks may be a way to address this issue to get a better sense of what will happen during code generation than a general framework for calling conventions.

CTMark Code Size (x86_64):

Test |Original| With | Reduction

This looks like a very worthwhile project. Often we are looking for faster execution with code size being a secondary issue though some uses will have the reverse priority. Hopefully, the project will provide execution times along with code-size reductions.

Neil Nelson

Indeed, an awesome project and an excellent report!

Code size doesn’t really get much attention, so the level of detail and the strong roadmap is refreshing.

Hopefully, the project will provide execution times along with code-size reductions.

I doubt it.

Outlining will (almost?) always make for slower code due to a lot more calls being made. But that’s ok for embedded targets, that don’t have strong speculative machinery.

Also, if the program fits versus doesn’t, then it will run infinitely faster. :smiley:

Kristof, do you know who at Arm would be good to review those patches?

cheers,
–renato

Thanks for the positive feedback!

All the patches are up now:

Elevating different constants to arguments: [IRSim][IROutliner] Adding support for consolidating functions with different output arguments.
Extracting require outputs: [IRSim][IROutliner] Adding support for consolidating functions with different output arguments.
Merging Output Blocks: [IRSim][IROutliner] Merging output blocks for extracted functions with outputs
Cost Model: [IRSim][IROutliner] Adding a cost model, and debug option to turn the model off.
OptRemarks: [IRSim][IROutliner] Adding OptRemarks for the IROutliner.
Function Attribute Merging: [IRSim][IROutliner] Adding consistent function attribute merging
DebugInfo Handling: [IRSim][IROutliner] Adding DebugInfo handling for IR outlined functions.
Outlining from linkonceodr: [IRSim][IROutliner] Adding option to enable outlining from linkonceodr functions
Flexibility for Isomorphic Predicates: [IRSim] Adding support for isomorphic predicates
Flexibility for Commutative Instructions: [IRSim] Adding commutativity matching to structure checking
Matching call instructions for similarity identification: [IRSim] Letting call instructions be legal for similarity identification.
Outlining Calls: [IRSim][IROutliner] Allowing call instructions to be outlined.
Matching GEP instructions for similarity identification: [IRSim] Letting gep instructions be legal for similarity identification.
Outlining GEP instructions: [IRSim][IROutliner] Allowing GEP instructions to be outlined.

Feedback, comments and reviews are welcome!

Andrew

Hi Andrew,

Thanks for this great work and presentation, I've also a huge interest in code
size reduction. I already looked at the first IR Outliner proposal,
and I really
like your more general approach.

I wanted to look at the impact on ARM targets, and managed to apply
your patchset
up to D87311 (the remaining ones which are related to call instructions and GEP
need to be reworked to be applied or work correctly). I looked at CTMark code
size other -Oz for both ARM and X86 and got a huge code size increase on both
targets (around 30% or 40% in Geomean) when compiled with -mllvm -ir-outliner.

Does it rings some bells on your side, is it expected given that the full series
is not applied, or did I just miss something ? From what I can observe, and as
you mentioned in your first mail, a lot of code is introduced to handle
parameters passing to outlined functions and the biggest code size increase was
observed after applying D87296.

Thanks a lot for this work and any hint on how to reproduce your results, I'm
eager to play with it and give some help if needed.

Cheers,
Yvan

Hi Yvan,

Glad to hear you’re interested! I’m not sure which ordering you’re looking at the patches, but a cost model was not added to the outliner until D87299. So, if that hasn’t been added I would expect some pretty big code increases since it just outlines whatever similarity it finds first otherwise.

The cost model has also been slightly reworked since the initial results to be more general, and try to use target hooks, but not all of the target hooks that we would really want exist. So, if it has been applied, I’ll definitely need to take a look.

Andrew

Hi Yvan,

Glad to hear you’re interested! I’m not sure which ordering you’re looking at the patches, but a cost model was not added to the outliner until D87299. So, if that hasn’t been added I would expect some pretty big code increases since it just outlines whatever similarity it finds first otherwise.

The cost model has also been slightly reworked since the initial results to be more general, and try to use target hooks, but not all of the target hooks that we would really want exist. So, if it has been applied, I’ll definitely need to take a look.

I applied the patches in order on top of your commit of D86974
(15645d044bc) that is:
D86975 - D86976 - D86977 - D86978 - D87294 - D87296 - D87298 - D87299 - D87300
D87301 - D87302 - D87309 - D87310

So, the cost model was part of it, I don't remember if I had to
resolve some conflicts for
it, but I'll re-check and look if it was applied correctly.

Yvan

Hi Yvan,

I went through and found a bug that was causing a repeated addition of removed instructions to the estimated benefit that occurred during some refactoring. I’ve uploaded the correct diff into the cost model patch: https://reviews.llvm.org/D87299.

This should take care of the large increases that you’re seeing. However, you should see smaller changes than the results I provided in the RFC, since the call and GEP instructions allow for the regions to be larger, making it easier to have a higher benefit to cost ratio.

Additionally, the current set of patches do not outline most intrinsics, since we weren’t sure how safe it was to outline every type of intrinsic. The initial data did not enforce this restriction as we adjusted the patches after to help land them more quickly. The hope is that the different intrinsics will be added piece by piece to make sure that outlining does not cause any miscompiles.

Let me know if there are any more complications.

Andrew

Hi Andrew,

Hi Yvan,

I went through and found a bug that was causing a repeated addition of removed instructions to the estimated benefit that occurred during some refactoring. I’ve uploaded the correct diff into the cost model patch: https://reviews.llvm.org/D87299.

Great, it fixes the issue on my side as well on X86 and ARM targets.

This should take care of the large increases that you’re seeing. However, you should see smaller changes than the results I provided in the RFC, since the call and GEP instructions allow for the regions to be larger, making it easier to have a higher benefit to cost ratio.

Yes, now the effect is almost flat without calls and GEP, it would be
great if you can rebase these patches on top of the changes you made
in the previous ones, but that's already a good basis.

Additionally, the current set of patches do not outline most intrinsics, since we weren’t sure how safe it was to outline every type of intrinsic. The initial data did not enforce this restriction as we adjusted the patches after to help land them more quickly. The hope is that the different intrinsics will be added piece by piece to make sure that outlining does not cause any miscompiles.

Ok it makes sense, I can help at some point with that and/or at least
test it on ARM.

Thanks a lot
Yvan