[RFC] PT.2 Add IR level interprocedural outliner for code size.

River707 · September 5, 2017, 11:16pm

Hey Everybody,

A little while ago I posted an RFC(http://lists.llvm.org/pipermail/llvm-dev/2017-July/115666.html) with the proposition of adding a new outliner at the IR level. There was some confusion and many questions regarding the proposal which I’d like to address here:

Note about nomenclature:

Candidate: A repeated sequence of instructions within a module.

Occurrence: One instance of a candidate sequence.

– Accompanied Graph Data –

Graph data is referenced in the sections below, any reference to Graph[Number] is referencing the numbered graph in the following document:

https://goo.gl/QDiVHU

---- Performance ----

I have tested the IR outliner and current Machine outliner on a wide variety of benchmarks. The results include total % reduction of both geomean and total size. It also includes individual results for each test in each respective benchmark.

The configurations tested are:

· Early+Late IR outlining

· Late IR outlining

· Machine outlining

· Early+Late+Machine outlining

· Late+Machine outlining

NOTE: For fairness in comparisons with the Machine Outliner, all IR outliner runs also include (-mno-red-zone, outlining from linkonce_odr/weak_odr functions).

The code size benchmarking results provided are:

LLVM Test Suite

X86_64, X86*, AArch64, Arm1176jzf-s*, Arm1176jzf-s-thumb*

Spec 2006

X86_64, X86*, AArch64, Arm1176jzf-s*, Arm1176jzf-s-thumb*

Clang

X86_64(Mac OS)

llvm-tblgen

X86_64(Mac OS)

CSiBE

AArch64

The machine outliner currently only supports X86_64 and AArch64.

Full Code Size Results:

https://goo.gl/ZBjHCG

— Algorithmic differences with the Machine Outliner ----

There was a lot of confusion on how exactly the algorithm I am proposing differs from what is available in the Machine Outliner. The similarities of the two outliners lie in the usage of a string matching algorithm and candidate pruning. The first step in the algorithm is to basically do the same common substring / pruning algorithm the post-RA MO uses but with a specially chosen congruence relation. I’d like to delve into the differences between the two:

Congruence Detection:

Machine Outliner

The machine outliner has the advantage of having this problem already taken care of by the register allocator, it simply checks to see if the two machine instructions are identical.

IR Outliner

In the IR outliner we work on semantic equivalence, i.e. we care the operations being performed are equivalent and not the values. This creates a need to add verification that we do have exact equivalence when we need it, e.g. ShuffleVector’s shuffle mask, not taking the address of InlineASM, etc.

A quick example of semantic equivalence:

%1 = add i32 1, i32 2

%2 = add i32 2, i32 3

These two instructions are not identical because the values of the operands are not identical. They are, however, semantically equivalent because they both perform the add operation.

This can be seen by simply removing the operand values used in the calculations:

%1 = add i32 , i32

%2 = add i32 , i32

Occurrence Verification:

Machine Outliner

At the post RA level you don’t need to do any kind of special verification for candidate occurrences because you don’t have to deal with the concept of inputs.

IR Outliner

At the IR/preRA level we need to do complex verification to make sure that the occurrences within a candidate have the same internal inputs. If two occurrences have different internal inputs then we need some form of control flow to maintain correctness. By internal inputs I mean the operands of instructions that come from an instruction within the occurrence, e.g.

%2 = …

// Start outlining occurrence.

%3 = …

%4 = sub %3, %2 // The first operand is an internal input, the second is external.

If there is any confusion about why we need control flow for internal inputs I am more than happy to provide examples and more detailed explanations.

Aside from internal inputs we also need to verify that the functions we are outlining from have compatible attributes.

Cost Modeling:

Machine Outliner

At the MIR level the cost information is extremely accurate. So cost modeling is composed of effectively counting the number of instructions and adding some frame/setup cost.

IR Outliner

At the IR level we are working with estimates for the costs of certain instructions. We try to match the IR cost to the MIR cost as closely as possible and in practice we can get fairly close(Graph[1]).

Taking this a step further we need to estimate the cost/setup of having x amount of parameters and y outputs, as well as the register pressure from both the call and the potentially outlined function.

Parameterization Optimizations:

Machine Outliner

The Machine outliner uses exact equivalence, which does not allow for any form of parameterization.

IR Outliner

Being at the IR level requires us to tackle parameterization, which then brings several optimizations to help lower the cost of parameterizing a sequence.

Constant Folding

The IR outliner will identify constant inputs and fold them.

Congruent Input Condensing

The outliner identifies the congruent sets of parameters for a function. Example:

void fn(int, int); → void fn(int);

fn(1, 1); → fn(1);

fn(%1, %1); → fn(%1);

Parameters 1 and 2 were found to be the same for each callsite of the function, so we condensed the congruent parameters.

Input Partitioning

The outliner partitions candidates that have a parameter that can be constant folded. Example:

fn(1);

fn(%1);

Occurrences 1 and 2 in the above candidate can have parameter 1 folded. We create a new candidate containing just occurrences 1 and 2 as it may be more profitable than the original candidate.

Constant int condensing

The outliner identifies constant int parameters and checks to see if, for each occurrence, they are an equal distance from other constant int parameters. If so it removes all but one of the parameters and represents the others as an add from the base. Example:

void fn(int a, int b);

fn(1, 2);

fn(3, 4);

In the above, parameters 1 and 2 are always a distance of 1 apart. We can redefine our function as:

void fn(int a) {

int b = a + 1;

…

}

Register Usage:

Machine Outliner

The MO works post RA with exact equivalence, so the most it will compute is if it needs to save the link register on arm64.

IR Outliner

The IR outliner needs to compute register usage for the new outlined function as well as the usage after generating a function call with x parameters and y outputs at each program point z.

Outlining:

Machine Outliner

At the MIR level we clone the outlined instructions into a new function, create some prologue/epilogue for the function, and then generate a call.

IR Outliner

At the IR level we also have to handle the parameters/outputs of the candidate. Here we need to merge all of the metadata of outlined instructions/outlined functions. We also need to identify congruent sets of parameters between call sites and then folding the amount of parameters that are needed for the call.

Suffix Array vs Suffix Tree+LCP:

The two structures should compute the same result, but there is a non obvious benefit that we get from the suffix array. With the suffix array approach we identify candidates that shares common occurrences albeit with a different length. This is very useful for complex verification/analysis, e.g. at the IR or pre RA level. This allows us to cache the work when we calculating inputs or verifying the internal inputs of occurrences. Although this won’t be an issue if/when we switch to a common interface for candidate selection.

---- A replacement for the Machine Outliner? Not exactly ----

The IR outliner was never intended as a replacement for the machine outliner and the two can coexist. The outliners tend to catch very different cases: the machine outliner tends to favor very small candidate lengths. Using a build of llvm-tblgen, the machine outliner gets ~52% of its benefit from outlined functions of 2-3 instructions. The IR outliner tends to favor large candidate lengths(2-20+), often composed of function calls. 52% of the benefit for the IR outliner in the llvm-tblgen example is found in outlined functions with final lengths up to 17. Data for example runs of both can be found in the graph data file and is summarized in Graph[2].

Included in the performance data are metrics showing the performance of using both the IR outliner and machine outliner. The data indicates that you can achieve up to, and exceed, 2% reduction of both geomean and total size by using both.

---- Pros/Cons of IR----

The current algorithm is implemented at the IR level, but there are trade offs to placing this transformation anywhere in the pipeline(IR/preRa/postRA).

– Less Precise Cost Modeling:

Being at the IR level creates a need to estimate the size cost of any given instruction.

How much does this imprecision affect the benefit estimation?
Included in the data : Graph[1]: is the difference between our estimated function size and the actual size in the binary. It shows that we get very close and tend to be on the conservative side.
Estimation causes the IR outliner to be conservative. Which means that we are losing out on potential benefit by overestimating cost.

– Higher Level of Abstraction:

The outliners are essentially string matching algorithms. Being at a higher level of abstraction naturally gives more opportunities for equivalence. As an example, call instructions are handled naturally at the IR level.
Will a preRA outliner be able to have the same relaxation in congruence matching? E.g will it be able to match tail and non tail function calls?
Being at the IR level means that we lose out on some instruction lowering idioms, e.g. constant expressions, bitwise rotation([shl, lshr, or] → [rot]), etc.
This is evident in the results for test suite for aarch64, in which the machine outliner outperforms the IR outliner due in part to the large amount of global accesses in the tests.

– Maintainability:

The IR level in general is much more maintainable.
We don’t have to be as conservative about certain ABI characteristics. This allows for the IR outliner to work without the need for any extra work(special options) from the users. For example, the machine outliner requires ‘noredzone’ but the IR outliner does not.

– Pipeline Flexibility:

As shown in the performance data below, we can get up to 2x performance by working pre function simplification. Though working pre simplification means the outliner must gamble between the benefits of outlining vs simplification.

– Loss of control:

The machine level can have more control over the outlining process. We could have optimized parameterization, alignment handling, etc.

---- Adapting the algorithm to pre-RA IR ----

The analysis portion of the IR outliner is already IR agnostic for the most part. It works on indices into the congruency vector for instructions and their inputs/outputs. This would mean that a preRA outliner would only have to define the MIR specific portions: Congruency detection, cost analysis, parameter/output optimizations, and the outlining of beneficial candidates.

– Implementation –

https://github.com/River707/llvm/blob/outliner/lib/Transforms/IPO/CodeSizeOutliner.cpp

All feedback/comments/discussion welcome and appreciated!

Thanks,

River Riddle

ghoflehner · September 22, 2017, 2:10am

In general I would love to see an outliner at the IR level also. But rather than a comparison vs. the machine outliner I would like to learn more about how the core data structures between the outliners will be shared. In particular for matching/pruning it seems to be a reasonable approach. A few more remarks/questions are below also.

Thanks
Gerolf

Hey Everybody,
A little while ago I posted an RFC(http://lists.llvm.org/pipermail/llvm-dev/2017-July/115666.html) with the proposition of adding a new outliner at the IR level. There was some confusion and many questions regarding the proposal which I’d like to address here:

Note about nomenclature:
Candidate: A repeated sequence of instructions within a module.
Occurrence: One instance of a candidate sequence.

– Accompanied Graph Data –

Graph data is referenced in the sections below, any reference to Graph[Number] is referencing the numbered graph in the following document:

https://goo.gl/QDiVHU

---- Performance ----

I have tested the IR outliner and current Machine outliner on a wide variety of benchmarks. The results include total % reduction of both geomean and total size. It also includes individual results for each test in each respective benchmark.

The configurations tested are:
· Early+Late IR outlining
· Late IR outlining
· Machine outlining
· Early+Late+Machine outlining
· Late+Machine outlining

NOTE: For fairness in comparisons with the Machine Outliner, all IR outliner runs also include (-mno-red-zone, outlining from linkonce_odr/weak_odr functions).

The code size benchmarking results provided are:

LLVM Test Suite

X86_64, X86*, AArch64, Arm1176jzf-s*, Arm1176jzf-s-thumb*

Spec 2006

X86_64, X86*, AArch64, Arm1176jzf-s*, Arm1176jzf-s-thumb*

Clang

X86_64(Mac OS)

llvm-tblgen

X86_64(Mac OS)

CSiBE

AArch64

The machine outliner currently only supports X86_64 and AArch64.

Full Code Size Results:
https://goo.gl/ZBjHCG

— Algorithmic differences with the Machine Outliner ----

There was a lot of confusion on how exactly the algorithm I am proposing differs from what is available in the Machine Outliner. The similarities of the two outliners lie in the usage of a string matching algorithm and candidate pruning. The first step in the algorithm is to basically do the same common substring / pruning algorithm the post-RA MO uses but with a specially chosen congruence relation. I’d like to delve into the differences between the two:

Congruence Detection:

Machine Outliner
The machine outliner has the advantage of having this problem already taken care of by the register allocator, it simply checks to see if the two machine instructions are identical.

IR Outliner
In the IR outliner we work on semantic equivalence, i.e. we care the operations being performed are equivalent and not the values. This creates a need to add verification that we do have exact equivalence when we need it, e.g. ShuffleVector’s shuffle mask, not taking the address of InlineASM, etc.

A quick example of semantic equivalence:
%1 = add i32 1, i32 2
%2 = add i32 2, i32 3

These two instructions are not identical because the values of the operands are not identical. They are, however, semantically equivalent because they both perform the add operation.
This can be seen by simply removing the operand values used in the calculations:
%1 = add i32 , i32
%2 = add i32 , i32

Occurrence Verification:

Machine Outliner
At the post RA level you don’t need to do any kind of special verification for candidate occurrences because you don’t have to deal with the concept of inputs.

IR Outliner
At the IR/preRA level we need to do complex verification to make sure that the occurrences within a candidate have the same internal inputs. If two occurrences have different internal inputs then we need some form of control flow to maintain correctness. By internal inputs I mean the operands of instructions that come from an instruction within the occurrence, e.g.

%2 = …
// Start outlining occurrence.
%3 = …
%4 = sub %3, %2 // The first operand is an internal input, the second is external.

If there is any confusion about why we need control flow for internal inputs I am more than happy to provide examples and more detailed explanations.

Aside from internal inputs we also need to verify that the functions we are outlining from have compatible attributes.

Cost Modeling:

Machine Outliner
At the MIR level the cost information is extremely accurate. So cost modeling is composed of effectively counting the number of instructions and adding some frame/setup cost.

IR Outliner
At the IR level we are working with estimates for the costs of certain instructions. We try to match the IR cost to the MIR cost as closely as possible and in practice we can get fairly close(Graph[1]).
Taking this a step further we need to estimate the cost/setup of having x amount of parameters and y outputs, as well as the register pressure from both the call and the potentially outlined function.

Parameterization Optimizations:

Machine Outliner
The Machine outliner uses exact equivalence, which does not allow for any form of parameterization.

IR Outliner
Being at the IR level requires us to tackle parameterization, which then brings several optimizations to help lower the cost of parameterizing a sequence.

Constant Folding
The IR outliner will identify constant inputs and fold them.

Congruent Input Condensing
The outliner identifies the congruent sets of parameters for a function. Example:
void fn(int, int); → void fn(int);
fn(1, 1); → fn(1);
fn(%1, %1); → fn(%1);
Parameters 1 and 2 were found to be the same for each callsite of the function, so we condensed the congruent parameters.

Input Partitioning
The outliner partitions candidates that have a parameter that can be constant folded. Example:
fn(1);
fn(1);
fn(%1);
Occurrences 1 and 2 in the above candidate can have parameter 1 folded. We create a new candidate containing just occurrences 1 and 2 as it may be more profitable than the original candidate.

Constant int condensing
The outliner identifies constant int parameters and checks to see if, for each occurrence, they are an equal distance from other constant int parameters. If so it removes all but one of the parameters and represents the others as an add from the base. Example:

void fn(int a, int b);
fn(1, 2);
fn(3, 4);

In the above, parameters 1 and 2 are always a distance of 1 apart. We can redefine our function as:
void fn(int a) {
int b = a + 1;
…
}

Register Usage:

Machine Outliner
The MO works post RA with exact equivalence, so the most it will compute is if it needs to save the link register on arm64.

IR Outliner
The IR outliner needs to compute register usage for the new outlined function as well as the usage after generating a function call with x parameters and y outputs at each program point z.

Outlining:

Machine Outliner
At the MIR level we clone the outlined instructions into a new function, create some prologue/epilogue for the function, and then generate a call.

IR Outliner
At the IR level we also have to handle the parameters/outputs of the candidate. Here we need to merge all of the metadata of outlined instructions/outlined functions. We also need to identify congruent sets of parameters between call sites and then folding the amount of parameters that are needed for the call.

Suffix Array vs Suffix Tree+LCP:

The two structures should compute the same result, but there is a non obvious benefit that we get from the suffix array. With the suffix array approach we identify candidates that shares common occurrences albeit with a different length. This is very useful for complex verification/analysis, e.g. at the IR or pre RA level. This allows us to cache the work when we calculating inputs or verifying the internal inputs of occurrences. Although this won’t be an issue if/when we switch to a common interface for candidate selection.

---- A replacement for the Machine Outliner? Not exactly ----

The IR outliner was never intended as a replacement for the machine outliner and the two can coexist. The outliners tend to catch very different cases: the machine outliner tends to favor very small candidate lengths. Using a build of llvm-tblgen, the machine outliner gets ~52% of its benefit from outlined functions of 2-3 instructions. The IR outliner tends to favor large candidate lengths(2-20+), often composed of function calls. 52% of the benefit for the IR outliner in the llvm-tblgen example is found in outlined functions with final lengths up to 17. Data for example runs of both can be found in the graph data file and is summarized in Graph[2].

Included in the performance data are metrics showing the performance of using both the IR outliner and machine outliner. The data indicates that you can achieve up to, and exceed, 2% reduction of both geomean and total size by using both.

---- Pros/Cons of IR----

The current algorithm is implemented at the IR level, but there are trade offs to placing this transformation anywhere in the pipeline(IR/preRa/postRA).

– Less Precise Cost Modeling:
Being at the IR level creates a need to estimate the size cost of any given instruction.

How much does this imprecision affect the benefit estimation?

Included in the data : Graph[1]: is the difference between our estimated function size and the actual size in the binary. It shows that we get very close and tend to be on the conservative side.

Estimation causes the IR outliner to be conservative. Which means that we are losing out on potential benefit by overestimating cost.

– Higher Level of Abstraction:

The outliners are essentially string matching algorithms. Being at a higher level of abstraction naturally gives more opportunities for equivalence. As an example, call instructions are handled naturally at the IR level.

Will a preRA outliner be able to have the same relaxation in congruence matching? E.g will it be able to match tail and non tail function calls?

Being at the IR level means that we lose out on some instruction lowering idioms, e.g. constant expressions, bitwise rotation([shl, lshr, or] → [rot]), etc.

This is evident in the results for test suite for aarch64, in which the machine outliner outperforms the IR outliner due in part to the large amount of global accesses in the tests.

– Maintainability:

The IR level in general is much more maintainable.

Why so? How did you measure that? What is your measure for “maintainable?"

We don’t have to be as conservative about certain ABI characteristics. This allows for the IR outliner to work without the need for any extra work(special options) from the users. For example, the machine outliner requires ‘noredzone’ but the IR outliner does not.

– Pipeline Flexibility:

As shown in the performance data below, we can get up to 2x performance by working pre function simplification. Though working pre simplification means the outliner must gamble between the benefits of outlining vs simplification.

– Loss of control:

The machine level can have more control over the outlining process. We could have optimized parameterization, alignment handling, etc.

---- Adapting the algorithm to pre-RA IR ----
The analysis portion of the IR outliner is already IR agnostic for the most part. It works on indices into the congruency vector for instructions and their inputs/outputs. This would mean that a preRA outliner would only have to define the MIR specific portions: Congruency detection, cost analysis, parameter/output optimizations, and the outlining of beneficial candidates.

That should apply the other way around too: take the MO outliner and adapt. No?

Daniel_Berlin1 · September 22, 2017, 2:45am

--- Algorithmic differences with the Machine Outliner ----

          There was a lot of confusion on how exactly the algorithm I am
proposing differs from what is available in the Machine Outliner. The
similarities of the two outliners lie in the usage of a string matching
algorithm and candidate pruning. The first step in the algorithm is to
basically do the same common substring / pruning algorithm the post-RA MO
uses but with a specially chosen congruence relation. I’d like to delve
into the differences between the two:

Congruence Detection:

- Machine Outliner

  The machine outliner has the advantage of having this problem already
taken care of by the register allocator, it simply checks to see if the two
machine instructions are identical.

- IR Outliner

  In the IR outliner we work on semantic equivalence, i.e. we care the
operations being performed are equivalent and not the values. This creates
a need to add verification that we do have exact equivalence when we need
it, e.g. ShuffleVector’s shuffle mask, not taking the address of InlineASM,
etc.

A quick example of semantic equivalence:

%1 = add i32 1, i32 2

%2 = add i32 2, i32 3

These two instructions are not identical because the values of the
operands are not identical. They are, however, semantically equivalent
because they both perform the add operation.

This can be seen by simply removing the operand values used in the
calculations:

%1 = add i32 , i32

%2 = add i32 , i32

FWIW, you could use the core NewGVN structures (GVNExpression.h) to do

this and just not fill in the operands.
GVNSink does a variant of this by using them.
In particular, the variant it does is: "do these instructions contribute to
their uses in an equivalent way". This is the same problem, but if you
weren't going to be willing to add any function arguments to fill in
operand values.

IE

/// [ %a1 = add i32 %b, 1 ] [ %c1 = add i32 %d, 1 ]
/// [ %a2 = xor i32 %a1, 1 ] [ %c2 = xor i32 %c1, 1 ]
/// \ /
/// [ %e = phi i32 %a2, %c2 ]
/// [ add i32 %e, 4 ]

These would value number differently using a normal value numbering
algorithm.
GVNSink instead computes "could i common a1 and c1, could i do that by
adding a phi".

(Though i realize now the computation it performs can now be performed in a
single pass. It's just the reverse of what we do in NewGVN. We look for
things we can convert into phi of ops, it looks for things it can convert
into op of phis to save instructions)

River707 · September 22, 2017, 3:02am

Hey Gerolf,

River707 · September 22, 2017, 3:09am

Hey Dan,
The intent from the beginning was to use the NewGVN infrastructure to help do the numbering, partially inspired by your responses to the original Machine Outliner RFC (That is actually how this started to some degree), but the infrastructure wasn’t quite ready when I started.

Thanks,
River Riddle

andreybokhanko · September 25, 2017, 12:00pm

Hi River,

Sorry for a late response – busy with other things.

I believe this is a well prepared follow-up – indeed, you covered a lot of contention points from the previous discussion. I know you’ll be giving a talk on the upcoming DevMeeting – looking forward visiting it! (BTW, the DevMeeting is an excellent opportunity to meet with people who asked questions / expressed criticism on your RFC and check that their concerns are fully covered.)

A few notes:

One big change from the prior RFC is that you expressly elaborated on why both IR and MI outliners should co-exist. Given this comments on which of two is more maintainable are probably not relevant and just serve as a source of more (needless) contention.
I would still love to see a single specific real-world (e.g. from the benchmarks) example of a case covered by MI but not IR outliner with an explanation of why IR outliner can’t do it (why we can’t estimate profitability on IR level). But this is really “nice to have”, not “must have”.
IMHO, the obvious common part (analysis) that many people suggested to implement in a generic way to re-use between IR/MI outliners, can be done as a second step. We should really start by accepting IR outliner in its current form. But I had this opinion already, so better to check with others during DevMeeting.

Yours,

Andrey

ghoflehner · September 27, 2017, 12:24am

Hey Gerolf,

Is this how you propose to move forward with the commit? Start with the utilities, then work with Jessica on the MO port, and finally work on IR equivalence, signatures etc.? This would seem like a sensitive approach to me, build upon what is there already and thus demonstrate the robustness of the new framework

I understand your urge since the problems are hard and interesting, but they are also solvable. Sharing (big) parts of the source base among the outliners reduces maintenance overhead. This is easier to achieve when tackled from the start.

Jessica_Paquette1 · September 27, 2017, 4:28pm

I think that, given previous discussion on the topic, we might want a split like this:

(1) Search structure

Suffix tree or suffix array.

(2) Numbering/mapping/congruence scheme

Every outliner should implement a function that maps instructions (or whatever structure you want to outline, crazy thoughts…) to integers. That should be passed to the search structure, which will return a list of repeated sequences.

The MachineOutliner currently has an “InstructionMapper” struct in it. I think we can probably build on that struct so that we can choose different mapping schemes.

(3) Cost model/candidate selection scheme

Every outliner should implement a function that returns the outlining cost or benefit of a candidate. In the MachineOutliner, this is a joint effort between the pass and the target.

(4) Outlining scheme

Every outliner should implement a function that actually performs outlining. That is, we should have a function that replaces each instance of a repeated sequence of instructions with a call to a newly-created function. In the MachineOutliner, the method by which we do this is decided on by the target.

So, we at the very least might want an interface that implements something like

unsigned mapInstructions(…);
unsigned getOutliningCost(…);
unsigned findCandidates(…);
bool outline(…);

The MachineOutliner is almost set up like this. So, I think it should be pretty easy to build the split this way without a lot of adaptation. The only place I can see it being somewhat tricky is maybe with all of the target-specified information in the MachineOutliner. I think this is pretty similar to what River has already suggested, though.

I think it would probably be easier to move forward by pulling stuff out of the MachineOutliner instead of replacing all of that code with new stuff. However, that being said, if there’s an improvement to both passes by changing the infrastructure, I’m not opposed to experimenting with changes!

It would probably be easier to try to mostly pull stuff from the MachineOutliner though, since it’s already been in tree for a while and has been tested quite a bit.

Jessica

Davide_Italiano · September 27, 2017, 10:23pm

Hi Jessica, I tend to agree we should try to maximize code reuse in
this context, and I think yours is a good step forward. In particular,
I'm particularly confident the `findCandidates()` bits should be
split.
That said, do we really want encapsulate the logic for finding
candidates into an interface? It's unclear whether it should live but
it seems much more akin to the stuff living in Transforms/Utils than a
proper interface.
So, IMHO it's much more suitable for an helper/utility.

For what it concerns `mapInstructions()` [please correct me if I'm wrong].
This bit of the interface should be responsible for value numbering
the instructions, as far as I can tell.
To the best of my understanding the way your outliner value numbers
using a nominal approach, i.e. two instructions get the same value
number if their string representation is the same (or something like
that).

With the notion above,
xorl %eax %eax
movl %eax, $0

aren't really considered equivalent (although, in practice, they are).

The way I imagined outlining to work at the IR level (and how I once
discussed with Dan) is leveraging the result of a value number
analysis (NewGVN) to assign numbers.

While the problem NewGVN solves is that of finding all the Herbrand
equivalences (structural equivalence, same operation and operand
structurally congruent), it also does a canonicalization step calling
InstSimplify to canonicalize (so, in this sense it catches more
stuffs, e.g.:

%x = add %a, 0
%y = sub %b, 0
not equivalent

instSimplify(%y) -> add %a, 0
now they're equivalent.

Note: The current IR outlliner doesn't really use the approach I just
described, yet, but it's not hard to imagine extending (or rewriting
it to use a real Global Value number analysis to get instructions that
are equivalent, for some definition of structural equivalency)

So, I wonder whether you plan to enhance the logic for value numbering
in your current scheme. If you do, and you plan to share more code
between IR and MI outliner, then it's a good idea I guess to have a
common interface (I'd love to hear your thoughts on this one).

About the `getOutliningCost()` API, what do you think to do in the MI outliner?
Currently, if I remember correctly, (almost) all instructions have
unit cost except for something.
It's not unreasonable to think this model will evolve. A relatively
obvious improvement might be that of taking in account the length
encoding for the target architecture when outlining. The IR outliner
uses TargetTransformInfo to do estimation costs. I think we may want
to evaluate whether (or not) it's feasible to have a base class for
the cost model that doesn't introduce too much boilerplate, or whether
we should bite the bullet and pay a small cost in duplication for this
logic across the two outliners.

Thanks,

Jessica_Paquette1 · September 27, 2017, 11:01pm

Hi Davide,

Thanks! I think that this is a really great thing to be pushing forward in general.

That said, do we really want encapsulate the logic for finding
candidates into an interface? It’s unclear whether it should live but
it seems much more akin to the stuff living in Transforms/Utils than a
proper interface.
So, IMHO it’s much more suitable for an helper/utility.

This sounds good to me. We could probably simplify it down to a utility that either

Finds all the repeated sequences in the program and returns them to the given outliner, which then applies its cost model to it
Takes in a (possibly optional) cost model function, finds all repeated sequences that satisfy that model, and then returns those sequences. If no model is provided, then it will return all sequences.

For what it concerns mapInstructions() [please correct me if I’m wrong].
This bit of the interface should be responsible for value numbering
the instructions, as far as I can tell.

That’s correct.

To the best of my understanding the way your outliner value numbers
using a nominal approach, i.e. two instructions get the same value
number if their string representation is the same (or something like
that).

Yep. They have to match exactly.

So, I wonder whether you plan to enhance the logic for value numbering
in your current scheme. If you do, and you plan to share more code
between IR and MI outliner, then it’s a good idea I guess to have a
common interface (I’d love to hear your thoughts on this one).

Absolutely. There are plenty of opportunities that the MachineOutliner misses due to simple things like register allocation, on top of things like the example you gave. A more powerful scheme, and moving the MachineOutliner pre-regalloc would be highly beneficial. Having a common interface here would be great. (This would be a really interesting way to analyze the overall structure of programs as well.)

Jessica

Matthias_Braun3 · September 28, 2017, 1:07am

I would also note that we already have InstCombine, GVN and other passes. Right now I see little value in teaching the outliner about those equivalences if we can just rely on earlier passes to normalize. This case also doesn’t look like a pass ordering problems as I don’t see the outliner enabling further optimizations in other passes.

Matthias

Davide_Italiano · September 28, 2017, 1:22am

I'm in agreement with you, for the outliner working in the mid-level optimizer.
FWIW, if you get the congruence classes formed by a value numbering
analysis you should get these for free.
Teaching the outliner how to catch congruences as-it-goes instead of
relying on an analysis is madness, and not really fruitful, IMHO. My
question is more how you can get the same effect when you're working
in the backend as the representation you're working on doesn't really
contain that much information.

I guess the answer it's just it's really hard without a Global VN to
understand if `xor %eax, %eax` and `andl %eax, $0` are the same, and
you don't have that information that far in the pipeline.
I'm not an expert in code generation so I may be completely off on
this one, but I assume that you somehow rely on SelectionDAG (e.g.
DAGCombine) to do the canonicalization similarly to what you would do
with InstCombine in the IR optimizer, and cases where you emit
equivalent instructions that have different encodings is relatively
rare, (in the case above, you always emit the XOR idiom).

Matthias_Braun3 · September 28, 2017, 1:35am

Yes SelectionDAG does some normalization and hopefully equivalent things will get pattern matched to the same instructions.
But in the end CodeGen is usually a place where we neither have a good understanding of instruction semantics and where transforming instructions is really hard as you typically already spent several passes doing resource allocation (registers, stack space, …) and constraint handling for the concrete instructions, changing them may bring new constraints/require new resources that you just cannot handle late in the pipeline. Taking your example: Switching MOV %eax, $0 to XOR eax, eax would suddenly start setting the eflags register which at the very least needs legality checks, or ways to spill/restore it, …

For the sake of maintainability and keeping a clean separation of concerns into passes we should not try to do too much that late in the CodeGen Pipeline.

Matthias

Daniel_Berlin1 · September 28, 2017, 2:05am

They normalize, but at least NewGVN (and GVNSink, which is closer to how the outliners work) there are opportunities it can discover for outlining that won’t be CSE’d.

In particular, if you have two groups of instructions with the same value, but one doesn’t dominate the other, you could outline them, but we wouldn’t eliminate them.

(We could speculate them in some cases if we wanted to get to one copy in a single function, but they can always be outlined into a function call).

preames · February 23, 2018, 5:16am

I’ll just note that this really sounds like an iterator over sequences. Could you arrange it that way? That would allow the cost modelling to be done externally to the sequence enumeration while still allowing early exits.

Topic		Replies	Views
[RFC] Interprocedural MIR-level outlining pass LLVM Dev List Archives	37	134	November 14, 2016
[RFC] Enhanced Machine Outliner – Part 1: FullLTO (Part 2: ThinLTO / NoLTO to come) Code Generation	4	157	May 2, 2024
[MachineOutliner][ARM][RFC] Add Machine Outliner support for ARM LLVM Dev List Archives	0	82	January 22, 2019
Path forward on profile guided inlining? LLVM Dev List Archives	1	78	September 1, 2015
FW: [RFC] Abstract Parallel IR Optimizations LLVM Dev List Archives	1	100	June 7, 2018

[RFC] PT.2 Add IR level interprocedural outliner for code size.

Related Topics