[RFC] Add IR level interprocedural outliner for code size.

River707 · July 20, 2017, 10:47pm

I’m River and I’m a compiler engineer at PlayStation. Recently, I’ve been working on an interprocedural outlining (code folding) pass for code size improvement at the IR level. We hit a couple of use cases that the current code size solutions didn’t handle well enough. Outlining is one of the avenues that seemed potentially beneficial.

– Algorithmic Approach –

The general implementation can be explained in stages:

Candidate Selection:

Each instruction in the module is mapped to a congruence class based upon relaxed equivalency constraints. For most instructions this is simply: The type, opcode, and operand types. The constraints are tightened for instructions with special state that require more exact equivalence (e.g. ShuffleVector requires a constant mask vector). Candidates are then found by constructing a suffix array and lcp(longest common prefix) array. Walking the lcp array gives us congruent chains of instructions within the module.

Candidate Analysis:

A verification step splits candidates that have different internal input sequences or incompatible parent function attributes between occurrences. An example of incompatible internal inputs sequences is:

X = W + 6; vs X = W + 6;

Y = X + 4; Y = W + 4;

The above two occurrences would need special control flow to exist within the same outlined function.

During analysis candidates have their inputs and outputs computed along with an estimated benefit from extraction. During input calculation a constant folding step removes all inputs that are the same amongst all occurrences.

Candidate Pruning:

Overlapping candidates are pruned with a generic greedy algorithm that picks occurrences starting from the most beneficial candidates.

Candidate Outlining:

Non pruned candidates are then outlined. Outputs from a candidate are returned via an output parameter, except for one output that is promoted to a return value. During outlining the inputs into the candidate are condensed by computing the equivalencies between the arguments at each occurrence. An example of this is:

outlinedFn(1,6,1); → outlinedFn(1,6);

outlinedFn(2,4,2); → outlinedFn(2,4);

In the above, parameters 1 and 3 were found to be equivalent for all occurrences, thus the amount of inputs was reduced to 2.

Debug Info:

Debug information is preserved for the calls to functions which have been outlined but all debug info from the original outlined portions is removed, making them harder to debug.

Profile Info:

If the pass is running at Os the outliner will only consider cold blocks, whereas Oz considers all blocks that are not marked as hot.

Location in Pipeline:

The pass is currently configured to run very late in the optimization pipeline. It is intended to run at Oz but will also run at Os if there is profile data available. The pass can optionally be run twice, once before function simplification and then again at the default location. This run is optional because you are gambling the potential benefits of redundancy elimination vs the potential benefits from function simplification. This can lead to large benefits or regressions depending on the benchmark (though the benefits tend to outnumber the regressions). The performance section contains data for both on a variety of benchmarks.

– Why at the IR level –

The decision to place the outliner at the IR level comes from a couple of major factors:

Desire to be target independent
Most opportunities for congruency

The downside to having this type of transformation be at the IR level is it means there will be less accuracy in the cost model - we can somewhat accurately model the cost per instruction but we can’t get information on how a window of instructions may lower. This can cause regressions depending on the platform/codebase, therefore to help alleviate this there are several tunable parameters for the cost model.

– Performance –

More results including clang, llvm-tblgen, and more specific numbers about benefits/regressions can be found in the notes section below.

Size Reduction:

Test Suite(X86_64):
Early+Late outlining provides a geomean of 10.5% reduction over clang Oz, with a largest improvement of ~67% and largest regression of ~7.5%.
Late outlining provides a geomean of 4.65% reduction, with a largest improvement of ~51% and largest regression of ~6.4%.
Spec 2006(X86_64)
Early+Late outlining provides a geomean reduction of 2.08%.
Late outlining provides 2.09%.
CSiBE(AArch64)
Early+Late outlining shows geomean reduction of around 3.5%.
Late outlining shows 3.1%.

Compile Time:

Compile time was tested under test-suite with a multisample of 5.

Early+Late outlining
Many improvements with > 40% compile time reduction.
Few regressions.
Late outlining
Greatest improvement is ~7.8%
Greatest regression is ~4% with a difference of <0.02s

Our explanation for the compile time reduction during early outlining is that due to the amount of redundancy reduction early in the optimization pipeline there is a reduction in the amount of instruction processing during the rest of the compilation.

Execution Time:

Ran with a multisample of 5.

Test Suite:
Early+Late outlining has many regressions up to 97%. The greatest improvement was around 7.5%.
Late outlining also has several regressions up to 44% and a greatest improvement of around 5.3%.
Spec:
Early+Late has a geomean regression of 3.5%.
Late outlining has a geomean regression of 1.25%.

The execution time results are to be expected given that the outliner, without profile data, will extract from whatever region it deems profitable. Extracting from the hot path can lead to a noticeable performance regression on any platform, which can be somewhat avoided by providing profile data during outlining.

– Tested Improvements –

MergeFunctions:

No noticeable benefit.

LTO:

LTO doesn’t have a code size pipeline, but %reductions over LTO are comparable to non LTO.

Input/Output Partitioning:

-This identifies inputs/outputs that may be folded by splitting a candidate. The benefit is minimal for the computations required.

Similar Candidate Merging:

The benefit to be gained is currently not worth the large complexity required to catch the desired cases.

– Potential Improvements –

Suffix&LCP array construction: The pass currently uses a very basic implementation that could be improved. There are definitely faster algorithms and some can construct both simultaneously. We will investigate this as a potential benefit for compile time in the future.
Os performance tuning: Under -Os the pass currently only runs on cold blocks. Ideally we could expand this to be a little more lax on less frequently executed blocks that aren’t cold.
Candidate Selection: The algorithm currently focuses on the longest common sequences. More checks could be added to see if shortening the sequence length produces a larger benefit(e.g less inputs/outputs). This would likely lead to an increase in compile time but should remain less than the baseline.
Non Exact Functions: The outliner currently ignores functions that do not have an exact definition.

CSiBE(Code Size Benchmark):

www.csibe.org

More detailed performance data:

goo.gl/5k6wsP

Implementation:

https://github.com/River707/llvm/blob/outliner/lib/Transforms/IPO/CodeSizeOutliner.cpp

Evgeny_Astigeevich · July 21, 2017, 9:24am

Hi River,

Do you know LLVM has the MachineOutliner pass (lib/CodeGen/MachineOutliner.cpp)?

It would be interesting to compare your approach with it.

Thanks,

Evgeny Astigeevich

River707 · July 21, 2017, 5:01pm

Hi Evgeny,
I know of the current machine outliner in LLVM. If you look in the “More detailed performance data” in the end section it includes performance comparisons to the machine outliner.
As for the algorithmic approach they are kind of similar.
Machine Outliner:

Builds a suffix tree based on identical equivalence between machine instrs.
Uses target specific cost model for benefit estimation.
Uses a greedy algorithm for candidate pruning.
Currently supports X86, AArch64 targets.
Requires NoRedZone on supported targets.

IR Outliner:

Builds a suffix array based upon relaxed equivalence between ir instrs.
Allows for inputs/outputs from an outlined function.
Complex verification of outlining candidates due to above two points(relaxed equivalency, inputs/outputs).
Tunable generic target independent cost model (estimates constant folded inputs, output condensing, etc.).
Note it uses the target info available (TargetTransformInfo)
Uses a greedy interval based algorithm for candidate pruning.
Supports profile data.
Supports emitting opt remarks.
Target independent, thus no required ABI constraints or hooks.
Though some parts of the approach are similar the implementation is very different. Also to note, the IR outliner can be expanded to also outline from multiple basic blocks. I’ve already prototyped the infrastructure for outlining regions, so its possible.

Thanks,
River Riddle

andreybokhanko · July 22, 2017, 10:23pm

Hi River,

Very impressive! – thanks for working on this.

A few questions, if you don’t mind.

First, on results (from goo.gl/5k6wsP). Some of them are quite surprising. In theory, “top improvements” should be quite similar in all three approaches (“Early&Late Outlining”, “Late Outlining” and “Machine Outliner”), with E&LO capturing most of the cases. Yet, they are very different:

Test Suite, top improvements:
E&LO:

enc-3des: 66.31%
StatementReordering-dbl: 51.45%
Symbolics-dbl: 51.42%
Recurrences-dbl: 51.38%
Packing-dbl: 51.33%

LO:

enc-3des: 50.7%
ecbdes: 46.27%
security-rjindael:45.13%
ControlFlow-flt: 25.79%
ControlFlow-dbl: 25.74%

MO:

ecbdes: 28.22%
Expansion-flt: 22.56%
Recurrences-flt: 22.19%
StatementReordering-flt: 22.15%
Searching-flt: 21.96%

SPEC, top improvements:
E&LO:

bzip2: 9.15%
gcc: 4.03%
sphinx3: 3.8%
H264ref: 3.24%
Perlbench: 3%

LO:

bzip2: 7.27%
sphinx3: 3.65%
Namd: 3.08%
Gcc: 3.06%
H264ref: 3.05%

MO:

Namd: 7.8%
bzip2: 7.27%
libquantum: 2.99%
h264ref: 2%

Do you understand why so?

I’m especially interested in cases where MO managed to find redundancies while E&O+LO didn’t. For example, 2.99% on libquantum (or is it simply below “top 5 results” for E&O+LO?) – did you investigated this?

Also, it would be nice to specify full options list for SPEC (I assume SPEC CPU2006?), similar to how results are reported on spec.org.

And a few questions on the RFC:

River707 · July 22, 2017, 11:05pm

Hi Andrey,
Questions and feedback are very much welcome.

The explanation as to why the improvements can vary between the IR and MIR outliner mainly boil down to the level of abstraction that each are working at. The MIR level has very accurate heuristics and is effectively the last post ISel target independent code gen pass. The IR outliner on the other hand has more estimation in the cost model, and can affect the decisions of function simplification passes, instruction selection, RA, etc. Taking this into account can lead to different results. Its important to remember the differences that being at a higher abstraction level can cause.
As for the spec(it is 2006, sorry I missed that) command line options, as well as any other benchmark, I only added “-Oz -mno-red-zone(to keep comparisons fair) -(whatever enables each transform)” to the default flags needed for compilation. I’ll try to get the exact command line options used and add them.
Debug information(line numbers) is currently only kept if it is the same across all occurrences. This was simply a design choice and can be changed if keeping one instance is the desired behavior.
The behavior described with the profile data is already implemented, I will update the numbers to include the results after including pgo data.
The LTO results are very experimental given that there isn’t a size pipeline for LTO yet(there should be). The %improvements can be similar to non LTO but because the LTO binary is generally much smaller the actual decrease in size is also much smaller. I’ll add more detailed LTO numbers as well.
Thanks,
River Riddle

andreybokhanko · July 23, 2017, 7:43am

River,

Thanks for the reply!

23 июля 2017 г., в 1:05, River Riddle <riddleriver@gmail.com> написал(а):

- The explanation as to why the improvements can vary between the IR and MIR outliner mainly boil down to the level of abstraction that each are working at. The MIR level has very accurate heuristics and is effectively the last post ISel target independent code gen pass. The IR outliner on the other hand has more estimation in the cost model, and can affect the decisions of function simplification passes, instruction selection, RA, etc. Taking this into account can lead to different results.

To clarify, I'm surprised not with % differences (this is understandable), but with differences in what benchmarks got improved. It seems odd that MO, working on lower abstraction level, managed to find redundancies (say, in libquantum) that EO+LO missed. But indeed -- perhaps EO+LO cost model just considered these cases to be non-profitable. It would be interesting to know precisely.

Yours,
Andrey

andreybokhanko · July 23, 2017, 6:32pm

River,

River707 · July 23, 2017, 9:57pm

Andrey,
Currently it will not catch this case because congruence is determined by types/opcode/etc, but cases like this can be caught by redefining what it means to be congruent. We can redefine congruency for add/sub/icmp/gep/etc to no longer care about types, or even opcodes, in certain cases. but this may cause the need for extra control flow.

As for your example, the machine outliner can handle the case when the addition amount and register are the same:
int x = (int) x2 + 4;
int p = (int)p2 + 1;

If we do relax the congruency restrictions, being at the IR level allows for us to identify that we could outline more than just the simple case:
assuming: sizeof(int) == sizeof(int *) == sizeof(long long *) == sizeof(char *)

int x = (int)x2 + 1;
int p = (int )p2 + (int)ipidx;
long longlp = (long long)lp2 + 3;
char *cp = (char *)cp2 + (int)cpidx;

– could outline to –
int x = outlinedFn(1, x2, 1);
int p = (int )outlinedFn(4, (int)p2, ipidx);
long longlp = (long long)outlinedFn(8, (int)lp2, 3);
char *cp = (char *)outlinedFn(1, (int)cp2, cpidx);

int outlinedFn(int SizePatchup, int Var, int Amount) {
return Var + (SizePatchup * Amount);
}

In the above we need some extra patch-up to account for the different sizes of the pointee types.

This is just one opportunity that can be caught when we start to redefine equivalence, something that is really powerful at the IR level. We wanted the initial submission to be have a compact and efficient implementation. Extensions like these can easily be added, but it is not a part of the initial design.
Thanks,
River Riddle

Evgeny_Astigeevich · July 24, 2017, 12:38pm

Hi River,

Thank you for information. Do you have your code in Phabricator?

As I understand your pass always runs after the Inliner. At the moment the Inliner is too aggressive at Os/Oz. I am trying to tune Inliner heuristics for Oz. It would be interesting to see if there are any code size improvements when the Inliner is disabled. This will demonstrate whether the Outliner could improve code size when duplicated code is not due to the Inliner. Another problem I have is that C++ programs get code size regressions due my changes in the inline heuristics. Have you seen this problem?

Thanks,

Evgeny Astigeevich

River707 · July 24, 2017, 5:53pm

Hi Evgeny,
The code hasn’t been posted to Phabricator yet, we wanted to get feedback on the current design and approach first. The code is forked on GitHub currently and a link was included in the original email. As for the inliner having an effect on the potential benefits, I just rebuilt llvm-tblgen with “-fno-inline -fno-inline-functions” to see if there was any change in results:

The inlining enabled Oz binary was ~20% smaller than the no-inline binary.
LO still gave ~2% benefit(down from 2.5)
E+LO gave ~2.5% benefit(down from 3.82)
During testing I have noticed that E+LO functions often have other functions inlined into them, which possibly leads to some of the benefit that can be observed from it. I haven’t really played around with the inlining heuristics myself, so I likely won’t be able to help very much further in that regard.
Thanks,
River Riddle

Jessica_Paquette1 · July 24, 2017, 6:12pm

Hi River,

I’m working on the MachineOutliner pass at the MIR level. Working at the IR level sounds interesting! It also seems like our algorithms are similar. I was thinking of taking the suffix array route with the MachineOutliner in the future.

Anyway, I’d like to ask about this:

River707 · July 24, 2017, 6:55pm

Hi Jessica,
The comparison to the inliner is an interesting one but we think it’s important to note the difference in the use of heuristics. The inliner is juggling many different tasks at the same time, execution speed, code size, etc. which can cause the parameters to be very sensitive depending on the benchmark/platform/etc. The outliners heuristics are focused solely on the potential code size savings from outlining, and is thus only sensitive to the current platform. This only creates a problem when we are over estimating the potential cost of a set of instructions for a particular target. The cost model parameters are only minimums: instruction sequence length, estimated benefit, occurrence amount. The heuristics themselves are conservative and based upon all of the target information available at the IR level, the parameters are just setting a lower bound to weed out any outliers. You are correct in that being at the machine level, before or after RA, will give the most accurate heuristics but we feel there’s an advantage to being at the IR level. At the IR level we can do so many more things that are either too difficult/complex for the machine level(e.g parameterization/outputs/etc). Not only can we do these things but they are available on all targets immediately, without the need for target hooks. The caution on the use of heuristics is understandable, but there comes a point when trade offs need to be made. We made the trade off for a loss in exact cost modeling to gain flexibility, coverage, and potential for further features. This trade off is the same made for quite a few IR level optimizations, including inlining. As for the worry about code size regressions, so far the results seem to support our hypothesis.
Thanks,
River Riddle

qcolombet · July 24, 2017, 8:42pm

Hi River,

Hi Jessica,
The comparison to the inliner is an interesting one but we think it’s important to note the difference in the use of heuristics. The inliner is juggling many different tasks at the same time, execution speed, code size, etc. which can cause the parameters to be very sensitive depending on the benchmark/platform/etc. The outliners heuristics are focused solely on the potential code size savings from outlining, and is thus only sensitive to the current platform. This only creates a problem when we are over estimating the potential cost of a set of instructions for a particular target. The cost model parameters are only minimums: instruction sequence length, estimated benefit, occurrence amount. The heuristics themselves are conservative and based upon all of the target information available at the IR level, the parameters are just setting a lower bound to weed out any outliers. You are correct in that being at the machine level, before or after RA, will give the most accurate heuristics but we feel there’s an advantage to being at the IR level. At the IR level we can do so many more things that are either too difficult/complex for the machine level(e.g parameterization/outputs/etc). Not only can we do these things but they are available on all targets immediately, without the need for target hooks. The caution on the use of heuristics is understandable, but there comes a point when trade offs need to be made. We made the trade off for a loss in exact cost modeling to gain flexibility, coverage, and potential for further features. This trade off is the same made for quite a few IR level optimizations, including inlining. As for the worry about code size regressions, so far the results seem to support our hypothesis.

Would still be interesting to see how well this could perform on some exact model (i.e., at the Machine level), IMO. Target hooks are cheap and choosing an implementation because it is simpler might not be the right long term solution.
At the very least, to know what trade-off we are making, having prototypes with the different approaches sounds sensible.
In particular, all the heuristics about cost for parameter passing (haven’t checked how you did it) sounds already complex enough and would require target hooks. Therefore, I am not seeing a clear win with an IR approach here.

Finally, having heuristics solely focused on code size do not seem realistic. Indeed, I am guessing you have some thresholds to avoid outlining some piece of code too small that would end up adding a whole lot of indirections and I don’t like magic numbers in general :).

To summarize, I wanted to point out that an IR approach is not as a clear win as you describe and would thus deserve more discussion.

Cheers,
-Quentin

Jessica_Paquette1 · July 24, 2017, 8:59pm

Hi River,

I think outlining at the IR level certainly has a lot of potential. I also think that working at a more generic representation such as IR opens lots of doors for interesting code size improvements; however, I’m still iffy on heuristics in general.

The inliner is juggling many different tasks at the same time, execution speed, code size, etc. which can cause the parameters to be very sensitive depending on the benchmark/platform/etc.

Strongly agree here; it’s unclear what a heuristic should look like for an inliner.

The outliners heuristics are focused solely on the potential code size savings from outlining, and is thus only sensitive to the current platform.

This is where I get a little concerned.

Firstly, as previously mentioned, we don’t know how an instruction will be lowered. So, if we say that each IR instruction has a cost of “1 somethings”, we haven’t really solved a code size problem, but more of a code reduction problem. That is, we’ve performed a reduction to the overall structure of the program, but we don’t know that that overall structural reduction will produce an equivalent code size reduction. I personally would be inclined to figure out how far off such a cost model could be, given a known architecture.

Secondly, suppose we have a cost model like in the inliner, where we have some heuristic constants which are employed by some cost model. If we change the model, then how do we know our heuristic constants are still good? Even if we tune these constants for some specific target to try and make good decisions, if we change the outliner at all, then we have to retune the heuristics. Changes in later passes could also introduce the need for retuning.

Finally, the big issue here is that with code size problems, or at least with outlining, we’re trying to solve something which is inherently tied to the architecture we’re working with. A good outlining decision for x86 will look different from a good outlining decision for ARM64. The issue isn’t purely structural; the actual instructions we’re dealing with matter.

Once again, I think this stuff is really interesting, and I think your results are looking great so far. It’s really awesome to see this method being successful at a high level. I’m also making a lot of assumptions about what your cost model might look like. I’ll probably have to look at the code to make a sound judgement.

Jessica

Meador_Inge1 · July 24, 2017, 9:16pm

Firstly, as previously mentioned, we don’t know how an instruction will be lowered. So, if we say that each IR instruction has a cost of “1 somethings”, we haven’t really solved a *code size problem*, but more of a *code reduction problem*. That is, we’ve performed a reduction to the overall structure of the program, but we don’t know that that overall structural reduction will produce an equivalent code size reduction. I personally would be inclined to figure out how far off such a cost model could be, given a known architecture.

“code size” vs. “code reduction” seems like a really nice way to think about this. I don’t know a lot about the areas in discussion, but one thing that occurred to me while reading this was: a code reduction couldn’t produce a worse code size than without the reduction, right?

— Meador

River707 · July 24, 2017, 9:36pm

Hi Quentin,
I appreciate the feedback. When I reference the cost of Target Hooks it’s mainly for maintainability and cost on a target author. We want to keep the intrusion into target information minimized. The heuristics used for the outliner are the same used by any other IR level pass seeking target information, i.e TTI for the most part. I can see where you are coming from with “having heuristics solely focused on code size do not seem realistic”, but I don’t agree with that statement. I think there is a disconnect on heuristics. The only user tunable parameters are the lower bound parameters(to the cost model), the actual analysis(heuristic calculation) is based upon TTI information.
When you say “Would still be interesting to see how well this could perform on some exact model (i.e., at the Machine level), IMO.” I am slightly confused as to what you mean. I do not intend to try and implement this algorithm at the MIR level given that it exists in Machine Outliner. There are several comparison benchmarks given in the “More detailed performance data” of the original RFC. It includes comparisons to the Machine Outliner when possible(I can’t build clang on Linux with Machine Outliner). I welcome any and all discussion on the placement of the outliner in LLVM.
Thanks,
River Riddle

River707 · July 24, 2017, 9:45pm

Hi Jessica,
When we model the cost of an instruction it is largely based upon an estimation by the TTI. It is still an estimation but the results prove to be somewhat close to the exact costs. If you are interested in the results for some known platforms check the “More detailed results” section of the original RFC. It provides comparisons between the two configurations of outlining at the IR level as well as the Machine Outliner. It includes results for X86_64(Linux and Mac) as well as AArch64(The only two platforms that the Machine Outliner supports). The results tend to be quite favorable in regards to the cost model.
As for the tunable heuristic(constants) numbers, these are very similar to those that the Machine Outliner might have if it was tunable: Minimum Benefit, Min Occurrences, Min Instruction Length. Those are the only constants that really exist within the cost model.
I appreciate all of the feedback and discussion!
Thanks,
River Riddle

Jessica_Paquette1 · July 24, 2017, 10:28pm

(Cc llvm-dev, whoops!)

Intuitively, no, but suppose you have a sequence of instructions that look like this (in some weird pseudocode I just made up)

a
b
c
x
a
b
c

Now, this could be reduced to something that looks like this by an outliner:

call my_function
x
call my_function

my_function:
a
b
c
return

Structurally-speaking, this is a reduction.

Now, suppose we’re using an architecture like AArch64 where we care about something like a link register. Then to be correct we may have to actually emit something like this:

save special_register
call my_function
restore special_register
x
save special_register
call my_function
restore special_register

my_function:
a
b
c
return

In this case, we’ve actually produced more instructions than before. Although we’ve simplified the structure of the code, we’ve still produced more of it. So, if a high-level outliner isn’t aware of this sort of thing, you could actually end up making binaries larger by removing similarity from the code.

Jessica

andreybokhanko · July 24, 2017, 11:09pm

Hi Quentin,

I really don’t have a dog in this race, but… we can theoritize on strengths and weaknesses of different approaches ad infinitum – or just accept that “the proof is in the pudding”, or so they say.

And the proof is here! – look at the numbers.

If anything, I believe it makes sense to at least accept Riddle’s phase to the trunk and let two approaches evolve in parallel. It probably silly to have both enabled at the same time, so one that currently provides a greater benefit should work on -Os/-Oz. With time, evolution will decide which one should survive – as it always does.

Yours,
Andrey

qcolombet · July 24, 2017, 11:14pm

Hi River,

Hi Quentin,
I appreciate the feedback. When I reference the cost of Target Hooks it’s mainly for maintainability and cost on a target author. We want to keep the intrusion into target information minimized. The heuristics used for the outliner are the same used by any other IR level pass seeking target information, i.e TTI for the most part. I can see where you are coming from with “having heuristics solely focused on code size do not seem realistic”, but I don’t agree with that statement.

If you only want code size I agree it makes sense, but I believe, even in Oz, we probably don’t want to slow the code by a big factor for a couple bytes. That’s what I wanted to say and what I wanted to point out is that you need to have some kind of model for the performance to avoid those worst cases. Unless we don’t care :).

I think there is a disconnect on heuristics. The only user tunable parameters are the lower bound parameters(to the cost model), the actual analysis(heuristic calculation) is based upon TTI information.

I don’t see how you can get around adding more hooks to know how a specific function prototype is going to be lowered (e.g., i64 needs to be split into two registers, fourth and onward parameters need to be pushed on the stack and so on). Those change the code size benefit.

When you say “Would still be interesting to see how well this could perform on some exact model (i.e., at the Machine level), IMO.” I am slightly confused as to what you mean. I do not intend to try and implement this algorithm at the MIR level given that it exists in Machine Outliner.

Of course, I don’t expect you to do that :). What I meant is that the claim that IR offers the better trade off is not based on hard evidences. I actually don’t buy it.

My point was to make sure I understand what you are trying to solve and given you have mentioned the MachineOutliner, why you are not working on improving it instead of suggesting a new framework.
Don’t take me wrong, maybe creating a new framework at the IR level is the right thing to do, but I still didn’t get that from your comments.

There are several comparison benchmarks given in the “More detailed performance data” of the original RFC. It includes comparisons to the Machine Outliner when possible(I can’t build clang on Linux with Machine Outliner). I welcome any and all discussion on the placement of the outliner in LLVM.

My fear with a new framework is that we are going to split the effort for pushing the outliner technology forward and I’d like to avoid that if at all possible.

Now, to be more concrete on your proposal, could you describe the cost model for deciding what to outline? (Really the cost model, not the suffix algo.)
Are outlined functions pushed into the list candidates for further outlining?

Cheers,
-Quentin

Topic		Replies	Views
[RFC] PT.2 Add IR level interprocedural outliner for code size. LLVM Dev List Archives	14	220	February 23, 2018
[RFC] Interprocedural MIR-level outlining pass LLVM Dev List Archives	37	265	November 14, 2016
[RFC] Framework for Finding and Using Similarity at the IR Level LLVM Dev List Archives	8	151	October 1, 2020
[RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks LLVM Dev List Archives	21	168	August 29, 2017
LLVM Inliner LLVM Dev List Archives	25	306	November 30, 2010

[RFC] Add IR level interprocedural outliner for code size.

Related topics