LLVM Inliner

Xinliang_David_Li · November 24, 2010, 1:07am

Hi, I browsed the LLVM inliner implementation, and it seems there is room for improvement. (I have not read it too carefully, so correct me if what I observed is wrong).

First the good side of the inliner – the function level summary and inline cost estimation is more elaborate and complete than gcc. For instance, it considers callsite arguments and the effects of optimization enabled by inlining.

Now more to the weakness of the inliner:

It is bottom up. The inlining is not done in the order based on the priority of the callsites. It may leave important callsites (at top of the cg) unlined due to higher cost after inline cost update. It also eliminates the possibility of inline specialization. To change this, the inliner pass may not use the pass manager infrastructure . (I noticed a hack in the inliner to workaround the problem – for static functions avoid inlining its callees if it causes it to become too big …)
There seems to be only one inliner pass. For calls to small functions, it is better to perform early inlining as one of the local (per function) optimizations followed by scalar opt clean up. This will sharpen the summary information. (Note the inline summary update does not consider the possible cleanup)
recursive inlining is not supported
function with indirect branch is not inlined. What source construct does indirect branch instr correspond to ? variable jump?
fudge factor prefers functions with vector instructions – why is that?
There is one heuristc used in inline-cost computation seems wrong:

// Calls usually take a long time, so they make the inlining gain smaller.
InlineCost += CalleeFI->Metrics.NumCalls * InlineConstants::CallPenalty;

Does it try to block inlining of callees with lots of calls? Note inlining such a function only increase static call counts.

Thanks,

David

Reid_Kleckner · November 24, 2010, 3:40am

Hi, I browsed the LLVM inliner implementation, and it seems there is room
for improvement. (I have not read it too carefully, so correct me if what I
observed is wrong).
First the good side of the inliner -- the function level summary and inline
cost estimation is more elaborate and complete than gcc. For instance, it
considers callsite arguments and the effects of optimization enabled by
inlining.
Now more to the weakness of the inliner:
1) It is bottom up. The inlining is not done in the order based on the
priority of the callsites. It may leave important callsites (at top of the
cg) unlined due to higher cost after inline cost update. It also eliminates
the possibility of inline specialization. To change this, the inliner pass
may not use the pass manager infrastructure . (I noticed a hack in the
inliner to workaround the problem -- for static functions avoid inlining its
callees if it causes it to become too big ..)
2) There seems to be only one inliner pass. For calls to small functions,
it is better to perform early inlining as one of the local (per function)
optimizations followed by scalar opt clean up. This will sharpen the summary
information. (Note the inline summary update does not consider the possible
cleanup)
3) recursive inlining is not supported
4) function with indirect branch is not inlined. What source construct does
indirect branch instr correspond to ? variable jump?

This corresponds to functions that use GCC's labels-as-values
extension, not things like switch statements or virtual calls, so this
really only applies to things like interpreter loops, which you don't
usually want to inline.

5) fudge factor prefers functions with vector instructions -- why is that?

I'm guessing that vectorized instructions are an indicator that the
function is a hotspot, and should therefore be inlined.

6) There is one heuristc used in inline-cost computation seems wrong:

// Calls usually take a long time, so they make the inlining gain smaller.
InlineCost += CalleeFI->Metrics.NumCalls * InlineConstants::CallPenalty;
Does it try to block inlining of callees with lots of calls? Note inlining
such a function only increase static call counts.

I'm guessing the heuristic is that functions with lots of calls are
heavy and spend a lot of time in their callees, so eliminating the
overhead of one call isn't worth it.

Reid

Xinliang_David_Li · November 24, 2010, 4:29am

Hi, I browsed the LLVM inliner implementation, and it seems there is room
for improvement. (I have not read it too carefully, so correct me if what I
observed is wrong).
First the good side of the inliner – the function level summary and inline
cost estimation is more elaborate and complete than gcc. For instance, it
considers callsite arguments and the effects of optimization enabled by
inlining.
Now more to the weakness of the inliner:

It is bottom up. The inlining is not done in the order based on the
priority of the callsites. It may leave important callsites (at top of the
cg) unlined due to higher cost after inline cost update. It also eliminates
the possibility of inline specialization. To change this, the inliner pass
may not use the pass manager infrastructure . (I noticed a hack in the
inliner to workaround the problem – for static functions avoid inlining its
callees if it causes it to become too big …)

There seems to be only one inliner pass. For calls to small functions,
it is better to perform early inlining as one of the local (per function)
optimizations followed by scalar opt clean up. This will sharpen the summary
information. (Note the inline summary update does not consider the possible
cleanup)

recursive inlining is not supported

function with indirect branch is not inlined. What source construct does
indirect branch instr correspond to ? variable jump?

This corresponds to functions that use GCC’s labels-as-values
extension, not things like switch statements or virtual calls, so this
really only applies to things like interpreter loops, which you don’t
usually want to inline.

fudge factor prefers functions with vector instructions – why is that?

I’m guessing that vectorized instructions are an indicator that the
function is a hotspot, and should therefore be inlined.

This is a reasonable heuristic.

There is one heuristc used in inline-cost computation seems wrong:

// Calls usually take a long time, so they make the inlining gain smaller.
InlineCost += CalleeFI->Metrics.NumCalls * InlineConstants::CallPenalty;
Does it try to block inlining of callees with lots of calls? Note inlining
such a function only increase static call counts.

I’m guessing the heuristic is that functions with lots of calls are
heavy and spend a lot of time in their callees, so eliminating the
overhead of one call isn’t worth it.

Those calls may sit in a cold paths – penalizing inlining to those functions can miss opportunities.

David

Nick_Lewycky · November 24, 2010, 8:37pm

Xinliang David Li wrote:

Hi, I browsed the LLVM inliner implementation, and it seems there is
room for improvement. (I have not read it too carefully, so correct me
if what I observed is wrong).

First the good side of the inliner -- the function level summary and
inline cost estimation is more elaborate and complete than gcc. For
instance, it considers callsite arguments and the effects of
optimization enabled by inlining.

Now more to the weakness of the inliner:

1) It is bottom up. The inlining is not done in the order based on the
priority of the callsites. It may leave important callsites (at top of
the cg) unlined due to higher cost after inline cost update. It also
eliminates the possibility of inline specialization. To change this, the
inliner pass may not use the pass manager infrastructure . (I noticed a
hack in the inliner to workaround the problem -- for static functions
avoid inlining its callees if it causes it to become too big ..)

2) There seems to be only one inliner pass. For calls to small
functions, it is better to perform early inlining as one of the local
(per function) optimizations followed by scalar opt clean up. This will
sharpen the summary information. (Note the inline summary update does
not consider the possible cleanup)

Hi David,

You're correct that there's only one inliner pass, but I don't understand how an early inliner would help. In LLVM's case, after we inline one function into another, we run the entire stack of per-function optimizations on it before deciding whether to inline it into its caller. This sharpens the summary information before the inliner looks at the next SCC, as we go bottom up. Is there something more that early inlining would do for us that isn't achieved with this model?

(Actually, there is another inliner pass, but it's for __attribute__((always_inline)) so it's rather uninteresting.)

3) recursive inlining is not supported

4) function with indirect branch is not inlined. What source construct
does indirect branch instr correspond to ? variable jump?

Right. It's for "goto *ptr;".

Nick

Xinliang_David_Li · November 24, 2010, 10:17pm

Xinliang David Li wrote:

Hi, I browsed the LLVM inliner implementation, and it seems there is
room for improvement. (I have not read it too carefully, so correct me
if what I observed is wrong).

First the good side of the inliner – the function level summary and
inline cost estimation is more elaborate and complete than gcc. For
instance, it considers callsite arguments and the effects of
optimization enabled by inlining.

Now more to the weakness of the inliner:

It is bottom up. The inlining is not done in the order based on the
priority of the callsites. It may leave important callsites (at top of
the cg) unlined due to higher cost after inline cost update. It also
eliminates the possibility of inline specialization. To change this, the
inliner pass may not use the pass manager infrastructure . (I noticed a
hack in the inliner to workaround the problem – for static functions
avoid inlining its callees if it causes it to become too big …)

There seems to be only one inliner pass. For calls to small
functions, it is better to perform early inlining as one of the local
(per function) optimizations followed by scalar opt clean up. This will
sharpen the summary information. (Note the inline summary update does
not consider the possible cleanup)

Hi David,

You’re correct that there’s only one inliner pass, but I don’t understand how an early inliner would help. In LLVM’s case, after we inline one function into another, we run the entire stack of per-function optimizations on it before deciding whether to inline it into its caller. This sharpens the summary information before the inliner looks at the next SCC, as we go bottom up. Is there something more that early inlining would do for us that isn’t achieved with this model?

Interesting – LLVM does perform on the fly cleanups during inlining transformation – this will make summary update precise. One thing I notice from the debug pass dump is that the ‘deduce function attribute’ pass happens before the clean up – Is it intended?

Thanks,

David

Chris_Lattner · November 28, 2010, 10:37pm

Hi, I browsed the LLVM inliner implementation, and it seems there is room for improvement. (I have not read it too carefully, so correct me if what I observed is wrong).

First the good side of the inliner -- the function level summary and inline cost estimation is more elaborate and complete than gcc. For instance, it considers callsite arguments and the effects of optimization enabled by inlining.

Yep, as others pointed out, this is intended to interact closely with the per-function optimizations that get mixed in due to the inliner being a callgraphscc pass. This is actually a really important property of the inliner. If you have a function foo that calls a leaf function bar, the sequence of optimization is:

1. Run the inliner on bar (noop, since it has no call sites)
2. Run the per-function passes on bar. This generally shrinks it, and prevents "abstraction penalty" from making bar look too big to inline.
3. Run the inliner on foo. Since foo calls bar, we consider inlining bar into foo and do so if profitable.
4. Run the per-function passes on foo. If bar got inlined, this means that we're running the per-function passes over the inlined contents of bar again.

In a traditional optimizer like GCC's, you end up with problems where you have to set a high inline threshold due to inlining-before-optimizing causing "abstraction penalty problems". An early inliner is a hack that tries to address this. Another problem with this approach from the compile time perspective is that you end up repeating work multiple times. For example, if there is a common subexpression in a small function, you end up inlining it into many places, then having to eliminate the common subexpression in each copy.

The LLVM inliner avoids these problems, but (as you point out) this really does force it to being a bottom-up inliner. This means that the bottom-up inliner needs to make decisions in strange ways in some cases: for example if qux called foo, and foo were static, then (when processing foo) we may decide not to inline bar into foo because it would be more profitable to inline foo into qux.

Now more to the weakness of the inliner:

1) It is bottom up. The inlining is not done in the order based on the priority of the callsites. It may leave important callsites (at top of the cg) unlined due to higher cost after inline cost update. It also eliminates the possibility of inline specialization. To change this, the inliner pass may not use the pass manager infrastructure . (I noticed a hack in the inliner to workaround the problem -- for static functions avoid inlining its callees if it causes it to become too big ..)

This is true, but I don't think it's a really large problem in practice. We don't have a "global inline threshold limit" (which I've never understood, except as a hack to prevent run-away inlining) so not visiting in priority order shouldn't prevent high-priority-but-processed-late candidates from being inlined.

The only potential issue I'm aware of is if we have A->B->C and we decide to inline C into B when it would be more profitable to inline B into A and leave C out of line. This can be handled with a heuristic like the one above.

2) There seems to be only one inliner pass. For calls to small functions, it is better to perform early inlining as one of the local (per function) optimizations followed by scalar opt clean up. This will sharpen the summary information. (Note the inline summary update does not consider the possible cleanup)

Yep. This is a feature

3) recursive inlining is not supported

This is a policy decision. It's not clear whether it is really a good idea, though I have seen some bugzilla or something about it. I agree that it should be revisited.

4) function with indirect branch is not inlined. What source construct does indirect branch instr correspond to ? variable jump?

See:

for more details.

6) There is one heuristc used in inline-cost computation seems wrong:

// Calls usually take a long time, so they make the inlining gain smaller.
InlineCost += CalleeFI->Metrics.NumCalls * InlineConstants::CallPenalty;

Does it try to block inlining of callees with lots of calls? Note inlining such a function only increase static call counts.

I think that this is a heuristic that Jakob came up with, but I think it's a good one, also discussed elsewhere on the thread.

When talking about inlining and tuning thresholds and heuristics, it is a good idea to quantify what the expected or possible wins of inlining a function are. Some of the ones I'm aware of:

1. In some cases, inlining shrinks code.

2. Inlining a function exposes optimization opportunities on the inlined code, because constant propagation and other simplifications can take place.

3. Inlining a function exposes optimizations in the caller because address-taken values can be promoted to registers.

4. Inlining a function can improve optimization in a caller because interprocedural side-effect analysis isn't needed. For example, load/call dependence may not be precise. This is something we should continue to improve in the optimizer though.

5. Inlining code with indirect call sites and switches can improve branch prediction if some callers of the function are biased differently than other callers. This is pretty hard to predict without profile info though.

The "punish functions containing lots of calls" is based on the assumption that functions which are mostly calls (again, this decision happens after the callee has been inlined and simplified) aren't themselves doing much work.

-Chris

Xinliang_David_Li · November 29, 2010, 7:39am

Hi, I browsed the LLVM inliner implementation, and it seems there is room for improvement. (I have not read it too carefully, so correct me if what I observed is wrong).

First the good side of the inliner – the function level summary and inline cost estimation is more elaborate and complete than gcc. For instance, it considers callsite arguments and the effects of optimization enabled by inlining.

Yep, as others pointed out, this is intended to interact closely with the per-function optimizations that get mixed in due to the inliner being a callgraphscc pass. This is actually a really important property of the inliner. If you have a function foo that calls a leaf function bar, the sequence of optimization is:

Run the inliner on bar (noop, since it has no call sites)

Run the per-function passes on bar. This generally shrinks it, and prevents “abstraction penalty” from making bar look too big to inline.

Run the inliner on foo. Since foo calls bar, we consider inlining bar into foo and do so if profitable.

Run the per-function passes on foo. If bar got inlined, this means that we’re running the per-function passes over the inlined contents of bar again.

On-the-fly clean up (optimization) while doing bottom up inlining is nice as you described. Many other compilers chose not to do this way due to scalability concerns (with IPO) – this can make the IPO the biggest bottom neck in terms of compile time (as it is serialized). Memory many not be a big issue for LLVM as I can see the good locality in pass manager. (Just curious, what is biggest application LLVM can build with IPO?)

In a traditional optimizer like GCC’s, you end up with problems where you have to set a high inline threshold due to inlining-before-optimizing causing “abstraction penalty problems”. An early inliner is a hack that tries to address this.

It is a hack in some sense (but a common practice) – but enables other flexibilities.

Another problem with this approach from the compile time perspective is that you end up repeating work multiple times. For example, if there is a common subexpression in a small function, you end up inlining it into many places, then having to eliminate the common subexpression in each copy.

Early inlining + scalar opt can do the same, right?

The LLVM inliner avoids these problems, but (as you point out) this really does force it to being a bottom-up inliner. This means that the bottom-up inliner needs to make decisions in strange ways in some cases: for example if qux called foo, and foo were static, then (when processing foo) we may decide not to inline bar into foo because it would be more profitable to inline foo into qux.

Now more to the weakness of the inliner:

It is bottom up. The inlining is not done in the order based on the priority of the callsites. It may leave important callsites (at top of the cg) unlined due to higher cost after inline cost update. It also eliminates the possibility of inline specialization. To change this, the inliner pass may not use the pass manager infrastructure . (I noticed a hack in the inliner to workaround the problem – for static functions avoid inlining its callees if it causes it to become too big …)

This is true, but I don’t think it’s a really large problem in practice. We don’t have a “global inline threshold limit” (which I’ve never understood, except as a hack to prevent run-away inlining) so not visiting in priority order shouldn’t prevent high-priority-but-processed-late candidates from being inlined.

global threshold can be used to control the unnecessary size growth. In some cases, the size increase may also cause increase in icache footprint leading to poor performance. In fact, with IPO/CMO, icache footprint can be modeled in some way and be used as one kind of global limit.

The only potential issue I’m aware of is if we have A->B->C and we decide to inline C into B when it would be more profitable to inline B into A and leave C out of line. This can be handled with a heuristic like the one above.

There seems to be only one inliner pass. For calls to small functions, it is better to perform early inlining as one of the local (per function) optimizations followed by scalar opt clean up. This will sharpen the summary information. (Note the inline summary update does not consider the possible cleanup)

Yep. This is a feature

recursive inlining is not supported

This is a policy decision. It’s not clear whether it is really a good idea, though I have seen some bugzilla or something about it. I agree that it should be revisited.

function with indirect branch is not inlined. What source construct does indirect branch instr correspond to ? variable jump?

See:
http://blog.llvm.org/2010/01/address-of-label-and-indirect-branches.html

for more details.

There is one heuristc used in inline-cost computation seems wrong:

// Calls usually take a long time, so they make the inlining gain smaller.
InlineCost += CalleeFI->Metrics.NumCalls * InlineConstants::CallPenalty;

Does it try to block inlining of callees with lots of calls? Note inlining such a function only increase static call counts.

I think that this is a heuristic that Jakob came up with, but I think it’s a good one, also discussed elsewhere on the thread.

When talking about inlining and tuning thresholds and heuristics, it is a good idea to quantify what the expected or possible wins of inlining a function are. Some of the ones I’m aware of:

In some cases, inlining shrinks code.

Inlining a function exposes optimization opportunities on the inlined code, because constant propagation and other simplifications can take place.

Inlining a function exposes optimizations in the caller because address-taken values can be promoted to registers.

Inlining a function can improve optimization in a caller because interprocedural side-effect analysis isn’t needed. For example, load/call dependence may not be precise. This is something we should continue to improve in the optimizer though.

Inlining code with indirect call sites and switches can improve branch prediction if some callers of the function are biased differently than other callers. This is pretty hard to predict without profile info though.

Besides – 1) reducing call overhead; 2) scheduling freedom; 3) enabling optimizations across inline instances of callee(s); 4) sharpening local analysis (mainly aliasing) results – such as points to, malloc etc.

It may also lose aliasing assertion (such as restrict aliasing) if not done properly.

The “punish functions containing lots of calls” is based on the assumption that functions which are mostly calls (again, this decision happens after the callee has been inlined and simplified) aren’t themselves doing much work.

My point is that using static count of callsites as a indicator for this can be misleading. All the calls may be calls to cold external functions for instance.

Thanks,

David

Duncan_Sands · November 29, 2010, 9:17am

Hi David,

Interesting -- LLVM does perform on the fly cleanups during inlining
transformation -- this will make summary update precise. One thing I notice from
the debug pass dump is that the 'deduce function attribute' pass happens before
the clean up -- Is it intended?

you are correct. This is due to a limitation of the pass manager: it would
clearly be better to run the function attributes pass after the per-function
passes that follow the inliner, but currently if you try what happens is that
first the inliner and on the fly cleanups get run on every function, and only
when they have finished the function attributes pass gets run on every
function. That means that function attributes of callees will not have been
calculated when the inliner (and the subsequent function passes) process a
function, which is bad. I think the pass manager should be fixed to not do
this, at which point the function attributes pass could be moved later.

Ciao,

Duncan.

Rafael_Avila_de_Espi · November 29, 2010, 12:07pm

On-the-fly clean up (optimization) while doing bottom up inlining is nice as
you described. Many other compilers chose not to do this way due to
scalability concerns (with IPO) -- this can make the IPO the biggest bottom
neck in terms of compile time (as it is serialized). Memory many not be a
big issue for LLVM as I can see the good locality in pass manager. (Just
curious, what is biggest application LLVM can build with IPO?)

I am not sure what is the biggest one, but the one I tried was clang
itself with LTO:

http://lists.cs.uiuc.edu/pipermail/llvmdev/2010-October/035584.html

Cheers,
Rafael

Xinliang_David_Li · November 29, 2010, 4:30pm

Hi David,

Interesting – LLVM does perform on the fly cleanups during inlining
transformation – this will make summary update precise. One thing I notice from
the debug pass dump is that the ‘deduce function attribute’ pass happens before
the clean up – Is it intended?

you are correct. This is due to a limitation of the pass manager: it would
clearly be better to run the function attributes pass after the per-function
passes that follow the inliner, but currently if you try what happens is that
first the inliner and on the fly cleanups get run on every function, and only
when they have finished the function attributes pass gets run on every
function. That means that function attributes of callees will not have been
calculated when the inliner (and the subsequent function passes) process a
function, which is bad. I think the pass manager should be fixed to not do
this, at which point the function attributes pass could be moved later.

So bottom up traversal of the (scc node of) callgraph is not sufficient to guarantee all callees are processed before the caller?

Thanks,

David

Duncan_Sands · November 29, 2010, 4:40pm

Hi David,

     > Interesting -- LLVM does perform on the fly cleanups during inlining
     > transformation -- this will make summary update precise. One thing I
    notice from
     > the debug pass dump is that the 'deduce function attribute' pass happens
    before
     > the clean up -- Is it intended?

    you are correct. This is due to a limitation of the pass manager: it would
    clearly be better to run the function attributes pass after the per-function
    passes that follow the inliner, but currently if you try what happens is that
    first the inliner and on the fly cleanups get run on every function, and only
    when they have finished the function attributes pass gets run on every
    function. That means that function attributes of callees will not have been
    calculated when the inliner (and the subsequent function passes) process a
    function, which is bad. I think the pass manager should be fixed to not do
    this, at which point the function attributes pass could be moved later.

So bottom up traversal of the (scc node of) callgraph is not sufficient to
guarantee all callees are processed before the caller?

that's not the problem, the problem is that the pass manager schedules two
bottom up traversals, one for the inliner + cleanup passes and another for
the function attributes pass, if you place place the function attributes
pass later in the pass list. I don't know why, but it seems to be a pass
manager bug (perhaps it is a feature...).

Ciao,

Duncan.

Chris_Lattner · November 29, 2010, 6:46pm

I'm not sure what the issue is, but that is clearly a bug.

-Chris

Chris_Lattner · November 29, 2010, 6:56pm

Run the inliner on bar (noop, since it has no call sites)

Run the per-function passes on bar. This generally shrinks it, and prevents “abstraction penalty” from making bar look too big to inline.

Run the inliner on foo. Since foo calls bar, we consider inlining bar into foo and do so if profitable.

Run the per-function passes on foo. If bar got inlined, this means that we’re running the per-function passes over the inlined contents of bar again.

On-the-fly clean up (optimization) while doing bottom up inlining is nice as you described. Many other compilers chose not to do this way due to scalability concerns (with IPO) – this can make the IPO the biggest bottom neck in terms of compile time (as it is serialized). Memory many not be a big issue for LLVM as I can see the good locality in pass manager. (Just curious, what is biggest application LLVM can build with IPO?)

I don’t really know, and I agree with you that LLVM’s LTO isn’t very scalable (it currently loads all the IR into memory). I haven’t thought a lot about this, but I’d tackle that problem in three stages:

Our LTO model runs optimizations at both compile and link time, the compile-time optimizations should work as they do now IMO. This is controversial though, because doing so could cause (e.g.) an inlining to happen “early” that would be seen as a bad idea with full LTO information. The advantage of doing compile-time optimizations is that it both shrinks the IR, and speeds up an incremental rebuild by avoiding having to do simple optimizations again.
At LTO time, the bottom-up processing of the callgraph is still goodness and presents good locality (unless you have very very large SCC’s). The tweak that we’d have to implement is lazy deserialization (already implemented) and reserialization to disk (which is missing). With this, you get much better memory footprint than “hold everything in memory at once”.
To support multiple cores/machines, you break the callgraph SCC DAG into parallel chunks that can be farmed out. There is a lot of parallelism in a DAG.

I don’t know of anyone planning on working on LTO at the moment though.

In a traditional optimizer like GCC’s, you end up with problems where you have to set a high inline threshold due to inlining-before-optimizing causing “abstraction penalty problems”. An early inliner is a hack that tries to address this.

It is a hack in some sense (but a common practice) – but enables other flexibilities.

The hack I’m referring to is the “raise the inline threshold”. If the inliner has any language specificity to its inline threshold, I consider it a hack. There is no reason the inliner should have to know if it’s building C or C++ code. It should be guided based on the structure of the code.

Another problem with this approach from the compile time perspective is that you end up repeating work multiple times. For example, if there is a common subexpression in a small function, you end up inlining it into many places, then having to eliminate the common subexpression in each copy.

Early inlining + scalar opt can do the same, right?

In some cases, but not in general, because you run into phase ordering problems.

This is true, but I don’t think it’s a really large problem in practice. We don’t have a “global inline threshold limit” (which I’ve never understood, except as a hack to prevent run-away inlining) so not visiting in priority order shouldn’t prevent high-priority-but-processed-late candidates from being inlined.

global threshold can be used to control the unnecessary size growth. In some cases, the size increase may also cause increase in icache footprint leading to poor performance. In fact, with IPO/CMO, icache footprint can be modeled in some way and be used as one kind of global limit.

I understand that, but that implies that you have some model for code locality. Setting a global code growth limit is (in my opinion) a hack unless you are aiming for the whole program to fit in the icache (which I don’t think anyone tries to do :).

With any other limit that is higher than your icache size, you are basically picking an arbitrary limit that is not based on the machine model or the instruction locality of the program.

The “punish functions containing lots of calls” is based on the assumption that functions which are mostly calls (again, this decision happens after the callee has been inlined and simplified) aren’t themselves doing much work.

My point is that using static count of callsites as a indicator for this can be misleading. All the calls may be calls to cold external functions for instance.

Absolutely true. It may also be completely wrong for some functions. It’s a heuristic

-Chris

nlopes · November 29, 2010, 11:05pm

BTW, why not learn the heuristic automatically?.. It's actually not very
difficult. Last semester, for a class project, I did some experiments with
using statistical classifier boosting to automatically derive new heuristic
functions for LLVM's inliner (for -O2 and -Os).
The results are a bit biased since I've used a very small training program set (just SPEC 2006). I just made the report available on-line in case someone is interested: http://web.ist.utl.pt/nuno.lopes/pubs/inline-stats10.pdf

Nuno

Devang_Patel1 · November 30, 2010, 1:01am

I am not sure I understand this. Here is what I am seeing when I run
'opt -inline -instcombine -functionattrs'

Target Data Layout
No Alias Analysis (always returns 'may' alias)
  ModulePass Manager
    Basic CallGraph Construction
    Call Graph SCC Pass Manager
      Function Integration/Inlining
      FunctionPass Manager
        Combine redundant instructions
    Basic CallGraph Construction
    Call Graph SCC Pass Manager
      Deduce function attributes
      FunctionPass Manager
        Preliminary module verification
        Dominator Tree Construction
        Module Verifier

Is this not what you expect ?

Devang_Patel1 · November 30, 2010, 1:04am

And if you run 'opt -inline -functionattrs -instcombine' you get

Target Data Layout
No Alias Analysis (always returns 'may' alias)
  ModulePass Manager
    Basic CallGraph Construction
    Call Graph SCC Pass Manager
      Function Integration/Inlining
      Deduce function attributes
      FunctionPass Manager
        Combine redundant instructions
        Preliminary module verification
        Dominator Tree Construction
        Module Verifier

Chris_Lattner · November 30, 2010, 1:15am

I'd expect (ignoring the verifier):

Target Data Layout
No Alias Analysis (always returns 'may' alias)
ModulePass Manager
   Basic CallGraph Construction
   Call Graph SCC Pass Manager
     Function Integration/Inlining
     FunctionPass Manager
       Combine redundant instructions
     Deduce function attributes

-Chris

Devang_Patel1 · November 30, 2010, 1:28am

Right now PassManager is not doing this because instcombine does not state that it preserves Inlining and Basic CallGraph Construction.

Chris_Lattner · November 30, 2010, 5:07am

I'm pretty sure that all CGSCC passes have to tolerate any function pass.

-Chris

Devang_Patel1 · November 30, 2010, 5:52am

OK, then will have to dig deeper.

Topic		Replies	Views
Evaluate and Expand the Module-Level Inliner GSoC	6	1007	April 15, 2022
Recursion inlining LLVM Dev List Archives	1	78	December 5, 2014
function inlining threshold ? LLVM Dev List Archives	2	121	July 5, 2005
RFC: Recursion inlining LLVM Dev List Archives	1	127	December 22, 2014
[GSoC] Inlining: using locally optimal inlining decisions LLVM Dev List Archives	1	112	May 25, 2021

LLVM Inliner

Related Topics