inlining with O3 and O4

I am wondering how O4 vs O3 do inlining. With O4 it looks like inlining is done first on each file and then at linking phase. Wouldn’t it be a better alternative to delay inlining decisions until the link stage?

Ram

Yes and no.
Yes in the sense that you may make some better decisions.
No in the sense that you will end up with larger modules (assuming
some simple early CSE/etc is also done), and as a result of having
done no inlining, may make worse decisions at the link stage inlining,
depending on what IPA analysis you base your link stage inlining on
and when it runs.

It's certainly possible to have a link-phase only early inliner, and a
link-phase only later inliner, and you will, in general, get better
decisions than a local inliner + link phase inliner, but the cost you
pay is more memory usage, more disk usage, etc.

I'm curious -- where do you draw these conclusions from?

With the current LLVM inliner (significant portions of which are quite new)
I would not expect bad decisions by delaying inlining until link time. In
fact, there are a large number of heuristics we use during per-module
inlining which make *zero* sense if you eventually perform LTO.

A very long-standing todo of mine is to build a per-module set of passes
for LTO builds that is very carefully chosen to be information preserving
and avoid decisions which can be better made at LTO-time. I suspect that we
would see significantly better LTO results from this, but of course only an
experiment will show. My hunch is because the optimization passes in LLVM
have been heavily tuned for the information available in the per-module
pass, and many of them will be ineffective if run after. The inliner is a
good example here. We specifically evaluate potential future inlining
opportunities when making a particular inlining decision. Doing that
per-module when you will eventually have total information seems flawed.

> I am wondering how O4 vs O3 do inlining. With O4 it looks like inlining
> is
> done first on each file and then at linking phase. Wouldn’t it be a
> better
> alternative to delay inlining decisions until the link stage?
Yes and no.
Yes in the sense that you may make some better decisions.
No in the sense that you will end up with larger modules (assuming
some simple early CSE/etc is also done), and as a result of having
done no inlining, may make worse decisions at the link stage inlining,
depending on what IPA analysis you base your link stage inlining on
and when it runs.

It's certainly possible to have a link-phase only early inliner, and a
link-phase only later inliner, and you will, in general, get better
decisions than a local inliner + link phase inliner, but the cost you
pay is more memory usage, more disk usage, etc.

I'm curious -- where do you draw these conclusions from?

Watching 4 compilers (ICC, XLC, GCC, Open64) go through about 10 years
worth of rewriting inliners every few years :wink:

With the current LLVM inliner (significant portions of which are quite new)
I would not expect bad decisions by delaying inlining until link time. In
fact, there are a large number of heuristics we use during per-module
inlining which make *zero* sense if you eventually perform LTO.

Sure. The heuristics tend to become more complex over time, however,
and require more analysis (IE "oh, i'm statistically likely to be able
to eliminate large parts of this function because it will become
constant" or "oh, inlining this performance critical function will
enable us to eliminate loads of otherwise undecidable-aliasing
pointers"). That analysis is usually stymied by lack of inlining and
simple CSE/dead code elimination (because in order to be fast, it's
usually not flow sensitive and has no concept of whether the code will
ever be executed).

I wouldn't disagree that with the exact current heuristics i see in
the inliner, you could delay all decisions until later and get better
results.

A very long-standing todo of mine is to build a per-module set of passes for
LTO builds that is very carefully chosen to be information preserving and
avoid decisions which can be better made at LTO-time. I suspect that we
would see significantly better LTO results from this, but of course only an
experiment will show. My hunch is because the optimization passes in LLVM
have been heavily tuned for the information available in the per-module
pass, and many of them will be ineffective if run after. The inliner is a
good example here. We specifically evaluate potential future inlining
opportunities when making a particular inlining decision. Doing that
per-module when you will eventually have total information seems flawed.

You are assuming the analysis you will run to evaluate future inlining
opportunities would not impact what that total information says :slight_smile:

In a perfect world, you are right.
It's always better to delay decisions to the latest point and until
you have all possible information.
In a practical world, analysis and passes that this "all possible
information" consists of, are affected by the decisions you are now
delaying.