Intended behavior of CGSCC pass manager.

_sean_silva · June 8, 2016, 11:19am

Hi Chandler, Philip, Mehdi, (and llvm-dev,)

(this is partially a summary of some discussions that happened at the last LLVM bay area social, and partially a discussion about the direction of the CGSCC pass manager)

A the last LLVM social we discussed the progress on the CGSCC pass manager. It seems like Chandler has a CGSCC pass manager working, but it is still unresolved exactly which semantics we want (more about this below) that are reasonably implementable.

AFAICT, there has been no public discussion about what exact semantics we ultimately want to have. We should figure that out.

The main difficulty which Chandler described is the apparently quite complex logic surrounding needing to run function passes nested within an SCC pass manager, while providing some guarantees about exactly what order the function passes are run. The existing CGSCC pass manager just punts on some of the problems that arise (look in CGPassManager::runOnModule, CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that Chandler has been trying to solve.

(
Why is this “function passes inside CGSCC passes” stuff interesting? Because LLVM can do inlining on an SCC (often just a single function) and then run function passes to simplify the function(s) in the SCC before it tries to inline into a parent SCC. (the SCC visitation order is post-order)
For example, we may inline a bunch of code, but after inlining we can tremendously simplify the function, and we want to do so before considering this function for inlining into its callers so that we get an accurate evaluation of the inline cost.
Based on what Chandler said, it seems that LLVM is fairly unique in this regard and other compilers don’t do this (which is why we can’t just look at how other compilers solve this problem; they don’t have this problem (maybe they should? or maybe we shouldn’t?)). For example, he described that GCC uses different inlining “phases”; e.g. it does early inlining on the entire module, then does simplifications on the entire module, then does late inlining on the entire module; so it is not able to incrementally simplify as it inlines like LLVM does.
)

As background for what is below, the LazyCallGraph tracks two graphs: the “call graph” and the “ref graph”.
Conceptually, the call graph is the graph of direct calls, where indirect calls and calls to external functions do not appear (or are connected to dummy nodes). The ref graph is basically the graph of all functions transitively accessible based on the globals/constants/etc. referenced by a function (e.g. if a function foo references a vtable that is defined in the module, there is an edge in the ref graph from foo to every function in the vtable).
The call graph is a strict subset of the ref graph.

Chandler described that he had a major breakthrough in that the CGSCC pass manager only had to deal with 3 classes of modifications that can occur:

a pass may e.g. propagate a load of a function pointer into an indirect call, turning it into an direct call. This requires adding an edge in the CG but not in the ref graph.
a pass may take a direct call and turn it into an indirect call. This requires removing an edge from the CG, but not in the ref graph.
a pass may delete a direct call. This removes an edge in the CG and also in the ref graph.

From the perspective of the CGSCC pass manager, these operations can affect the SCC structure. Adding an edge might merge SCC’s and deleting an edge might split SCC’s. Chandler mentioned that apparently the issues of splitting and merging SCC’s within the current infrastructure are actually quite challenging and lead to e.g. iterator invalidation issues, and that is what he is working on.

(
The ref graph is important to guide the overall SCC visitation order because it basically represents “the largest graph that the CG may turn into due to our static analysis of this module”. I.e. no transformation we can statically make in the CGSCC passes can ever cause us to need to merge SCC’s in the ref graph.
)

I have a couple overall questions/concerns:

The ref graph can easily go quadratic. E.g.

typedef void (*fp)();
fp funcs = {
&foo1,
&foo2,
…
&fooN
}
void foo1() { funcssomething; }

void foo2() { funcssomething; }
…
void fooN() { funcssomething; }

One real-world case where this might come about is in the presence of vtables.

The existing CGSCC pass manager does not have this issue AFAIK because it does not consider the ref graph.

Does anybody have any info/experience about how densely connected the ref graph can get in programs that might reasonably be fed to the compiler?
I just did a quick sanity check with LLD/ELF using --lto-newpm-passes=cgscc(no-op-cgscc) and it at least seemed to terminate / not run out of memory. Based on some rough calculations looking at the profile, it seem like the entire run of the inliner in the old LTO pipeline (which is about 5% of total LTO time on this particular example I looked at) is only 2-3x as expensive as just --lto-newpm-passes=cgscc(no-op-cgscc), so the LazyCallGraph construction is certainly not free.

What is the intended behavior of CGSCC passes when SCC’s are split or merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now we run some function passes nested inside the CGSCC pass manager (e.g. to simplify things after inlining). Consider:

a) These function passes are e.g. now able to devirtualize a call, adding an edge to the CG, forming a larger CG SCC. Do you re-run the CGSCC pass (say, the inliner) on this larger SCC?

b) These function passes are e.g. able to DCE a call, removing an edge from the CG. This converts, say, a CG SCC which is a cycle graph (like a->b->c->a) into a path graph (a->b->c, with no edge back to a). The inliner had already visited a, b, and c as a single SCC. Now does it have to re-visit c, then b, then a, as single-node SCC’s?

btw:

One way that I have found it useful to think about this is in terms of the visitation during Tarjan’s SCC algorithm. I’ll reference the pseudocode in https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm. Inside the “strongconnect” routine when we have identified an SCC (the true branch of if (v.lowlink = v.index) test ) we can visit stack[v.index:stack.size()] as an SCC. This may or may not invalidate some things on the stack (the variable S in the pseudocode) and we may need to fix it up (e.g. inlining deleted a function, so we can’t have an entry on the stack). Then, we can run function passes as we pop individual functions off the stack, but it is easier to think about IMO than merging of SCC data structures: if we add edges to the CG then we have to do more DFS on the new edges and if we delete edges then the DFS order of the stack gives us certain guarantees.
Personally I find this much easier to reason about than the description in terms of splitting and merging SCC’s in the CG and ref graph (which the LazyCallGraph API makes one to think about since it hides the underlying Tarjan’s algorithm).
The LazyCallGraph API makes the current loop in http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100 very clean, but at least for my thinking about the problem, it seems like the wrong abstraction (and most of the LazyCallGraph API seems to be unused, so it seems like it may be overly heavyweight).
E.g. I think that maybe the easiest thing to do is to turn the current approach inside out: instead of having the pass manager logic be the “normal code” and forcing the Tarjan algorithm to become a state machine of iterators, use an open-coded Tarjan algorithm with some callbacks and make the pass management logic be the state machine.
This will also open the door to avoiding the potentially quadratic size of the ref graph, since e.g. in the example I gave above, we can mark the funcs array itself as already having been visited during the walk. In the current LazyCallGraph, this would require adding some sort of notion of hyperedge.

Since this is such a high priority (due to blocking PGO inlining), I will probably try my hand at implementing the CGSCC pass manager sometime soon unless somebody beats me to it. (I’ll probably try the “open-coded SCC visit” approach).

Another possibility is implementing the new CGSCC pass manager that uses the same visitation semantics as the one in the old PM, and then we can refactor that as needed. In fact, that may be the best approach so that porting to the new PM is as NFC as possible and we can isolate the functional (i.e., need benchmarks, measurements …) changes in separate commits.

Sorry for the wall of text.

– Sean Silva

Finkel_Hal_J · June 8, 2016, 4:32pm

From: "Sean Silva via llvm-dev" <llvm-dev@lists.llvm.org>
To: "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Wednesday, June 8, 2016 6:19:03 AM
Subject: [llvm-dev] Intended behavior of CGSCC pass manager.

Hi Chandler, Philip, Mehdi, (and llvm-dev,)

(this is partially a summary of some discussions that happened at the
last LLVM bay area social, and partially a discussion about the
direction of the CGSCC pass manager)

A the last LLVM social we discussed the progress on the CGSCC pass
manager. It seems like Chandler has a CGSCC pass manager working,
but it is still unresolved exactly which semantics we want (more
about this below) that are reasonably implementable.

AFAICT, there has been no public discussion about what exact
semantics we ultimately want to have. We should figure that out.

The main difficulty which Chandler described is the apparently quite
complex logic surrounding needing to run function passes nested
within an SCC pass manager, while providing some guarantees about
exactly what order the function passes are run. The existing CGSCC
pass manager just punts on some of the problems that arise (look in
CGPassManager::runOnModule, CGPassManager::RunAllPassesOnSCC, and
CGPassManager::RunPassOnSCC in
llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems
that Chandler has been trying to solve.

(
Why is this "function passes inside CGSCC passes" stuff interesting?
Because LLVM can do inlining on an SCC (often just a single
function) and then run function passes to simplify the function(s)
in the SCC before it tries to inline into a parent SCC. (the SCC
visitation order is post-order)
For example, we may inline a bunch of code, but after inlining we can
tremendously simplify the function, and we want to do so before
considering this function for inlining into its callers so that we
get an accurate evaluation of the inline cost.
Based on what Chandler said, it seems that LLVM is fairly unique in
this regard and other compilers don't do this (which is why we can't
just look at how other compilers solve this problem; they don't have
this problem (maybe they should? or maybe we shouldn't?)). For
example, he described that GCC uses different inlining "phases";
e.g. it does early inlining on the entire module, then does
simplifications on the entire module, then does late inlining on the
entire module; so it is not able to incrementally simplify as it
inlines like LLVM does.

This incremental simplification is an important feature of our inliner, and one we should endeavor to keep. We might also want different phases at some point (e.g. a top-down and a bottom-up phase), but that's another story.

)

As background for what is below, the LazyCallGraph tracks two graphs:
the "call graph" and the "ref graph".
Conceptually, the call graph is the graph of direct calls, where
indirect calls and calls to external functions do not appear (or are
connected to dummy nodes). The ref graph is basically the graph of
all functions transitively accessible based on the
globals/constants/etc. referenced by a function (e.g. if a function
`foo` references a vtable that is defined in the module, there is an
edge in the ref graph from `foo` to every function in the vtable).
The call graph is a strict subset of the ref graph.

Chandler described that he had a major breakthrough in that the CGSCC
pass manager only had to deal with 3 classes of modifications that
can occur:
- a pass may e.g. propagate a load of a function pointer into an
indirect call, turning it into an direct call. This requires adding
an edge in the CG but not in the ref graph.
- a pass may take a direct call and turn it into an indirect call.
This requires removing an edge from the CG, but not in the ref
graph.
- a pass may delete a direct call. This removes an edge in the CG and
also in the ref graph.

From the perspective of the CGSCC pass manager, these operations can
affect the SCC structure. Adding an edge might merge SCC's and
deleting an edge might split SCC's. Chandler mentioned that
apparently the issues of splitting and merging SCC's within the
current infrastructure are actually quite challenging and lead to
e.g. iterator invalidation issues, and that is what he is working
on.

(
The ref graph is important to guide the overall SCC visitation order
because it basically represents "the largest graph that the CG may
turn into due to our static analysis of this module". I.e. no
transformation we can statically make in the CGSCC passes can ever
cause us to need to merge SCC's in the ref graph.
)

I have a couple overall questions/concerns:

1. The ref graph can easily go quadratic. E.g.

typedef void (*fp)();
fp funcs = {
&foo1,
&foo2,
...
&fooN
}
void foo1() { funcs[something](); }

void foo2() { funcs[something](); }
...
void fooN() { funcs[something](); }

One real-world case where this might come about is in the presence of
vtables.

The existing CGSCC pass manager does not have this issue AFAIK
because it does not consider the ref graph.

Does anybody have any info/experience about how densely connected the
ref graph can get in programs that might reasonably be fed to the
compiler?
I just did a quick sanity check with LLD/ELF using
`--lto-newpm-passes=cgscc(no-op-cgscc)` and it at least seemed to
terminate / not run out of memory. Based on some rough calculations
looking at the profile, it seem like the entire run of the inliner
in the old LTO pipeline (which is about 5% of total LTO time on this
particular example I looked at) is only 2-3x as expensive as just
`--lto-newpm-passes=cgscc(no-op-cgscc)`, so the LazyCallGraph
construction is certainly not free.

2. What is the intended behavior of CGSCC passes when SCC's are split
or merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now
we run some function passes nested inside the CGSCC pass manager
(e.g. to simplify things after inlining). Consider:

This is not how I thought the current scheme worked -- I was under the impression that we had a call graph with conservatively-connected dummy nodes for external/indirect functions. As a result, there is no semantics-preserving optimization that will merge SCCs, only split them. In that case, I'd expect that once an SCC is split, we re-run the CGSCC passes over the newly-separated SCCs. But this corresponds to running over the "ref graph", as you describe it. I don't understand why we want the non-conservative graph.

a) These function passes are e.g. now able to devirtualize a call,
adding an edge to the CG, forming a larger CG SCC. Do you re-run the
CGSCC pass (say, the inliner) on this larger SCC?

b) These function passes are e.g. able to DCE a call, removing an
edge from the CG. This converts, say, a CG SCC which is a cycle
graph (like a->b->c->a) into a path graph (a->b->c, with no edge
back to a). The inliner had already visited a, b, and c as a single
SCC. Now does it have to re-visit c, then b, then a, as single-node
SCC's?

btw:

One way that I have found it useful to think about this is in terms
of the visitation during Tarjan's SCC algorithm. I'll reference the
pseudocode in
Tarjan's strongly connected components algorithm - Wikipedia
. Inside the "strongconnect" routine when we have identified an SCC
(the true branch of `if (v.lowlink = v.index)` test ) we can visit
stack[v.index:stack.size()] as an SCC. This may or may not
invalidate some things on the stack (the variable `S` in the
pseudocode) and we may need to fix it up (e.g. inlining deleted a
function, so we can't have an entry on the stack). Then, we can run
function passes as we pop individual functions off the stack, but it
is easier to think about IMO than merging of SCC data structures: if
we add edges to the CG then we have to do more DFS on the new edges
and if we delete edges then the DFS order of the stack gives us
certain guarantees.
Personally I find this much easier to reason about than the
description in terms of splitting and merging SCC's in the CG and
ref graph (which the LazyCallGraph API makes one to think about
since it hides the underlying Tarjan's algorithm).
The LazyCallGraph API makes the current loop in
http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100
very clean, but at least for my thinking about the problem, it seems
like the wrong abstraction (and most of the LazyCallGraph API seems
to be unused, so it seems like it may be overly heavyweight).
E.g. I think that maybe the easiest thing to do is to turn the
current approach inside out: instead of having the pass manager
logic be the "normal code" and forcing the Tarjan algorithm to
become a state machine of iterators, use an open-coded Tarjan
algorithm with some callbacks and make the pass management logic be
the state machine.
This will also open the door to avoiding the potentially quadratic
size of the ref graph, since e.g. in the example I gave above, we
can mark the `funcs` array itself as already having been visited
during the walk. In the current LazyCallGraph, this would require
adding some sort of notion of hyperedge.

FWIW, I see no purpose in abstracting one algorithm, especially if that makes things algorithmically harder. Also, the LazyCallGraph abstraction and the iterator abstraction seem like separate issues. Iterator abstractions are often useful because you can use them in generic algorithms, etc.

Since this is such a high priority (due to blocking PGO inlining), I
will probably try my hand at implementing the CGSCC pass manager
sometime soon unless somebody beats me to it. (I'll probably try the
"open-coded SCC visit" approach).

Another possibility is implementing the new CGSCC pass manager that
uses the same visitation semantics as the one in the old PM, and
then we can refactor that as needed. In fact, that may be the best
approach so that porting to the new PM is as NFC as possible and we
can isolate the functional (i.e., need benchmarks, measurements ...)
changes in separate commits.

I'm in favor of this approach for exactly the reason you mention. Being able to bisect regressions to the algorithmic change, separate from the infrastructure change, will likely make things easier in the long run (and will avoid the problem, to the extent possible, of performance regressions blocking the pass-manager work).

Sorry for the wall of text.

No problem; much appreciated.

-Hal

Daniel_Berlin1 · June 8, 2016, 4:39pm

I can state that almost all call graphs of compilers include edges for
indirect calls and external functions, so they are already quadratic in
this sense.

if what you state is correct, and we don't have a conservatively correct
call graph, that would be ... interesting.
The solution to most issues (large sccs, etc) that exist here that most
other compilers take is to try to make the call graph more precise, not to
avoid indirect/external calls in the call graph.

In turn, this means the solution often take is to not have two graphs at
all.

davidxl · June 8, 2016, 5:57pm

Hi Chandler, Philip, Mehdi, (and llvm-dev,)

(this is partially a summary of some discussions that happened at the last
LLVM bay area social, and partially a discussion about the direction of the
CGSCC pass manager)

A the last LLVM social we discussed the progress on the CGSCC pass
manager. It seems like Chandler has a CGSCC pass manager working, but it is
still unresolved exactly which semantics we want (more about this below)
that are reasonably implementable.

AFAICT, there has been no public discussion about what exact semantics we
ultimately want to have. We should figure that out.

The main difficulty which Chandler described is the apparently quite
complex logic surrounding needing to run function passes nested within an
SCC pass manager, while providing some guarantees about exactly what order
the function passes are run. The existing CGSCC pass manager just punts on
some of the problems that arise (look in CGPassManager::runOnModule,
CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in
llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that
Chandler has been trying to solve.

(
Why is this "function passes inside CGSCC passes" stuff interesting?
Because LLVM can do inlining on an SCC (often just a single function) and
then run function passes to simplify the function(s) in the SCC before it
tries to inline into a parent SCC. (the SCC visitation order is post-order)
For example, we may inline a bunch of code, but after inlining we can
tremendously simplify the function, and we want to do so before considering
this function for inlining into its callers so that we get an accurate
evaluation of the inline cost.
Based on what Chandler said, it seems that LLVM is fairly unique in this
regard and other compilers don't do this (which is why we can't just look
at how other compilers solve this problem; they don't have this problem
(maybe they should? or maybe we shouldn't?)). For example, he described
that GCC uses different inlining "phases"; e.g. it does early inlining on
the entire module, then does simplifications on the entire module, then
does late inlining on the entire module; so it is not able to incrementally
simplify as it inlines like LLVM does.
)

I want to clarify a little more on GCC's behavior. GCC has two types of
ipa_passes. One called simple_ipa pass and the other is called regular
ipa_pass. A simple IPA pass does not transform IR by itself, but it can
have function level sub-passes that does transformation. If it has function
sub-passes, the function passes will be executed in bottom up order just
like LLVM's CGSCC pass. A regular ipa pass is somewhat like the module
pass in LLVM.

GCC's pass pipeline is actually similar to LLVM's pipeline overall: it has
the following components:

(1) lowering passes
(2) small/simple IPA passes including the early optimization pipeline
(bottom-up)
(3) full ipa pipelines
(4) post-ipa optimization and code generation.

The difference is that LLVM's (2) includes only inst-combine and simplfy
CFG, but GCC's (2) is larger which happens also includes an early inlining
pass. GCC's (2) is similar to LLVM's CGSCC pass but with fewer passes.
The inliner is also targeting tiny functions that reduces size overall.
LLVM's LTO pipeline is kind of like this too.

GCC's (3) includes a full blown regular inlining pass. This inliner uses
the priority order based algorithm, so bottom-up traversal does not make
sense. It requires (2) to have pass to collect summary to drive the
decisions (as well as profile data).

David

Sanjoy_Das1 · June 8, 2016, 6:14pm

Hi Sean,

Sean Silva wrote:
> Hi Chandler, Philip, Mehdi, (and llvm-dev,)
>
> (this is partially a summary of some discussions that happened at the last LLVM bay area social, and partially a
> discussion about the direction of the CGSCC pass manager)

Thanks for writing this up! This sort of thing is very helpful for
people who don't attend the social, or had to leave early.

> Chandler described that he had a major breakthrough in that the CGSCC pass manager only had to deal with 3 classes of
> modifications that can occur:
> - a pass may e.g. propagate a load of a function pointer into an indirect call, turning it into an direct call. This
> requires adding an edge in the CG but not in the ref graph.
> - a pass may take a direct call and turn it into an indirect call. This requires removing an edge from the CG, but not
> in the ref graph.
> - a pass may delete a direct call. This removes an edge in the CG and also in the ref graph.

At what granularity are we modeling these things? E.g. if SimplifyCFG
deletes a basic block, will we remove call edges that start from that
block?

Note there is fourth kind of modification that isn't modeled here:
devirtualization, in which we transform IR like

%fnptr = compute_fnptr(%receiver, i32 <method id>)
call %fptr(%receiver)

to (via a pass that understands the semantics of compute_fnptr).

call some.specific.Class::method(%receiver)

However, at this time I don't think modeling "out of thin air"
devirtualization of this sort is important, since nothing upstream
does things like this (we do have ModulePasses downstream that do
this).

> From the perspective of the CGSCC pass manager, these operations can affect the SCC structure. Adding an edge might
> merge SCC's and deleting an edge might split SCC's. Chandler mentioned that apparently the issues of splitting and
> merging SCC's within the current infrastructure are actually quite challenging and lead to e.g. iterator invalidation
> issues, and that is what he is working on.
>
> (
> The ref graph is important to guide the overall SCC visitation order because it basically represents "the largest graph
> that the CG may turn into due to our static analysis of this module". I.e. no transformation we can statically make in
> the CGSCC passes can ever cause us to need to merge SCC's in the ref graph.
> )

Except in the above "out of thin air" devirtualization case.

I think cross-function store forwarding can also be problematic:

void foo(fnptr* bar) {
if (bar)
(*bar)();
}

void baz() { foo(null); }

void caller() {
   fnptr *t = malloc();
   *t = baz;
   foo(t);
}

// RefGraph is
// caller -> baz
// caller -> foo
// baz -> foo

Now the RefSCCs are {foo}, {caller}, {baz} but if we forward the store
to *t in caller into the load in foo (and are smart about the null
check) we get:

void foo(fnptr* bar) {
if (bar)
baz();
}

void baz() { foo(null); }

void caller() {
   fnptr *t = malloc();
   *t = baz;
   foo(t);
}

// RefGraph is
// foo -> baz
// baz -> foo
// caller -> foo
// caller -> baz

and now the RefSCCs are {foo, baz}, {caller}

But again, I think this is fine since nothing upstream does this at
this time; and we can cross that bridge when we come to it.

> 2. What is the intended behavior of CGSCC passes when SCC's are split or merged? E.g. a CGSCC pass runs on an SCC (e.g.
> the inliner). Now we run some function passes nested inside the CGSCC pass manager (e.g. to simplify things after
> inlining). Consider:
>
> a) These function passes are e.g. now able to devirtualize a call, adding an edge to the CG, forming a larger CG SCC. Do
> you re-run the CGSCC pass (say, the inliner) on this larger SCC?
>
> b) These function passes are e.g. able to DCE a call, removing an edge from the CG. This converts, say, a CG SCC which
> is a cycle graph (like a->b->c->a) into a path graph (a->b->c, with no edge back to a). The inliner had already visited
> a, b, and c as a single SCC. Now does it have to re-visit c, then b, then a, as single-node SCC's?

Okay, so this is the same question I wrote above: "At what granularity
are we modeling these things?", phrased more eloquently.

> One way that I have found it useful to think about this is in terms of the visitation during Tarjan's SCC algorithm.
> I'll reference the pseudocode in Tarjan's strongly connected components algorithm - Wikipedia.
> Inside the "strongconnect" routine when we have identified an SCC (the true branch of `if (v.lowlink = v.index)` test )
> we can visit stack[v.index:stack.size()] as an SCC. This may or may not invalidate some things on the stack (the
> variable `S` in the pseudocode) and we may need to fix it up (e.g. inlining deleted a function, so we can't have an
> entry on the stack). Then, we can run function passes as we pop individual functions off the stack, but it is easier to
> think about IMO than merging of SCC data structures: if we add edges to the CG then we have to do more DFS on the new
> edges and if we delete edges then the DFS order of the stack gives us certain guarantees.

I'm not sure how this will be easier E.g. consider

X -> A
A -> B
B -> C
C -> B
B -> A

Now you're going to pop A,B,C together, as an SCC; and you start
optimizing them, and the edge from B to A falls out. Don't you have
the same problem now (i.e. we've removed an edge from an SCC we're
iterating over)? Perhaps I misunderstood your scheme?

-- Sanjoy

mehdi_amini · June 8, 2016, 6:58pm

The fact that we have a separate nodes for calling into an external function and “being called” from an external function, these don’t form SCC. So it means we can end up merging SCC IIUC.

davidxl · June 8, 2016, 7:31pm

Hi Chandler, Philip, Mehdi, (and llvm-dev,)

(this is partially a summary of some discussions that happened at the last
LLVM bay area social, and partially a discussion about the direction of the
CGSCC pass manager)

A the last LLVM social we discussed the progress on the CGSCC pass
manager. It seems like Chandler has a CGSCC pass manager working, but it is
still unresolved exactly which semantics we want (more about this below)
that are reasonably implementable.

AFAICT, there has been no public discussion about what exact semantics we
ultimately want to have. We should figure that out.

The main difficulty which Chandler described is the apparently quite
complex logic surrounding needing to run function passes nested within an
SCC pass manager, while providing some guarantees about exactly what order
the function passes are run. The existing CGSCC pass manager just punts on
some of the problems that arise (look in CGPassManager::runOnModule,
CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in
llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that
Chandler has been trying to solve.

(
Why is this "function passes inside CGSCC passes" stuff interesting?
Because LLVM can do inlining on an SCC (often just a single function) and
then run function passes to simplify the function(s) in the SCC before it
tries to inline into a parent SCC. (the SCC visitation order is post-order)
For example, we may inline a bunch of code, but after inlining we can
tremendously simplify the function, and we want to do so before considering
this function for inlining into its callers so that we get an accurate
evaluation of the inline cost.
Based on what Chandler said, it seems that LLVM is fairly unique in this
regard and other compilers don't do this (which is why we can't just look
at how other compilers solve this problem; they don't have this problem
(maybe they should? or maybe we shouldn't?)). For example, he described
that GCC uses different inlining "phases"; e.g. it does early inlining on
the entire module, then does simplifications on the entire module, then
does late inlining on the entire module; so it is not able to incrementally
simplify as it inlines like LLVM does.
)

As background for what is below, the LazyCallGraph tracks two graphs: the
"call graph" and the "ref graph".
Conceptually, the call graph is the graph of direct calls, where indirect
calls and calls to external functions do not appear (or are connected to
dummy nodes). The ref graph is basically the graph of all functions
transitively accessible based on the globals/constants/etc. referenced by a
function (e.g. if a function `foo` references a vtable that is defined in
the module, there is an edge in the ref graph from `foo` to every function
in the vtable).
The call graph is a strict subset of the ref graph.

Chandler described that he had a major breakthrough in that the CGSCC pass
manager only had to deal with 3 classes of modifications that can occur:
- a pass may e.g. propagate a load of a function pointer into an indirect
call, turning it into an direct call. This requires adding an edge in the
CG but not in the ref graph.
- a pass may take a direct call and turn it into an indirect call. This
requires removing an edge from the CG, but not in the ref graph.
- a pass may delete a direct call. This removes an edge in the CG and also
in the ref graph.

From the perspective of the CGSCC pass manager, these operations can
affect the SCC structure. Adding an edge might merge SCC's and deleting an
edge might split SCC's. Chandler mentioned that apparently the issues of
splitting and merging SCC's within the current infrastructure are actually
quite challenging and lead to e.g. iterator invalidation issues, and that
is what he is working on.

(
The ref graph is important to guide the overall SCC visitation order
because it basically represents "the largest graph that the CG may turn
into due to our static analysis of this module". I.e. no transformation we
can statically make in the CGSCC passes can ever cause us to need to merge
SCC's in the ref graph.
)

I have a couple overall questions/concerns:

1. The ref graph can easily go quadratic. E.g.

typedef void (*fp)();
fp funcs = {
  &foo1,
  &foo2,
  ...
  &fooN
}
void foo1() { funcs[something](); }
void foo2() { funcs[something](); }
...
void fooN() { funcs[something](); }

One real-world case where this might come about is in the presence of
vtables.

The existing CGSCC pass manager does not have this issue AFAIK because it
does not consider the ref graph.

Does anybody have any info/experience about how densely connected the ref
graph can get in programs that might reasonably be fed to the compiler?
I just did a quick sanity check with LLD/ELF using
`--lto-newpm-passes=cgscc(no-op-cgscc)` and it at least seemed to terminate
/ not run out of memory. Based on some rough calculations looking at the
profile, it seem like the entire run of the inliner in the old LTO pipeline
(which is about 5% of total LTO time on this particular example I looked
at) is only 2-3x as expensive as just
`--lto-newpm-passes=cgscc(no-op-cgscc)`, so the LazyCallGraph construction
is certainly not free.

Conceptually, reference graph should also include variable nodes. With
variable nodes introduced, the quadratic behavior mentioned can be avoided.

2. What is the intended behavior of CGSCC passes when SCC's are split or
merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now we run
some function passes nested inside the CGSCC pass manager (e.g. to simplify
things after inlining). Consider:

a) These function passes are e.g. now able to devirtualize a call, adding
an edge to the CG, forming a larger CG SCC. Do you re-run the CGSCC pass
(say, the inliner) on this larger SCC?

b) These function passes are e.g. able to DCE a call, removing an edge
from the CG. This converts, say, a CG SCC which is a cycle graph (like
a->b->c->a) into a path graph (a->b->c, with no edge back to a). The
inliner had already visited a, b, and c as a single SCC. Now does it have
to re-visit c, then b, then a, as single-node SCC's?

btw:

One way that I have found it useful to think about this is in terms of the
visitation during Tarjan's SCC algorithm. I'll reference the pseudocode in
Tarjan's strongly connected components algorithm - Wikipedia.
Inside the "strongconnect" routine when we have identified an SCC (the true
branch of `if (v.lowlink = v.index)` test ) we can visit
stack[v.index:stack.size()] as an SCC. This may or may not invalidate some
things on the stack (the variable `S` in the pseudocode) and we may need to
fix it up (e.g. inlining deleted a function, so we can't have an entry on
the stack). Then, we can run function passes as we pop individual functions
off the stack, but it is easier to think about IMO than merging of SCC data
structures: if we add edges to the CG then we have to do more DFS on the
new edges and if we delete edges then the DFS order of the stack gives us
certain guarantees.
Personally I find this much easier to reason about than the description in
terms of splitting and merging SCC's in the CG and ref graph (which the
LazyCallGraph API makes one to think about since it hides the underlying
Tarjan's algorithm).
The LazyCallGraph API makes the current loop in
http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100
very clean, but at least for my thinking about the problem, it seems like
the wrong abstraction (and most of the LazyCallGraph API seems to be
unused, so it seems like it may be overly heavyweight).
E.g. I think that maybe the easiest thing to do is to turn the current
approach inside out: instead of having the pass manager logic be the
"normal code" and forcing the Tarjan algorithm to become a state machine of
iterators, use an open-coded Tarjan algorithm with some callbacks and make
the pass management logic be the state machine.
This will also open the door to avoiding the potentially quadratic size of
the ref graph, since e.g. in the example I gave above, we can mark the
`funcs` array itself as already having been visited during the walk. In the
current LazyCallGraph, this would require adding some sort of notion of
hyperedge.

Since this is such a high priority (due to blocking PGO inlining), I will
probably try my hand at implementing the CGSCC pass manager sometime soon
unless somebody beats me to it. (I'll probably try the "open-coded SCC
visit" approach).

Another possibility is implementing the new CGSCC pass manager that uses
the same visitation semantics as the one in the old PM, and then we can
refactor that as needed. In fact, that may be the best approach so that
porting to the new PM is as NFC as possible and we can isolate the
functional (i.e., need benchmarks, measurements ...) changes in separate
commits.

A very high level comment: why do we need to update callgraph on the fly ?
Can we have a more general support of iterative SCC pass invocation?

something like:

1) build the callgraph
2) cache the post-order traversal order

3) if the order list is empty -- done
4) traversal: invoke function passes for each function on the order (step 2
or 5). The call graph gets updated on the fly (with new edges, or new nodes
for cloned functions)
5) update the function traversal order from new nodes and new edges created
in 4)
6) go to step 3).

David

_sean_silva · June 8, 2016, 9:25pm

------------------------------

*From: *"Sean Silva via llvm-dev" <llvm-dev@lists.llvm.org>
*To: *"llvm-dev" <llvm-dev@lists.llvm.org>
*Sent: *Wednesday, June 8, 2016 6:19:03 AM
*Subject: *[llvm-dev] Intended behavior of CGSCC pass manager.

Hi Chandler, Philip, Mehdi, (and llvm-dev,)

(this is partially a summary of some discussions that happened at the last
LLVM bay area social, and partially a discussion about the direction of the
CGSCC pass manager)

A the last LLVM social we discussed the progress on the CGSCC pass
manager. It seems like Chandler has a CGSCC pass manager working, but it is
still unresolved exactly which semantics we want (more about this below)
that are reasonably implementable.

AFAICT, there has been no public discussion about what exact semantics we
ultimately want to have. We should figure that out.

The main difficulty which Chandler described is the apparently quite
complex logic surrounding needing to run function passes nested within an
SCC pass manager, while providing some guarantees about exactly what order
the function passes are run. The existing CGSCC pass manager just punts on
some of the problems that arise (look in CGPassManager::runOnModule,
CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in
llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that
Chandler has been trying to solve.

(
Why is this "function passes inside CGSCC passes" stuff interesting?
Because LLVM can do inlining on an SCC (often just a single function) and
then run function passes to simplify the function(s) in the SCC before it
tries to inline into a parent SCC. (the SCC visitation order is post-order)
For example, we may inline a bunch of code, but after inlining we can
tremendously simplify the function, and we want to do so before considering
this function for inlining into its callers so that we get an accurate
evaluation of the inline cost.
Based on what Chandler said, it seems that LLVM is fairly unique in this
regard and other compilers don't do this (which is why we can't just look
at how other compilers solve this problem; they don't have this problem
(maybe they should? or maybe we shouldn't?)). For example, he described
that GCC uses different inlining "phases"; e.g. it does early inlining on
the entire module, then does simplifications on the entire module, then
does late inlining on the entire module; so it is not able to incrementally
simplify as it inlines like LLVM does.

This incremental simplification is an important feature of our inliner,
and one we should endeavor to keep. We might also want different phases at
some point (e.g. a top-down and a bottom-up phase), but that's another
story.

)

As background for what is below, the LazyCallGraph tracks two graphs: the
"call graph" and the "ref graph".
Conceptually, the call graph is the graph of direct calls, where indirect
calls and calls to external functions do not appear (or are connected to
dummy nodes). The ref graph is basically the graph of all functions
transitively accessible based on the globals/constants/etc. referenced by a
function (e.g. if a function `foo` references a vtable that is defined in
the module, there is an edge in the ref graph from `foo` to every function
in the vtable).
The call graph is a strict subset of the ref graph.

Chandler described that he had a major breakthrough in that the CGSCC pass
manager only had to deal with 3 classes of modifications that can occur:
- a pass may e.g. propagate a load of a function pointer into an indirect
call, turning it into an direct call. This requires adding an edge in the
CG but not in the ref graph.
- a pass may take a direct call and turn it into an indirect call. This
requires removing an edge from the CG, but not in the ref graph.
- a pass may delete a direct call. This removes an edge in the CG and also
in the ref graph.

From the perspective of the CGSCC pass manager, these operations can
affect the SCC structure. Adding an edge might merge SCC's and deleting an
edge might split SCC's. Chandler mentioned that apparently the issues of
splitting and merging SCC's within the current infrastructure are actually
quite challenging and lead to e.g. iterator invalidation issues, and that
is what he is working on.

(
The ref graph is important to guide the overall SCC visitation order
because it basically represents "the largest graph that the CG may turn
into due to our static analysis of this module". I.e. no transformation we
can statically make in the CGSCC passes can ever cause us to need to merge
SCC's in the ref graph.
)

I have a couple overall questions/concerns:

1. The ref graph can easily go quadratic. E.g.

typedef void (*fp)();
fp funcs = {
  &foo1,
  &foo2,
  ...
  &fooN
}
void foo1() { funcs[something](); }
void foo2() { funcs[something](); }
...
void fooN() { funcs[something](); }

One real-world case where this might come about is in the presence of
vtables.

The existing CGSCC pass manager does not have this issue AFAIK because it
does not consider the ref graph.

Does anybody have any info/experience about how densely connected the ref
graph can get in programs that might reasonably be fed to the compiler?
I just did a quick sanity check with LLD/ELF using
`--lto-newpm-passes=cgscc(no-op-cgscc)` and it at least seemed to terminate
/ not run out of memory. Based on some rough calculations looking at the
profile, it seem like the entire run of the inliner in the old LTO pipeline
(which is about 5% of total LTO time on this particular example I looked
at) is only 2-3x as expensive as just
`--lto-newpm-passes=cgscc(no-op-cgscc)`, so the LazyCallGraph construction
is certainly not free.

2. What is the intended behavior of CGSCC passes when SCC's are split or
merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now we run
some function passes nested inside the CGSCC pass manager (e.g. to simplify
things after inlining). Consider:

This is not how I thought the current scheme worked -- I was under the
impression that we had a call graph with conservatively-connected dummy
nodes for external/indirect functions.

The fact that we have a separate nodes for calling into an external
function and "being called" from an external function, these don't form
SCC. So it means we can end up merging SCC IIUC.

Thanks for clarifying this.

My intuition for this behavior (by looking at a limiting case) is that e.g.
during FullLTO we have `main`, which is "externally called", so that any
function containing an external function call or indirect call would may
end up calling back into main from the compiler's perspective.
This would result in a huge SCC containing `main` and all functions that
transitively call a function containing an external function call /
indirect call; this doesn't seem like the SCC we want to look at during
inlining.

I.e. the conservatively correct call graph is not the graph that we
necessarily want during inlining. The LazyCallGraph "potential direct call"
approach seems a better fit for what the inliner wants.

The comments in LazyCallGraph.h do clarify this somewhat:

github.com

llvm-mirror/llvm/blob/master/include/llvm/Analysis/LazyCallGraph.h#L14


      
          // See https://llvm.org/LICENSE.txt for license information.
          // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
          //
          //===----------------------------------------------------------------------===//
          /// \file
          ///
          /// Implements a lazy call graph analysis and related passes for the new pass
          /// manager.
          ///
          /// NB: This is *not* a traditional call graph! It is a graph which models both
          /// the current calls and potential calls. As a consequence there are many
          /// edges in this call graph that do not correspond to a 'call' or 'invoke'
          /// instruction.
          ///
          /// The primary use cases of this graph analysis is to facilitate iterating
          /// across the functions of a module in ways that ensure all callees are
          /// visited prior to a caller (given any SCC constraints), or vice versa. As
          /// such is it particularly well suited to organizing CGSCC optimizations such
          /// as inlining, outlining, argument promotion, etc. That is its primary use
          /// case and motivates the design. It may not be appropriate for other
          /// purposes. The use graph of functions or some other conservative analysis of

E.g. it says:

/// NB: This is *not* a traditional call graph! It is a graph which models
both
/// the current calls and potential calls. As a consequence there are many
/// edges in this call graph that do not correspond to a 'call' or 'invoke'
/// instruction.

-- Sean Silva

_sean_silva · June 8, 2016, 9:54pm

Hi Chandler, Philip, Mehdi, (and llvm-dev,)

(this is partially a summary of some discussions that happened at the
last LLVM bay area social, and partially a discussion about the direction
of the CGSCC pass manager)

A the last LLVM social we discussed the progress on the CGSCC pass
manager. It seems like Chandler has a CGSCC pass manager working, but it is
still unresolved exactly which semantics we want (more about this below)
that are reasonably implementable.

AFAICT, there has been no public discussion about what exact semantics we
ultimately want to have. We should figure that out.

The main difficulty which Chandler described is the apparently quite
complex logic surrounding needing to run function passes nested within an
SCC pass manager, while providing some guarantees about exactly what order
the function passes are run. The existing CGSCC pass manager just punts on
some of the problems that arise (look in CGPassManager::runOnModule,
CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in
llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that
Chandler has been trying to solve.

(
Why is this "function passes inside CGSCC passes" stuff interesting?
Because LLVM can do inlining on an SCC (often just a single function) and
then run function passes to simplify the function(s) in the SCC before it
tries to inline into a parent SCC. (the SCC visitation order is post-order)
For example, we may inline a bunch of code, but after inlining we can
tremendously simplify the function, and we want to do so before considering
this function for inlining into its callers so that we get an accurate
evaluation of the inline cost.
Based on what Chandler said, it seems that LLVM is fairly unique in this
regard and other compilers don't do this (which is why we can't just look
at how other compilers solve this problem; they don't have this problem
(maybe they should? or maybe we shouldn't?)). For example, he described
that GCC uses different inlining "phases"; e.g. it does early inlining on
the entire module, then does simplifications on the entire module, then
does late inlining on the entire module; so it is not able to incrementally
simplify as it inlines like LLVM does.
)

As background for what is below, the LazyCallGraph tracks two graphs: the
"call graph" and the "ref graph".
Conceptually, the call graph is the graph of direct calls, where indirect
calls and calls to external functions do not appear (or are connected to
dummy nodes). The ref graph is basically the graph of all functions
transitively accessible based on the globals/constants/etc. referenced by a
function (e.g. if a function `foo` references a vtable that is defined in
the module, there is an edge in the ref graph from `foo` to every function
in the vtable).
The call graph is a strict subset of the ref graph.

Chandler described that he had a major breakthrough in that the CGSCC
pass manager only had to deal with 3 classes of modifications that can
occur:
- a pass may e.g. propagate a load of a function pointer into an indirect
call, turning it into an direct call. This requires adding an edge in the
CG but not in the ref graph.
- a pass may take a direct call and turn it into an indirect call. This
requires removing an edge from the CG, but not in the ref graph.
- a pass may delete a direct call. This removes an edge in the CG and
also in the ref graph.

From the perspective of the CGSCC pass manager, these operations can
affect the SCC structure. Adding an edge might merge SCC's and deleting an
edge might split SCC's. Chandler mentioned that apparently the issues of
splitting and merging SCC's within the current infrastructure are actually
quite challenging and lead to e.g. iterator invalidation issues, and that
is what he is working on.

(
The ref graph is important to guide the overall SCC visitation order
because it basically represents "the largest graph that the CG may turn
into due to our static analysis of this module". I.e. no transformation we
can statically make in the CGSCC passes can ever cause us to need to merge
SCC's in the ref graph.
)

I have a couple overall questions/concerns:

1. The ref graph can easily go quadratic. E.g.

typedef void (*fp)();
fp funcs = {
  &foo1,
  &foo2,
  ...
  &fooN
}
void foo1() { funcs[something](); }
void foo2() { funcs[something](); }
...
void fooN() { funcs[something](); }

One real-world case where this might come about is in the presence of
vtables.

The existing CGSCC pass manager does not have this issue AFAIK because it
does not consider the ref graph.

Does anybody have any info/experience about how densely connected the ref
graph can get in programs that might reasonably be fed to the compiler?

I can state that almost all call graphs of compilers include edges for
indirect calls and external functions, so they are already quadratic in
this sense.

if what you state is correct, and we don't have a conservatively correct
call graph, that would be ... interesting.

Mehdi clarified downthread the situation (at least for me).

Now that I look more closely at the comments in

github.com

llvm-mirror/llvm/blob/master/include/llvm/Analysis/LazyCallGraph.h

//===- LazyCallGraph.h - Analysis of a Module's call graph ------*- C++ -*-===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//
/// \file
///
/// Implements a lazy call graph analysis and related passes for the new pass
/// manager.
///
/// NB: This is *not* a traditional call graph! It is a graph which models both
/// the current calls and potential calls. As a consequence there are many
/// edges in this call graph that do not correspond to a 'call' or 'invoke'
/// instruction.
///
/// The primary use cases of this graph analysis is to facilitate iterating
/// across the functions of a module in ways that ensure all callees are
/// visited prior to a caller (given any SCC constraints), or vice versa. As

This file has been truncated. show original

it intentionally does not model a call graph in this sense. I.e.
- in a traditional CG, an edge means something like "at runtime there may
be a call from A->B"
- in the LazyCallGraph an edge (a "ref edge" as it calls it) represents
something like "during optimization of this module, we may discover the
existence of a direct call from A->B". There is also a distinguished
subgraph of the ref graph (which I think LazyCallGraph calls just the "call
graph") which represents the actual direct calls that are present in the
module currently.

The comments in LazyCallGraph.h are quite good, but the existing
CallGraph.h doesn't seem to touch on this up front in its comments in quite
the same way, but it does at least say that it models with two external
nodes like Mehdi mentioned:

github.com

llvm-mirror/llvm/blob/master/include/llvm/Analysis/CallGraph.h#L30


      
          /// analysis builds one external node such that:
          ///   1. All functions in the module without internal linkage will have edges
          ///      from this external node, indicating that they could be called by
          ///      functions outside of the module.
          ///   2. All functions whose address is used for something more than a direct
          ///      call, for example being stored into a memory location will also have
          ///      an edge from this external node.  Since they may be called by an
          ///      unknown caller later, they must be tracked as such.
          ///
          /// There is a second external node added for calls that leave this module.
          /// Functions have a call edge to the external node iff:
          ///   1. The function is external, reflecting the fact that they could call
          ///      anything without internal linkage or that has its address taken.
          ///   2. The function contains an indirect function call.
          ///
          /// As an extension in the future, there may be multiple nodes with a null
          /// function.  These will be used when we can prove (through pointer analysis)
          /// that an indirect call site can call only a specific set of functions.
          ///
          /// Because of these properties, the CallGraph captures a conservative superset
          /// of all of the caller-callee relationships, which is useful for

So its edges don't represent "at runtime there may be a call from A->B".
But since it doesn't maintain a "ref graph" I'm not sure what the edges
exactly represent.

-- Sean Silva

Finkel_Hal_J · June 8, 2016, 9:55pm

From: "Sean Silva" <chisophugis@gmail.com>
To: "Mehdi Amini" <mehdi.amini@apple.com>
Cc: "Hal Finkel" <hfinkel@anl.gov>, "llvm-dev"
<llvm-dev@lists.llvm.org>
Sent: Wednesday, June 8, 2016 4:25:09 PM
Subject: Re: [llvm-dev] Intended behavior of CGSCC pass manager.

>

> > > From: "Sean Silva via llvm-dev" < llvm-dev@lists.llvm.org >
> >
>

> > > To: "llvm-dev" < llvm-dev@lists.llvm.org >
> >
>

> > > Sent: Wednesday, June 8, 2016 6:19:03 AM
> >
>

> > > Subject: [llvm-dev] Intended behavior of CGSCC pass manager.
> >
>

> > > Hi Chandler, Philip, Mehdi, (and llvm-dev,)
> >
>

> > > (this is partially a summary of some discussions that happened
> > > at
> > > the
> > > last LLVM bay area social, and partially a discussion about the
> > > direction of the CGSCC pass manager)
> >
>

> > > A the last LLVM social we discussed the progress on the CGSCC
> > > pass
> > > manager. It seems like Chandler has a CGSCC pass manager
> > > working,
> > > but it is still unresolved exactly which semantics we want
> > > (more
> > > about this below) that are reasonably implementable.
> >
>

> > > AFAICT, there has been no public discussion about what exact
> > > semantics we ultimately want to have. We should figure that
> > > out.
> >
>

> > > The main difficulty which Chandler described is the apparently
> > > quite
> > > complex logic surrounding needing to run function passes nested
> > > within an SCC pass manager, while providing some guarantees
> > > about
> > > exactly what order the function passes are run. The existing
> > > CGSCC
> > > pass manager just punts on some of the problems that arise
> > > (look
> > > in
> > > CGPassManager::runOnModule, CGPassManager::RunAllPassesOnSCC,
> > > and
> > > CGPassManager::RunPassOnSCC in
> > > llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the
> > > problems
> > > that Chandler has been trying to solve.
> >
>

> > > (
> >
>

> > > Why is this "function passes inside CGSCC passes" stuff
> > > interesting?
> > > Because LLVM can do inlining on an SCC (often just a single
> > > function) and then run function passes to simplify the
> > > function(s)
> > > in the SCC before it tries to inline into a parent SCC. (the
> > > SCC
> > > visitation order is post-order)
> >
>

> > > For example, we may inline a bunch of code, but after inlining
> > > we
> > > can
> > > tremendously simplify the function, and we want to do so before
> > > considering this function for inlining into its callers so that
> > > we
> > > get an accurate evaluation of the inline cost.
> >
>

> > > Based on what Chandler said, it seems that LLVM is fairly
> > > unique
> > > in
> > > this regard and other compilers don't do this (which is why we
> > > can't
> > > just look at how other compilers solve this problem; they don't
> > > have
> > > this problem (maybe they should? or maybe we shouldn't?)). For
> > > example, he described that GCC uses different inlining
> > > "phases";
> > > e.g. it does early inlining on the entire module, then does
> > > simplifications on the entire module, then does late inlining
> > > on
> > > the
> > > entire module; so it is not able to incrementally simplify as
> > > it
> > > inlines like LLVM does.
> >
>

> > This incremental simplification is an important feature of our
> > inliner, and one we should endeavor to keep. We might also want
> > different phases at some point (e.g. a top-down and a bottom-up
> > phase), but that's another story.
>

> > > )
> >
>

> > > As background for what is below, the LazyCallGraph tracks two
> > > graphs:
> > > the "call graph" and the "ref graph".
> >
>

> > > Conceptually, the call graph is the graph of direct calls,
> > > where
> > > indirect calls and calls to external functions do not appear
> > > (or
> > > are
> > > connected to dummy nodes). The ref graph is basically the graph
> > > of
> > > all functions transitively accessible based on the
> > > globals/constants/etc. referenced by a function (e.g. if a
> > > function
> > > `foo` references a vtable that is defined in the module, there
> > > is
> > > an
> > > edge in the ref graph from `foo` to every function in the
> > > vtable).
> >
>

> > > The call graph is a strict subset of the ref graph.
> >
>

> > > Chandler described that he had a major breakthrough in that the
> > > CGSCC
> > > pass manager only had to deal with 3 classes of modifications
> > > that
> > > can occur:
> >
>

> > > - a pass may e.g. propagate a load of a function pointer into
> > > an
> > > indirect call, turning it into an direct call. This requires
> > > adding
> > > an edge in the CG but not in the ref graph.
> >
>

> > > - a pass may take a direct call and turn it into an indirect
> > > call.
> > > This requires removing an edge from the CG, but not in the ref
> > > graph.
> >
>

> > > - a pass may delete a direct call. This removes an edge in the
> > > CG
> > > and
> > > also in the ref graph.
> >
>

> > > From the perspective of the CGSCC pass manager, these
> > > operations
> > > can
> > > affect the SCC structure. Adding an edge might merge SCC's and
> > > deleting an edge might split SCC's. Chandler mentioned that
> > > apparently the issues of splitting and merging SCC's within the
> > > current infrastructure are actually quite challenging and lead
> > > to
> > > e.g. iterator invalidation issues, and that is what he is
> > > working
> > > on.
> >
>

> > > (
> >
>

> > > The ref graph is important to guide the overall SCC visitation
> > > order
> > > because it basically represents "the largest graph that the CG
> > > may
> > > turn into due to our static analysis of this module". I.e. no
> > > transformation we can statically make in the CGSCC passes can
> > > ever
> > > cause us to need to merge SCC's in the ref graph.
> >
>

> > > )
> >
>

> > > I have a couple overall questions/concerns:
> >
>

> > > 1. The ref graph can easily go quadratic. E.g.
> >
>

> > > typedef void (*fp)();
> >
>

> > > fp funcs = {
> >
>

> > > &foo1,
> >
>

> > > &foo2,
> >
>

> > > ...
> >
>

> > > &fooN
> >
>

> > > }
> >
>

> > > void foo1() { funcs[something](); }
> >
>

> > > void foo2() { funcs[something](); }
> >
>

> > > ...
> >
>

> > > void fooN() { funcs[something](); }
> >
>

> > > One real-world case where this might come about is in the
> > > presence
> > > of
> > > vtables.
> >
>

> > > The existing CGSCC pass manager does not have this issue AFAIK
> > > because it does not consider the ref graph.
> >
>

> > > Does anybody have any info/experience about how densely
> > > connected
> > > the
> > > ref graph can get in programs that might reasonably be fed to
> > > the
> > > compiler?
> >
>

> > > I just did a quick sanity check with LLD/ELF using
> > > `--lto-newpm-passes=cgscc(no-op-cgscc)` and it at least seemed
> > > to
> > > terminate / not run out of memory. Based on some rough
> > > calculations
> > > looking at the profile, it seem like the entire run of the
> > > inliner
> > > in the old LTO pipeline (which is about 5% of total LTO time on
> > > this
> > > particular example I looked at) is only 2-3x as expensive as
> > > just
> > > `--lto-newpm-passes=cgscc(no-op-cgscc)`, so the LazyCallGraph
> > > construction is certainly not free.
> >
>

> > > 2. What is the intended behavior of CGSCC passes when SCC's are
> > > split
> > > or merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner).
> > > Now
> > > we run some function passes nested inside the CGSCC pass
> > > manager
> > > (e.g. to simplify things after inlining). Consider:
> >
>

> > This is not how I thought the current scheme worked -- I was
> > under
> > the impression that we had a call graph with
> > conservatively-connected dummy nodes for external/indirect
> > functions.
>

> The fact that we have a separate nodes for calling into an external
> function and "being called" from an external function, these don't
> form SCC. So it means we can end up merging SCC IIUC.

Thanks for clarifying this.

My intuition for this behavior (by looking at a limiting case) is
that e.g. during FullLTO we have `main`, which is "externally
called", so that any function containing an external function call
or indirect call would may end up calling back into main from the
compiler's perspective.
This would result in a huge SCC containing `main` and all functions
that transitively call a function containing an external function
call / indirect call; this doesn't seem like the SCC we want to look
at during inlining.

No, I thought we special-cased main. Or at least we did until we figured out that we couldn't do that for all languages, and so we introduced some kind of "no recurse" attribute?

-Hal

mehdi_amini · June 8, 2016, 10:10pm

I thought of it as the edges are “there is a direct call from A → B”. Which is a subset of “at runtime there may be a call from A->B”.

I think that with all this discussion, it is important to distinguish that (I think) there is no “correctness” issue at stance (we won’t miscompile anything), but there may be missing optimization in some cases. I think the current scheme catches most cases and when it does not we are just missing potential inlining. The question may be how much (more) cases we really need to catch with the new pass manager?
And could a first implementation not catch everything and be improved incrementally? This comes back somehow to what Hal was mentioning (reproducing the current behavior before improving it).

_sean_silva · June 8, 2016, 10:10pm

------------------------------

*From: *"Sean Silva" <chisophugis@gmail.com>
*To: *"Mehdi Amini" <mehdi.amini@apple.com>
*Cc: *"Hal Finkel" <hfinkel@anl.gov>, "llvm-dev" <llvm-dev@lists.llvm.org>
*Sent: *Wednesday, June 8, 2016 4:25:09 PM
*Subject: *Re: [llvm-dev] Intended behavior of CGSCC pass manager.

------------------------------

*From: *"Sean Silva via llvm-dev" <llvm-dev@lists.llvm.org>
*To: *"llvm-dev" <llvm-dev@lists.llvm.org>
*Sent: *Wednesday, June 8, 2016 6:19:03 AM
*Subject: *[llvm-dev] Intended behavior of CGSCC pass manager.

Hi Chandler, Philip, Mehdi, (and llvm-dev,)

(this is partially a summary of some discussions that happened at the
last LLVM bay area social, and partially a discussion about the direction
of the CGSCC pass manager)

A the last LLVM social we discussed the progress on the CGSCC pass
manager. It seems like Chandler has a CGSCC pass manager working, but it is
still unresolved exactly which semantics we want (more about this below)
that are reasonably implementable.

AFAICT, there has been no public discussion about what exact semantics we
ultimately want to have. We should figure that out.

The main difficulty which Chandler described is the apparently quite
complex logic surrounding needing to run function passes nested within an
SCC pass manager, while providing some guarantees about exactly what order
the function passes are run. The existing CGSCC pass manager just punts on
some of the problems that arise (look in CGPassManager::runOnModule,
CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in
llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that
Chandler has been trying to solve.

(
Why is this "function passes inside CGSCC passes" stuff interesting?
Because LLVM can do inlining on an SCC (often just a single function) and
then run function passes to simplify the function(s) in the SCC before it
tries to inline into a parent SCC. (the SCC visitation order is post-order)
For example, we may inline a bunch of code, but after inlining we can
tremendously simplify the function, and we want to do so before considering
this function for inlining into its callers so that we get an accurate
evaluation of the inline cost.
Based on what Chandler said, it seems that LLVM is fairly unique in this
regard and other compilers don't do this (which is why we can't just look
at how other compilers solve this problem; they don't have this problem
(maybe they should? or maybe we shouldn't?)). For example, he described
that GCC uses different inlining "phases"; e.g. it does early inlining on
the entire module, then does simplifications on the entire module, then
does late inlining on the entire module; so it is not able to incrementally
simplify as it inlines like LLVM does.

This incremental simplification is an important feature of our inliner,
and one we should endeavor to keep. We might also want different phases at
some point (e.g. a top-down and a bottom-up phase), but that's another
story.

)

As background for what is below, the LazyCallGraph tracks two graphs: the
"call graph" and the "ref graph".
Conceptually, the call graph is the graph of direct calls, where indirect
calls and calls to external functions do not appear (or are connected to
dummy nodes). The ref graph is basically the graph of all functions
transitively accessible based on the globals/constants/etc. referenced by a
function (e.g. if a function `foo` references a vtable that is defined in
the module, there is an edge in the ref graph from `foo` to every function
in the vtable).
The call graph is a strict subset of the ref graph.

Chandler described that he had a major breakthrough in that the CGSCC
pass manager only had to deal with 3 classes of modifications that can
occur:
- a pass may e.g. propagate a load of a function pointer into an indirect
call, turning it into an direct call. This requires adding an edge in the
CG but not in the ref graph.
- a pass may take a direct call and turn it into an indirect call. This
requires removing an edge from the CG, but not in the ref graph.
- a pass may delete a direct call. This removes an edge in the CG and
also in the ref graph.

From the perspective of the CGSCC pass manager, these operations can
affect the SCC structure. Adding an edge might merge SCC's and deleting an
edge might split SCC's. Chandler mentioned that apparently the issues of
splitting and merging SCC's within the current infrastructure are actually
quite challenging and lead to e.g. iterator invalidation issues, and that
is what he is working on.

(
The ref graph is important to guide the overall SCC visitation order
because it basically represents "the largest graph that the CG may turn
into due to our static analysis of this module". I.e. no transformation we
can statically make in the CGSCC passes can ever cause us to need to merge
SCC's in the ref graph.
)

I have a couple overall questions/concerns:

1. The ref graph can easily go quadratic. E.g.

typedef void (*fp)();
fp funcs = {
  &foo1,
  &foo2,
  ...
  &fooN
}
void foo1() { funcs[something](); }
void foo2() { funcs[something](); }
...
void fooN() { funcs[something](); }

One real-world case where this might come about is in the presence of
vtables.

The existing CGSCC pass manager does not have this issue AFAIK because it
does not consider the ref graph.

Does anybody have any info/experience about how densely connected the ref
graph can get in programs that might reasonably be fed to the compiler?
I just did a quick sanity check with LLD/ELF using
`--lto-newpm-passes=cgscc(no-op-cgscc)` and it at least seemed to terminate
/ not run out of memory. Based on some rough calculations looking at the
profile, it seem like the entire run of the inliner in the old LTO pipeline
(which is about 5% of total LTO time on this particular example I looked
at) is only 2-3x as expensive as just
`--lto-newpm-passes=cgscc(no-op-cgscc)`, so the LazyCallGraph construction
is certainly not free.

2. What is the intended behavior of CGSCC passes when SCC's are split or
merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now we run
some function passes nested inside the CGSCC pass manager (e.g. to simplify
things after inlining). Consider:

This is not how I thought the current scheme worked -- I was under the
impression that we had a call graph with conservatively-connected dummy
nodes for external/indirect functions.

The fact that we have a separate nodes for calling into an external
function and "being called" from an external function, these don't form
SCC. So it means we can end up merging SCC IIUC.

Thanks for clarifying this.

My intuition for this behavior (by looking at a limiting case) is that
e.g. during FullLTO we have `main`, which is "externally called", so that
any function containing an external function call or indirect call would
may end up calling back into main from the compiler's perspective.
This would result in a huge SCC containing `main` and all functions that
transitively call a function containing an external function call /
indirect call; this doesn't seem like the SCC we want to look at during
inlining.

No, I thought we special-cased main. Or at least we did until we figured
out that we couldn't do that for all languages, and so we introduced some
kind of "no recurse" attribute?

This is just a thought experiment I did for the intuition of why modeling
the call graph purely in the sense of "an edge A->B in the call graph means
that at runtime a call from A to B may occur" is not what we want during
inlining. I have no idea what we actually do (I'm still exploring).

Looking naively in

github.com

llvm-mirror/llvm/blob/master/lib/Analysis/CallGraph.cpp

//===- CallGraph.cpp - Build a Module's call graph ------------------------===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//

#include "llvm/Analysis/CallGraph.h"
#include "llvm/ADT/STLExtras.h"
#include "llvm/ADT/SmallVector.h"
#include "llvm/Config/llvm-config.h"
#include "llvm/IR/Module.h"
#include "llvm/IR/Function.h"
#include "llvm/IR/Intrinsics.h"
#include "llvm/IR/PassManager.h"
#include "llvm/Pass.h"
#include "llvm/Support/Compiler.h"
#include "llvm/Support/Debug.h"
#include "llvm/Support/raw_ostream.h"

This file has been truncated. show original

and

github.com

llvm-mirror/llvm/blob/master/lib/Analysis/CallGraphSCCPass.cpp

//===- CallGraphSCCPass.cpp - Pass that operates BU on call graph ---------===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//
//
// This file implements the CallGraphSCCPass class, which is used for passes
// which are implemented as bottom-up traversals on the call graph.  Because
// there may be cycles in the call graph, passes of this type operate on the
// call-graph in SCC order: that is, they process function bottom-up, except for
// recursive functions, which they process all at once.
//
//===----------------------------------------------------------------------===//

#include "llvm/Analysis/CallGraphSCCPass.h"
#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/SCCIterator.h"
#include "llvm/ADT/Statistic.h"

This file has been truncated. show original

I don't see any mention of norecurse that would prevent us from forming
SCC's due to the norecurse attribute though.
We do have some logic for main, but it doesn't seem like it would affect my
thought experiment:

github.com

llvm-mirror/llvm/blob/master/lib/Analysis/CallGraph.cpp#L65


      
              I.second->allReferencesDropped();
          #endif
          }
          
          void CallGraph::addToCallGraph(Function *F) {
            CallGraphNode *Node = getOrInsertFunction(F);
          
            // If this function has external linkage or has its address taken, anything
            // could call it.
            if (!F->hasLocalLinkage() || F->hasAddressTaken())
              ExternalCallingNode->addCalledFunction(nullptr, Node);
          
            // If this function is not defined in this translation unit, it could call
            // anything.
            if (F->isDeclaration() && !F->isIntrinsic())
              Node->addCalledFunction(nullptr, CallsExternalNode.get());
          
            // Look for calls by this function.
            for (BasicBlock &BB : *F)
              for (Instruction &I : BB) {
                if (auto *Call = dyn_cast<CallBase>(&I)) {

Like Mehdi pointed out, I think that we avoid forming this kind of giant
SCC implied by the "an edge A->B in the call graph means that at runtime a
call from A to B may occur" by having two separate external nodes. This
makes an external/indirect function call be fundamentally not tracked as
potentially calling back into `main` (or other external functions in this
module).

-- Sean Silva

_sean_silva · June 8, 2016, 10:22pm

Hi Chandler, Philip, Mehdi, (and llvm-dev,)

(this is partially a summary of some discussions that happened at the
last LLVM bay area social, and partially a discussion about the direction
of the CGSCC pass manager)

A the last LLVM social we discussed the progress on the CGSCC pass
manager. It seems like Chandler has a CGSCC pass manager working, but it is
still unresolved exactly which semantics we want (more about this below)
that are reasonably implementable.

AFAICT, there has been no public discussion about what exact semantics we
ultimately want to have. We should figure that out.

The main difficulty which Chandler described is the apparently quite
complex logic surrounding needing to run function passes nested within an
SCC pass manager, while providing some guarantees about exactly what order
the function passes are run. The existing CGSCC pass manager just punts on
some of the problems that arise (look in CGPassManager::runOnModule,
CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in
llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that
Chandler has been trying to solve.

(
Why is this "function passes inside CGSCC passes" stuff interesting?
Because LLVM can do inlining on an SCC (often just a single function) and
then run function passes to simplify the function(s) in the SCC before it
tries to inline into a parent SCC. (the SCC visitation order is post-order)
For example, we may inline a bunch of code, but after inlining we can
tremendously simplify the function, and we want to do so before considering
this function for inlining into its callers so that we get an accurate
evaluation of the inline cost.
Based on what Chandler said, it seems that LLVM is fairly unique in this
regard and other compilers don't do this (which is why we can't just look
at how other compilers solve this problem; they don't have this problem
(maybe they should? or maybe we shouldn't?)). For example, he described
that GCC uses different inlining "phases"; e.g. it does early inlining on
the entire module, then does simplifications on the entire module, then
does late inlining on the entire module; so it is not able to incrementally
simplify as it inlines like LLVM does.
)

I want to clarify a little more on GCC's behavior. GCC has two types of
ipa_passes. One called simple_ipa pass and the other is called regular
ipa_pass. A simple IPA pass does not transform IR by itself, but it can
have function level sub-passes that does transformation. If it has function
sub-passes, the function passes will be executed in bottom up order just
like LLVM's CGSCC pass. A regular ipa pass is somewhat like the module
pass in LLVM.

GCC's pass pipeline is actually similar to LLVM's pipeline overall: it has
the following components:

(1) lowering passes
(2) small/simple IPA passes including the early optimization pipeline
(bottom-up)
(3) full ipa pipelines
(4) post-ipa optimization and code generation.

The difference is that LLVM's (2) includes only inst-combine and simplfy
CFG, but GCC's (2) is larger which happens also includes an early inlining
pass.

So the early inlining is a function pass? In LLVM I think that the function
pass contract would not allow inlining to occur in a function pass (the
contract is theoretically designed to allow function passes to run
concurrently on functions within a module).

GCC's (2) is similar to LLVM's CGSCC pass but with fewer passes. The
inliner is also targeting tiny functions that reduces size overall.
LLVM's LTO pipeline is kind of like this too.

GCC's (3) includes a full blown regular inlining pass. This inliner uses
the priority order based algorithm, so bottom-up traversal does not make
sense. It requires (2) to have pass to collect summary to drive the
decisions (as well as profile data).

Thanks for this clarification; it is always good to know what other
compilers do for comparison.

-- Sean Silva

davidxl · June 8, 2016, 10:44pm

+cc Easwaran who is an expert on GCC’s inliner. See my reply below.

_sean_silva · June 8, 2016, 11:36pm

------------------------------

*From: *"Sean Silva via llvm-dev" <llvm-dev@lists.llvm.org>
*To: *"llvm-dev" <llvm-dev@lists.llvm.org>
*Sent: *Wednesday, June 8, 2016 6:19:03 AM
*Subject: *[llvm-dev] Intended behavior of CGSCC pass manager.

Hi Chandler, Philip, Mehdi, (and llvm-dev,)

(this is partially a summary of some discussions that happened at the last
LLVM bay area social, and partially a discussion about the direction of the
CGSCC pass manager)

A the last LLVM social we discussed the progress on the CGSCC pass
manager. It seems like Chandler has a CGSCC pass manager working, but it is
still unresolved exactly which semantics we want (more about this below)
that are reasonably implementable.

AFAICT, there has been no public discussion about what exact semantics we
ultimately want to have. We should figure that out.

The main difficulty which Chandler described is the apparently quite
complex logic surrounding needing to run function passes nested within an
SCC pass manager, while providing some guarantees about exactly what order
the function passes are run. The existing CGSCC pass manager just punts on
some of the problems that arise (look in CGPassManager::runOnModule,
CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in
llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that
Chandler has been trying to solve.

(
Why is this "function passes inside CGSCC passes" stuff interesting?
Because LLVM can do inlining on an SCC (often just a single function) and
then run function passes to simplify the function(s) in the SCC before it
tries to inline into a parent SCC. (the SCC visitation order is post-order)
For example, we may inline a bunch of code, but after inlining we can
tremendously simplify the function, and we want to do so before considering
this function for inlining into its callers so that we get an accurate
evaluation of the inline cost.
Based on what Chandler said, it seems that LLVM is fairly unique in this
regard and other compilers don't do this (which is why we can't just look
at how other compilers solve this problem; they don't have this problem
(maybe they should? or maybe we shouldn't?)). For example, he described
that GCC uses different inlining "phases"; e.g. it does early inlining on
the entire module, then does simplifications on the entire module, then
does late inlining on the entire module; so it is not able to incrementally
simplify as it inlines like LLVM does.

This incremental simplification is an important feature of our inliner,
and one we should endeavor to keep. We might also want different phases at
some point (e.g. a top-down and a bottom-up phase), but that's another
story.

)

As background for what is below, the LazyCallGraph tracks two graphs: the
"call graph" and the "ref graph".
Conceptually, the call graph is the graph of direct calls, where indirect
calls and calls to external functions do not appear (or are connected to
dummy nodes). The ref graph is basically the graph of all functions
transitively accessible based on the globals/constants/etc. referenced by a
function (e.g. if a function `foo` references a vtable that is defined in
the module, there is an edge in the ref graph from `foo` to every function
in the vtable).
The call graph is a strict subset of the ref graph.

Chandler described that he had a major breakthrough in that the CGSCC pass
manager only had to deal with 3 classes of modifications that can occur:
- a pass may e.g. propagate a load of a function pointer into an indirect
call, turning it into an direct call. This requires adding an edge in the
CG but not in the ref graph.
- a pass may take a direct call and turn it into an indirect call. This
requires removing an edge from the CG, but not in the ref graph.
- a pass may delete a direct call. This removes an edge in the CG and also
in the ref graph.

From the perspective of the CGSCC pass manager, these operations can
affect the SCC structure. Adding an edge might merge SCC's and deleting an
edge might split SCC's. Chandler mentioned that apparently the issues of
splitting and merging SCC's within the current infrastructure are actually
quite challenging and lead to e.g. iterator invalidation issues, and that
is what he is working on.

(
The ref graph is important to guide the overall SCC visitation order
because it basically represents "the largest graph that the CG may turn
into due to our static analysis of this module". I.e. no transformation we
can statically make in the CGSCC passes can ever cause us to need to merge
SCC's in the ref graph.
)

I have a couple overall questions/concerns:

1. The ref graph can easily go quadratic. E.g.

typedef void (*fp)();
fp funcs = {
  &foo1,
  &foo2,
  ...
  &fooN
}
void foo1() { funcs[something](); }
void foo2() { funcs[something](); }
...
void fooN() { funcs[something](); }

One real-world case where this might come about is in the presence of
vtables.

The existing CGSCC pass manager does not have this issue AFAIK because it
does not consider the ref graph.

Does anybody have any info/experience about how densely connected the ref
graph can get in programs that might reasonably be fed to the compiler?
I just did a quick sanity check with LLD/ELF using
`--lto-newpm-passes=cgscc(no-op-cgscc)` and it at least seemed to terminate
/ not run out of memory. Based on some rough calculations looking at the
profile, it seem like the entire run of the inliner in the old LTO pipeline
(which is about 5% of total LTO time on this particular example I looked
at) is only 2-3x as expensive as just
`--lto-newpm-passes=cgscc(no-op-cgscc)`, so the LazyCallGraph construction
is certainly not free.

2. What is the intended behavior of CGSCC passes when SCC's are split or
merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now we run
some function passes nested inside the CGSCC pass manager (e.g. to simplify
things after inlining). Consider:

This is not how I thought the current scheme worked -- I was under the
impression that we had a call graph with conservatively-connected dummy
nodes for external/indirect functions. As a result, there is no
semantics-preserving optimization that will merge SCCs, only split them. In
that case, I'd expect that once an SCC is split, we re-run the CGSCC passes
over the newly-separated SCCs. But this corresponds to running over the
"ref graph", as you describe it. I don't understand why we want the
non-conservative graph.

a) These function passes are e.g. now able to devirtualize a call, adding
an edge to the CG, forming a larger CG SCC. Do you re-run the CGSCC pass
(say, the inliner) on this larger SCC?

b) These function passes are e.g. able to DCE a call, removing an edge
from the CG. This converts, say, a CG SCC which is a cycle graph (like
a->b->c->a) into a path graph (a->b->c, with no edge back to a). The
inliner had already visited a, b, and c as a single SCC. Now does it have
to re-visit c, then b, then a, as single-node SCC's?

btw:

One way that I have found it useful to think about this is in terms of the
visitation during Tarjan's SCC algorithm. I'll reference the pseudocode in
Tarjan's strongly connected components algorithm - Wikipedia.
Inside the "strongconnect" routine when we have identified an SCC (the true
branch of `if (v.lowlink = v.index)` test ) we can visit
stack[v.index:stack.size()] as an SCC. This may or may not invalidate some
things on the stack (the variable `S` in the pseudocode) and we may need to
fix it up (e.g. inlining deleted a function, so we can't have an entry on
the stack). Then, we can run function passes as we pop individual functions
off the stack, but it is easier to think about IMO than merging of SCC data
structures: if we add edges to the CG then we have to do more DFS on the
new edges and if we delete edges then the DFS order of the stack gives us
certain guarantees.
Personally I find this much easier to reason about than the description in
terms of splitting and merging SCC's in the CG and ref graph (which the
LazyCallGraph API makes one to think about since it hides the underlying
Tarjan's algorithm).
The LazyCallGraph API makes the current loop in
http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100
very clean, but at least for my thinking about the problem, it seems like
the wrong abstraction (and most of the LazyCallGraph API seems to be
unused, so it seems like it may be overly heavyweight).
E.g. I think that maybe the easiest thing to do is to turn the current
approach inside out: instead of having the pass manager logic be the
"normal code" and forcing the Tarjan algorithm to become a state machine of
iterators, use an open-coded Tarjan algorithm with some callbacks and make
the pass management logic be the state machine.
This will also open the door to avoiding the potentially quadratic size of
the ref graph, since e.g. in the example I gave above, we can mark the
`funcs` array itself as already having been visited during the walk. In the
current LazyCallGraph, this would require adding some sort of notion of
hyperedge.

FWIW, I see no purpose in abstracting one algorithm, especially if that
makes things algorithmically harder. Also, the LazyCallGraph abstraction
and the iterator abstraction seem like separate issues. Iterator
abstractions are often useful because you can use them in generic
algorithms, etc.

Since this is such a high priority (due to blocking PGO inlining), I will
probably try my hand at implementing the CGSCC pass manager sometime soon
unless somebody beats me to it. (I'll probably try the "open-coded SCC
visit" approach).

Another possibility is implementing the new CGSCC pass manager that uses
the same visitation semantics as the one in the old PM, and then we can
refactor that as needed. In fact, that may be the best approach so that
porting to the new PM is as NFC as possible and we can isolate the
functional (i.e., need benchmarks, measurements ...) changes in separate
commits.

I'm in favor of this approach for exactly the reason you mention. Being
able to bisect regressions to the algorithmic change, separate from the
infrastructure change, will likely make things easier in the long run (and
will avoid the problem, to the extent possible, of performance regressions
blocking the pass-manager work).

Yeah. Last night after I went home I was thinking to myself that this
really seems like the obvious path forward. I will start implementing this.

-- Sean Silva

_sean_silva · June 8, 2016, 11:38pm

Hi Chandler, Philip, Mehdi, (and llvm-dev,)

(this is partially a summary of some discussions that happened at the
last LLVM bay area social, and partially a discussion about the direction
of the CGSCC pass manager)

A the last LLVM social we discussed the progress on the CGSCC pass
manager. It seems like Chandler has a CGSCC pass manager working, but it is
still unresolved exactly which semantics we want (more about this below)
that are reasonably implementable.

AFAICT, there has been no public discussion about what exact semantics we
ultimately want to have. We should figure that out.

The main difficulty which Chandler described is the apparently quite
complex logic surrounding needing to run function passes nested within an
SCC pass manager, while providing some guarantees about exactly what order
the function passes are run. The existing CGSCC pass manager just punts on
some of the problems that arise (look in CGPassManager::runOnModule,
CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in
llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that
Chandler has been trying to solve.

(
Why is this "function passes inside CGSCC passes" stuff interesting?
Because LLVM can do inlining on an SCC (often just a single function) and
then run function passes to simplify the function(s) in the SCC before it
tries to inline into a parent SCC. (the SCC visitation order is post-order)
For example, we may inline a bunch of code, but after inlining we can
tremendously simplify the function, and we want to do so before considering
this function for inlining into its callers so that we get an accurate
evaluation of the inline cost.
Based on what Chandler said, it seems that LLVM is fairly unique in this
regard and other compilers don't do this (which is why we can't just look
at how other compilers solve this problem; they don't have this problem
(maybe they should? or maybe we shouldn't?)). For example, he described
that GCC uses different inlining "phases"; e.g. it does early inlining on
the entire module, then does simplifications on the entire module, then
does late inlining on the entire module; so it is not able to incrementally
simplify as it inlines like LLVM does.
)

As background for what is below, the LazyCallGraph tracks two graphs: the
"call graph" and the "ref graph".
Conceptually, the call graph is the graph of direct calls, where indirect
calls and calls to external functions do not appear (or are connected to
dummy nodes). The ref graph is basically the graph of all functions
transitively accessible based on the globals/constants/etc. referenced by a
function (e.g. if a function `foo` references a vtable that is defined in
the module, there is an edge in the ref graph from `foo` to every function
in the vtable).
The call graph is a strict subset of the ref graph.

Chandler described that he had a major breakthrough in that the CGSCC
pass manager only had to deal with 3 classes of modifications that can
occur:
- a pass may e.g. propagate a load of a function pointer into an indirect
call, turning it into an direct call. This requires adding an edge in the
CG but not in the ref graph.
- a pass may take a direct call and turn it into an indirect call. This
requires removing an edge from the CG, but not in the ref graph.
- a pass may delete a direct call. This removes an edge in the CG and
also in the ref graph.

From the perspective of the CGSCC pass manager, these operations can
affect the SCC structure. Adding an edge might merge SCC's and deleting an
edge might split SCC's. Chandler mentioned that apparently the issues of
splitting and merging SCC's within the current infrastructure are actually
quite challenging and lead to e.g. iterator invalidation issues, and that
is what he is working on.

(
The ref graph is important to guide the overall SCC visitation order
because it basically represents "the largest graph that the CG may turn
into due to our static analysis of this module". I.e. no transformation we
can statically make in the CGSCC passes can ever cause us to need to merge
SCC's in the ref graph.
)

I have a couple overall questions/concerns:

1. The ref graph can easily go quadratic. E.g.

typedef void (*fp)();
fp funcs = {
  &foo1,
  &foo2,
  ...
  &fooN
}
void foo1() { funcs[something](); }
void foo2() { funcs[something](); }
...
void fooN() { funcs[something](); }

One real-world case where this might come about is in the presence of
vtables.

The existing CGSCC pass manager does not have this issue AFAIK because it
does not consider the ref graph.

Does anybody have any info/experience about how densely connected the ref
graph can get in programs that might reasonably be fed to the compiler?
I just did a quick sanity check with LLD/ELF using
`--lto-newpm-passes=cgscc(no-op-cgscc)` and it at least seemed to terminate
/ not run out of memory. Based on some rough calculations looking at the
profile, it seem like the entire run of the inliner in the old LTO pipeline
(which is about 5% of total LTO time on this particular example I looked
at) is only 2-3x as expensive as just
`--lto-newpm-passes=cgscc(no-op-cgscc)`, so the LazyCallGraph construction
is certainly not free.

Conceptually, reference graph should also include variable nodes. With
variable nodes introduced, the quadratic behavior mentioned can be avoided.

Yeah, this is what I was trying to get at with the statement "In the
current LazyCallGraph, this would require adding some sort of notion of
hyperedge."
But you are right: from an implementation perspective of a call graph data
structure that is trying to model the "ref graph" of LazyCallGraph, it is
cleaner to have nodes that are not functions than to introduce a notion of
hyperedge.

2. What is the intended behavior of CGSCC passes when SCC's are split or
merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now we run
some function passes nested inside the CGSCC pass manager (e.g. to simplify
things after inlining). Consider:

a) These function passes are e.g. now able to devirtualize a call, adding
an edge to the CG, forming a larger CG SCC. Do you re-run the CGSCC pass
(say, the inliner) on this larger SCC?

b) These function passes are e.g. able to DCE a call, removing an edge
from the CG. This converts, say, a CG SCC which is a cycle graph (like
a->b->c->a) into a path graph (a->b->c, with no edge back to a). The
inliner had already visited a, b, and c as a single SCC. Now does it have
to re-visit c, then b, then a, as single-node SCC's?

btw:

One way that I have found it useful to think about this is in terms of
the visitation during Tarjan's SCC algorithm. I'll reference the pseudocode
in
Tarjan's strongly connected components algorithm - Wikipedia.
Inside the "strongconnect" routine when we have identified an SCC (the true
branch of `if (v.lowlink = v.index)` test ) we can visit
stack[v.index:stack.size()] as an SCC. This may or may not invalidate some
things on the stack (the variable `S` in the pseudocode) and we may need to
fix it up (e.g. inlining deleted a function, so we can't have an entry on
the stack). Then, we can run function passes as we pop individual functions
off the stack, but it is easier to think about IMO than merging of SCC data
structures: if we add edges to the CG then we have to do more DFS on the
new edges and if we delete edges then the DFS order of the stack gives us
certain guarantees.
Personally I find this much easier to reason about than the description
in terms of splitting and merging SCC's in the CG and ref graph (which the
LazyCallGraph API makes one to think about since it hides the underlying
Tarjan's algorithm).
The LazyCallGraph API makes the current loop in
http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100
very clean, but at least for my thinking about the problem, it seems like
the wrong abstraction (and most of the LazyCallGraph API seems to be
unused, so it seems like it may be overly heavyweight).
E.g. I think that maybe the easiest thing to do is to turn the current
approach inside out: instead of having the pass manager logic be the
"normal code" and forcing the Tarjan algorithm to become a state machine of
iterators, use an open-coded Tarjan algorithm with some callbacks and make
the pass management logic be the state machine.
This will also open the door to avoiding the potentially quadratic size
of the ref graph, since e.g. in the example I gave above, we can mark the
`funcs` array itself as already having been visited during the walk. In the
current LazyCallGraph, this would require adding some sort of notion of
hyperedge.

Since this is such a high priority (due to blocking PGO inlining), I will
probably try my hand at implementing the CGSCC pass manager sometime soon
unless somebody beats me to it. (I'll probably try the "open-coded SCC
visit" approach).

Another possibility is implementing the new CGSCC pass manager that uses
the same visitation semantics as the one in the old PM, and then we can
refactor that as needed. In fact, that may be the best approach so that
porting to the new PM is as NFC as possible and we can isolate the
functional (i.e., need benchmarks, measurements ...) changes in separate
commits.

A very high level comment: why do we need to update callgraph on the fly ?
Can we have a more general support of iterative SCC pass invocation?

something like:

1) build the callgraph
2) cache the post-order traversal order

3) if the order list is empty -- done
4) traversal: invoke function passes for each function on the order (step
2 or 5). The call graph gets updated on the fly (with new edges, or new
nodes for cloned functions)
5) update the function traversal order from new nodes and new edges
created in 4)
6) go to step 3).

(sorry for the delayed reply... this is a very poignant question / example)

From the discussion with Chandler, I think he wants to provide more

guarantees to function passes about the visitation order. He will need to
explain his exact concerns. But IIRC the essence of one of the issues is
captured in the example I gave in 2.b) in the OP:

"These function passes are e.g. able to DCE a call, removing an edge from
the CG. This converts, say, a CG SCC which is a cycle graph (like
a->b->c->a) into a path graph (a->b->c, with no edge back to a)."

The issue that I remember Chandler brought up is that before deleting an
edge from an SCC, the nodes in the SCC are "unordered" w.r.t. each other;
but after deleting an edge from the CG which splits the SCC, there is an
order.
In other words, that the cached post-order may no longer be a post-order
traversal of the new graph after a function pass runs. For example, in the
cached post-order traversal we may have entered the SCC `a->b->c->a` via an
edge `x->b` and so our cached post-order traversal visits the functions in
the order `a, then c, then b`. If a function pass visits `c` and deletes
the call `c->a`, then "a, then c, then b" is not a valid post-order for
visiting the functions.

Looking at the explanation I gave, I think there isn't really a problem
here. The cached post-order can only be invalidated for functions that have
already been visited.

Speaking more broadly about the algorithm you just described, did you
intend to omit an SCC visitation step? The goal of the CGSCC pass manager
is the ability to visit an SCC (e.g. inliner visits an SCC), then
immediately run function passes to simplify the result of inlining. Only
after the simplification has occurred do we visit the parent SCC. By
running the simplifications in lock-step with the inliner SCC visitation we
give parent SCC's a more accurate view of the real cost of inlining a
function from a child SCC.

Issues similar to 2.a) (i.e. adding edges to the CG) also affect the
visitation order of SCC's (not just functions). For example, we visit an
SCC with a CGSCC pass (e.g. inliner). Then we run the first function pass
on that SCC, which may add an edge (e.g. promote indirect call to direct
call) that may enlarge the SCC. Do we continue running the remaining
function passes? Do we re-visit this new enlarged SCC? (if so, when?) These
are the types of questions that motivated this post (hence the name
"Intended behavior of CGSCC pass manager.").

-- Sean Silva

_sean_silva · June 8, 2016, 11:48pm

Hi Chandler, Philip, Mehdi, (and llvm-dev,)

(this is partially a summary of some discussions that happened at the
last LLVM bay area social, and partially a discussion about the direction
of the CGSCC pass manager)

A the last LLVM social we discussed the progress on the CGSCC pass
manager. It seems like Chandler has a CGSCC pass manager working, but it is
still unresolved exactly which semantics we want (more about this below)
that are reasonably implementable.

AFAICT, there has been no public discussion about what exact semantics
we ultimately want to have. We should figure that out.

The main difficulty which Chandler described is the apparently quite
complex logic surrounding needing to run function passes nested within an
SCC pass manager, while providing some guarantees about exactly what order
the function passes are run. The existing CGSCC pass manager just punts on
some of the problems that arise (look in CGPassManager::runOnModule,
CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in
llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that
Chandler has been trying to solve.

(
Why is this "function passes inside CGSCC passes" stuff interesting?
Because LLVM can do inlining on an SCC (often just a single function) and
then run function passes to simplify the function(s) in the SCC before it
tries to inline into a parent SCC. (the SCC visitation order is post-order)
For example, we may inline a bunch of code, but after inlining we can
tremendously simplify the function, and we want to do so before considering
this function for inlining into its callers so that we get an accurate
evaluation of the inline cost.
Based on what Chandler said, it seems that LLVM is fairly unique in this
regard and other compilers don't do this (which is why we can't just look
at how other compilers solve this problem; they don't have this problem
(maybe they should? or maybe we shouldn't?)). For example, he described
that GCC uses different inlining "phases"; e.g. it does early inlining on
the entire module, then does simplifications on the entire module, then
does late inlining on the entire module; so it is not able to incrementally
simplify as it inlines like LLVM does.
)

As background for what is below, the LazyCallGraph tracks two graphs:
the "call graph" and the "ref graph".
Conceptually, the call graph is the graph of direct calls, where
indirect calls and calls to external functions do not appear (or are
connected to dummy nodes). The ref graph is basically the graph of all
functions transitively accessible based on the globals/constants/etc.
referenced by a function (e.g. if a function `foo` references a vtable that
is defined in the module, there is an edge in the ref graph from `foo` to
every function in the vtable).
The call graph is a strict subset of the ref graph.

Chandler described that he had a major breakthrough in that the CGSCC
pass manager only had to deal with 3 classes of modifications that can
occur:
- a pass may e.g. propagate a load of a function pointer into an
indirect call, turning it into an direct call. This requires adding an edge
in the CG but not in the ref graph.
- a pass may take a direct call and turn it into an indirect call. This
requires removing an edge from the CG, but not in the ref graph.
- a pass may delete a direct call. This removes an edge in the CG and
also in the ref graph.

From the perspective of the CGSCC pass manager, these operations can
affect the SCC structure. Adding an edge might merge SCC's and deleting an
edge might split SCC's. Chandler mentioned that apparently the issues of
splitting and merging SCC's within the current infrastructure are actually
quite challenging and lead to e.g. iterator invalidation issues, and that
is what he is working on.

(
The ref graph is important to guide the overall SCC visitation order
because it basically represents "the largest graph that the CG may turn
into due to our static analysis of this module". I.e. no transformation we
can statically make in the CGSCC passes can ever cause us to need to merge
SCC's in the ref graph.
)

I have a couple overall questions/concerns:

1. The ref graph can easily go quadratic. E.g.

typedef void (*fp)();
fp funcs = {
  &foo1,
  &foo2,
  ...
  &fooN
}
void foo1() { funcs[something](); }
void foo2() { funcs[something](); }
...
void fooN() { funcs[something](); }

One real-world case where this might come about is in the presence of
vtables.

The existing CGSCC pass manager does not have this issue AFAIK because
it does not consider the ref graph.

Does anybody have any info/experience about how densely connected the
ref graph can get in programs that might reasonably be fed to the compiler?

I can state that almost all call graphs of compilers include edges for
indirect calls and external functions, so they are already quadratic in
this sense.

if what you state is correct, and we don't have a conservatively correct
call graph, that would be ... interesting.

Mehdi clarified downthread the situation (at least for me).

Now that I look more closely at the comments in
https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Analysis/LazyCallGraph.h
it intentionally does not model a call graph in this sense. I.e.
- in a traditional CG, an edge means something like "at runtime there may
be a call from A->B"
- in the LazyCallGraph an edge (a "ref edge" as it calls it) represents
something like "during optimization of this module, we may discover the
existence of a direct call from A->B". There is also a distinguished
subgraph of the ref graph (which I think LazyCallGraph calls just the "call
graph") which represents the actual direct calls that are present in the
module currently.

The comments in LazyCallGraph.h are quite good, but the existing
CallGraph.h doesn't seem to touch on this up front in its comments in quite
the same way, but it does at least say that it models with two external
nodes like Mehdi mentioned:

https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Analysis/CallGraph.h#L30
So its edges don't represent "at runtime there may be a call from A->B".
But since it doesn't maintain a "ref graph" I'm not sure what the edges
exactly represent.

I thought of it as the edges are "there is a direct call from A -> B".
Which is a subset of "at runtime there may be a call from A->B".

I think that with all this discussion, it is important to distinguish that
(I think) there is no "correctness" issue at stance (we won't miscompile
anything)

This is a good point worth explicitly noting (and hopefully there are in
fact no passes relying on it to mean "at runtime there may be a call from
A->B").

, but there may be missing optimization in some cases. I think the current
scheme catches most cases and when it does not we are just missing
potential inlining. The question may be how much (more) cases we really
need to catch with the new pass manager?

I think this only affects inlining of mutually recursive functions. Most
functions are not mutually recursive [citation needed] so I'm not too
worried.
(The main performance-critical case that I can think of using mutual
recursion would be parsers).

-- Sean Silva

mehdi_amini · June 9, 2016, 12:11am

What about devirtualization? Or even simply cases where an indirect call can be folded to a direct call?
Now you are processing the caller without having processed the callee.
I think the current way of handling this is to add the new callee to the SCC (even though it is not actually participating into it) and continue to iterate on this SCC. Sounds like a hack, but seems to works in practice.

Mehdi

_sean_silva · June 9, 2016, 12:35am

Hi Sean,

Sean Silva wrote:
> Hi Chandler, Philip, Mehdi, (and llvm-dev,)
>
> (this is partially a summary of some discussions that happened at the
last LLVM bay area social, and partially a
> discussion about the direction of the CGSCC pass manager)

Thanks for writing this up! This sort of thing is very helpful for
people who don't attend the social, or had to leave early.

> Chandler described that he had a major breakthrough in that the CGSCC
pass manager only had to deal with 3 classes of
> modifications that can occur:
> - a pass may e.g. propagate a load of a function pointer into an
indirect call, turning it into an direct call. This
> requires adding an edge in the CG but not in the ref graph.
> - a pass may take a direct call and turn it into an indirect call. This
requires removing an edge from the CG, but not
> in the ref graph.
> - a pass may delete a direct call. This removes an edge in the CG and
also in the ref graph.

At what granularity are we modeling these things? E.g. if SimplifyCFG
deletes a basic block, will we remove call edges that start from that
block?

Note there is fourth kind of modification that isn't modeled here:
devirtualization, in which we transform IR like

  %fnptr = compute_fnptr(%receiver, i32 <method id>)
  call %fptr(%receiver)

to (via a pass that understands the semantics of compute_fnptr).

  call some.specific.Class::method(%receiver)

However, at this time I don't think modeling "out of thin air"
devirtualization of this sort is important, since nothing upstream
does things like this (we do have ModulePasses downstream that do
this).

> From the perspective of the CGSCC pass manager, these operations can
affect the SCC structure. Adding an edge might
> merge SCC's and deleting an edge might split SCC's. Chandler mentioned
that apparently the issues of splitting and
> merging SCC's within the current infrastructure are actually quite
challenging and lead to e.g. iterator invalidation
> issues, and that is what he is working on.
>
> (
> The ref graph is important to guide the overall SCC visitation order
because it basically represents "the largest graph
> that the CG may turn into due to our static analysis of this module".
I.e. no transformation we can statically make in
> the CGSCC passes can ever cause us to need to merge SCC's in the ref
graph.
> )

Except in the above "out of thin air" devirtualization case.

I think cross-function store forwarding can also be problematic:

void foo(fnptr* bar) {
  if (bar)
    (*bar)();
}

void baz() { foo(null); }

void caller() {
  fnptr *t = malloc();
  *t = baz;
  foo(t);
}

// RefGraph is
// caller -> baz
// caller -> foo
// baz -> foo

Now the RefSCCs are {foo}, {caller}, {baz} but if we forward the store
to *t in caller into the load in foo (and are smart about the null
check) we get:

void foo(fnptr* bar) {
  if (bar)
    baz();
}

void baz() { foo(null); }

void caller() {
  fnptr *t = malloc();
  *t = baz;
  foo(t);
}

// RefGraph is
// foo -> baz
// baz -> foo
// caller -> foo
// caller -> baz

and now the RefSCCs are {foo, baz}, {caller}

In both of these examples you gave (out-of-thin-air devirtualization and
forwarding into a callee) is the contract of a CGSCC pass being violated?
(I believe the contract is that a CGSCC pass cannot invalidate the analyses
of a different SCC (
https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Analysis/CGSCCPassManager.h#L104)
)

If not, then Chandler's analysis was missing a case (beyond the 3 cases I
mentioned which he had convinced himself were the only possible ones).

But again, I think this is fine since nothing upstream does this at
this time; and we can cross that bridge when we come to it.

> 2. What is the intended behavior of CGSCC passes when SCC's are split or
merged? E.g. a CGSCC pass runs on an SCC (e.g.
> the inliner). Now we run some function passes nested inside the CGSCC
pass manager (e.g. to simplify things after
> inlining). Consider:
>
> a) These function passes are e.g. now able to devirtualize a call,
adding an edge to the CG, forming a larger CG SCC. Do
> you re-run the CGSCC pass (say, the inliner) on this larger SCC?
>
> b) These function passes are e.g. able to DCE a call, removing an edge
from the CG. This converts, say, a CG SCC which
> is a cycle graph (like a->b->c->a) into a path graph (a->b->c, with no
edge back to a). The inliner had already visited
> a, b, and c as a single SCC. Now does it have to re-visit c, then b,
then a, as single-node SCC's?

Okay, so this is the same question I wrote above: "At what granularity
are we modeling these things?", phrased more eloquently.

> One way that I have found it useful to think about this is in terms of
the visitation during Tarjan's SCC algorithm.
> I'll reference the pseudocode in
Tarjan's strongly connected components algorithm - Wikipedia
.
> Inside the "strongconnect" routine when we have identified an SCC (the
true branch of `if (v.lowlink = v.index)` test )
> we can visit stack[v.index:stack.size()] as an SCC. This may or may not
invalidate some things on the stack (the
> variable `S` in the pseudocode) and we may need to fix it up (e.g.
inlining deleted a function, so we can't have an
> entry on the stack). Then, we can run function passes as we pop
individual functions off the stack, but it is easier to
> think about IMO than merging of SCC data structures: if we add edges to
the CG then we have to do more DFS on the new
> edges and if we delete edges then the DFS order of the stack gives us
certain guarantees.

I'm not sure how this will be easier E.g. consider

X -> A
A -> B
B -> C
C -> B
B -> A

Now you're going to pop A,B,C together, as an SCC; and you start
optimizing them, and the edge from B to A falls out. Don't you have
the same problem now (i.e. we've removed an edge from an SCC we're
iterating over)? Perhaps I misunderstood your scheme?

I wasn't really proposing a specific algorithm, but rather a different way
to think about the problem and what semantics we really want (hence the
thread title) and are implementable.

A good example of what I mean is David's example of caching the post-order
traversal: thinking in terms of a raw graph algorithm can provide some
insights into the actual invariants that are (or can be) maintained.
Whereas that example was in terms of just a post-order CG walk, to think
about the whole CGSCC pass manager visitation, thinking in terms of
something like Tarjan's algorithm seems likely to provide insight about the
invariants both on the function order and the SCC order.

(btw, http://www.webgraphviz.com/ helped me visualize your graph; handy!)

-- Sean Silva

_sean_silva · June 9, 2016, 12:35am

Sent from my iPhone

Hi Chandler, Philip, Mehdi, (and llvm-dev,)

(this is partially a summary of some discussions that happened at the
last LLVM bay area social, and partially a discussion about the direction
of the CGSCC pass manager)

A the last LLVM social we discussed the progress on the CGSCC pass
manager. It seems like Chandler has a CGSCC pass manager working, but it is
still unresolved exactly which semantics we want (more about this below)
that are reasonably implementable.

AFAICT, there has been no public discussion about what exact semantics
we ultimately want to have. We should figure that out.

The main difficulty which Chandler described is the apparently quite
complex logic surrounding needing to run function passes nested within an
SCC pass manager, while providing some guarantees about exactly what order
the function passes are run. The existing CGSCC pass manager just punts on
some of the problems that arise (look in CGPassManager::runOnModule,
CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in
llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that
Chandler has been trying to solve.

(
Why is this "function passes inside CGSCC passes" stuff interesting?
Because LLVM can do inlining on an SCC (often just a single function) and
then run function passes to simplify the function(s) in the SCC before it
tries to inline into a parent SCC. (the SCC visitation order is post-order)
For example, we may inline a bunch of code, but after inlining we can
tremendously simplify the function, and we want to do so before considering
this function for inlining into its callers so that we get an accurate
evaluation of the inline cost.
Based on what Chandler said, it seems that LLVM is fairly unique in
this regard and other compilers don't do this (which is why we can't just
look at how other compilers solve this problem; they don't have this
problem (maybe they should? or maybe we shouldn't?)). For example, he
described that GCC uses different inlining "phases"; e.g. it does early
inlining on the entire module, then does simplifications on the entire
module, then does late inlining on the entire module; so it is not able to
incrementally simplify as it inlines like LLVM does.
)

As background for what is below, the LazyCallGraph tracks two graphs:
the "call graph" and the "ref graph".
Conceptually, the call graph is the graph of direct calls, where
indirect calls and calls to external functions do not appear (or are
connected to dummy nodes). The ref graph is basically the graph of all
functions transitively accessible based on the globals/constants/etc.
referenced by a function (e.g. if a function `foo` references a vtable that
is defined in the module, there is an edge in the ref graph from `foo` to
every function in the vtable).
The call graph is a strict subset of the ref graph.

Chandler described that he had a major breakthrough in that the CGSCC
pass manager only had to deal with 3 classes of modifications that can
occur:
- a pass may e.g. propagate a load of a function pointer into an
indirect call, turning it into an direct call. This requires adding an edge
in the CG but not in the ref graph.
- a pass may take a direct call and turn it into an indirect call. This
requires removing an edge from the CG, but not in the ref graph.
- a pass may delete a direct call. This removes an edge in the CG and
also in the ref graph.

From the perspective of the CGSCC pass manager, these operations can
affect the SCC structure. Adding an edge might merge SCC's and deleting an
edge might split SCC's. Chandler mentioned that apparently the issues of
splitting and merging SCC's within the current infrastructure are actually
quite challenging and lead to e.g. iterator invalidation issues, and that
is what he is working on.

(
The ref graph is important to guide the overall SCC visitation order
because it basically represents "the largest graph that the CG may turn
into due to our static analysis of this module". I.e. no transformation we
can statically make in the CGSCC passes can ever cause us to need to merge
SCC's in the ref graph.
)

I have a couple overall questions/concerns:

1. The ref graph can easily go quadratic. E.g.

typedef void (*fp)();
fp funcs = {
  &foo1,
  &foo2,
  ...
  &fooN
}
void foo1() { funcs[something](); }
void foo2() { funcs[something](); }
...
void fooN() { funcs[something](); }

One real-world case where this might come about is in the presence of
vtables.

The existing CGSCC pass manager does not have this issue AFAIK because
it does not consider the ref graph.

Does anybody have any info/experience about how densely connected the
ref graph can get in programs that might reasonably be fed to the compiler?

I can state that almost all call graphs of compilers include edges for
indirect calls and external functions, so they are already quadratic in
this sense.

if what you state is correct, and we don't have a conservatively correct
call graph, that would be ... interesting.

Mehdi clarified downthread the situation (at least for me).

Now that I look more closely at the comments in
https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Analysis/LazyCallGraph.h
it intentionally does not model a call graph in this sense. I.e.
- in a traditional CG, an edge means something like "at runtime there may
be a call from A->B"
- in the LazyCallGraph an edge (a "ref edge" as it calls it) represents
something like "during optimization of this module, we may discover the
existence of a direct call from A->B". There is also a distinguished
subgraph of the ref graph (which I think LazyCallGraph calls just the "call
graph") which represents the actual direct calls that are present in the
module currently.

The comments in LazyCallGraph.h are quite good, but the existing
CallGraph.h doesn't seem to touch on this up front in its comments in quite
the same way, but it does at least say that it models with two external
nodes like Mehdi mentioned:

https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Analysis/CallGraph.h#L30
So its edges don't represent "at runtime there may be a call from A->B".
But since it doesn't maintain a "ref graph" I'm not sure what the edges
exactly represent.

I thought of it as the edges are "there is a direct call from A -> B".
Which is a subset of "at runtime there may be a call from A->B".

I think that with all this discussion, it is important to distinguish
that (I think) there is no "correctness" issue at stance (we won't
miscompile anything)

This is a good point worth explicitly noting (and hopefully there are in
fact no passes relying on it to mean "at runtime there may be a call from
A->B").

, but there may be missing optimization in some cases. I think the
current scheme catches most cases and when it does not we are just missing
potential inlining. The question may be how much (more) cases we really
need to catch with the new pass manager?

I think this only affects inlining of mutually recursive functions. Most
functions are not mutually recursive [citation needed] so I'm not too
worried.
(The main performance-critical case that I can think of using mutual
recursion would be parsers).

What about devirtualization? Or even simply cases where an indirect call
can be folded to a direct call?
Now you are processing the caller without having processed the callee.
I think the current way of handling this is to add the new callee to the
SCC (even though it is not actually participating into it) and continue to
iterate on this SCC. Sounds like a hack, but seems to works in practice.

Ah, yeah. Having a more principled solution to this is one of the main
motivations Chandler mentioned. I forgot about this when writing the OP.
I think this is precisely the thing that the "ref graph" solves. I.e. you
are constrained by the ref graph SCC's and so this kind of unordered
visition cannot occur (pending closer study of Sanjoy's counterexamples).

-- Sean Silva

Topic		Replies	Views
Intended behavior of CGSCC pass manager. LLVM Dev List Archives	5	83	June 9, 2016
New Pass Manager and CGSSCPassManager LLVM Dev List Archives	4	92	November 4, 2020
RFC: Pass Manager Redux LLVM Dev List Archives	11	91	July 23, 2012
Status of new pass manager work LLVM Dev List Archives	6	97	May 6, 2016
CGSCC passes (in the new PM) LLVM Dev List Archives	2	115	October 29, 2020

Intended behavior of CGSCC pass manager.

Related Topics