Linker option to dump dependency graph

Hi,

I’ve heard people say that they want to analyze dependencies between object files at the linker level so that they can run a whole-program analysis which cannot be done at the compiler that works for one compilation unit at a time. I’d like to start a discussion as to what we can do with it and how to make it possible. I’m also sharing my idea about how to make it possible.

Dependency analyses
First, let me start with a few examples of analyses I’m heard of or thinking about. Dependencies between object files can be represented as a graph where vertices are input sections and edges are symbols and relocations. Analyses would work on the dependency graph. Examples of analyses include but not limited to the following:

  • Figure out why some library or an object file gets linked.

  • Finding a candidate to eliminate dependency by finding a “weak” link to a library. We can for example say the dependency to a library is weak if the library in the graph can be unreachable if we remove N edges from the graph (which is likely to correspond to removing N function calls from the code), where N is a small number.

  • Understanding which of new dependencies increase the executable size the most, compare to a previous build.

  • Finding bad or circular dependencies between sub-components.

There would be many more analyses you want to run at the linker input level. Currently, lld doesn’t actively support such analyses. There are a few options to make the linker emit dependency information (e.g. --cref or -Map), but the output of the options is not comprehensive; you cannot reconstruct a dependency graph from the output of the options.

Dumping dependency graph
So, I’m thinking if it would be desirable to add a new feature to the linker to dump an entire dependency graph in such a way that a graph can be reconstructed by reading it back. Once we have such feature, we can link a program with the feature enabled and run any kind of dependency analysis on the output. You can save dumps to compare to previous builds. You can run any number of analyses on a dump, instead of invoking the linker for each analysis.

I don’t have a concrete idea about the file output format, but I believe it is essentially enough to emit triplets of (, , ), which represents an edge, to reconstruct a graph.

Thoughts?

Hi,

I’ve heard people say that they want to analyze dependencies between object files at the linker level so that they can run a whole-program analysis which cannot be done at the compiler that works for one compilation unit at a time. I’d like to start a discussion as to what we can do with it and how to make it possible. I’m also sharing my idea about how to make it possible.

Dependency analyses
First, let me start with a few examples of analyses I’m heard of or thinking about. Dependencies between object files can be represented as a graph where vertices are input sections and edges are symbols and relocations. Analyses would work on the dependency graph. Examples of analyses include but not limited to the following:

  • Figure out why some library or an object file gets linked.

  • Finding a candidate to eliminate dependency by finding a “weak” link to a library. We can for example say the dependency to a library is weak if the library in the graph can be unreachable if we remove N edges from the graph (which is likely to correspond to removing N function calls from the code), where N is a small number.

  • Understanding which of new dependencies increase the executable size the most, compare to a previous build.

  • Finding bad or circular dependencies between sub-components.

There would be many more analyses you want to run at the linker input level. Currently, lld doesn’t actively support such analyses. There are a few options to make the linker emit dependency information (e.g. --cref or -Map), but the output of the options is not comprehensive; you cannot reconstruct a dependency graph from the output of the options.

Dumping dependency graph
So, I’m thinking if it would be desirable to add a new feature to the linker to dump an entire dependency graph in such a way that a graph can be reconstructed by reading it back. Once we have such feature, we can link a program with the feature enabled and run any kind of dependency analysis on the output. You can save dumps to compare to previous builds. You can run any number of analyses on a dump, instead of invoking the linker for each analysis.

I don’t have a concrete idea about the file output format, but I believe it is essentially enough to emit triplets of (, , ), which represents an edge, to reconstruct a graph.

Thoughts?

Back when I worked on the linker I pretty much always had a way to dump a graphviz dot file to look at things. Pretty much every graph library/tool can read dot files, and they are easy to hack up a parser for. You can also add attributes to nodes and edges to store arbitrary data.

As for what to put it in, it really depends on how detailed it needs to be. Should symbols and sections be collapsed together? Should it include relocation types? Symbol types/binding/size/etc?

  • Michael Spencer

Hi,

I’ve heard people say that they want to analyze dependencies between object files at the linker level so that they can run a whole-program analysis which cannot be done at the compiler that works for one compilation unit at a time. I’d like to start a discussion as to what we can do with it and how to make it possible. I’m also sharing my idea about how to make it possible.

Dependency analyses
First, let me start with a few examples of analyses I’m heard of or thinking about. Dependencies between object files can be represented as a graph where vertices are input sections and edges are symbols and relocations. Analyses would work on the dependency graph. Examples of analyses include but not limited to the following:

  • Figure out why some library or an object file gets linked.

  • Finding a candidate to eliminate dependency by finding a “weak” link to a library. We can for example say the dependency to a library is weak if the library in the graph can be unreachable if we remove N edges from the graph (which is likely to correspond to removing N function calls from the code), where N is a small number.

  • Understanding which of new dependencies increase the executable size the most, compare to a previous build.

  • Finding bad or circular dependencies between sub-components.

There would be many more analyses you want to run at the linker input level. Currently, lld doesn’t actively support such analyses. There are a few options to make the linker emit dependency information (e.g. --cref or -Map), but the output of the options is not comprehensive; you cannot reconstruct a dependency graph from the output of the options.

Dumping dependency graph
So, I’m thinking if it would be desirable to add a new feature to the linker to dump an entire dependency graph in such a way that a graph can be reconstructed by reading it back. Once we have such feature, we can link a program with the feature enabled and run any kind of dependency analysis on the output. You can save dumps to compare to previous builds. You can run any number of analyses on a dump, instead of invoking the linker for each analysis.

I don’t have a concrete idea about the file output format, but I believe it is essentially enough to emit triplets of (, , ), which represents an edge, to reconstruct a graph.

Thoughts?

Back when I worked on the linker I pretty much always had a way to dump a graphviz dot file to look at things. Pretty much every graph library/tool can read dot files, and they are easy to hack up a parser for. You can also add attributes to nodes and edges to store arbitrary data.

That’s an interesting idea.

As for what to put it in, it really depends on how detailed it needs to be. Should symbols and sections be collapsed together? Should it include relocation types? Symbol types/binding/size/etc?

Maybe everything? We can for example emit all symbols and input sections first, and then emit a graph as the second half of the output. E.g.

Symbols:

Sections:

Graph:
1 2 3 // 1st section depends on 3rd section via 2nd symbol
5 1 4 // likewise

I suppose it’s a question of if we want users to need to also read the inputs if they want things like section size and other section/symbol attributes. It would be pretty trivial to include that data as long as we have a format/syntax for it.

dot supports listing nodes first with attributes and then referring to them by name later when listing edges.

  • Michael Spencer

Hello,

I think outputting a dependency graph is a good idea and would enable
some offline analysis. I think that there is some advantage to
building some of the simpler ones in, particularly those that would
need heavy annotations to the dependency graph, in particular unless
we write a sample analysis tool that ships with the release, many
users are going to miss out on useful features as they aren't going to
have the time to build one. I've put some comments inline:

Hi,

I've heard people say that they want to analyze dependencies between object files at the linker level so that they can run a whole-program analysis which cannot be done at the compiler that works for one compilation unit at a time. I'd like to start a discussion as to what we can do with it and how to make it possible. I'm also sharing my idea about how to make it possible.

Dependency analyses
First, let me start with a few examples of analyses I'm heard of or thinking about. Dependencies between object files can be represented as a graph where vertices are input sections and edges are symbols and relocations. Analyses would work on the dependency graph. Examples of analyses include but not limited to the following:

- Figure out why some library or an object file gets linked.

Arm's proprietary linker has a very helpful feature in verbose mode
where it will report on object loading: global/weak definitions and
global/weak references. For libraries you'd get a message like
selecting member.o from library.a to define symbol S. This resulted in
quite an effective trace of the linker output that could answer most
"why did this library and object file get loaded question?" One thing
a dependency graph might not capture is the order in which events
occur, this can be very useful when debugging problems caused by
library selection order.

- Finding a candidate to eliminate dependency by finding a "weak" link to a library. We can for example say the dependency to a library is weak if the library in the graph can be unreachable if we remove N edges from the graph (which is likely to correspond to removing N function calls from the code), where N is a small number.

- Understanding which of new dependencies increase the executable size the most, compare to a previous build.

Arm's linker, being focused on embedded systems has a useful feature
that summarises the amount of content taken from each object broken
down into code, ro-data, rw-date etc. This can be helpful in the face
of comdat group elimination and optimisations such as garbage
collection and ICF that can be difficult to predict from a dependency
graph. It is true that this information could be added as attributes
but again it may just be easier to write a simple analysis pass over
the output in the linker.

- Finding bad or circular dependencies between sub-components.

There would be many more analyses you want to run at the linker input level. Currently, lld doesn't actively support such analyses. There are a few options to make the linker emit dependency information (e.g. --cref or -Map), but the output of the options is not comprehensive; you cannot reconstruct a dependency graph from the output of the options.

Dumping dependency graph
So, I'm thinking if it would be desirable to add a new feature to the linker to dump an entire dependency graph in such a way that a graph can be reconstructed by reading it back. Once we have such feature, we can link a program with the feature enabled and run any kind of dependency analysis on the output. You can save dumps to compare to previous builds. You can run any number of analyses on a dump, instead of invoking the linker for each analysis.

I don't have a concrete idea about the file output format, but I believe it is essentially enough to emit triplets of (<from input section>, <symbol>, <to input section>), which represents an edge, to reconstruct a graph.

Thoughts?

Back when I worked on the linker I pretty much always had a way to dump a graphviz dot file to look at things. Pretty much every graph library/tool can read dot files, and they are easy to hack up a parser for. You can also add attributes to nodes and edges to store arbitrary data.

That's an interesting idea.

As for what to put it in, it really depends on how detailed it needs to be. Should symbols and sections be collapsed together? Should it include relocation types? Symbol types/binding/size/etc?

Maybe everything? We can for example emit all symbols and input sections first, and then emit a graph as the second half of the output. E.g.

Symbols:
  <list of symbols>
Sections:
  <list of sections>
Graph:
1 2 3 // 1st section depends on 3rd section via 2nd symbol
5 1 4 // likewise

I suppose it's a question of if we want users to need to also read the inputs if they want things like section size and other section/symbol attributes. It would be pretty trivial to include that data as long as we have a format/syntax for it.

dot supports listing nodes first with attributes and then referring to them by name later when listing edges.

- Michael Spencer

I've experimented with dot files for this type of thing in the past.
The difficulty is that they get too large to be realistically viewed
very quickly. At that point you need to write scripts to process the
output and in that case you may as well use JSON or XML, which I guess
could easily be processed into dot files. To summarise, I think we may
be able to do quite well with some very simple extra analysis in LLD,
a machine readable dependency graph would also be very useful for the
more complex cases.

Peter

Peter,

Thank you for sharing your experiences and thoughts!

Hello,

I think outputting a dependency graph is a good idea and would enable
some offline analysis. I think that there is some advantage to
building some of the simpler ones in, particularly those that would
need heavy annotations to the dependency graph, in particular unless
we write a sample analysis tool that ships with the release, many
users are going to miss out on useful features as they aren’t going to
have the time to build one. I’ve put some comments inline:

Hi,

I’ve heard people say that they want to analyze dependencies between object files at the linker level so that they can run a whole-program analysis which cannot be done at the compiler that works for one compilation unit at a time. I’d like to start a discussion as to what we can do with it and how to make it possible. I’m also sharing my idea about how to make it possible.

Dependency analyses
First, let me start with a few examples of analyses I’m heard of or thinking about. Dependencies between object files can be represented as a graph where vertices are input sections and edges are symbols and relocations. Analyses would work on the dependency graph. Examples of analyses include but not limited to the following:

  • Figure out why some library or an object file gets linked.

Arm’s proprietary linker has a very helpful feature in verbose mode
where it will report on object loading: global/weak definitions and
global/weak references. For libraries you’d get a message like
selecting member.o from library.a to define symbol S. This resulted in
quite an effective trace of the linker output that could answer most
“why did this library and object file get loaded question?” One thing
a dependency graph might not capture is the order in which events
occur, this can be very useful when debugging problems caused by
library selection order.

Slightly offtopic, but if that’s option is useful, maybe we should implement that to lld?

  • Finding a candidate to eliminate dependency by finding a “weak” link to a library. We can for example say the dependency to a library is weak if the library in the graph can be unreachable if we remove N edges from the graph (which is likely to correspond to removing N function calls from the code), where N is a small number.

  • Understanding which of new dependencies increase the executable size the most, compare to a previous build.

Arm’s linker, being focused on embedded systems has a useful feature
that summarises the amount of content taken from each object broken
down into code, ro-data, rw-date etc. This can be helpful in the face
of comdat group elimination and optimisations such as garbage
collection and ICF that can be difficult to predict from a dependency
graph. It is true that this information could be added as attributes
but again it may just be easier to write a simple analysis pass over
the output in the linker.

  • Finding bad or circular dependencies between sub-components.

There would be many more analyses you want to run at the linker input level. Currently, lld doesn’t actively support such analyses. There are a few options to make the linker emit dependency information (e.g. --cref or -Map), but the output of the options is not comprehensive; you cannot reconstruct a dependency graph from the output of the options.

Dumping dependency graph
So, I’m thinking if it would be desirable to add a new feature to the linker to dump an entire dependency graph in such a way that a graph can be reconstructed by reading it back. Once we have such feature, we can link a program with the feature enabled and run any kind of dependency analysis on the output. You can save dumps to compare to previous builds. You can run any number of analyses on a dump, instead of invoking the linker for each analysis.

I don’t have a concrete idea about the file output format, but I believe it is essentially enough to emit triplets of (, , ), which represents an edge, to reconstruct a graph.

Thoughts?

Back when I worked on the linker I pretty much always had a way to dump a graphviz dot file to look at things. Pretty much every graph library/tool can read dot files, and they are easy to hack up a parser for. You can also add attributes to nodes and edges to store arbitrary data.

That’s an interesting idea.

As for what to put it in, it really depends on how detailed it needs to be. Should symbols and sections be collapsed together? Should it include relocation types? Symbol types/binding/size/etc?

Maybe everything? We can for example emit all symbols and input sections first, and then emit a graph as the second half of the output. E.g.

Symbols:

Sections:

Graph:
1 2 3 // 1st section depends on 3rd section via 2nd symbol
5 1 4 // likewise

I suppose it’s a question of if we want users to need to also read the inputs if they want things like section size and other section/symbol attributes. It would be pretty trivial to include that data as long as we have a format/syntax for it.

dot supports listing nodes first with attributes and then referring to them by name later when listing edges.

  • Michael Spencer

I’ve experimented with dot files for this type of thing in the past.
The difficulty is that they get too large to be realistically viewed
very quickly. At that point you need to write scripts to process the
output and in that case you may as well use JSON or XML, which I guess
could easily be processed into dot files. To summarise, I think we may
be able to do quite well with some very simple extra analysis in LLD,
a machine readable dependency graph would also be very useful for the
more complex cases.

I agree with the summary. I believe there are some analyses that are popular and useful for a large number of users, and that kind of analyses should probably be implemented directly to lld, so that the analyses are usable out of the box. Other analyses would be implemented as offline analyses.

As to the file format, I don’t have a strong preference, but if the output could easily become too large that graphviz can’t practically handle, the dot format may not make sense that much. Perhaps JSON is something that people are most familiar with, so I’m inclined to it. (But file format is not the most important topic – what we’ll output is more important.)

Hello,

I think outputting a dependency graph is a good idea and would enable
some offline analysis. I think that there is some advantage to
building some of the simpler ones in, particularly those that would
need heavy annotations to the dependency graph, in particular unless
we write a sample analysis tool that ships with the release, many
users are going to miss out on useful features as they aren’t going to
have the time to build one. I’ve put some comments inline:

Hi,

I’ve heard people say that they want to analyze dependencies between object files at the linker level so that they can run a whole-program analysis which cannot be done at the compiler that works for one compilation unit at a time. I’d like to start a discussion as to what we can do with it and how to make it possible. I’m also sharing my idea about how to make it possible.

Dependency analyses
First, let me start with a few examples of analyses I’m heard of or thinking about. Dependencies between object files can be represented as a graph where vertices are input sections and edges are symbols and relocations. Analyses would work on the dependency graph. Examples of analyses include but not limited to the following:

  • Figure out why some library or an object file gets linked.

Arm’s proprietary linker has a very helpful feature in verbose mode
where it will report on object loading: global/weak definitions and
global/weak references. For libraries you’d get a message like
selecting member.o from library.a to define symbol S. This resulted in
quite an effective trace of the linker output that could answer most
“why did this library and object file get loaded question?” One thing
a dependency graph might not capture is the order in which events
occur, this can be very useful when debugging problems caused by
library selection order.

  • Finding a candidate to eliminate dependency by finding a “weak” link to a library. We can for example say the dependency to a library is weak if the library in the graph can be unreachable if we remove N edges from the graph (which is likely to correspond to removing N function calls from the code), where N is a small number.

  • Understanding which of new dependencies increase the executable size the most, compare to a previous build.

Arm’s linker, being focused on embedded systems has a useful feature
that summarises the amount of content taken from each object broken
down into code, ro-data, rw-date etc. This can be helpful in the face
of comdat group elimination and optimisations such as garbage
collection and ICF that can be difficult to predict from a dependency
graph. It is true that this information could be added as attributes
but again it may just be easier to write a simple analysis pass over
the output in the linker.

  • Finding bad or circular dependencies between sub-components.

There would be many more analyses you want to run at the linker input level. Currently, lld doesn’t actively support such analyses. There are a few options to make the linker emit dependency information (e.g. --cref or -Map), but the output of the options is not comprehensive; you cannot reconstruct a dependency graph from the output of the options.

Dumping dependency graph
So, I’m thinking if it would be desirable to add a new feature to the linker to dump an entire dependency graph in such a way that a graph can be reconstructed by reading it back. Once we have such feature, we can link a program with the feature enabled and run any kind of dependency analysis on the output. You can save dumps to compare to previous builds. You can run any number of analyses on a dump, instead of invoking the linker for each analysis.

I don’t have a concrete idea about the file output format, but I believe it is essentially enough to emit triplets of (, , ), which represents an edge, to reconstruct a graph.

Thoughts?

Back when I worked on the linker I pretty much always had a way to dump a graphviz dot file to look at things. Pretty much every graph library/tool can read dot files, and they are easy to hack up a parser for. You can also add attributes to nodes and edges to store arbitrary data.

That’s an interesting idea.

As for what to put it in, it really depends on how detailed it needs to be. Should symbols and sections be collapsed together? Should it include relocation types? Symbol types/binding/size/etc?

Maybe everything? We can for example emit all symbols and input sections first, and then emit a graph as the second half of the output. E.g.

Symbols:

Sections:

Graph:
1 2 3 // 1st section depends on 3rd section via 2nd symbol
5 1 4 // likewise

I suppose it’s a question of if we want users to need to also read the inputs if they want things like section size and other section/symbol attributes. It would be pretty trivial to include that data as long as we have a format/syntax for it.

dot supports listing nodes first with attributes and then referring to them by name later when listing edges.

  • Michael Spencer

I’ve experimented with dot files for this type of thing in the past.
The difficulty is that they get too large to be realistically viewed
very quickly. At that point you need to write scripts to process the
output and in that case you may as well use JSON or XML, which I guess
could easily be processed into dot files. To summarise, I think we may
be able to do quite well with some very simple extra analysis in LLD,
a machine readable dependency graph would also be very useful for the
more complex cases.

Peter

Indeed they often get too large for graphviz, but there are a few other dot viewers that can handle huge graphs, such as Gephi. I often use these when graphs get too large for graphviz.

  • Michael Spencer

Hi,

I’ve heard people say that they want to analyze dependencies between object files at the linker level so that they can run a whole-program analysis which cannot be done at the compiler that works for one compilation unit at a time. I’d like to start a discussion as to what we can do with it and how to make it possible. I’m also sharing my idea about how to make it possible.

Dependency analyses
First, let me start with a few examples of analyses I’m heard of or thinking about. Dependencies between object files can be represented as a graph where vertices are input sections and edges are symbols and relocations. Analyses would work on the dependency graph. Examples of analyses include but not limited to the following:

  • Figure out why some library or an object file gets linked.

  • Finding a candidate to eliminate dependency by finding a “weak” link to a library. We can for example say the dependency to a library is weak if the library in the graph can be unreachable if we remove N edges from the graph (which is likely to correspond to removing N function calls from the code), where N is a small number.

  • Understanding which of new dependencies increase the executable size the most, compare to a previous build.

  • Finding bad or circular dependencies between sub-components.

There would be many more analyses you want to run at the linker input level. Currently, lld doesn’t actively support such analyses. There are a few options to make the linker emit dependency information (e.g. --cref or -Map), but the output of the options is not comprehensive; you cannot reconstruct a dependency graph from the output of the options.

Dumping dependency graph
So, I’m thinking if it would be desirable to add a new feature to the linker to dump an entire dependency graph in such a way that a graph can be reconstructed by reading it back. Once we have such feature, we can link a program with the feature enabled and run any kind of dependency analysis on the output. You can save dumps to compare to previous builds. You can run any number of analyses on a dump, instead of invoking the linker for each analysis.

I don’t have a concrete idea about the file output format, but I believe it is essentially enough to emit triplets of (, , ), which represents an edge, to reconstruct a graph.

A couple of points:

  • Symbols are not the only kind of dependency edge from one section to another – there’s also SHF_LINK_ORDER. Maybe we can just call the edge “SHF_LINK_ORDER” in that case.

  • Do we want to mark up the GC roots in some way? I imagine that we could do that with a synthetic node that represents the GC root, and then have all roots include edges from it. With my proposal for partitioning, perhaps we could have one GC root node per partition.

Peter

Hi,

I’ve heard people say that they want to analyze dependencies between object files at the linker level so that they can run a whole-program analysis which cannot be done at the compiler that works for one compilation unit at a time. I’d like to start a discussion as to what we can do with it and how to make it possible. I’m also sharing my idea about how to make it possible.

Dependency analyses
First, let me start with a few examples of analyses I’m heard of or thinking about. Dependencies between object files can be represented as a graph where vertices are input sections and edges are symbols and relocations. Analyses would work on the dependency graph. Examples of analyses include but not limited to the following:

  • Figure out why some library or an object file gets linked.

  • Finding a candidate to eliminate dependency by finding a “weak” link to a library. We can for example say the dependency to a library is weak if the library in the graph can be unreachable if we remove N edges from the graph (which is likely to correspond to removing N function calls from the code), where N is a small number.

  • Understanding which of new dependencies increase the executable size the most, compare to a previous build.

  • Finding bad or circular dependencies between sub-components.

There would be many more analyses you want to run at the linker input level. Currently, lld doesn’t actively support such analyses. There are a few options to make the linker emit dependency information (e.g. --cref or -Map), but the output of the options is not comprehensive; you cannot reconstruct a dependency graph from the output of the options.

Dumping dependency graph
So, I’m thinking if it would be desirable to add a new feature to the linker to dump an entire dependency graph in such a way that a graph can be reconstructed by reading it back. Once we have such feature, we can link a program with the feature enabled and run any kind of dependency analysis on the output. You can save dumps to compare to previous builds. You can run any number of analyses on a dump, instead of invoking the linker for each analysis.

I don’t have a concrete idea about the file output format, but I believe it is essentially enough to emit triplets of (, , ), which represents an edge, to reconstruct a graph.

A couple of points:

  • Symbols are not the only kind of dependency edge from one section to another – there’s also SHF_LINK_ORDER. Maybe we can just call the edge “SHF_LINK_ORDER” in that case.

  • Do we want to mark up the GC roots in some way? I imagine that we could do that with a synthetic node that represents the GC root, and then have all roots include edges from it. With my proposal for partitioning, perhaps we could have one GC root node per partition.

I think we should mark up the GC root in some way. One thing to note is that not only input sections but also symbols can be GC root. In terms of the graph, both edge and vertex should have a “GC root” bit.

Hi,

I’ve heard people say that they want to analyze dependencies between object files at the linker level so that they can run a whole-program analysis which cannot be done at the compiler that works for one compilation unit at a time. I’d like to start a discussion as to what we can do with it and how to make it possible. I’m also sharing my idea about how to make it possible.

Dependency analyses
First, let me start with a few examples of analyses I’m heard of or thinking about. Dependencies between object files can be represented as a graph where vertices are input sections and edges are symbols and relocations. Analyses would work on the dependency graph. Examples of analyses include but not limited to the following:

  • Figure out why some library or an object file gets linked.

  • Finding a candidate to eliminate dependency by finding a “weak” link to a library. We can for example say the dependency to a library is weak if the library in the graph can be unreachable if we remove N edges from the graph (which is likely to correspond to removing N function calls from the code), where N is a small number.

  • Understanding which of new dependencies increase the executable size the most, compare to a previous build.

  • Finding bad or circular dependencies between sub-components.

There would be many more analyses you want to run at the linker input level. Currently, lld doesn’t actively support such analyses. There are a few options to make the linker emit dependency information (e.g. --cref or -Map), but the output of the options is not comprehensive; you cannot reconstruct a dependency graph from the output of the options.

Dumping dependency graph
So, I’m thinking if it would be desirable to add a new feature to the linker to dump an entire dependency graph in such a way that a graph can be reconstructed by reading it back. Once we have such feature, we can link a program with the feature enabled and run any kind of dependency analysis on the output. You can save dumps to compare to previous builds. You can run any number of analyses on a dump, instead of invoking the linker for each analysis.

I don’t have a concrete idea about the file output format, but I believe it is essentially enough to emit triplets of (, , ), which represents an edge, to reconstruct a graph.

A couple of points:

  • Symbols are not the only kind of dependency edge from one section to another – there’s also SHF_LINK_ORDER. Maybe we can just call the edge “SHF_LINK_ORDER” in that case.

  • Do we want to mark up the GC roots in some way? I imagine that we could do that with a synthetic node that represents the GC root, and then have all roots include edges from it. With my proposal for partitioning, perhaps we could have one GC root node per partition.

I think we should mark up the GC root in some way. One thing to note is that not only input sections but also symbols can be GC root. In terms of the graph, both edge and vertex should have a “GC root” bit.

You can represent both with a synthetic GC root vertex, though? e.g.

GC root --[ExportedFunction]–> .text.ExportedFunction
GC root --[.init_array]–> .init_array.InitializedGlobal

Peter

Hi,

I’ve heard people say that they want to analyze dependencies between object files at the linker level so that they can run a whole-program analysis which cannot be done at the compiler that works for one compilation unit at a time. I’d like to start a discussion as to what we can do with it and how to make it possible. I’m also sharing my idea about how to make it possible.

Dependency analyses
First, let me start with a few examples of analyses I’m heard of or thinking about. Dependencies between object files can be represented as a graph where vertices are input sections and edges are symbols and relocations. Analyses would work on the dependency graph. Examples of analyses include but not limited to the following:

  • Figure out why some library or an object file gets linked.

  • Finding a candidate to eliminate dependency by finding a “weak” link to a library. We can for example say the dependency to a library is weak if the library in the graph can be unreachable if we remove N edges from the graph (which is likely to correspond to removing N function calls from the code), where N is a small number.

  • Understanding which of new dependencies increase the executable size the most, compare to a previous build.

  • Finding bad or circular dependencies between sub-components.

There would be many more analyses you want to run at the linker input level. Currently, lld doesn’t actively support such analyses. There are a few options to make the linker emit dependency information (e.g. --cref or -Map), but the output of the options is not comprehensive; you cannot reconstruct a dependency graph from the output of the options.

Dumping dependency graph
So, I’m thinking if it would be desirable to add a new feature to the linker to dump an entire dependency graph in such a way that a graph can be reconstructed by reading it back. Once we have such feature, we can link a program with the feature enabled and run any kind of dependency analysis on the output. You can save dumps to compare to previous builds. You can run any number of analyses on a dump, instead of invoking the linker for each analysis.

I don’t have a concrete idea about the file output format, but I believe it is essentially enough to emit triplets of (, , ), which represents an edge, to reconstruct a graph.

A couple of points:

  • Symbols are not the only kind of dependency edge from one section to another – there’s also SHF_LINK_ORDER. Maybe we can just call the edge “SHF_LINK_ORDER” in that case.

  • Do we want to mark up the GC roots in some way? I imagine that we could do that with a synthetic node that represents the GC root, and then have all roots include edges from it. With my proposal for partitioning, perhaps we could have one GC root node per partition.

I think we should mark up the GC root in some way. One thing to note is that not only input sections but also symbols can be GC root. In terms of the graph, both edge and vertex should have a “GC root” bit.

You can represent both with a synthetic GC root vertex, though? e.g.

GC root --[ExportedFunction]–> .text.ExportedFunction
GC root --[.init_array]–> .init_array.InitializedGlobal

I think that should work, but I’m not sure if this is easier to handle than adding a bit to each vertex/edge.

Hi,

I’ve heard people say that they want to analyze dependencies between object files at the linker level so that they can run a whole-program analysis which cannot be done at the compiler that works for one compilation unit at a time. I’d like to start a discussion as to what we can do with it and how to make it possible. I’m also sharing my idea about how to make it possible.

Dependency analyses
First, let me start with a few examples of analyses I’m heard of or thinking about. Dependencies between object files can be represented as a graph where vertices are input sections and edges are symbols and relocations. Analyses would work on the dependency graph. Examples of analyses include but not limited to the following:

  • Figure out why some library or an object file gets linked.

  • Finding a candidate to eliminate dependency by finding a “weak” link to a library. We can for example say the dependency to a library is weak if the library in the graph can be unreachable if we remove N edges from the graph (which is likely to correspond to removing N function calls from the code), where N is a small number.

  • Understanding which of new dependencies increase the executable size the most, compare to a previous build.

  • Finding bad or circular dependencies between sub-components.

There would be many more analyses you want to run at the linker input level. Currently, lld doesn’t actively support such analyses. There are a few options to make the linker emit dependency information (e.g. --cref or -Map), but the output of the options is not comprehensive; you cannot reconstruct a dependency graph from the output of the options.

Dumping dependency graph
So, I’m thinking if it would be desirable to add a new feature to the linker to dump an entire dependency graph in such a way that a graph can be reconstructed by reading it back. Once we have such feature, we can link a program with the feature enabled and run any kind of dependency analysis on the output. You can save dumps to compare to previous builds. You can run any number of analyses on a dump, instead of invoking the linker for each analysis.

I don’t have a concrete idea about the file output format, but I believe it is essentially enough to emit triplets of (, , ), which represents an edge, to reconstruct a graph.

A couple of points:

  • Symbols are not the only kind of dependency edge from one section to another – there’s also SHF_LINK_ORDER. Maybe we can just call the edge “SHF_LINK_ORDER” in that case.

  • Do we want to mark up the GC roots in some way? I imagine that we could do that with a synthetic node that represents the GC root, and then have all roots include edges from it. With my proposal for partitioning, perhaps we could have one GC root node per partition.

I think we should mark up the GC root in some way. One thing to note is that not only input sections but also symbols can be GC root. In terms of the graph, both edge and vertex should have a “GC root” bit.

You can represent both with a synthetic GC root vertex, though? e.g.

GC root --[ExportedFunction]–> .text.ExportedFunction
GC root --[.init_array]–> .init_array.InitializedGlobal

I think that should work, but I’m not sure if this is easier to handle than adding a bit to each vertex/edge.

It seems like it would make graph reachability queries a little easier to express. Like if you wanted to find out why section “foo” is in your program, you could ask “what’s the path from ‘GC root’ to ‘foo’”. But with a bit on vertices/edges you need to ask “what’s the path from some vertex/edge with the ‘GC root’ bit set to ‘foo’”. I don’t have much actual experience with graph query languages, though.

Peter

To summarise, I think we may
be able to do quite well with some very simple extra analysis in LLD,
a machine readable dependency graph would also be very useful for the
more complex cases.

Strongly agree. The linker based dependency graph would be very useful for Uefi firmware. Below are my usage examples:
1. I need to detect the redundant code in my firmware, and I once wrote a analysis tool to compare the IR level symbols and call graph info before any optimization and after full optimization (e.g. LTO). But the IR level info does not support assembly code info well. So, there are many dependency information missing and false positive in my analysis tool. It will be more sound if the linker can help output complete and accurate dependency graph for final executable.
2. I need a tool to analyze and track the firmware module accurate dependency for build cache soundness. Build performance is now a pain point in our CI system because every patch need to verify on many build targets in our side. We hope to enable the build cache (both module level and file level) to accelerate the build time. For module level build cache enabling, a very important problem is how to know the module's accurate dependency efficiently. I'm looking forward to the linker based dependency graph feature.

Thanks
Steven

+1 for graphviz dot format, so that it can be consumed by any one of many existing graph visualization tools.

I hacked up a patch to make lld output a dependency graph in the graphviz “dot” format.

https://gist.github.com/rui314/4eab9f328a5568b682d11c84d328cdaa – this is a patch, which is just visiting all input sections and relocations. Note that this is far from completion but just a proof-of-concept.

https://gist.github.com/rui314/5e85c559835ecddad46dcf02fe3ffafc is a result of static-linking a “hello world” program.

https://rui314.github.io/hello.svg – I rendered the above dot file with graphviz sfdp engine. The rendered graph is too large and very hard to read. Apparently, I need a better visualization tool.

You might have realized this already but it’s probably not a good idea to use InputSection::Relocations for this because that ends up missing anything that becomes a dynamic relocation. I reckon that the code should be doing exactly what MarkLive.cpp is doing.

Peter

Hi Rui,

https://gist.github.com/rui314/5e85c559835ecddad46dcf02fe3ffafc is a result of static-linking a “hello world” program.

I cannot see the main() function symbol in the dependency result. Shouldn’t the dependency result include all linked symbols?

Thanks

Steven

One thing a dependency graph might not capture is the order in which events occur, this can be very useful when debugging problems caused by library selection order.

The event stream sounds like a more fine-grained --trace (-t).

(<from input section>, <symbol>, <to input section>)

In --no-gc-sections mode and in some analysis, the file name part of
the input section should be good enough.

section size and other section/symbol attributes

If such customization is favored and the complexity isn't a big issue,
it can probably be implemented as format specifiers (I'm thinking of
printf, ps -o, date, ...). The design of
GitHub - Juniper/libxo: The libxo library allows an application to generate text, XML, JSON, and HTML output using a common set of function calls. The application decides at run time which output style should be produced. can be used for reference.

We shall flesh out the possible vertex/edge types and additional
information that users may expect.

Just wanted to check in on this - did your patches make it past the prototype phase?

No I didn’t.