Thanks everyone for insightful comments and suggestions! I scoped this a bit more, and prepared a draft implementation plan. Any feedback is welcome.
TLDR: we can add some components as generic libraries to LLVM and make them useful for other tools:
-
DWARF Write - there is a library for parsing DWARF objects, but no generic library for writing them.
-
DWARF Graph - a set of classes and routines for building a directed graph that represent DWARF of a compile unit, an object file, or multiple files.
Other components are specific to dwarfreduce:
- Passes that transform a DWARF graph.
- Reduction logic.
- Interface with the test program.
DWARF reduce algorithm
-
Input DWARF: read DWARF from an object file, or from a separate file in case of split DWARF.
-
Build a graph where nodes represent DIEs and edges are dependencies (references) between them.
-
Run a sequence of passes to remove some nodes from the graph.
-
Output DWARF: write the graph back in the same format as the original.
-
Output Metadata: write additional information attached to nodes.
-
Call a test program:
-
If it succeeds, goto (3).
-
If the test program fails, roll back the changes made to the graph, mark the affected nodes as “interesting”, goto (3) and take this new information into account when selecting nodes to remove.
Input DWARF
Use existing LLVM API for reading a DWARF object: llvm/DebugInfo/DWARF/DWARFContext.h, DWARFObject.h, and others.
DWARFContext parses the input, and allows to iterate over compile units recursively and access individual DWARFDie objects.
There should be an ability to run reduction on multiple input object files. This can be done in a script, separate from the main llvm-dwarfreduce program.
Build a DWARF Graph
DWARF Graph is a directed graph with nodes representing DWARFDie objects such as types, functions, arguments, etc. Edges represent dependencies between them: a record depends on fields and member functions, functions depend on arguments, arguments depend on types, etc. Cycles are possible when a node indirectly references itself (a linked list type, for example, contains a pointer to itself).
Nodes are created by iterating recursively over DWARFDie objects of a compile unit.
Edges (dependencies) may be a bit tricky:
-
We can find direct dependencies by checking DIE attributes with DWARFFormValue::FC_Reference values. These can be dereferenced to find an offset of the dependency DIE.
-
If any other dependencies are possible (I don’t know), we have to find them as well. If we miss any, output DWARF will be invalid.
llvm::DirectedGraph, DGNode and DGEdge are used implement DWARF Graph. With llvm::GraphTraits and llvm::DOTGraphTraits it is possible to reuse LLVM graph algorithms and graphviz output support.
Although graphviz output is quite easy to do, visualization may be problematic due to size of DWARF graphs even for simple programs. It is a difficult problem, and there are different methods and programs that help to visualize large graphs.
DWARF reduction passes
Passes transform a DWARF graph by removing nodes and edges, taking information from previous passes and test runs into account.
Sequence of passes is important. Due to size of a typical graph, we first need to run passes with low computational complexity. These passes should remove a significant portion of a graph, and allow passes with high complexity to execute in reasonable time.
Passes should be configurable via options, so a user can limit or guide them to some degree, when a generic algorithm fails.
The following passes may be useful:
-
Remove leaf nodes. A leaf node is formed when a DIE is not referenced by any other DIE. These can be removed without breaking DWARF structure. The pass can iterate until there are no more leaf nodes (i.e. only nodes that form cycles remain).
-
Remove edges. Different types of edges (DIE references) may be eliminated or substituted. Typed pointers can be substituted by void pointers, fields can be replaced by untyped “padding” field, friend references can be removed, etc. Removing such edges may break cycles and allow for more leaf nodes to be removed in pass (1).
-
Remove sub-graphs. Remove a node (root) and every node that depends on it. Selecting a root can be tricky. Ideally we want to identify clusters of interconnected nodes and remove these clusters one by one, but such algorithms have high complexity. We can use heuristics instead - the goal is to progress with reduction, not to find the most optimal way of doing it. Feedback from a test program can be used to improve accuracy.
Output DWARF
Unlike DWARF input, a nice standalone LLVM library for writing DWARF files does not exist (yet). There are at least two independent implementations: one in AsmPrinter, another in DWARFLinker.
AsmPrinter uses a its own version of DWARF classes internally (as opposed to Read DWARF API), and there is no generic library interface - it works with LLVM IR metadata as part of regular CodeGen.
DWARFLinker uses the same API that we use to read DWARF, but the implementation is tied to the linker. We can refactor it into a generic library, and use for llvm-dwarfreduce to output a modified graph.
Metadata
Nodes of a DWARF graph (DIEs) can have metadata attached to them.
When we remove a node and it causes the test program to fail, we can mark the node as “interesting” and avoid removing it during next passes.
DWARF DIEs are identified by their offset from the beginning of the section, so they are going to change when we modify DWARF. We can add a persistent ID metadata to track the node between iterations.
Test program
Test program is provided by a user. It checks if a modified DWARF is still “interesting” (for example, if it triggers a bug). If it is not, we have to roll back the graph to the previous state, apply feedback to the algorithm, and try again.
If the test program shows that a modified DWARF is fine, we can use it as an input and perform another iteration of reduction.
It is important to balance run time of the algorithm and run time of the test program.
The interface between dwarfreduce and the test program is still TBD. We can output DWARF separately, and let the test program integrate it back into the source binary. Or we can try and modify the source in-place.