[GSoC] Implement a single updater class for Dominators - Final Report

Hello everyone,

During this summer I was working on implementing a single updater
class for Dominators.

Note 1): If you are just concern about how this can affect your
development, please see section 4 - Actions.
Note 2): The full length report, which details the contribution,
challenges can be found in [1].

1. Background
The API for updating the dominator and postdominator tree provided by
LLVM before this project was fragmented. Different functions affecting
these data structures had to decide whether update them eagerly or to
perform updates on the DominatorTree using a special wrapper class
DeferredDominance to enable lazy updates and deduplication. Passes and
utils also needed to handle some corner cases (e.g., explicitly remove
forward-unreachable nodes from the PostDominatorTree before deletion)
when they updated dominators.

Thus, there was a strong need to implement a single updater class for
abstracting away the tree update strategies and which trees are
actually being updated to provide a uniform and clean interface for
pass developers to update dominators. Moreover, after implementing the
new updater class, it is also feasible to implement a timing method on
it to take a closer look than before on where is beneficial to improve
the compilation time. The new updater class is also able to offer
developers an easier way to test whether it is profitable to preserve
the DominatorTree along the optimization pipeline combing with
automatic CFG snapshot and diff algorithms implemented in this new
updater.

2. Tasks completed
The work during this summer can be divided into the following four parts:

a) Explore the drawbacks of the DomintorTree interface, cleanup the
codebase, implement the new DomTreeUpdater interface and fit the
DomTreeUpdater in all scenarios.

The DomTreeUpdater is designed and implemented [2] to have the
functionality of the previous DominatorTree wrapper class
DeferredDominance. It can also work the same as a raw DominatorTree
and PostDominatorTree. First, it can provide 12 ways of updating
dominators. ({None, DomTree, PostDomTree, Both Trees} x {Eager, Lazy,
(Auto - proposed/not merged)}) It greatly expands and reuses the
previous interface of only 3~6 update strategies (often {None,
DomTree, Lazy} + {None, DomTree, Eager}) provided by utils in
Local.cpp and BasicBlockUtils.cpp. Second, it does not require users
to write redundant calls and worry about corner cases mentioned in
[3]. It also enables the utils to be cleaner by combining the
parameters previously used by three different classes (i.e.
DeferredDominance, DominatorTree, PostDominatorTree) into one and
removing unnecessary branching logic. Furthermore, it exposes more
function (e.g., callbacks on BasicBlock deletion) than its predecessor
(DeferredDominance) and is less confusing to users. (e.g.,
recalculate() under Lazy UpdateStrategy is not related to the internal
state of the updater any more)

b) Transform codebase to use the new DomTreeUpdater interface to make
updating dominators more consistent and uniform.

For converting existing passes to use the new interface, now all the
interfaces that previously accepted DeferredDominance and all passes
that used DeferredDominace or both DomTree and PostDomTree are
migrated to use the new interface [4]. It can enable passes to
preserve PostDominatorTree easily in the future and make utils easier
to maintain.

c) Implement timing methods to find out where is profitable to improve
the compilation time. Explore and preserve the DominatorTree and
PostDominatorTree along the LLVM optimization pipeline in various
passes.

After I implemented the DomTreeUpdater class, I began implementing a
timer and several statistics to find out where the most time is
consumed by the DominatorTree related calculation. After the timing
patch [5] is ready, I discovered that recalculation takes most of the
time in the dominatortree calculation (See [Statistics] Benchmarks on
Lazy UpdateStrategy [6]). Because of that, I began to look into how to
preserve the DominatorTree along the pipeline. I used the
“-debug-pass=Structure” argument to see where the domtree is
invalidated in the legacy pass manager and looked into the source code
of those passes to find out whether it is good to preserve the
dominators in it. I also wrote a tool to analyze where the
DominatorTree is used and invalidated in the new Pass Manager. Because
the new pass manager schedules passes dynamically, I used multiple
inputs to find out the results. I made several improvements to the
pipeline [7-9] and wrote a report ([Dominators] Preserving the
DominatorTree along the pipeline [10]).

d) Explore ways to make passes which cannot submit accurate updates to
use the DomTreeUpdater safely and ways to make developers of complex
passes to test whether it is profitable to preserve the DomintorTree
along the pipeline.

During the transformation of existing passes and transforms, Brian
Rzycki reminded me of that the JumpThreading pass relies on an
undocumented behavior of the predecessor of the DomTreeUpdater (i.e.
removing invalid updates which never actually happened). JumpThreading
Pass currently relies on some internal state of the DomTreeUpdater to
work. Because the pass itself is complicated to test thoroughly, and
DominatorTree updates is hard to get synced with the CFG in the pass
because of its complexity, the safest choice is to implement a method
which uses an algorithm to find out the difference between the CFG of
the previous state when the DominatorTree get initialized or updated
and the current state. The algorithm views the CFG as a set of edges
and the diff between two CFGs can then be calculated using set
difference.

I implemented this functionality in three different ways. The first
one named “Auto[1]” in [11] only needs the user to submit the
BasicBlocks going to be deleted. The second one named “Auto[2]” in
[12] needs the same information the user submits currently using the
Lazy UpdateStrategy. The third one named “Auto[3]” in [13] needs the
user to submit extra information before the CFG changes. I implemented
timing patches [16-18] on these three implementations. I benchmarked
them [19-21] and found the third one was almost in line with the
current implementation, which is the most promising. But due to the
lack of time before the project ends, its performance was not fully
examined to get merged into the LLVM codebase. I think the next step
after the project is to find out ways to optimize the storage time
used by this implementation to get it faster. Though the first
implementation is slower than the third one, I think it can serve as a
tool to test whether it is profitable to preserve the DominatorTree
Analysis in a complex pass. I implemented this idea on the SimplifyCFG
pass in [14]. The statistics [15] shows that currently it is not
profitable to do so and it deserves further investigation after the
project.

3. Future work
a) I mentioned in my initial proposal to implement a pruning algorithm
to prune updates to the PostDominatorTree by getting information from
the up-to-date DominatorTree to improve the compilation time. After
benchmarking [6], I found most time spent by the dominator tree
calculation is reconstruction of the (forward) dominator tree,
problems with updating dominators in JumpThreading exposed at the same
time and we need to figure out a smarter way to update the
DominatorTree in JumpThreading and save time consumed by
reconstruction. After a discussion with my mentor, we decided to look
into ways to improve the compilation time by preserving the
DominatorTree / PostDominatorTree along the pipeline and explore ways
on CFG snapshotting and diffing with hints-based update strategy.

b) The JumpThreading Pass in the current LLVM codebase heavily relies
on the internal state of the DomTreeUpdater which makes it hard and
tricky to update the DominatorTree and the PostDominatorTree in this
pass.

The JumpThreading Pass can submit an update which is maybe already
known by the DominatorTree which makes updates cancel with each other
inside the DomTreeUpdater which results in a DominatorTree getting
unsynced with the current CFG.

This problem can be solved by applying the Auto UpdateStrategy patches
mentioned above. We need to continue to work on improving the
compilation time performance before those patches go upstream.

c) From the report I created [10], several passes still invalidates
the DominatorTree and the PostDominatorTree in the new pass manager.
It is possible to improve the compilation time by preserving analysis
in those passes.

d) The SimplifyCFG Pass consumes most of the time of the DominatorTree
calculation by invalidating it several times along the LLVM
optimization pipeline.

A work-in-progress patch [14] has been created to experimentally
preserve the DominatorTree in the SimplifyCFG. Though it suffers a
worse performance than the baseline (the current implementation in
LLVM revision 339227), it is valuable to look into why it takes so
long to update the DominatorTree after SimplifyCFG runs and whether it
can be improved.

4. Actions
a) For pass developers:
If you want to update the DominatorTree lazily or do deduplication
before submitting updates, please rebase your patch using
DeferredDominance class to use the DomTreeUpdater class.

Current passes preserving the DominatorTree by updating it
incrementally and using utils which accept the DomTreeUpdater provided
by Local.cpp and BasicBlockUtils.cpp can gain free PostDominatorTree
preservation ability by migrating from raw DominatorTree updating API
to the DomTreeUpdater.

If you are currently writing a complex pass and wondering whether it
is beneficial to preserve the DominatorTree, you can decide it by
having a try using [11] and [16] and see the time consumed by “DomTree
Calculation”. See [14] for an example.

b) For developers working on utils (Local.cpp and BasicBlockUtils.cpp):
Please make functions updating the DominatorTree using the incremental
way via the DomTreeUpdater. It is needed to check whether it is
compatible with both DominatorTree and PostDominatorTree under both
updating strategies (Eager/Lazy).

5. Acknowledgements
I would like to offer my special thanks to my mentor Jakub Kuderski
for patiently guiding me on this project and providing me with
valuable reviews and advice. I would also like to thank Brian Rzycki
for advising me and reviewing my patches. Thanks to all LLVM community
members for their support during the project.

I would like to continue maintaining the DomTreeUpdater after the project :).

Best,
Chijun Sima

[1] https://docs.google.com/document/d/1hubvaiQxQSBR9H9TOnH1nrtSusEr3vg5sXVLGj7xrHE/edit?usp=sharing
[2] https://reviews.llvm.org/D48383
[3] http://lists.llvm.org/pipermail/llvm-dev/2018-June/123883.html
[4] https://reviews.llvm.org/D48967
[5] https://reviews.llvm.org/D50300
[6] https://reviews.llvm.org/P8098
[7] https://reviews.llvm.org/D48914
[8] https://reviews.llvm.org/D49982
[9] https://reviews.llvm.org/D49988
[10] http://lists.llvm.org/pipermail/llvm-dev/2018-August/125185.html
[11] https://reviews.llvm.org/D50302
[12] https://reviews.llvm.org/D50308
[13] https://reviews.llvm.org/D50311
[14] https://reviews.llvm.org/D50305
[15] https://reviews.llvm.org/P8097
[16] https://reviews.llvm.org/D50303
[17] https://reviews.llvm.org/D50309
[18] https://reviews.llvm.org/D50313
[19] https://reviews.llvm.org/P8099
[20] https://reviews.llvm.org/P8100
[21] https://reviews.llvm.org/P8101