[GSOC 2018] Implement a single updater class for Dominators

Hello,

I’m an undergraduate student studying CS in the South China University
of Technology.

I have been using clang compiler and related tools since I started
studying C++ and I would like to work on LLVM in this year’s GSoC. I
am interested in “Implement a single updater class for Dominators”.
[1] I have achieved a bronze medal in the 2017 ACM-ICPC Asia Xian
Regional Contest [2] (being a member of the team “Charizard”) thus I
think I have some knowledge on basic tree/graph algorithms and data
structures.

I have gone through most of the LLVM Kaleidoscope tutorial and I have
watched the video of the presentation “Dominator Trees and incremental
updates that transcend time” presented on the 2017 LLVM Developers’
Meeting. I have also started to understand the algorithm mentioned in
the comments of the code related to the dominator tree. I have created
a Bugzilla account and I am now working on a small bug related to
syntax warning (Sema).

I am wondering if there are some other bugs or materials I can go over
in order to achieve a better understanding on the LLVM’s codebase (or
the project mentioned above)? Which part of the codebase do you
recommend to view to learn more about the usage and drawbacks of the
current dominator tree related API? I would appreciate it.

Best regards,

Chijun Sima

[1] https://llvm.org/OpenProjects.html#llvm_domtree_updater
[2] https://icpc.baylor.edu/regionals/finder/asia-xian-2017/standings

Hi Chijun,

Thanks for your interest in the project.

I have gone through most of the LLVM Kaleidoscope tutorial and I have
watched the video of the presentation “Dominator Trees and incremental
updates that transcend time” presented on the 2017 LLVM Developers’
Meeting. I have also started to understand the algorithm mentioned in
the comments of the code related to the dominator tree.

This sounds like a good start, but I don’t think there’s strong need to understand the Depth Based Search algorithm in detail – grasping the main idea behind it should be enough. If you want to have some rough idea on how it works you can try to run domtree tests with -debug-only=dom-tree-builder.
To run just the IR tests for dominators, you can do: ./unittests/IR/IRTests --gtest_filter=DominatorTree*. Additionally, you can also play with Holder.F->viewCFG() and DT->viewGraph() to see what’s going on inside the tests.

I am wondering if there are some other bugs or materials I can go over
in order to achieve a better understanding on the LLVM’s codebase (or
the project mentioned above)? Which part of the codebase do you
recommend to view to learn more about the usage and drawbacks of the
current dominator tree related API? I would appreciate it.

I think that the programmer’s manual is extremely useful when you want to get more familiar with the code base: http://llvm.org/docs/ProgrammersManual.html. I would recommend to focus on how to interact with IR, llvm’s data structures, and debug utilities.
The most important files related to the projects are probably: GenericDomTreeConstruction.h, GenericDomTree.h, Dominators.h, Dominators.cpp.

If you want to see the incremental API in action, take a look at some simpler transforms like loop deletion and aggressive dead code elimination (ADCE). The problems the project mentions can be seen in various ‘helper’/‘utils’ functions used by transforms. Take a look at these code reviews, especially the functions that can accept DT* and/or DDT*:
https://reviews.llvm.org/D41302
https://reviews.llvm.org/D40146
https://reviews.llvm.org/D42804

I think that it’s worth adding that a better API is only one part the project; the second, perhaps bigger and more fun, is to figure out how to prune redundant PostDomTree updates when we have a fully-updated DomTree available.

Feel free to ask more questions should you have any – I and others will be more than happy to clarify.

Best,
Kuba

Hi Kuba,

Thanks for your advice in your previous letter.

During last week, I have read the documents on Doxygen and the source
code of the DomTreeBase/DomTree/PostDomTree/DeferredDominance class, I
believe now I have a much better understanding on the relationship
between these classes and how DeferredDominance class performs lazy
updates. I have also learnt the current usage and drawbacks of the
fragmented API by looking into how several ‘transform util’ functions
use these APIs and running the IR tests. Furthermore, I have read the
code reviews you mentioned to get the idea of previous approaches on
implementing the updater class and so on.

I have done some early sketch on the API of the new updater class.

From my current understanding, to solve the fragmentation problem of

the API, the new class first, need to maintain the DomTree and
PostDomTree class and deprecate the DeferredDominance class. This
approach can save redundant code on calling the update function on
both DomTree and PostDomTree. Second, the API should be something
looks like DT.update(updateStrategy::lazy,updateKind::delete,A,B) or
DT.applyUpdates(updateStrategy::lazy,{{ updateKind::delete,A,B }}).
Third, the API can auto flush() when a query happens, which can make
the code easier to maintain. Forth, the updater class can hold an
optional PostDomTree and make its status synchronize with the
corresponding DomTree (can use lazy updates until query happens) and
use information from the DomTree to prune useless updates.

After I read the comments on D41302[1], in which Dave(dmgreen)
mentions there is a need to preserve the PostDomTree in the future. I
am wondering whether we need to preserve a PostDomTree all the time in
the new class? Moreover, as mentioned in the open projects page[2],
the API should be able to specify which trees are actually being
updated (none, only DomTree, only PostDomTree, both), while I doubt
that if the updater initially holds both the DomTree and PostDomTree
for example, it is inappropriate to update only the PostDomTree, which
will just break the synchronization between the two trees (DomTree and
PostDomTree). Please correct me if I am wrong. :smiley:

I am preparing to submit the proposal to GSoC this week. At the same
time, I will try to figure out something about the way to prune
redundant PostDomTree updates.

I am looking forward to your reply.

Regards,

Chijun Sima

[1] https://reviews.llvm.org/D41302?id=127155#inline-361616

[2] https://llvm.org/OpenProjects.html#llvm_domtree_updater

Hi Chijun,

Great, seems like you did a lot of progress and understand the issues quite well!

I have done some early sketch on the API of the new updater class.

From my current understanding, to solve the fragmentation problem of
the API, the new class first, need to maintain the DomTree and
PostDomTree class and deprecate the DefferredDominance class.

There are a couple of possibilities here. First, we can come up with a new updater class and gradually replace DefferredDominance and and the basic incremental update API with it (.applyUpdates()). During this process all the classes can coexists which should make it easier to make the transition smooth instead of risking breaking multiple places in LLVM at once. The other approach would be to modify DefferredDominance and make the transition happen in place, which should be also doable. I think that for prototyping it would be easier to clone DefferedDominance and replace one use of it at a time with the new updater class.

Second, the API should be something
looks like DT.update(updateStrategy::lazy,updateKind::delete,A,B) or
DT.applyUpdates(updateStrategy::lazy,{{ updateKind::delete,A,B }}).

Having update strategies as enums sounds very reasonable, but I’d rather we didn’t teach the DT to do it. I think the existing interface (and the header file) is already pretty large and it should make sense to have a separate class for the updates that the DT and PDT don’t know about at all. I suspect that the only thing that might be required would be to expose some additional functions from DT and PDT for update pruning, but I’m not sure about it at this point.

Third, the API can auto flush() when a query happens, which can make
the code easier to maintain. Forth, the updater class can hold an
optional PostDomTree and make its status synchronize with the
corresponding DomTree (can use lazy updates until query happens) and
use information from the DomTree to prune useless updates.

Sounds good. One issue we might hit and need to be care about is not to have both the updater class and DT in a single function somewhere in the codebase. In such cases it is possible to have some pending updates that will eventually change DT, but do .dominates() queries on the outdated tree. However, this problem exists even today to some extend (with DT and DDT) and I don’t think it’s very serious.

After I read the comments on D41302[1], in which Dave(dmgreen)
mentions there is a need to preserve the PostDomTree in the future. I
am wondering whether we need to preserve a PostDomTree all the time in
the new class?

I think that at some point we realized that preserving PDT everywhere is quite challenging with the current data structures in LLVM. The issue is that for postdominators you have to access predecessors, which requires you to track uses instead of just looking up the last instruction within a single basic block. What’s more, changes anywhere near terminator instructions in IR can cause PDT to look completely different every time you do it. To do it efficiently, you have to be very lazy about updates and have a good strategy for pruning. This project should be a huge step towards it, but I think that we have to start with optional PDT anyway.
Danny and Chandler (cc’d) should remember more about the different directions that were considered here.

Moreover, as mentioned in the open projects page[2],
the API should be able to specify which trees are actually being
updated (none, only DomTree, only PostDomTree, both), while I doubt
that if the updater initially holds both the DomTree and PostDomTree
for example, it is inappropriate to update only the PostDomTree, which
will just break the synchronization between the two trees (DomTree and
PostDomTree). Please correct me if I am wrong. :smiley:

My intention in the proposal was to say that the updater class can hold (or not) two kinds of domtrees – DT and PDT, and the updates should affect whatever is currently available. I hope this is what you are asking about. Breaking synchronization between two trees inside the updater class sounds like something we really want to avoid.

Sincerely,
Jakub

Hi Kuba,

Thanks for your clarification on the project in the previous letter.

I have submitted a proposal draft at the GSoC website, the draft has
been shared with the LLVM organization. I will appreciate it if you
can give me some advice on the proposal. This draft can be viewed by
the organization. (If you do not have access, please mail me, and I
will give you the link.)

I am looking forward to your reply.

Regards,
Chijun

Hi Chijun,

I left you my feedback in the doc (+ some nitpicks). Overall, it looks very solid and shows you understand the problem well and know what to expect.
Your proposed timeline is reasonable and I don’t see any major things to improve there. Don’t hesitate to reach out if you have any questions related to my comments (either here or in the document).

Thanks,
Kuba

Hi Kuba,

Thanks for your feedback. I have made some improvements on the
proposal according to your feedback and clarify something on the time
availability. I left comments on the questions you asked in the doc.
Please check it out.

Thanks,
Chijun

Hi Kuba,

Thanks for your comments on the PDT update pruning algorithms part in
my proposal. I have modified my proposal according to your feedback.
Please check it out. :smiley:

Regards,
Chijun