Adding more analysis algorithms to Clang?

Hi,

There are some program analysis algorithms in clang/lib/Analysis directory. I am wandering if there are more algorithms to come or not. Are there some people working on this? What kind of algorithms are welcome to be part of Clang?

I am asking this because I want to help on this part. But I am not sure if source-to-source transformation algorithms are welcome in this community, since many optimization work is done in the LLVM backend. Currently I have implemented to CFG analysis algorithms (building the dominance tree and finding the strong connected components in CFG). I plan to implement more data flow analysis or even pointer analysis algorithms. If these algorithms are useful to people in this community, I would love to submit a patch.

I am new to Clang and this community. Hopefully these questions are not naive to you.

Thanks.

Hi Guoping,

Additions to libAnalysis have largely been demand-driven. I think we are fine with general contributions to this library that meeting the following criteria:

(1) Meet the Clang coding conventions and quality expectations.

(2) Are testable ("make test"), and tests are included in clang/test.

(3) Performance is acceptable and measurable, or at least documented when algorithms are known to be expensive.

(4) APIs are well-documented, which can come in the form of good doxygen comments.

Source-to-source transformations are welcome, and the Clang codebase does contains tools that are source-to-source rewriters. For example, libRewrite provides low-level functionality for doing textual rewrites of source files using Clang SourceLocations. Other analysis algorithms are useful as well; we just prefer that they can be regression tested, and the APIs are well-documented (especially if they don't have any real clients in the codebase at present).

Cheers,
Ted

Dear Ted

I added two control flow analysis algorithms: One builds the dominance tree for CFG. The other one finds all SCC components within the CFG. Attached is the patch. I integrated these algorithms within the CFG and CFGBlock classes. This is the first time I submit something, so I have some questions:

1 While I would really love to contribute, I am not sure if the code quality is enough. Please let me know any problems with the code so I shall fix them.
2 These algorithms shall work on any CFG. I am not sure what kind of special test cases should be added. I do wrote an example (in tools/clang/examples) to illustrate how to use these two algorithms. Is it necessary to submit the example into the examples/ directory?
3 I tried my best to follow the code standards and comments. Also, in the comments, I listed the key reference papers which described thorough details of these algorithms. They are the state of the art algorithms as far as I know, and very efficient, although I am not sure my implementation is good or not.

I am looking forward to feedback. I shall fix problems until it meets the quality for actual commit.
Thanks.

CFG.cpp.patch (5.62 KB)

CFG.h.patch (2.89 KB)

  1. Use “unsigned int” instead of just “unsigned”
  2. Shouldnt "CFGStack.pop_back_val(); " on line 45 be outside the “if (AllVisited)” statement?
  3. The check on line 76 is unnecessary. If you are at line 76 then the check at line 71 was false which means that the check at 76 is true. So I guess replace “else if” with “else”.
  4. typo on line 52

Arjun

Thanks Arjun for your feedback. I updated these files (attached).
Ted, I am looking forward to your comments.

CFG.cpp.patch (5.65 KB)

CFG.h.patch (2.89 KB)

Hi Guoping,

Here are some comments mostly based on review of buildDominanceTree().

This should be extensively tested. I think unit tests(clang/test/unittests) could be a good fit in this case. Please, make tests a part of your next patch.

It would be nice if you could provide more expressive API for users of the algorithms. For example, provide reverse post order iterator which can be used by multiple clients instead of using the RPOQueue local variable to store the results. Similarly for the dominator info, it would be nice to add some API which allows querying if one block dominates another one… LLVM already has these algorithms implemented. I am not sure if some reuse is possible or not:
http://llvm.org/docs/doxygen/html/PostOrderIterator_8h_source.html
http://llvm.org/docs/doxygen/html/classllvm_1_1DominatorTree.html

Other comments:

  • A single patch file would be easier to review.

  • If the following if statement needed? (Can a block have null successors?)

  • if(CFGBlock *B = *I) {
  • Is VisitVector needed? Maybe you could use the BlockRPO of the successors instead.

  • Shouldn’t Dominator fields be initialized to 0?

  • 2 components/functions should be pulled out of buildDominanceTree(): buildReversePostorder(), intersect().

  • I think it would be cleaner to include the paper references as part of method comments/not fields comments.

  • “That is, dead code CFG blocks are ignored by this algorithm” - I’d remove this comment. The first sentence is clear enough. (I think you are talking about unreachable code here, and dominance analysis are not powerful enough to find all unreachable code.)

Microscopic style issues:

  • Comments should use proper punctuation.

  • You need to insert a space after a comma and before {. For example:
    RPOQueue.push_back(item,BlkBVC); → RPOQueue.push_back(item, BlkBVC);
    if(!VisitVector[BlockID]){ → if (!VisitVector[BlockID]) {

  • CFGBlock * getSCCRoot() → CFGBlock *getSCCRoot()

Cheers,
Anna.

Hi Guoping,

First, I appreciate your interest in pushing this code back to the open source tree. I want to start with some high-level issues, as I think there needs to be some significant restructuring before we can really get into the details.

My main concern right now is the factoring of the code. None of this analysis logic should be added to the CFG class itself, but instead built on top of it. For example, the LiveVariables analysis builds on top of the CFG; it doesn’t embed information inside it. Instead, we should see a separate API entry point that, given a CFG, returns the analysis results.

Put another way, I should never expect to see analysis logic, like dominators, in CFG.h and CFG.cpp.

There are a variety of reasons for this. The first is modularity and purity. The CFG is just in the business of modeling control-flow. No more, and no less. That makes it API contract easy (or easier) to understand.

The second is just good software engineering. If we add 10 new analyses, do they need to get embedded into the CFG? Besides cluttering the API, that potentially means that some clients of the CFG need to pay the cost of having that logic their even if they don’t use. For example, the compiler itself won’t use the dominance analysis in the CFG, but if the analysis gets buried in the CFG API then it potentially must get linked in. Moreover, I should be able to extend the set of analyses possibly by providing new libraries or plugins; requiring that they are buried in the CFG doesn’t allow this flexibility.

Before I am really going to review this patch in detail, you will need to split your logic out of the CFG, and provide new classes and entry points for clients to use. That’s absolutely a prerequisite before it can get checked in.

My other concern is that there is no testing logic. We absolutely must have a way to test these analyses. We have very high quality standards in Clang, both in terms of code quality but also in terms of functionality. Without any tests, there is no reason to assume that these analyses really work at all.

What I propose is that you have similar testing to what we do with the LiveVariables analysis. There is a static analysis check, LiveVariablesDumper (lib/StaticAnalyzer/Checkers/DebugCheckers.cpp) that simply dumps out the live variables information for a CFG. That information can then be regression tested using FileCheck (there are many examples in the “test” directory of using FileCheck to examine output from a tool). While I’m not proposing that your analyses be part of the static analyzer proper, we can use the analyzer to test your analyses in this way. What I would then expect would be a set of tests for each of your analyses that regression tested the expected results of the analysis. Providing a way to dump these analysis results is not only good for regression testing, but also important just for debugging the behavior of these analyses in general.

The regression testing is also a prerequisite for this code getting checked into the Clang repository.

Cheers,
Ted

Hi, Ted, Anna, Arjun, etc

Thanks you all for your comments. Now I think at least I know the direction on how to improve the quality of the code. I shall fix these issues in my next patch.

Hi, Anna

You are right. Since LLVM already has dominatorTree implementation, it’s not very interesting to build the wheel again.

Hi, Anna

You are right. Since LLVM already has dominatorTree implementation, it’s not very interesting to build the wheel again.

LLVM CFG and Clang front end CFG are different. I am not sure if the LLVM dominator tree can be integrated to work with Clang CFG (most likely not). Though, it could be worthwhile to check how it is implemented, both from algorithmic as well as API design points of view.

I think bringing those algorithms into Clang would be useful.

Cheers,
Anna.

Hello

I re-engineered the dominance algorithm. Attached are the cpp/h files and a test input file. There are still several things:
A. This implementation is done based on various comments pointed out by Ted, Anna, Arjun, et al. While I tried my best to provide a much better version this time, it still may not meet the quality to be a part of Clang. Please let me know if there are any issues on the code. I shall continue to improve it until it’s good enough.

B. The dominance algorithm need two basic things. One is the post order iterator of a general graph. As Anna pointed out, Clang already has this facility. So I use it directly to traverse the CFG nodes in reverse post order. The other one is the LLVM’s implementation of DominatorTree. This implementation is different from the one in VMCore on two aspects: 1 The dominator implementation in LLVM VMCore is closely coupled with the LLVM IR infrastructure (basic blocks, instructions, etc). The attached implementation mainly suits the CFG graph of Clang. 2. The LLVM implementation used the classical Lengauer-Tarjan algorithm, published in 1979. The attached implementation is based on an improved 2001 paper. For more details please refer to the comments of the code.

C. Shall I also provide the test code for the algorithm? While I know I need to test code, but I am not sure the standard procedures. Please let me know.
Thanks.

Dominators.cpp (7.58 KB)

Dominators.h (2.42 KB)

domtest.c (662 Bytes)

Hi Guoping,

Yes, you should include a test with your patch. The test should be a part of the clang regression tests - it should be automatically tested when we run $make test. As for how to approach testing here, see the suggestion made by Ted in one of the previous emails to use testing approach similar to how we test LiveVariables analysis. This is not going to be trivial but it is absolutely necessary.

As for the format of the patch, please see “Making a Patch” in

http://llvm.org/docs/DeveloperPolicy.html

Thanks!
Anna.

Hi,

Attached is the revised patch for dominance tree analysis. I added necessary test code. To run the test: clang -cc1 -analyze -analyzer-checker=debug.DumpDominators input-file.c
Two things:
1 The patch file is generated by running svn diff under the llvm/tools/clang directory. I do not know why this patch only contains information for existing file modifications, not including new added files (dominators.cpp (should be in clang/lib/Analysis/), dominators.h (should be in clang/include/clang/Analysis/Analyses/), domtest.c (should be in clang/test/Analysis)). So I included the patch in together with the three new files as attached.

  1. While I tried my best to test the algorithm, I am not sure to what extent can this process be satisfactory enough. Do I need to add more test inputs in domtest.c? Or what else should I do to ensure more thorough testing?

Please let me know if there are any issues. I shall fix them. Thanks.

Dominators.cpp (7.58 KB)

Dominators.h (2.42 KB)

domtest.c (662 Bytes)

dominators.patch (1.94 KB)

Hi, clang

Thanks Joerg and Arash. Attached is the new patch with everything contained. Please let me know if there are any further issues. I shall continue working to improve it.

–Guoping

2011/10/18 Joerg Sonnenberger <joerg@britannica.bec.de>

dominators.patch (13.6 KB)

Hi Guoping,

To automate testing, you should check the expected output of the dominators analysis checker with FileCheck. Inside the test, you specify the command which should be applied on it as well as the expected output. It will get tested when you run the following from the clang build directory (see the Testing section on http://clang.llvm.org/hacking.html):
$ make test
or if you only want to test the tests from the analysis folder:
$ TESTDIRS=Analysis make test

Your test is probably going to look similar to the following example, which tests CFG builder:
test/Analysis/initializers-cfg-output.cpp

You can find the documentation on FileCheck here: http://llvm.org/docs/TestingGuide.html The document might be a bit outdated but it does a good job explaining the concepts.

Cheers,
Anna.

Hi

Thanks Anna. Thanks for all who help point issues in my code!

The FileCheck logic has been added. Now I believe the code is at least in a reasonable shape. Attached is the revised patch. Please let me know if there are additional issues. I shall fix them.

dominators.patch (15.3 KB)

Hi Guoping,

This looks great to me. My main concern is that we are copy-pasting TopologicallySortedCFG and CFGBlockSet yet again. This is a very bad trend. Before this patch goes in, I’d like to see that common code refactored. I’ll look into fixing that today.

Cheers,
Ted

Hi, Ted

TopologicallySortedCFG implements a post ordering of all CFG nodes. Maybe renaming it into POSortedCFG is better. Similarly, some algorithms may need something like pre-ordered CFG. I believe more analysis algorithms may need either of these. It may be nice to have such a class as an basic functionality in the infrastructure. Analysis algorithms will just use this basic functionality, rather than implement/copy-paste such classes in a somewhat ad-hoc way.

I completely agree. I’ll look into wiring up something like POSortedCFG right now.

Hi Guoping,

I’ve added PostOrderCFGView. Can you regenerate your patch using this class?

Cheers,
Ted