[analyzer][GSoC] Implementing a dataflow framework for the Clang Static Analyzer

Hi!

This summer, I indent to participate in Google Summer of Code, during which I’d like to work on the Clang Static Analyzer. In some not-so-recent discussions, the lack of a dataflow analysis infrastructure was mentioned, which I think would be a great addition, and I believe certainly a “tough enough” task.

Why I believe that this would be great, is that it would open the window for a variety of new analyses, like “points to” analysis, better dead code detection, and a variety of new checks that couldn’t be implemented with symbolic execution and/or AST matching.

Do you think it’s realistic to implement this within a summer? Do you have maybe some other, related but smaller projects that is a better fit for GSoC?

So far, I’ve spent the better part of one and a half years on the Clang Static Analyzer. I’ve implemented a path sensitive checker called UninitializedObjectChecker, and an AST based clang-tidy check called bugprone-throw-keyword-missing. Some other projects that didn’t see the light of day lead me to learn a lot about BugReporter and bug path construction. Lately, I’ve spent most of my time on the frontend of the analyzer, maintaining checker registration and command line option handling, during which I got to see and learn from a lot of checker implementations, how path sensitive analysis was implemented, and where a new analysis type could possibly fit in.

In order to prepare for this task, I’ve already started reading some literature (most notably Engineering a Compiler from Keith Cooper and Linda Torczon), and started attending related courses at university. If you could recommend some other sources, they would be most welcome. :slight_smile:

Would greatly appreciate any feedback!

Cheers,
Kristóf Umann

Hey!

First of all, yup, we're totally doing GSoC this year! I'll do my best
to finally force myself to update the open projects list soon with a
funny random idea that some folks around there are interested in :slight_smile:
Now, also i'm alone these days since George changed jobs; Gabor, are
you interested in / capable of helping out again? 'Cause they'll most
likely only let me pick one student, while we already have a few
people looking into participating :slight_smile:

Data flow... Well, this one's definitely wanted, but that's a big one,
definitely bigger than any of the projects that we've tried so far :slight_smile:
I guess such project would aim at allowing data-flow checkers that
don't have to write down transfer functions for every kind of
statement in every checker, but instead rely on the engine to handle
common basic effects of statements. If done properly, it pretty much
means writing a new Static Analyzer, which is more complicated than
the old one because it also needs its state merge operations defined.

There may be "poor-man's" solutions to this, which would allow us to
have a bit of progress without developing too much complex machinery.
One such solution is to introduce careful (conservative) tracking of
dropped execution paths in the Static Analyzer; it would allow finding
bugs that require analysis of all paths as long as the Analyzer is
known to have explored the function in its entirety. In practice
that'd be a lot of dropped coverage (much more than a proper
solution), but it'd still probably find *something*, while allowing us
to re-use most of the infrastructure. This, of course, is a dead-end
in the long run - if we ever want to regain the lost coverage, we'd
have to start from scratch.

So my overall feel here is that i'd love to hear a specific proposal,
but i suspect that if you go for doing this properly, one summer would
only be enough to scratch the surface. If you want to have a feel of
what it'd take, i'd recommend poking our existing data flow analyses.
Eg., what would it take to remove my C++ "object under construction"
liveness hacks by improving live variables analysis? You might also
want to write a few analyses of your own and see if you can generalize
something out of them. Eg., if you write an analysis that checks
whether a function is pure, we can instantly use it to improve our
conservative evaluation of pure functions (eg., produce the same
return value symbol every time it's called with the same parameters
and skip invalidations - we may even try to do it across translation
units because such summary is easy to serialize).

Hi!

I think I could have the same level of commitment I had last summer. If you find that sufficient, I am happy to help with the mentoring!

I do agree with Artem, if you just started to familiarize yourself with the topic, I would suggest starting with implementing some algorithms from scratch (or tweak existing ones). While improving existing algorithms is always nice, and it is good to get rid of some technical debt, it is sometimes harder to write a good report in the end.

Some random ideas that I would love to see implemented:

  • Points to analysis would have a number of uses like more precise loop widening, call graph construction, ability to experiment with other invalidation schemes
  • Static backward slicing could help us shorten bug paths and filter events
  • For some checks even just a set of dominators/post dominators for each basic block would be useful, see Kristof’s uninitialized value check’s “guarded member” heuristic
  • Maybe some better measure of function complexity for the analyzer’s inlining decision? As far as I remember we use the number of basic blocks now.
  • Maybe some static edge frequency estimate for the CFG to be able to experiment with other exploration strategies? Taking the most or least frequently taken edge leading to an unvisited node sounds like a fun experiment, but nothing guarantee that the end result will actually be useful.

I think it is also ok to write multiple proposals, but you should aim for quality over quantity.

Regards,
Gabor

An unrelated humble/funny project is now officially up on Yay! I mean, regardless of how the final setup would look, i’d be happy to see you intervene :slight_smile: Whoa, this one sounds really interesting and also very impactful. We’re right now pretty bad at figuring out which events are relevant and which aren’t, and it constantly bothers me (one of the latest examples is ). Maybe we could even poke that as a separate smaller project? Another good thing about this smaller project is that i don’t know how to implement that, but it sounds like you do - which means that we could co-mentor it as part of GSoC (:

Artem Dergachev <noqnoqneo@gmail.com> ezt írta (időpont: 2019. márc. 7., Cs, 1:38):

An unrelated humble/funny project is now officially up on http://llvm.org/OpenProjects.html#analyze-llvm

Hi!

I think I could have the same level of commitment I had last summer. If you find that sufficient, I am happy to help with the mentoring!

Yay! I mean, regardless of how the final setup would look, i’d be happy to see you intervene :slight_smile:

I do agree with Artem, if you just started to familiarize yourself with the topic, I would suggest starting with implementing some algorithms from scratch (or tweak existing ones). While improving existing algorithms is always nice, and it is good to get rid of some technical debt, it is sometimes harder to write a good report in the end.

Some random ideas that I would love to see implemented:

  • Points to analysis would have a number of uses like more precise loop widening, call graph construction, ability to experiment with other invalidation schemes
  • Static backward slicing could help us shorten bug paths and filter events

Whoa, this one sounds really interesting and also very impactful. We’re right now pretty bad at figuring out which events are relevant and which aren’t, and it constantly bothers me (one of the latest examples is https://bugs.llvm.org/show_bug.cgi?id=40913#c4). Maybe we could even poke that as a separate smaller project? Another good thing about this smaller project is that i don’t know how to implement that, but it sounds like you do - which means that we could co-mentor it as part of GSoC (:

Currently, this is the one topic that interests me the most, it sounds like an achievable, but “tough enough” task for a summer.

  • For some checks even just a set of dominators/post dominators for each basic block would be useful, see Kristof’s uninitialized value check’s “guarded member” heuristic

Before I make a former proposal however, I’ll implement this as an exercise, because (according to Gábor) this is a project that could be implemented in a week(ish). First, I’ll add some code to my checker, and see what it takes to make this functionality available to every checker – this should give me a good idea on how this entire project could look like.

Thank you so much for the encouragement by the way, I’m super excited to work on this! :slight_smile: