[GSOC] Static Analyzer infrastructure improvement, BodyFarm

Hello clang-devel,

I am a student, and I would like to participate in Google Summer of Code 2014. I am mainly interested in the Static Analyzer and I would like to make infrastructural improvements. I have some experience in writing checkers and other clang based tools. Right now I am an intern at a company where my job is to implement checkers to verify that wether their code follows their design rules.

In my work one of my biggest obstacle was that, the static analyzer lacks the ability of cross translation unit analysis. This is the reason why I am very interested in BodyFarm.

There is an open project to model standard library functions using BodyFarm to make the analysis more precise. I would like to help to make BodyFarm working.

Furthermore I would like to make BodyFarm something more general. Coverity is doing it’s analysis in two steps. First it builds a model, and than it uses that model when it does the analysis. I want to make it possible for checker writers to do a preliminary run to collect some definitions that can be used during the analysis. This would provide checker writers with some limited cross translation unit support which would be a great improvement in my opinion.

What are your opinions? Is there someone who willing to mentor this project? What are the chances it will get accepted?

Thanks in advance,

Gábor Horváth

Hello,

I don't know if your subject is good or not (it sounds great to me, but I'm
a newbie), but I thought that GSoC was reserved to "unemployed" students.
As you have said, you are already intern, I think that avoid you to do the
GSoC.

I hope I'm wrong, but I'm in the same case, I'm student since 5 years and
every summer I can't participate because I always an internship to do.

Regards.

Hello,

As far as I know a GSoC project should be the primary focus, so it is like
a full time job. I can pause my internship status for the summer if I get
accepted to participate in the project.

Cheers,
Gábor

Hello,

Ok, I didn’t expect that you could do that, sorry.

Regards.

As far as I know a GSoC project should be the primary focus, so it is like
a full time job. I can pause my internship status for the summer if I get
accepted to participate in the project.

Cheers,
Gábor

Hello,

Ok, I didn't expect that you could do that, sorry.

That's correct. Otherwise we'd need to drop such a student as soon as
possible (even before midterm). This already happened in the past,
btw.

Hi Gábor,

Extending BodyFarm to model a wide variety of APIs would be quite useful. Even with the lack of cross translation unit analysis, there will also be a set of core APIs whose source is unavailable when analyzing a project. Having good models for those APIs could be invaluable for specific contexts. Moreover, the synthesized body can be optimized more for the task of static analysis, and less on the actual implementation details which can contribute additional complexity for the analyzer to reason about.

With the two-phase analysis, what you are basically suggesting is that we implement summary-based analysis using BodyFarm. That’s an interesting approach, and its one we have privately discussed in the past. One advantage of that approach is that you could possibly generate models from one codebase (e.g., some library providing an API) and then use that model for analyzing other codebases. Summary-based analysis runs into possible limitations here when you need to iterate on a fixed point to generate the summary, or the summary needs to be context-sensitive, but those are refinements we can explore over time.

More generally, having a good way to get models into BodyFarm without hand coding AST construction logic would be useful. Thus I see two possible (complimentary) projects:

1. Implement a two-phased analysis, like you suggest, where models are created automatically from analysis and fed into subsequent analysis passes. This would require defining a fair amount of infrastructure to generate models, save and read them from disk, and the tooling support needed to do the two-phase analysis.

2. Provide an easy way for people to author models. A natural approach would be that someone could write the model in source code and its AST gets turned into a pre-baked model. This could be something as simple as writing a dumper that translates AST elements into the current AST construction logic in BodyFarm, or some other canned representation we can load from disk (pre-baked ASTs are a little tricky to just load within an existing AST for the translation unit we are analyzing).

Both of these are fantastic projects. I’m a bit concerned that #1 may be a bit ambitious for a single GSoC project, and #2 provides much of the infrastructure needed for #1 but can be broken down into smaller pieces that are incremental and general goodness. One attractive thing about #2 is that it possible could allow us to remove all the existing BodyFarm models that are hand-coded in Clang itself and just marshal them in from model files.

I’d be happy to mentor you on this project. The key, however, would be to identify incremental milestones and scope the project out so that it could be feasible to achieve reasonable progress over a GSoC period.

Cheers,
Ted

Hi Ted,

I am glad that you could mentor me. I agree that the first option may be a too ambitious for this summer, so I would like to work on the second one for this summer. I like that approach, where one could write the model as source code. However I think that in the long term the textual representation of models may not be appropriate if we would like to use this feature as a foundation for summary based analysis. I think it would be very useful to have an API to serialize some parts of the AST of a translation unit to a model file, that can be loaded later. That API could be utilized by a future project that implements two-phased analysis.

Should we discuss the possible milestones and the scope privately or it is better to discuss it here on the mailing list? Am I supposed to define those milestones myself? Sorry if the answers are obvious, this is the first time I’m attending to GSoC.

Thanks,

Gábor

(I’ve responded to Gábor off list)