Hi Gábor,
Extending BodyFarm to model a wide variety of APIs would be quite useful. Even with the lack of cross translation unit analysis, there will also be a set of core APIs whose source is unavailable when analyzing a project. Having good models for those APIs could be invaluable for specific contexts. Moreover, the synthesized body can be optimized more for the task of static analysis, and less on the actual implementation details which can contribute additional complexity for the analyzer to reason about.
With the two-phase analysis, what you are basically suggesting is that we implement summary-based analysis using BodyFarm. That’s an interesting approach, and its one we have privately discussed in the past. One advantage of that approach is that you could possibly generate models from one codebase (e.g., some library providing an API) and then use that model for analyzing other codebases. Summary-based analysis runs into possible limitations here when you need to iterate on a fixed point to generate the summary, or the summary needs to be context-sensitive, but those are refinements we can explore over time.
More generally, having a good way to get models into BodyFarm without hand coding AST construction logic would be useful. Thus I see two possible (complimentary) projects:
1. Implement a two-phased analysis, like you suggest, where models are created automatically from analysis and fed into subsequent analysis passes. This would require defining a fair amount of infrastructure to generate models, save and read them from disk, and the tooling support needed to do the two-phase analysis.
2. Provide an easy way for people to author models. A natural approach would be that someone could write the model in source code and its AST gets turned into a pre-baked model. This could be something as simple as writing a dumper that translates AST elements into the current AST construction logic in BodyFarm, or some other canned representation we can load from disk (pre-baked ASTs are a little tricky to just load within an existing AST for the translation unit we are analyzing).
Both of these are fantastic projects. I’m a bit concerned that #1 may be a bit ambitious for a single GSoC project, and #2 provides much of the infrastructure needed for #1 but can be broken down into smaller pieces that are incremental and general goodness. One attractive thing about #2 is that it possible could allow us to remove all the existing BodyFarm models that are hand-coded in Clang itself and just marshal them in from model files.
I’d be happy to mentor you on this project. The key, however, would be to identify incremental milestones and scope the project out so that it could be feasible to achieve reasonable progress over a GSoC period.
Cheers,
Ted