Lazily parsing additional source files

Hi!

I am working on a Google Summer of Code project to improve the Clang Static Analyzer. In that project it would be essential to parse external source files and inject AST into the translation unit that is being compiled. The external files would contain definitons that are being looked up. The goal would be to avoid runtime cost if no lookup is required. So basicly I want to add new code lazily to an existing AST after parsing is done by injecting new source code.

Moreover some type information may not be available in those external source files, so type information in the translation unit that is being analyzed should be utilized.

What do you think, what would be the most efficient and elegant way to approach this problem?

Thanks in advance,

Gábor

Hi, Gábor. If you look far back in the SVN history you can see sketches of where we tried this, with an unimplemented concept of “marshalling” used to get data from one ASTContext to another. As I remember, it didn’t go very far because it turns it out it’s very difficult to actually match up types and decls from different translation units.

Trying to parse new code could have better luck, though you’d probably have to change the way things are currently set up to not count the main source file as ended. You could still run into trouble if there are, say, static functions with the same name in the other TU, though.

I’m not sure what you mean by “some type information may not be available in those external source files”. You can’t actually parse C code fully without type information, because certain constructs are ambiguous otherwise.

The approach we’ve considered before is to come up with some AST-agnostic “summary” of a function, like “the first parameter is never modified even though it’s passed as non-const, and the second parameter is always the return value”. A more advanced form of this would allow checkers to store information this way as well. Then this summary information could be “applied” at a call site (using the declaration in the primary TU), without having to worry about making the ASTs match up. This summary information could also be persisted, meaning that when you reanalyze the same project you wouldn’t have to generate the summaries all over again.

Of course, you don’t have to do things this way. I’m just concerned that C is very much structured around the notion of translation units, and that it will be very difficult to handle code outside of that context.

If you have any specific questions, I’ll try to answer them fairly promptly. Anna should be coming back soon, too.
Jordan

Hi Jordan,

The goal here is to use source code to replace the synthesis of body summaries provided by BodyFarm rather than the ASTs being hand-crafted by BodyFarm. BodyFarm obviously depends on the types being available for parameters and such, and I think the same requirements could be here as well. Thus the idea is to lazily create ASTs from source when we ask BodyFarm to synthesize the body of a function, and utilize all the existing declared types, language options, etc., from the current ASTContext. If some if the dependencies cannot be resolved, a reasonable solution would be to fall back and fail to synthesize a body.

Ted

Ah, okay…I thought this was for arbitrary other files in the project. Doing this for synthesized bodies seems much more plausible!

Jordan

Hi Jordan,

I have some questions. First of all I do not really know yet if it would be better to create a new compiler instance to parse the model files or reuse the existing compiler instance.

In either case the current FrontendAction logic does not fit in the scenario. I could create a new class that is derived from FrontendAction that uses an existing ASTContext to parse a file, however the BeginSourceFile, Execute, and EndSourceFile methods are not virtual (and they contain logic that prevents using FrontendAction as a base class for this task). I think they might not be virtual on purpuse. Do you think that it would be ok to change those methods to be virtual?

Thanks,
Gábor

Hi!

I am working on a Google Summer of Code project to improve the Clang
Static Analyzer. In that project it would be essential to parse external
source files and inject AST into the translation unit that is being
compiled. The external files would contain definitons that are being looked
up. The goal would be to avoid runtime cost if no lookup is required. So
basicly I want to add new code lazily to an existing AST after parsing is
done by injecting new source code.

Isn't that exactly what modules solve?

Cheers,
/Manuel

Thanks, that is a good point. Do yo know what is the current status of module support for C++? Is it mature enough to be used? If not mature enough yet is it worth for me to start working on it (so it is less effort to make them work with regular translation units than creating my own lazy parsing logic)?

Cheers,

Gábor

+richardsmith, the C+±modules-man

Hi!

Do you know, if the module system is also capable to handle not self contained modules? So is it possible to use the type information from the

translation unit that is being parsed and compile and load a module that contains no type definitions (assuming all the required type information is available in the translation unit at the point when the module needs to be loaded).

Thanks,

Gábor

Hi!

Do you know, if the module system is also capable to handle not self
contained modules? So is it possible to use the type information from the
translation unit that is being parsed and compile and load a module that
contains no type definitions (assuming all the required type information is
available in the translation unit at the point when the module needs to be
loaded).

It'd theoretically be possible (you could take your existing AST, emit it
as a module, then build another module starting with an import of the first
one), but I'm not really sure what the benefit would be. You wouldn't be
able to reuse the things you parsed, since they depend on the particular
translation unit you're within.

It sounds like all you really want is to parse, then, at the end of the
translation unit, scan the parsed translation unit for declarations that
you'd like to implicitly define, and parse those declarations. Something
similar to injecting a '#pragma clang define_needed_decls' at the end of
the TU and teaching the lexer to translate that pragma into the right set
of #includes would seem reasonable. (You'll need to be a little careful
that lookahead doesn't cause you to pick too few things, but other than
that it should be OK.)

I think Richard answered this question pretty well, but I think these are fundamentally different problems. We’re talking about lazily synthesizing function bodies on demand, in the context of the existing translation unit. For flexibility, we want the definitions of these function bodies to adapt to the specifics of the translation unit, e.g. the definitions of the types used to define the parameter and return value. That means the source bodies are not self-contained modules.

There also seems to be complexity here when using a system that already used modules. Consider a case, like OS X, where the APIs are accessible from modules. Say there is an API, foo(), whose API is vended by a module, and that’s how it gets imported into a translation unit. At analysis time, however, we want a body for that function, even if one was not provided. We then want to conjure up a function body for foo(), even though its declaration of foo() was already pulled from a real module and for all intensive purposes is considered part of that module, not our fake one that contains our faux definition for foo().

FWIW, this lazy conjuring of function bodies is what we do today in BodyFarm. We just do it in an ad hoc way by creating the ASTs by hand.