Modifying an AST and Sema's dependency on an AST consumer

Hi there,

   I'm trying to do a few weird things with clang. The general scheme
is that I want to generate ASTs from my C/C++ code and keep them
around to modify or extract some information later. Then, at some
point, I want to feed the ASTs to CodeGen, play with the modules and
then link them together and run. I have encountered something that I
do not completely understand: Sema needs an ASTConsumer because when
it's done with the class definition processing it calls
ASTConsumer::HandleTagDeclDefinition which is implemented in CodeGen
and seems to do some important work. Shouldn't CodeGen be able to
figure out by itself if it has got all the nodes constituting a class
definition? Otherwise the user has to emulate the Sema's behavior.

   The second thing I'm trying to do is an insertion of some function
declaration nodes to the TranslationUnit that is being created
(implicitly declare some functions that are defined in in another
file/TU/module and will be linked to the one being processed). For
example I have:

test01.c:

double doSth( double param ) {
   return 2.0*param+3.0;
}

test02.c:

#include <stdio.h>

int main( int argv, char** argv ) {
   printf( "%lf\n", doSth( 17.0 ) );
   return 0;
}

   I want to parse test01.c and get all the function definitions,
"convert" them to function declarations, and insert to the translation
unit/ast context of the test02.c. I can extract the definitions and
create the declaration nodes using FunctionDecl::Create, I enter new
function types to the ASTContext and I think that I update all the
tables in it. Yet still I seem to experience some strange memory
issues when the modified TU gets processed by the CodeGen. I think I
have missed something essential. Can somebody please explain me the
steps necessary to do something like this?

Cheers,
   Lukasz

I sorted it out. The problem was the object ownership. It seemed
natural to me that the identifier table should be owned by the
ASTContext, but it is owned by the Preprocessor object which I deleted
after I was done with the parsing.

   Lukasz

tables in it. Yet still I seem to experience some strange memory
issues when the modified TU gets processed by the CodeGen.

I sorted it out. The problem was the object ownership. It seemed
natural to me that the identifier table should be owned by the
ASTContext, but it is owned by the Preprocessor object which I deleted
after I was done with the parsing.

The IdentifierTable should be owned by ASTContext (as you expected).

This is a bug (that I thought we fixed a long time ago...apparently not).

snaroff

Just to clarify, the Preprocessor owns IdentifierTable because (a) one needs the IdentifiedTable when preprocessing/lexing a file and (b) one can use a Preprocessor object without constructing ASTs.

I don't think this is a bug. See my other email.

Hey Ted,

I read your email but I'm still not convinced there isn't an issue here (though it certainly isn't critical to deal with now).

No doubt that running the preprocessor standalone will require an IdentifierTable.

When a preprocessor is being run in a larger context (where AST's are being generated), it makes more sense (to me) if the "larger context" owns the IdentifierTable. In this instance, ASTContext.

while ASTContext is more long-lived.

snaroff

However, it is perfectly reasonable for a client to want to use ASTs
and delete the Preprocessor. Does this mean we should have another
context object?

- Daniel

That's a fair argument, as a Preprocessor is more of a "worker" object that is used to do a single pass over the source in a translation unit. One possible solution is to have the IdentifierTable object be passed-by-reference to the constructor of Preprocessor with the assumption that the ownership of the IdentifiedTable lies elsewhere.

I'm not certain, however, what the performance implications would be of such a change. After mucking around in the Preprocessor for the last couple of weeks I'm convinced that even small changes can have an unanticipated adverse effect on performance.

Understood. I don't think we need to act on this now.

snaroff

However, it is perfectly reasonable for a client to want to use ASTs
and delete the Preprocessor.

Absolutely.

Does this mean we should have another
context object?

That's a possibilty, but what would such a context object represent? My main concern is that at some point a proliferation of such objects becomes more cumbersome than the benefit they provide.

Does this mean we should have another
context object?

That's a possibilty, but what would such a context object represent? My
main concern is that at some point a proliferation of such objects becomes
more cumbersome than the benefit they provide.

I'm not exactly sure what it means, I imagine it would be a high level
context which captures any information that should be available
whenever a client is using parts of the clang API. This doesn't
necessarily mean a proliferation of contexts, for example it may make
sense for this object to hold the language options or target
information. This is no different than forcing the clients to allocate
an identifier table up front, it just moves it to a common place.

- Daniel

However, it is perfectly reasonable for a client to want to use ASTs
and delete the Preprocessor.

Yes.

Does this mean we should have another
context object?

No.

ASTContext should own the identifier info table. If you're using the preprocessor without sema, you have to manage the lifetime of the identifier info table yourself.

-Chris

Ok, many thanks for your replies. For the prototype I'm building it's
not a critical issue right now, but still it would be great to have it
fixed.

What about my other problem? I had only a very brief look at this part
of the code but I can see that you have probably done that for
performance reasons. I mean, Sema seems to have all the structures to
determine when it's done with the class processing efficiently. My
problem is that I need to push my TUs through CodeGen after Sema is
long gone.

   Lukasz

Given our current interfaces, I think the simplest approach to this
would be to add a something like a generic ASTStreamer class which
takes an AST and a consumer and streams the AST through the consumer.
This avoids the need for CodeGen or other consumers to be aware that
they are being called differently.

- Daniel

Or to make AST consumers dependent only on ASTs. Interface clarity vs
efficiency in the main use case, that's always a tough choice :slight_smile:

   Lukasz

Or to make AST consumers dependent only on ASTs. Interface clarity vs
efficiency in the main use case, that's always a tough choice :slight_smile:

Indeed. However, with a class which adapts an AST for use by an AST
consumer this is effectively the same API (although the consumer
doesn't get a chance to optimize for random access).

We definitely do not want to require parsing the entire AST before
doing code generation. Although we don't use this facility yet, we
hope to eventually support function-at-a-time code generation as a way
to reduce memory consumption and hopefully improve -O0 time.

- Daniel

I was not suggesting that. I was just saying that the consumer
could probably figure out on it's own when to call
  HandleTagDeclDefinition (this is the only method I could spot that
requires Sema to be aware of the consumer), but I understand that it
comes for a price. And yes, I agree than an ASTStreamer or just a
method ASTConsumer::HandleEntireTU should be a reasonable solution in
the current state of things.

   Ok, thanks for your answers! Everything looks a lot clearer to me now.

   Lukasz