Returning original parsing state after parsing arbitrary tokens

Assume that a program did the following:

1) Called EnterTokenStream with arbitrary tokens (followed by a Semicolon
and Eof)
2) Called setSuppressAllDiagnostics(true) to keep from emitting errors (in
case the next step fails)
3) Called ParseStatement to produce a StmtResult from the additional tokens
(which may or may not produce a valid result)
4) Lexed tokens (if necessary) to remove any unlexed added tokens
5) Called setSuppressAllDiagnostics(false) to restore error handling

What variables and/or other states should be saved to return the
Preprocessor/Parser/Lexer to the state they were in before step 1?

It's impossible to completely save/restore the state of semantic
analysis short of using fork() or equivalent. You might be able to
get away with an approximation depending on your exact requirements,
though.

At a higher level, what are you trying to do?

-Eli

This is a slight exaggeration; that said, parsing a statement in the
general case can change essentially arbitrary pieces of the AST and
the lookup data-structures in Sema, particularly in C++ when template
instantiation is involved. Even in C there are bits stored in the AST
like whether a declaration is used which can't be easily
saved/restored.

-Eli

Eli Friedman wrote

It's impossible to completely save/restore the state of semantic
analysis short of using fork() or equivalent.

This is a slight exaggeration; that said, parsing a statement in
thegeneral case can change essentially arbitrary pieces of the AST andthe
lookup data-structures in Sema, particularly in C++ when
templateinstantiation is involved. Even in C there are bits stored in the
ASTlike whether a declaration is used which can't be easilysaved/restored.

The sub-goal is to parse the parameters of function-like macros to produce
StmtResults without altering the normal parsing. While all macro parameters
may not be well formed, most will likely produce valid results after calling
ParseStatement.
The goal is to facilitate source-to-source translation by walking the AST.
Maintaining the original parameters for function-like macros is desirable in
this task.
(I realize that having a valid StmtResult is only half of the equation to
determining if the parameter makes sense to be parsed in isolation. The next
task is to expand the macro to a canonical form using the obtained
StmtResult, and compare against a canonical form of the "normally parsed"
already expanded macro)
My code to parse macro parameters works on simple cases, but not when
including certain headers because of parsing state changes.
Would you recommend I resort to fork() and messaging, or is there a more
elegant solution to keep the parsing state pristine?

Hi,

Assume that a program did the following:

1) Called EnterTokenStream with arbitrary tokens (followed by a Semicolon
and Eof)
2) Called setSuppressAllDiagnostics(true) to keep from emitting errors (in
case the next step fails)
3) Called ParseStatement to produce a StmtResult from the additional tokens
(which may or may not produce a valid result)
4) Lexed tokens (if necessary) to remove any unlexed added tokens
5) Called setSuppressAllDiagnostics(false) to restore error handling

What variables and/or other states should be saved to return the
Preprocessor/Parser/Lexer to the state they were in before step 1?

It's impossible to completely save/restore the state of semantic
analysis short of using fork() or equivalent. You might be able to
get away with an approximation depending on your exact requirements,
though.

At a higher level, what are you trying to do?

That use case is particularly interesting to us, too. We want to be able to support discovery at run-time of dependent library and their automatic loading.
We want to be able to load a library + parse its headers on a failed lookup and then retry the lookup, which will now succeed, and continue the original parsing. This requires to save the current state of the parser (and/or sema?) just after the 'failed' lookup, then do the #include of the headers and finally restore to the previous.

Vassil

Hmm... the problem is, semantic analysis is really not designed to
deal with something like this. Even if you ignore template
instantiation, an arbitrary declaration still inserts things into the
AST which can't be easily removed. If you can reject input code which
contains any declarations before it hits semantic analysis, and don't
need to deal with input code that uses templates, that should suppress
most of the ripple effects I can think of, though.

-Eli

Eli Friedman wrote

Hmm... the problem is, semantic analysis is really not designed to
deal with something like this. Even if you ignore template
instantiation, an arbitrary declaration still inserts things into the
AST which can't be easily removed. If you can reject input code which
contains any declarations before it hits semantic analysis, and don't
need to deal with input code that uses templates, that should suppress
most of the ripple effects I can think of, though.

Your help is greatly appreciated! I have a fair working knowledge of the AST
once it's produced, but I admit my ignorance when it comes to what's
involved with the parsing and production of the AST.

So, it appears that some ways to attack this problem are:

*1) PushState() and PopState():* /[Grade: ---]/ Although technically
possible, based on our discussion this appears unlikely without a major
overhaul.

*2) fork():* /[Grade: +]/ I modified my code to use fork(), and it works as
you would expect it: arbitrary tokens can be parsed with no alteration of
the parent AST. However, the resultant StmtResult is isolated in the child
process.

  a) Could the StmtResult be "deep copied" to the parent process somehow?
(I'm not even sure what would be involved in creating a "inter-process
compatible" StmtResult.)
  
  b) It might be possible to do the necessary source-to-source transformation
in the child process and pass that back to the parent process, but this
would not be ideal.

*3) Identify and reject tokens that lead to declarations or use templates:*
/[Grade: ---]/ I'm not sure I know how to do this without actually parsing
the tokens. (I'm not even sure I know how to do this even *after* parsing
the tokens).

*4) fork() for verification:* /[Grade: ++]/ It should be possible to use a
forked child process to identify valid StmtResults that don't modify the
AST, and then tell the Parent process to "go ahead" and parse the valid
token streams when appropriate. This seems to be a promising avenue. So,
given a valid StmtResult in the child process, it would be necessary to
determine if parsing in the parent process would alter the parent AST in an
adverse way.

Any recommendations on a good methodology for a new function "bool
StmtResultCausesAdverseRipples(StmtResult &SR);"?

Thanks again for your help!

You might want to have a look at ASTImporter and how LLDB use it.
Vassil

Vassil Vassilev wrote

a) Could the StmtResult be "deep copied" to the parent process somehow?
(I'm not even sure what would be involved in creating a "inter-process
compatible" StmtResult.)

You might want to have a look at ASTImporter and how LLDB use it.
Vassil

Because I would be copying from one process to another, I think I would have
to rewrite ASTImporter to use something similar to a pipe mechanism that
doesn't have any pointers or references to process memory. I'm not sure that
I am capable of that.

The most promising solution to me seems to be option #4 from my previous
post.

Of course, it's possible that I'm misunderstanding something... it's been
known to happen : -)

Arh.. I see. Yes it would be difficult to implement copying nodes from another process (if possible at all).
Vassil