What's the best way to...?

Hi,

I'm looking to determine if some given code fragment is "complete" -
and by that I mean that, for example, it can be completely enclosed by
an AST node if parsed in-context.

For example,

printf("%d\n",

is not complete by my definition.

But "printf("%d\n", i);" is. Similarly, "if (blah) {" is not. But "if
(blah) { doStuff(); }" is. Or likewise "int fib(int n)" is not
complete - but with a semicolon at the end, its a function
declaration, or with a function body following, it's a function
definition - and thus complete.

I'm looking at what's the best way to do such an analysis using the
clang libraries. I have some ideas, but I thought I'd ask the list to
make sure I'm not missing some better way...

One way I thought about is using the Lexer to lex it into a list of
tokens, then use a stack to match {}, , etc. If the code fragment is
balanced, then look at the last non-comment token and check if its
either a "}" or a ";", in which case the fragment is complete.
Otherwise it's not.

Obviously, this technique is just an approximation - and there's
probably edge cases it does not cover. Ideally, it would be nice to
re-use logic from the Parser - but looking at the public APIs - I
didn't see an easy way to accomplish this.

Any suggestions would be appreciated. Thanks.

-Alexei

Alexei Svitkine wrote:

Hi,

I'm looking to determine if some given code fragment is "complete" -
and by that I mean that, for example, it can be completely enclosed by
an AST node if parsed in-context.

For example,

printf("%d\n",

is not complete by my definition.

But "printf("%d\n", i);" is. Similarly, "if (blah) {" is not. But "if
(blah) { doStuff(); }" is. Or likewise "int fib(int n)" is not
complete - but with a semicolon at the end, its a function
declaration, or with a function body following, it's a function
definition - and thus complete.

I'm looking at what's the best way to do such an analysis using the
clang libraries. I have some ideas, but I thought I'd ask the list to
make sure I'm not missing some better way...
  

Very wild idea, and probably not practicable: put a layer between lexer
and parser. Usually it just forwards tokens, but when the lexer runs out
of tokens, you do a stack trace to find out what part of the parser
called you. Then you know what the parser would now expect.

Of course, that probably requires careful analysis of the parser and
some huge state tables, not to mention that stack tracing as part of the
normal program flow is simply impracticable.

Sebastian

Very wild idea, and probably not practicable: put a layer between lexer
and parser. Usually it just forwards tokens, but when the lexer runs out
of tokens, you do a stack trace to find out what part of the parser
called you. Then you know what the parser would now expect.

Hmm.. Perhaps if the Parser was made to pass a constant that says what
type of thing is being parsed to the Lexer when asking for the next
token, the stack trace voodoo can be avoided. The Lexer would then
need to be decoupled from the Parser to make this possible... perhaps
the Parser would use a LexerProvider or something which just wraps the
Lexer and takes the extra parameter. Then I could derive from
LexerProvider and intercept that...

Seems like it would require a lot of (small) changes to the Parser to
do this... and I can understand that there might be resistance against
this approach since it would add a level of indirection and thus
decrease performance (though a smart compiler/linker could optimize
that out?).

-Alexei

though a smart compiler/linker could optimize that out?

I meant inline.

-Alexei

Hi Alexei,

Alexei Svitkine wrote:

Hi,

I'm looking to determine if some given code fragment is "complete" -
and by that I mean that, for example, it can be completely enclosed by
an AST node if parsed in-context.

For example,

printf("%d\n",

is not complete by my definition.

But "printf("%d\n", i);" is. Similarly, "if (blah) {" is not. But "if
(blah) { doStuff(); }" is. Or likewise "int fib(int n)" is not
complete - but with a semicolon at the end, its a function
declaration, or with a function body following, it's a function
definition - and thus complete.
  
Here's an idea, not sure how well it will work or whether it suits your purposes.

The preprocessor has a "FileChanged" callback, look in "Lex/PPCallbacks.h".
The idea is that you "intercept" that callback (using Preprocessor::setPPCallbacks) and enter the code fragment using something like:

  llvm::MemoryBuffer *SB = llvm::MemoryBuffer::getMemBufferCopy(buf, buf_len); // buf contains the code fragment string
  unsigned FileID = SourceMgr.createFileIDForMemBuffer(SB);
  PP.EnterSourceFile(FileID, 0); // PP is the Preprocessor

Then you call "Parser::ParseTopLevelDecl".
If the code fragment is incomplete, the FileChanged callback will be invoked (the preprocessor will close the "file" trying to consume more tokens), otherwise ParseTopLevelDecl will return the AST node (or null in case of error).

Also, you will probably need to make some Parser methods like "ParseExpression" public so that you can use them.

-Argiris