Relexing more than one tokens

Once obtained a SourceLocation (or a SourceRange) we'd like to relex the
preprocessed stream to check for the presence of some tokens.

An example of use would be to check if the int type in an AST
declaration was written with "signed" or not.

We are able to relex a single token from a given SourceLocation, but we
haven't found a way to use a SourceRange to relex all the tokens
included in the range.

Is there a way?

Do you have hints for alternative way to accomplish the same aim?

We don't have a great way to do this right now. The basic problem is that you could have something like this:

   foo bar baz

In order to lex from "foo" to "baz", you need to know (e.g.) if bar is a macro that expands to zero (or many) tokens. From an arbitrary point in an ASTConsumer, you don't have this information, because the macros could be undef'd etc.

However, not all hope is lost. It is very reasonable for an ASTConsumer to construct ASTs for a translation unit *AND* then preprocess the whole file again to get the tokens in a big vector. Given that, you could map from the AST node to an index in the vector, then scan around in the vector of tokens looking for what you want.

This is the idea of how the TokenRewriter works: it assumes that the clients of the rewriter have an AST or something else, and it works on a big array of tokens that it separately preprocesses.

-Chris

Chris Lattner ha scritto:

Once obtained a SourceLocation (or a SourceRange) we'd like to relex the
preprocessed stream to check for the presence of some tokens.

An example of use would be to check if the int type in an AST
declaration was written with "signed" or not.

We are able to relex a single token from a given SourceLocation, but we
haven't found a way to use a SourceRange to relex all the tokens
included in the range.

Is there a way?

Do you have hints for alternative way to accomplish the same aim?

We don't have a great way to do this right now. The basic problem is
that you could have something like this:

  foo bar baz

In order to lex from "foo" to "baz", you need to know (e.g.) if bar is a
macro that expands to zero (or many) tokens. From an arbitrary point in
an ASTConsumer, you don't have this information, because the macros
could be undef'd etc.

Taken for granted that currently from any SourceLocation I can obtain
the related token, a possibility could be to have in class SLocEntry a
reference to next token's SourceLocation in preprocessed stream. It
should not be too hard to implement, but this means to add 32 bits to
each SLocEntry and to keep all translation unit source locations in memory.

However, not all hope is lost. It is very reasonable for an ASTConsumer
to construct ASTs for a translation unit *AND* then preprocess the whole
file again to get the tokens in a big vector. Given that, you could map
from the AST node to an index in the vector, then scan around in the
vector of tokens looking for what you want.

How you'd map the AST node to the relexed tokens vector?

In order to lex from "foo" to "baz", you need to know (e.g.) if bar is a
macro that expands to zero (or many) tokens. From an arbitrary point in
an ASTConsumer, you don't have this information, because the macros
could be undef'd etc.

Taken for granted that currently from any SourceLocation I can obtain
the related token, a possibility could be to have in class SLocEntry a
reference to next token's SourceLocation in preprocessed stream. It
should not be too hard to implement, but this means to add 32 bits to
each SLocEntry and to keep all translation unit source locations in memory.

We can't do that, SourceLocation has to stay 32-bits, it is very pervasive.

However, not all hope is lost. It is very reasonable for an ASTConsumer
to construct ASTs for a translation unit *AND* then preprocess the whole
file again to get the tokens in a big vector. Given that, you could map
from the AST node to an index in the vector, then scan around in the
vector of tokens looking for what you want.

How you'd map the AST node to the relexed tokens vector?

Just compare the source locations.

-Chris

Chris Lattner ha scritto:

In order to lex from "foo" to "baz", you need to know (e.g.) if bar is a
macro that expands to zero (or many) tokens. From an arbitrary point in
an ASTConsumer, you don't have this information, because the macros
could be undef'd etc.

Taken for granted that currently from any SourceLocation I can obtain
the related token, a possibility could be to have in class SLocEntry a
reference to next token's SourceLocation in preprocessed stream. It
should not be too hard to implement, but this means to add 32 bits to
each SLocEntry and to keep all translation unit source locations in
memory.

We can't do that, SourceLocation has to stay 32-bits, it is very pervasive.

SourceLocation size would not change, the link to next SourceLocation
would be added to SLocEntry (similarly to IncludeLoc and SpellingLoc).

However, not all hope is lost. It is very reasonable for an ASTConsumer
to construct ASTs for a translation unit *AND* then preprocess the whole
file again to get the tokens in a big vector. Given that, you could map
from the AST node to an index in the vector, then scan around in the
vector of tokens looking for what you want.

How you'd map the AST node to the relexed tokens vector?

Just compare the source locations.

Are you meaning that preprocessing the whole translation unit again I'd
get the same SourceLocation opaque ID?

We can't do that, SourceLocation has to stay 32-bits, it is very pervasive.

SourceLocation size would not change, the link to next SourceLocation
would be added to SLocEntry (similarly to IncludeLoc and SpellingLoc).

That is just moving the cost to another place. Adding 32-bits per token is too much.

However, not all hope is lost. It is very reasonable for an ASTConsumer
to construct ASTs for a translation unit *AND* then preprocess the whole
file again to get the tokens in a big vector. Given that, you could map
from the AST node to an index in the vector, then scan around in the
vector of tokens looking for what you want.

How you'd map the AST node to the relexed tokens vector?

Just compare the source locations.

Are you meaning that preprocessing the whole translation unit again I'd
get the same SourceLocation opaque ID?

Hopefully. If not, you can do a "deep comparison" along the lines of SourceManager::isBeforeInTranslationUnit.

-Chris

Chris Lattner ha scritto:

However, not all hope is lost. It is very reasonable for an
ASTConsumer
to construct ASTs for a translation unit *AND* then preprocess the
whole
file again to get the tokens in a big vector. Given that, you
could map
from the AST node to an index in the vector, then scan around in the
vector of tokens looking for what you want.

How you'd map the AST node to the relexed tokens vector?

Just compare the source locations.

Are you meaning that preprocessing the whole translation unit again I'd
get the same SourceLocation opaque ID?

Hopefully. If not, you can do a "deep comparison" along the lines of
SourceManager::isBeforeInTranslationUnit.

Instead of preprocessing the whole translation unit, wouldn't be more
effective to transform Preprocessor::Lex() in a virtual method and
permit to applications to use a derived class of Preprocessor that have
a Lex method that after calling Preprocessor::Lex() saves a copy of each
tokens in the big vector you said above?

Yes, but that would impose an unacceptable performance penalty on the normal case. Also, you can't subclass "Preprocessor".

-Chris