from source location to token?

dear cfe list,
I am implementing an AST visitor, and I seldom need to know the exact token
that generated the AST vertex.

I can get the source location using the virtual function getSourceRange(),
the clang::SourceLocation can be used with the SourceManager, but I can't
find a natural way to get the clang::Token or the IdentifierInfo.

Or I must use const char* clang::SourceManager::getCharacterData() and parse
again?

Thanks for help.
pb

Hi Paolo,

I'm not sure what you are trying to do, however here is some info...

- The AST's don't currently store Tokens (they are transient), and we have no mechanism for associating an AST with a list/stream of Tokens.
- The NamedDecl AST stores a DeclarationName (which is commonly an IdentifierInfo).
- The DeclRefExpr AST stores a NamedDecl.

You can, as you figured out, get a character stream from a SourceLocation (and lex/parse again).

It would be interesting to know a little bit more about what you are doing. In general, re-lexing/re-parsing isn't very desirable (though it may be appropriate if it's uncommon).

HTH,

snaroff

steve naroff wrote:

Hi Paolo,

I'm not sure what you are trying to do, however here is some info...

- The AST's don't currently store Tokens (they are transient), and we have no mechanism for associating an AST with a list/stream of Tokens.
- The NamedDecl AST stores a DeclarationName (which is commonly an IdentifierInfo).
- The DeclRefExpr AST stores a NamedDecl.

You can, as you figured out, get a character stream from a SourceLocation (and lex/parse again).

It would be interesting to know a little bit more about what you are doing. In general, re-lexing/re-parsing isn't very desirable (though it may be appropriate if it's uncommon).

Hi Steve,

there are program analyses that, for instance:

1) need to reason on an exact representation of floating-point literals
    (any approximation may result into an unsound analysis);
2) need to reason on the textual representation that was used in the
    program also for integer literals (for example, there are coding
    rules that forbid the use of octal constants: the analyzer should
    flag their use in the source program).

All the best,

    Roberto

steve naroff wrote:

Hi Paolo,
I'm not sure what you are trying to do, however here is some info...
- The AST's don't currently store Tokens (they are transient), and we have no mechanism for associating an AST with a list/stream of Tokens.
- The NamedDecl AST stores a DeclarationName (which is commonly an IdentifierInfo).
- The DeclRefExpr AST stores a NamedDecl.
You can, as you figured out, get a character stream from a SourceLocation (and lex/parse again).
It would be interesting to know a little bit more about what you are doing. In general, re-lexing/re-parsing isn't very desirable (though it may be appropriate if it's uncommon).

Hi Steve,

there are program analyses that, for instance:

1) need to reason on an exact representation of floating-point literals
  (any approximation may result into an unsound analysis);
2) need to reason on the textual representation that was used in the
  program also for integer literals (for example, there are coding
  rules that forbid the use of octal constants: the analyzer should
  flag their use in the source program).

Hi Roberto,

Makes sense. We could consider adding a "LiteralName" to IntegerLiteral/FloatingLiteral if this becomes commonplace (it may help doing more precise pretty printing as well).

snaroff

It would be interesting to know a little bit more about what you are

doing. In general, re-lexing/re-parsing isn’t very desirable (though

it may be appropriate if it’s uncommon).

Hi Steve,

there are program analyses that, for instance:

  1. need to reason on an exact representation of floating-point literals
    (any approximation may result into an unsound analysis);

Hi Roberto,

You don’t need the original Token to get this. Clang’s source location information is so precise that (given a SourceLocation) you can arbitrarily re-lex a token later. This property is actually inherent to how we handle SourceRange’s: the source range points to the start of the first/end token in the range. To get render through the end of the token, the diagnostics machinery re-lexes the token, which gives exactly the original spelling and length.

You can see this in action with the command line driver. Consider this program:

struct s;
void foo(struct s *s) {
*s + 0.12321e-42;
}

-fsyntax-only prints:

t.c:7:7: error: invalid operands to binary expression (‘struct s’ and ‘double’)
*s + 0.12321e-42;
~~ ^ ~~~~~~~~~~~

The way it gets the end of the fp literal is to relex the token with code in Lexer::MeasureTokenLength.

Given a SourceLocation and the length of the token (as returned by MeasureTokenLength) you can get the exact original spelling (including trigraphs and escaped newlines, beware). The pointer to the start of the string is obtained with:

const char *StrData = SourceMgr.getCharacterData(Loc);

  1. need to reason on the textual representation that was used in the
    program also for integer literals (for example, there are coding
    rules that forbid the use of octal constants: the analyzer should
    flag their use in the source program).

Sure, Clang can handle this sort of thing with no problem.

-Chris

Chris Lattner wrote:

I have still a problem. Consider the following program:

#define N_OF_ELEM 037;
int a = N_OF_ELEM;

Using the method you suggest I have that the textual representation of the
integer literal in the right hand side of the assignment is `N_OF_ELEM'.

It seems Lexer::MeasureTokenLength() automatically changes the SourceLocation
using clang::SourceManager::getLogicalLoc() . In my case it shouldn't.

What is the best way to obtain the token text after preprocessing?

thanks
pb

Use "Loc = SourceManager.getPhysicalLoc(Loc)" before calling that code,

-Chris

The clang's AST does not seem to remember the real expression that initialized
the size of constant arrays. The clang::ConstantArrayType class does not have
method like getSizeExpr() of clang::VariableArrayType.

Is there a way to knows the exact expression even in case of constant arrays?

I'd like to check about how the constant has been obtained. Moreover I'd like
to see if there are some kind of overflow.

E.g., in my machine:

source_file.c:
int main() {
    int b[10000000 * 1000000];
}

$ clang -ast-print source_file.c
typedef struct __va_list_tag __builtin_va_list[1];

int main() {
  int b[1316134912];
}

or even more intriguing:

source_file2.c:
int main() {
    int b[1000000 * 1000000];
}

$ clang -ast-print source_file2.c
typedef struct __va_list_tag __builtin_va_list[1];
te.c:3:11: error: array size is negative
    int b[1000000 * 1000000];
          ^~~~~~~~~~~~~~~~~

int main() {
  int b[3567587328];
}

1 diagnostic generated.

Just checking if the expression really gives the AST result would warn
against huge programming like this ones.

pb

I did that, and it worked well until the last update of clang.
But after updating to release 62317 I get this error:
‘class clang::SourceManager’ has no member named ‘getPhysicalLoc’

what is the correct solution now?

pb

Chris just renamed getPhysicalLoc to getSpellingLoc.

Sebastian