Problems and more problems with the python bindings


I'm currently writing a tool to do static analysis using the Python
bindings and one of the first tasks is to write a simple C-to-C

For this task, the AST object cindex generates should return more
information than what it's actually returning (like for
BINARY_OPERATOR/UNARY_OPERATOR the operator in question, for literals
the corresponding literal, etc...).

The hack I'm using for retrieving this information not exposed by the
python bindings is the following patches provided by Manuel Holtgrewe
[1]. It exposes in the python bindings the libclang tokenization APIs
and does the following:

1) Retrieve the location of the expression (the start and the end
2) Read from the provided file the corresponding chunk of C code.
3) Tokenize this string and extract from it the interesting part.

However, this approach doesn't work for many reasons. Some of them:

1) The start and end offset of the expression returned by the python
bindings may be wrong. For example, the following expression:

  if ( 0 == 1 && 2 == 3 )
    // stuff

...will report 3 BINARY_OPERATOR expressions (which is correct) but with
wrong start and end offsets:

CursorKind.BINARY_OPERATOR <SourceLocation file '../tests/test7.c', line
3, column 10>
CursorKind.BINARY_OPERATOR <SourceLocation file '../tests/test7.c', line
3, column 10>
CursorKind.BINARY_OPERATOR <SourceLocation file '../tests/test7.c', line
3, column 20>

The 1st and 2nd BINARY_OPERATOR returned both have the same start and
end offset (the complete parenthesis expression), which is wrong.

2) It, simply, doesn't work with macros. If I expect an INTEGER_LITERAL
and I search in the tokens returned by the clang tokenization API I'll
seek for a TokenKind.LITERAL expression. However, if it isn't a literal
but a macro (like __LINE__, for example), as I'm reading the raw file
without preprocessing (as the preprocessed buffer, if I'm not wrong, is
not available) I'm not reading the correct literal.

Well, I think my mail is very long already so, here goes my questions:

1) Is there an easy way to retrieve this information (the relevant
operator for BINARY_OPERATOR, UNARY_OPERATOR, *_LITERAL, etc...) using
the current python bindings?

2) If not (as seems to be), can somebody tell me where to look in the
CLang's source code to expose this information?

PS: Sorry for the long e-mail.

Thanks in advance,
Joxean Koret

I can't speak for whether there is a better way this could be done
through the cursor interface. However, it would be possible to map that
cursor to a token by obtaining the cursor's location then querying for
the token at that source location.

Unfortunately, the current Python bindings don't support the token API.
However, I have a preliminary implementation at
that does. I'm planning to submit the patch once it is complete (I just
have tests to write). Feel free to use my code until something official