Python bindings questions

Hi all,

I'm writing a simple tool that generates one XML file based on the AST
given by the clang Python bindings. The problem is that I don't know how
to get the operator used, the constant values, etc... I mean, given this
simple C code:

void foo(void)
{
  int i = 29;
}

I first get a VAR_DECL cursor object with displayname "i" and, then, I
get an INTEGER_LITERAL cursor object but I don't know how to get the
value "29". The same goes for BINARY_OPERATOR, etc...

So, my question is: how do I get this information with the python
bindings? Sorry if it's a dumb question!

Thanks in advance!
Joxean Koret

There are two solutions I can up with: (1) Extract the string of the extent of the INTEGER_LITERAL, (2) Annotate the tokens.

(2) requires wrapping of the libclang tokenization API that is not available in the clang SVN's python wrapper yet (I think). I have an experimental version of this living at

https://github.com/holtgrewe/linty/blob/master/clang/cindex.py

The main "problem" is that tokens are not managed as single objects but as arrays. Additionally, the tokens can be annotated back to their AST cursors but this information lives in an additional array.

My take on a pythonic wrapper was to write a TokenCollection class and some more icing. This new API part works pretty well for me.

I have not yet asked for people's opinion on this list since I wanted to find out how this works out for my projects. If anyone is interested in doing a review, they are welcome to it.

HTH,
Manuel

Hi,

Thanks Manuel, however, your approach doesn't work always. For example,
for a CursorKind.CSTYLE_CAST_EXPR object it will return the complete
casting line (i.e., "(const unsigned char *)var"), for
CursorKind.COMPOUND_STMT objects it will return the complete compound
statement (which may happily be the complete function you're analysing),
etc... You need to code very ugly hacks in order to get the real data
you want which, IMHO, should be exposed in an easier way.

Apart from that, there is also some information still not available (not
related to your code but to the main code). For example, the following C
code:

void foo(int i)
{
  if ( i == 0 || i == 1 )
    stuff();
}

and this one:

void foo(int i)
{
  if ( i == 0 && i == 1 )
    stuff();
}

...will return the same AST as the "&&" or "||" operator doesn't appear.
But, curiously, in this case:

void foo(int i)
{
  if ( i && i == 1 )
    stuff();
}

...it returns correctly the && operator. It seems to be a bug. Just in
case any of you want to take a look, attached goes the script I wrote
and a simple C test file (BTW, using the code Manuel Holtgrewe provided
in his previous e-mail, available here [1]).

[1] https://github.com/holtgrewe/linty/blob/master/clang/cindex.py

Thanks & Regards,
Joxean Koret

clang2xml.py (4.01 KB)

Hmmm. I just noticed this after I wrote an implementation for the token
API myself:

https://github.com/indygreg/clang/tree/python_features/bindings/python/clang

I worked around the problem you mentioned by copying the individual
CXToken structs into separate Python objects then calling
clang_disposeTokens() on the original memory before returning the newly
obtained tokens. This relies on the current implementation of those APIs
where the allocated CXToken array is merely itself a shallow copy of
underlying data. Although it seems to violate an unstated API contract,
it does work and is much cleaner on the Python side.

Greg

Hi,

I don't know if it's the correct place to submit patches, anyway,
attached goes a patch for the CLang's Python bindings to add support for
the Tokenization API (the code was wrote by Manuel Holtgrewe) and also
for adding a property to the cursor object called 'expression'. This
property retrieves the operator or the literal expression (the unique
CursorKinds I added support to ATM). This way, nobody needs to code
horrible hacks in order to get those expressions.

@Manuel: I hope it's OK to submit to the list your code among mine in
the patch.

Regards,
Joxean Koret

clang.patch (13.2 KB)

Joxean Koret <joxeankoret@...> writes:

Hi,

I don't know if it's the correct place to submit patches, anyway,
attached goes a patch for the CLang's Python bindings to add support for
the Tokenization API (the code was wrote by Manuel Holtgrewe) and also
for adding a property to the cursor object called 'expression'. This
property retrieves the operator or the literal expression (the unique
CursorKinds I added support to ATM). This way, nobody needs to code
horrible hacks in order to get those expressions.

@Manuel: I hope it's OK to submit to the list your code among mine in
the patch.

Regards,
Joxean Koret

Thank you very much! Your patch helped alot!
I actually don't using Python binding, just libclang, but your
clang_getCursorExpression() should be commited it to 3.2

Best regards,
Alexey Kovalevsky