Why doesn't LibClang's cursor faithfully reflect the node information of the source file's AST?

Hi, I’m recently working on a project that is aiming to batch process some files, which are .i files generated from .c and .cpp files. As we all known, .i files will introduce much more lines than their source files because of the compiler preprocessor’s behaviours, but when we use -ast-dump to print the AST message of a .i file, it will record the origin line number information in one node. For example:

The Image shown above is a screenshot of an AST message exported from a .i file. You can see that AST node at line 3140 faithfully recorded a function declaration that has the correct line range 219-355.

As I understand it (which may be incorrect), the LibClang cursor should traverse down through the AST nodes, so I also think that the Location information it returns via functions such as clang_getExpansionLocation() and clang_getSpellingLocation() should be consistent with what is written in the AST Node.

Everything works well in .c and .cpp files, however, when I use LibClang to process .i files and try to print the cursor’s Location, I find that it returns the line number from the .i file (for example, 177749 in the 3rd image) instead of the line number from the source file as recorded in AST Node.

The clang_getPresumedLocation() function I found in the CXSourceLocation.h file briefly solved some problems with simpler .i files by printing out the positions recorded in the # ... debugging hinting lines. However, not all lines in the .i file are following these hinting lines, which makes this function no longer a solution to my problem.

Oddly enough, I once used the CSA Checker in another project to process .i files directly, and it didn’t create this problem, but printed the original line numbers in the explodedGraph nodes.

So what I’d like to know is: Are cursors in LibClang one-to-one with AST nodes? Is there any way for the user to directly access the information in the AST nodes?

The cursors in libclang are thin wrappers around the AST nodes. There are private APIs in CXCursor.h like getCursorVariableRef() that can convert from a CXCursor to an AST node, but due to being private APIs, they may not be suitable for your needs.

I would recommend using presumed locations. The line markers tell the compiler “pretend line number N starts here”, so the fact that code doesn’t immediately follow the line markers isn’t an issue. e.g.,

# 123 "dummy.c"
int var; // Presumed location is dummy.c:123

void func(void); // Presumed location is dummy.c:125

# 456 "derp.c"
int other; // Presumed location is derp.c:456

If you’re finding the presumed locations you get back are wrong, then I think we’d need to see a more concrete example of what the .i file contents are and what presumed locations you’re getting back that appear to be incorrect.

Thank you for your response! I continued to verify and found that the PresumedLocation itself is fine, and that the cause of my problem is in the AST.

If it doesn’t bother you, I’d like to follow up with you on this question: I found that if there is a function declared like this in the .cpp file, the AST exported from the .cpp file is able to successfully interpret it as a function declaration node:

However, after I converted the .cpp file to an .i file, the corresponding function can no longer be found in the AST, and at the same time, this datatype doesn’t seem to be highlighted in the source file, too.

I’m curious what caused this.

My guess is that the issue is that the declaration is invalid.

and thus something is dropping it, either when preprocessing to a file or when trying to take the preprocessed code and form a new AST from it.

Thanks for the answer. I found that the problem with .i should be in not correctly resolving types such as compute::kernel to a valid int and I will look for other ways to fix it. Thanks again for your help :slight_smile:

1 Like

By the way, I’ve actually been very curious as to why the cursor used in libclang’s design isn’t strictly traversing abstract syntax tree nodes. Or rather, why not expose the node data structure to the user side.

Is it because of the execution cost of the traversal algorithms or something else?And do other syntax tree traversal tools in Clang Frontend, such as RecursiveASTVisitor, have such a feature?

It is traversing the AST: https://github.com/llvm/llvm-project/blob/6d66440c50e3047b5ce152830d4ccae381c7c2bf/clang/tools/libclang/CursorVisitor.h#L70 (the base classes are ones we commonly use when traversing the AST).

We don’t expose the AST nodes directly through this interface because the libclang interface is intended to be stable while our C++ interface is constantly changing. This allows folks to use libclang without having to constantly modify their code as trunk evolves.

1 Like

Thanks for replying! :slight_smile:

1 Like